I have been using Claude 4 for production coding work since its launch. Not as a toy, not for demos - for actual engineering tasks: refactoring a 50,000-line codebase, building new features from scratch, debugging race conditions, and writing infrastructure code. This is a technical assessment of what has changed, what genuinely works, and where it still falls short.
Extended Thinking - The Feature That Changed Everything
Previous models generated code by predicting the next token. Claude 4 with extended thinking actually reasons about problems before writing code. The difference is not subtle.
Here is a real example. I asked Claude 4 to implement a distributed rate limiter using Redis. Without extended thinking, you get a standard token bucket implementation - correct but basic. With extended thinking enabled, the model first reasons through:
- Race conditions between multiple instances reading and writing to Redis
- Clock skew between servers
- What happens when Redis is temporarily unavailable
- The tradeoff between accuracy and performance in the sliding window approach
Then it generates code that handles all of these:
import redis.asyncio as redis
import time
import hashlib
class DistributedRateLimiter:
"""
Sliding window rate limiter using Redis sorted sets.
Handles: multi-instance race conditions, clock skew tolerance,
Redis failure fallback (fails open with local tracking).
"""
def __init__(
self,
redis_client: redis.Redis,
max_requests: int,
window_seconds: int,
clock_skew_tolerance_ms: int = 500
):
self.redis = redis_client
self.max_requests = max_requests
self.window_seconds = window_seconds
self.skew_tolerance = clock_skew_tolerance_ms / 1000
self._local_counts: dict[str, int] = {} # Fallback when Redis is down
async def is_allowed(self, key: str) -> tuple[bool, int]:
"""Returns (allowed, remaining_requests)."""
try:
return await self._check_redis(key)
except redis.RedisError:
return self._check_local(key)
async def _check_redis(self, key: str) -> tuple[bool, int]:
now = time.time()
window_start = now - self.window_seconds - self.skew_tolerance
redis_key = f"ratelimit:{key}"
# Atomic operation: remove old entries, add new, count, set expiry
pipe = self.redis.pipeline()
pipe.zremrangebyscore(redis_key, 0, window_start)
pipe.zadd(redis_key, {f"{now}:{hashlib.md5(str(now).encode()).hexdigest()[:8]}": now})
pipe.zcard(redis_key)
pipe.expire(redis_key, self.window_seconds + 1)
results = await pipe.execute()
current_count = results[2]
if current_count > self.max_requests:
# Over limit - remove the entry we just added
await self.redis.zremrangebyscore(redis_key, now, now)
return False, 0
return True, self.max_requests - current_count
def _check_local(self, key: str) -> tuple[bool, int]:
"""Fail-open local fallback. Less accurate but keeps service running."""
count = self._local_counts.get(key, 0)
if count >= self.max_requests * 1.5: # Generous limit in degraded mode
return False, 0
self._local_counts[key] = count + 1
return True, self.max_requests - count - 1
This is not code a junior engineer writes. The clock skew tolerance, the fail-open fallback with a generous local limit, the atomic pipeline - these are decisions that come from experience with distributed systems. Extended thinking lets the model simulate that experience.
200K Context Window - Full Codebase Understanding
The 200K token context window means Claude 4 can hold roughly 150,000 lines of code in context simultaneously. This changes the nature of what you can ask.
Before (with 8K-32K context):
- “Here is a function. Fix this bug.”
- Work file-by-file, lose cross-file context
Now (with 200K context):
- “Here is our entire authentication module (47 files). Find the inconsistency between how we validate JWT tokens in the API gateway versus the WebSocket handler.”
- “This PR has 23 changed files. Review it for architectural consistency and potential issues.”
The practical workflow I use:
# Concatenate all relevant source files with file markers
find src/auth -name "*.ts" -exec echo "=== {} ===" \; -exec cat {} \; > /tmp/auth_module.txt
# Pass the entire module to Claude for analysis
# The file is ~8,000 lines / ~40,000 tokens - well within the context window
The accuracy on cross-file reasoning is genuinely good. It catches issues like:
- Type A is defined differently in two files
- Error handling in module X does not match the contract expected by module Y
- A function was renamed in the interface but not in all implementations
It is not perfect. Beyond about 100K tokens, recall drops noticeably for details buried in the middle of the context. But even partial cross-file understanding is a massive improvement over no cross-file understanding.
Claude Code CLI - Agentic Coding in the Terminal
Claude Code is a CLI tool that gives Claude direct access to your filesystem, terminal, and development environment. It reads your code, makes edits, runs tests, and iterates until the task is done.
This is not autocomplete. This is an agent that does engineering work.
A typical interaction:
$ claude
> Refactor the payment processing module to use the Strategy pattern
instead of the switch statement in process_payment(). Make sure all
existing tests still pass.
Claude reads 12 files in src/payments/
Claude creates src/payments/strategies/
Claude writes StripeStrategy, PayPalStrategy, BankTransferStrategy
Claude modifies process_payment() to use strategy registry
Claude runs: npm test
3 tests fail - fixing...
Claude updates test fixtures to match new structure
Claude runs: npm test
All 47 tests pass.
Done. Changed 8 files, created 3 new files. Want me to show the diff?
The key capabilities that make this work in practice:
- File system access - it reads your actual code, not a pasted snippet
- Test execution - it runs your test suite and iterates on failures
- Git awareness - it understands branches, diffs, and commit history
- Project context - it reads package.json, tsconfig, .env.example to understand your stack
Where it excels:
- Refactoring tasks with clear success criteria (“make tests pass”)
- Adding features that follow existing patterns in the codebase
- Writing tests for untested code
- Fixing lint errors and type issues across many files
Head-to-Head - Claude 4 vs GPT-4 vs Gemini for Coding
I ran the same 20 coding tasks across all three models. Tasks ranged from algorithm implementation to full-feature development to debugging. Here are the results:
| Task Category | Claude 4 Opus | GPT-4o | Gemini 2.5 Pro |
|---|---|---|---|
| Algorithm implementation | 95% correct | 90% correct | 88% correct |
| Bug fixing (single file) | 90% | 85% | 82% |
| Bug fixing (cross-file) | 80% | 60% | 70% |
| Full feature (with tests) | 85% | 75% | 72% |
| Refactoring | 90% | 80% | 78% |
| Code review quality | Excellent | Good | Good |
| Infrastructure/DevOps | 85% | 80% | 75% |
| Performance optimization | 80% | 75% | 82% |
Key observations:
Claude 4 leads on complex reasoning tasks. Cross-file bug fixing and refactoring are where extended thinking provides the most advantage. It consistently considers edge cases that other models miss.
GPT-4o is faster for simple tasks. If you need a quick utility function or a straightforward implementation, GPT-4o’s lower latency makes it more practical. The quality difference on simple tasks is negligible.
Gemini 2.5 Pro has the best performance optimization instincts. Its training data seemingly includes more systems-level content. It suggests optimizations (data structure changes, algorithm swaps, caching strategies) that the other models miss. Its 1M context window is also unmatched for truly massive codebases.
Cost comparison for a typical coding session (2 hours of active use):
| Model | Approximate Cost | Notes |
|---|---|---|
| Claude 4 Opus | $8-15 | Expensive but highest quality |
| Claude 4 Sonnet | $2-5 | Best value for most coding tasks |
| GPT-4o | $3-6 | Good balance of speed and quality |
| Gemini 2.5 Pro | $1-3 | Cheapest, competitive quality |
Where Claude 4 Still Fails
No honest review ignores the failure modes.
1. Overconfidence on ambiguous requirements. Claude 4 will build exactly what you ask for, even when what you asked for is wrong. It rarely pushes back with “this design has a flaw” unless you explicitly ask it to review the approach first. GPT-4o is slightly better at unsolicited pushback.
2. Generated code style drift. Over a long session, the coding style drifts. It starts matching your codebase style, then gradually shifts toward its own preferences (more verbose variable names, different error handling patterns). You need to anchor it with explicit style guidelines.
3. Hallucinated APIs and libraries. It still occasionally invents function signatures for real libraries. Less frequent than a year ago, but it happens. Always verify import statements and function signatures against actual documentation.
4. Test quality plateau. AI-generated tests cover the obvious cases well but consistently miss the non-obvious edge cases that come from understanding the business domain. “What happens when a user has a subscription that was migrated from our old billing system?” - no model generates this test without being told about the legacy system.
5. Difficulty with truly novel problems. If the solution is not represented in its training data, quality drops sharply. Implementing a new consensus algorithm or a custom database storage engine is still firmly in “human engineer” territory.
Practical Recommendations
For individual engineers:
- Use Claude Code (or Cursor with Claude 4) as your primary coding tool
- Use extended thinking for any task involving more than 2 files
- Start with Sonnet for speed, escalate to Opus for complex problems
- Always review generated code - trust but verify
For engineering teams:
- Standardize on AI coding tools across the team
- Create project-level context files (.claude/MEMORY.md or similar) to give the model project-specific knowledge
- Measure the impact: track PR cycle time, bug rates, and developer satisfaction
- Do not use AI as a reason to skip code review - AI-generated code needs human review just like human-generated code
For technical leaders:
- The productivity gain is real: 30-50% on feature development tasks
- The quality gain is also real, but only if engineers review AI output critically
- The biggest risk is not that AI writes bad code - it is that engineers stop thinking critically because they trust the AI too much
Claude 4 has not replaced engineers. It has changed the nature of engineering work. The repetitive parts are largely automated. The thinking parts - architecture, design, understanding what to build - are more important than ever. The engineers who thrive are the ones who can direct AI effectively and catch its mistakes quickly. That is the new core competency.
Comments