Most teams using AI in production make the same mistake: they pick one model and use it for everything. Every autocomplete suggestion, every code generation task, every architecture review goes through the same frontier model. This is like hiring a senior architect to fix typos. It works, but the cost-to-value ratio is absurd.
Model routing is the practice of directing each task to the model best suited for it - matching task complexity to model capability. Done right, it cuts API costs by 5-10x while maintaining or improving output quality, because smaller models are often better at simple tasks than large models that overthink them.
The Model Tier Landscape in 2026
The major model families have settled into clear capability tiers. The names vary by provider, but the pattern is consistent:
| Tier | Claude | OpenAI | Typical Use | |
|---|---|---|---|---|
| Fast/Small | Haiku | GPT-4o mini | Gemini Flash | Search, autocomplete, classification, simple transforms |
| Mid/Balanced | Sonnet | GPT-4o | Gemini Pro | Code generation, refactoring, summarization, test writing |
| Frontier | Opus | o3 | Gemini Ultra | Architecture, complex debugging, multi-file reasoning, novel problems |
The cost differences between tiers are dramatic. A typical comparison for 1 million tokens of input + output:
| Model Tier | Approximate Cost per 1M Tokens | Relative Speed | Context Window |
|---|---|---|---|
| Haiku-class | $0.25 - $1.00 | Fastest (< 500ms for most tasks) | 200K |
| Sonnet-class | $3.00 - $10.00 | Moderate (1-3 seconds typical) | 200K |
| Opus-class | $15.00 - $60.00 | Slowest (3-15 seconds typical) | 200K+ |
Using Opus for a task that Haiku handles perfectly is a 15-60x cost multiplier for zero quality gain. Across thousands of daily requests, this adds up fast.
What Each Tier Does Best
Fast Models (Haiku-class)
Fast models excel at tasks with clear patterns and limited ambiguity. They are not dumber - they are more focused. For well-defined tasks, this focus is an advantage.
Ideal tasks:
- Code autocomplete and inline suggestions
- Variable and function name generation
- Simple code transformations (rename, extract, inline)
- Classification (is this a bug report or feature request?)
- Search query reformulation
- Commit message generation
- Documentation string generation
# Perfect Haiku task: generate a docstring for an existing function
def route_to_haiku(task):
"""Simple, well-defined tasks with clear correct answers."""
return client.messages.create(
model="claude-haiku",
max_tokens=256,
messages=[{"role": "user", "content": f"Write a Python docstring for this function:\n{task.code}"}]
)
Fast models also work well as filters. Before sending a complex task to an expensive model, a fast model can classify whether the task actually needs the expensive model.
Mid-Tier Models (Sonnet-class)
The workhorse tier. Mid-tier models handle most day-to-day coding tasks with good quality and reasonable cost. They understand context, follow instructions reliably, and generate code that is correct for standard patterns.
Ideal tasks:
- Function and class implementation from specifications
- Unit and integration test generation
- Code refactoring within a single file
- Bug fixes with clear reproduction steps
- API endpoint implementation following existing patterns
- Database query generation
# Good Sonnet task: implement a function given a spec and context
def route_to_sonnet(task):
"""Standard implementation tasks with clear patterns."""
return client.messages.create(
model="claude-sonnet",
max_tokens=4096,
system=task.project_context,
messages=[{"role": "user", "content": task.specification}]
)
Frontier Models (Opus-class)
Frontier models are for tasks where the additional reasoning capability justifies the cost. These are tasks involving ambiguity, multi-step reasoning, large context windows, or novel problems without clear patterns.
Ideal tasks:
- System architecture design and review
- Complex debugging across multiple files
- Security audit and vulnerability analysis
- Performance optimization with tradeoff analysis
- Migrating between frameworks or major versions
- Designing APIs and data models from requirements
- Code review requiring deep domain understanding
# Opus-justified task: debug a complex distributed system issue
def route_to_opus(task):
"""Complex reasoning, ambiguity, or multi-file coordination."""
return client.messages.create(
model="claude-opus",
max_tokens=8192,
system=task.full_project_context,
messages=task.conversation_history
)
Building a Model Router
A practical model router classifies incoming tasks and directs them to the appropriate tier. The simplest effective approach is rule-based:
class ModelRouter:
def __init__(self):
self.rules = {
"haiku": [
lambda t: t.type in ("autocomplete", "docstring", "commit_message"),
lambda t: t.estimated_output_tokens < 200,
lambda t: t.type == "classification",
],
"opus": [
lambda t: t.type in ("architecture", "security_audit", "debug_complex"),
lambda t: t.files_involved > 5,
lambda t: t.type == "migration",
],
}
def route(self, task) -> str:
for rule in self.rules.get("opus", []):
if rule(task):
return "claude-opus"
for rule in self.rules.get("haiku", []):
if rule(task):
return "claude-haiku"
return "claude-sonnet" # default to mid-tier
The router checks Opus conditions first (to avoid underserving complex tasks), then Haiku conditions (to save cost on simple tasks), and defaults to Sonnet for everything else. This default-to-middle approach is safe - Sonnet handles the widest range of tasks acceptably.
The Cascading Pattern
Rule-based routing requires knowing the task complexity upfront. The cascading pattern handles uncertainty by starting cheap and escalating:
async def cascading_generate(task, quality_threshold=0.8):
# Try Haiku first
result = await generate(model="claude-haiku", task=task)
if evaluate_quality(result) >= quality_threshold:
return result # cost: $0.001
# Haiku wasn't good enough, try Sonnet
result = await generate(model="claude-sonnet", task=task)
if evaluate_quality(result) >= quality_threshold:
return result # cost: $0.01
# Fall through to Opus
result = await generate(model="claude-opus", task=task)
return result # cost: $0.05
The key is the evaluate_quality function. For code generation, this can be concrete: does the code compile? Do existing tests pass? Does the type checker accept it? For less structured output, a fast model can evaluate whether the response adequately addresses the prompt - using Haiku to judge whether Haiku’s output was sufficient is effective and cheap.
The cascading pattern is particularly powerful because it is self-optimizing. Most tasks resolve at the cheapest tier. Only genuinely complex tasks escalate. The average cost per task drops dramatically compared to routing everything through a frontier model.
Cost Impact in Practice
Consider a development team generating 1,000 AI requests per day with this distribution:
| Task Type | Daily Count | Without Routing (all Opus) | With Routing | Savings |
|---|---|---|---|---|
| Autocomplete/simple | 600 | 600 x $0.05 = $30 | 600 x $0.001 = $0.60 | 98% |
| Code generation | 300 | 300 x $0.05 = $15 | 300 x $0.01 = $3.00 | 80% |
| Architecture/complex | 100 | 100 x $0.05 = $5 | 100 x $0.05 = $5.00 | 0% |
| Daily total | 1,000 | $50.00 | $8.60 | 83% |
That is $12,000/month saved for a single team. Across an organization with multiple teams, model routing is often the single highest-impact cost optimization available.
Routing in AI Coding Tools
Most AI coding tools now support model routing natively or through configuration. In Claude Code, the model can be specified per task type. In Cursor, different models can be assigned to tab completion versus chat versus agent mode. The principle is the same regardless of the tool: match the model to the task.
For teams building custom AI tooling, the router should be a first-class component - not an afterthought. It sits between the task queue and the model API, and every request flows through it. Logging which model handled which task and the resulting quality score provides data for continuously tuning the routing rules.
The Bottom Line
Model routing is not about using cheap models to save money. It is about using the right model for each task to optimize the cost-quality tradeoff across the entire workload. Fast models are better at simple tasks. Mid-tier models handle standard work reliably. Frontier models justify their cost on genuinely complex problems. A routing layer that makes this selection automatically - whether through rules, cascading, or learned classifiers - is foundational infrastructure for any team using AI at scale.
Comments