When you enable extended thinking on Claude or switch to an o-series model, the price per request jumps 3 to 10x. That is not because you are getting a smarter model. You are getting the same model spending more computation on your specific question at inference time. Understanding the difference between training-time compute and inference-time compute changes how you decide when these models are worth using.
How a Standard Model Generates Output
A standard LLM takes all the input tokens (system prompt, history, user message), runs them through the transformer layers in a single forward pass to predict the next token, then samples that token and appends it to the sequence. It repeats this process until it generates an end token. Each forward pass scales with model size. Total cost scales linearly with output token count.
The model cannot revise what it has already written. Once a token is generated, it becomes part of the context for all future tokens. If the model starts down a wrong path in step two of a twelve-step proof, it has no mechanism to erase that and try a different approach. It can only continue forward.
What Reasoning Models Actually Do
Reasoning models generate a chain of thought before producing the final response. This internal chain of thought:
- Is generated token by token, exactly like regular output
- Is billed as output tokens (sometimes at a slightly different rate)
- Can be 5 to 50 times longer than the final answer
- Functions as a working scratchpad that the model can read as it generates more of it
The crucial difference from standard chain-of-thought prompting (“Let’s think step by step”): when you put the reasoning in the visible output, the model still has to commit to every token. It cannot explore and discard approaches. Reasoning models with dedicated thinking space can generate “wait, this approach breaks down at step 4, let me try a different decomposition” and actually follow through on that correction. The rejected path is in the thinking tokens, which you either see (Claude’s extended thinking) or do not see (OpenAI’s o-series), but either way the model benefits from having taken it.
The other factor: reasoning models are fine-tuned specifically on data designed to teach effective scratchpad use. The training reinforces which problem types benefit from extended exploration and how to structure that exploration.
The Token Math
Here is a concrete cost comparison for a hard coding problem:
Standard Claude Sonnet request:
Input: 1,400 tokens
Output: 600 tokens
Total: 2,000 tokens
Cost (input $3/M, output $15/M): $0.0134
Same request with extended thinking (budget_tokens=8000):
Input: 1,400 tokens
Thinking tokens: 5,200 tokens (actually used)
Output: 520 tokens
Total: 7,120 tokens
Cost: $0.0120 input + $0.0858 thinking+output = $0.0978
Ratio: ~7.3x more expensive
The thinking budget is a maximum, not a fixed charge. Set budget_tokens=8000 and if the problem resolves in 800 thinking tokens, that is all you pay for. But on genuinely complex problems, the model will often use most of the budget you give it.
Training Scaling vs Inference Scaling
Two levers for getting better model output:
- Training scaling: train a larger model on more data with more compute. The cost is amortized across every request the model ever handles. Each inference is cheap.
- Inference scaling (test-time compute): spend more computation on each individual request at inference time. The cost is per request.
These are not substitutes for each other on all tasks. A 10x inference compute budget does not generally match a 10x larger model. But for a specific class of problems - ones with checkable correct answers and multi-step reasoning requirements - inference scaling is surprisingly effective.
The intuition: if you are solving a math problem, you can verify your work. A model that spends 5x more compute working through the proof and checking each step will catch errors that a single-pass model commits to and cannot revise. For problems without checkable answers, the same extra compute does not have the same payoff because there is no verification signal to act on.
What Happens During the Thinking Phase
Standard model:
[Input tokens] ---> [Transformer layers] ---> [Output tokens]
One forward pass per output token. Linear cost.
Reasoning model:
[Input tokens] ---> [Thinking phase] ---> [Answer phase]
|
Token by token, same
transformer forward pass
Model generates:
"Let me break this into parts...
Part 1: [works out subproblem]
Wait, that assumes X but the input
says not-X. Let me revisit...
Actually the right approach is Y..."
Then produces final answer
incorporating the corrected reasoning.
Cost: O(input + thinking + output)
Thinking tokens are billed output tokens.
The thinking phase is sequential token generation just like regular output. There is no explicit backtracking in the traditional programming sense. What looks like backtracking is the model generating text that acknowledges a mistake and then generating the corrected continuation. The key is that the model can “read” everything it has written in the thinking block so far as context for each new thinking token, which gives it the ability to self-correct across hundreds or thousands of tokens of intermediate work.
The Decision Framework
Two questions determine whether extended thinking is worth the cost:
- Does the task have a verifiable or clearly evaluable correct answer?
- Does reaching that answer require multiple non-obvious reasoning steps?
If both answers are yes, extended thinking typically earns its cost. If either is no, you are probably paying a premium for marginal gain.
| Task Type | Verifiable | Multi-step | Use Thinking |
|---|---|---|---|
| Competition math (AIME, Olympiad) | Yes | Yes | Yes - 30-60% improvement |
| Hard algorithm implementation | Yes | Yes | Yes - catches edge cases |
| Security / logic vulnerability analysis | Yes | Yes | Yes - explores more attack paths |
| Multi-step planning with constraints | Partly | Yes | Usually yes |
| Complex SQL with nested conditions | Yes | Partly | Often worth it |
| Standard API integration | Yes | No | No - standard model fine |
| Factual Q&A (with source material) | Yes | No | No - lookup not reasoning |
| Classification / extraction | Yes | No | No - overkill |
| Creative writing | No | No | No - over-thinking hurts fluency |
| Simple code generation (CRUD, boilerplate) | Yes | No | No - cost not justified |
| Summarization | No | No | No |
Configuring Token Budgets
Setting budget_tokens too low is worse than not using extended thinking. If the model hits the budget mid-reasoning, it is forced to produce a final answer from an incomplete chain of thought, which can be less accurate than a plain model’s direct response.
import anthropic
client = anthropic.Anthropic()
def call_with_thinking(prompt: str, budget: int = 5000) -> dict:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=16000,
thinking={
"type": "enabled",
"budget_tokens": budget
},
messages=[{"role": "user", "content": prompt}]
)
thinking_text = ""
answer_text = ""
for block in response.content:
if block.type == "thinking":
thinking_text += block.thinking
elif block.type == "text":
answer_text += block.text
return {
"thinking": thinking_text,
"answer": answer_text,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
}
# Hard algorithmic problem: give it room to explore
result = call_with_thinking(hard_algo_problem, budget=12000)
# Logic puzzle: medium budget
result = call_with_thinking(logic_puzzle, budget=4000)
# Simple code task: skip thinking entirely
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
messages=[{"role": "user", "content": simple_task}]
)
A reasonable starting point: 5,000 budget_tokens for tasks you believe need reasoning. In production, log the actual thinking tokens consumed per request. If the model consistently uses less than 30% of the budget, either the task is simpler than you thought, or it does not benefit from this approach. If it consistently hits the ceiling, raise the budget or accept that you are truncating the reasoning.
How o-Series Differs From Extended Thinking
OpenAI’s o-series models (o1, o3, o4-mini) and Claude’s extended thinking share the inference-scaling premise but differ in how they expose it:
| Aspect | Claude extended thinking | OpenAI o-series |
|---|---|---|
| Thinking visibility | Streamed to you via API | Hidden, summary only |
| Budget control | Explicit budget_tokens parameter | reasoning_effort (low/medium/high) |
| Opt-in | Per request | Always on (varies by model) |
| Debuggability | High - you can read the reasoning | Low - black box |
| Cost predictability | High - you see token counts | Lower - effort levels are approximate |
The case for hidden thinking (o-series): the model cannot optimize its scratchpad to look impressive to you rather than to be useful. There is some evidence that visible thinking causes models to write more for the audience, which slightly degrades actual reasoning quality.
The case for visible thinking (Claude): you can validate the reasoning chain, catch systematic errors, and understand why a model reached a surprising conclusion. In high-stakes applications, this matters.
Benchmark Performance Reality
From published evals and internal measurements across a range of tasks:
Task category Improvement over base model
----------------------------------------------------------
Competition math (AIME/AMC) 30-60%
Competitive programming 20-40%
Graduate-level science Q&A 15-30%
Complex software debugging 10-25%
Standard software engineering 5-15% (often not worth 4-8x cost)
API documentation tasks 0-5%
Creative writing -5% to +2% (often worse)
Simple classification 0-3%
The consistent pattern: the more a task resembles a formal problem with checkable correctness, the more inference compute helps. Tasks with fuzzy success criteria improve little or not at all.
Building a Two-Path System
The practical implementation is a router that sends most requests to a standard model and escalates to a reasoning model when the task warrants it. The routing signal is usually available in the task metadata or content:
class ThinkingRouter:
THINKING_SIGNALS = [
"prove", "verify", "find all edge cases",
"is this mathematically correct",
"debug why this algorithm fails",
"optimize this query", "identify vulnerabilities",
"step by step", "formal proof",
]
SKIP_THINKING = [
"summarize", "translate", "classify",
"what is", "list the", "format this",
"write a docstring", "rename",
]
def route(self, task: str, force_thinking: bool = False) -> dict:
task_lower = task.lower()
if force_thinking:
return {"thinking": True, "budget": 8000}
# Skip list takes priority
if any(sig in task_lower for sig in self.SKIP_THINKING):
return {"thinking": False}
# Check for signals that benefit from reasoning
if any(sig in task_lower for sig in self.THINKING_SIGNALS):
return {"thinking": True, "budget": 6000}
# Default: no thinking
return {"thinking": False}
In production, replace keyword matching with a fast classifier model (haiku-class) that predicts whether a task benefits from extended thinking. The classifier adds one cheap API call but routes more accurately than heuristics.
Honest Assessment
What works:
Extended thinking genuinely solves problems that standard models fail on. If your eval baseline shows a model failing at complex algorithmic tasks or multi-step logic, try a reasoning model before concluding the task is impossible. The improvement on hard math and coding problems is real and consistent.
The token budget control in Claude’s API is genuinely useful. You can measure actual thinking consumption per request type and dial the budget to match the real complexity distribution in your workload.
What does not work:
Using reasoning models as a default upgrade. A reasoning model on a simple task is not better than a standard model. It is often slightly worse (more mechanical, over-engineered answers) and always more expensive. Extended thinking is not a general quality multiplier.
Assuming more thinking tokens always correlates with a better answer. On tasks without a checkable correct answer, extra thinking often produces longer, more hedge-y answers rather than better ones. Log and evaluate actual output quality, not token count.
What to actually do:
Run your workload through both a standard model and a reasoning model and measure the quality delta on your actual tasks. Do not assume benchmark results transfer directly to your use case. If you find a category of requests where reasoning models consistently produce better results, route those specifically. If you find the delta is marginal, stick to the standard model and use the budget difference to make more API calls, run evals, or give users faster responses.
The economics are simple: reasoning models are a premium tool for a specific kind of problem. Build your system to use that tool precisely, not broadly.
Comments