What 'Thinking' Actually Costs - Reasoning Models and Test-Time Compute

When you enable extended thinking on Claude or switch to an o-series model, the price per request jumps 3 to 10x. That is not because you are getting a smarter model. You are getting the same model spending more computation on your specific question at inference time. Understanding the difference between training-time compute and inference-time compute changes how you decide when these models are worth using.

How a Standard Model Generates Output

A standard LLM takes all the input tokens (system prompt, history, user message), runs them through the transformer layers in a single forward pass to predict the next token, then samples that token and appends it to the sequence. It repeats this process until it generates an end token. Each forward pass scales with model size. Total cost scales linearly with output token count.

The model cannot revise what it has already written. Once a token is generated, it becomes part of the context for all future tokens. If the model starts down a wrong path in step two of a twelve-step proof, it has no mechanism to erase that and try a different approach. It can only continue forward.

What Reasoning Models Actually Do

Reasoning models generate a chain of thought before producing the final response. This internal chain of thought:

Is generated token by token, exactly like regular output
Is billed as output tokens (sometimes at a slightly different rate)
Can be 5 to 50 times longer than the final answer
Functions as a working scratchpad that the model can read as it generates more of it

The crucial difference from standard chain-of-thought prompting (“Let’s think step by step”): when you put the reasoning in the visible output, the model still has to commit to every token. It cannot explore and discard approaches. Reasoning models with dedicated thinking space can generate “wait, this approach breaks down at step 4, let me try a different decomposition” and actually follow through on that correction. The rejected path is in the thinking tokens, which you either see (Claude’s extended thinking) or do not see (OpenAI’s o-series), but either way the model benefits from having taken it.

The other factor: reasoning models are fine-tuned specifically on data designed to teach effective scratchpad use. The training reinforces which problem types benefit from extended exploration and how to structure that exploration.

The Token Math

Here is a concrete cost comparison for a hard coding problem:

Standard Claude Sonnet request:
  Input:              1,400 tokens
  Output:               600 tokens
  Total:              2,000 tokens
  Cost (input $3/M, output $15/M): $0.0134

Same request with extended thinking (budget_tokens=8000):
  Input:              1,400 tokens
  Thinking tokens:    5,200 tokens  (actually used)
  Output:               520 tokens
  Total:              7,120 tokens
  Cost:               $0.0120 input + $0.0858 thinking+output = $0.0978

  Ratio: ~7.3x more expensive

The thinking budget is a maximum, not a fixed charge. Set budget_tokens=8000 and if the problem resolves in 800 thinking tokens, that is all you pay for. But on genuinely complex problems, the model will often use most of the budget you give it.

Training Scaling vs Inference Scaling

Two levers for getting better model output:

Training scaling: train a larger model on more data with more compute. The cost is amortized across every request the model ever handles. Each inference is cheap.
Inference scaling (test-time compute): spend more computation on each individual request at inference time. The cost is per request.

These are not substitutes for each other on all tasks. A 10x inference compute budget does not generally match a 10x larger model. But for a specific class of problems - ones with checkable correct answers and multi-step reasoning requirements - inference scaling is surprisingly effective.

The intuition: if you are solving a math problem, you can verify your work. A model that spends 5x more compute working through the proof and checking each step will catch errors that a single-pass model commits to and cannot revise. For problems without checkable answers, the same extra compute does not have the same payoff because there is no verification signal to act on.

What Happens During the Thinking Phase

Standard model:
  [Input tokens] ---> [Transformer layers] ---> [Output tokens]
  One forward pass per output token. Linear cost.

Reasoning model:
  [Input tokens] ---> [Thinking phase] ---> [Answer phase]
                            |
                    Token by token, same
                    transformer forward pass
                    
                    Model generates:
                    "Let me break this into parts...
                     Part 1: [works out subproblem]
                     Wait, that assumes X but the input
                     says not-X. Let me revisit...
                     Actually the right approach is Y..."
                    
                    Then produces final answer
                    incorporating the corrected reasoning.

  Cost: O(input + thinking + output)
  Thinking tokens are billed output tokens.

The thinking phase is sequential token generation just like regular output. There is no explicit backtracking in the traditional programming sense. What looks like backtracking is the model generating text that acknowledges a mistake and then generating the corrected continuation. The key is that the model can “read” everything it has written in the thinking block so far as context for each new thinking token, which gives it the ability to self-correct across hundreds or thousands of tokens of intermediate work.

The Decision Framework

Two questions determine whether extended thinking is worth the cost:

Does the task have a verifiable or clearly evaluable correct answer?
Does reaching that answer require multiple non-obvious reasoning steps?

If both answers are yes, extended thinking typically earns its cost. If either is no, you are probably paying a premium for marginal gain.

Task Type	Verifiable	Multi-step	Use Thinking
Competition math (AIME, Olympiad)	Yes	Yes	Yes - 30-60% improvement
Hard algorithm implementation	Yes	Yes	Yes - catches edge cases
Security / logic vulnerability analysis	Yes	Yes	Yes - explores more attack paths
Multi-step planning with constraints	Partly	Yes	Usually yes
Complex SQL with nested conditions	Yes	Partly	Often worth it
Standard API integration	Yes	No	No - standard model fine
Factual Q&A (with source material)	Yes	No	No - lookup not reasoning
Classification / extraction	Yes	No	No - overkill
Creative writing	No	No	No - over-thinking hurts fluency
Simple code generation (CRUD, boilerplate)	Yes	No	No - cost not justified
Summarization	No	No	No

Configuring Token Budgets

Setting budget_tokens too low is worse than not using extended thinking. If the model hits the budget mid-reasoning, it is forced to produce a final answer from an incomplete chain of thought, which can be less accurate than a plain model’s direct response.

import anthropic

client = anthropic.Anthropic()

def call_with_thinking(prompt: str, budget: int = 5000) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=16000,
        thinking={
            "type": "enabled",
            "budget_tokens": budget
        },
        messages=[{"role": "user", "content": prompt}]
    )

    thinking_text = ""
    answer_text = ""
    for block in response.content:
        if block.type == "thinking":
            thinking_text += block.thinking
        elif block.type == "text":
            answer_text += block.text

    return {
        "thinking": thinking_text,
        "answer": answer_text,
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
    }

# Hard algorithmic problem: give it room to explore
result = call_with_thinking(hard_algo_problem, budget=12000)

# Logic puzzle: medium budget
result = call_with_thinking(logic_puzzle, budget=4000)

# Simple code task: skip thinking entirely
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    messages=[{"role": "user", "content": simple_task}]
)

A reasonable starting point: 5,000 budget_tokens for tasks you believe need reasoning. In production, log the actual thinking tokens consumed per request. If the model consistently uses less than 30% of the budget, either the task is simpler than you thought, or it does not benefit from this approach. If it consistently hits the ceiling, raise the budget or accept that you are truncating the reasoning.

How o-Series Differs From Extended Thinking

OpenAI’s o-series models (o1, o3, o4-mini) and Claude’s extended thinking share the inference-scaling premise but differ in how they expose it:

Aspect	Claude extended thinking	OpenAI o-series
Thinking visibility	Streamed to you via API	Hidden, summary only
Budget control	Explicit budget_tokens parameter	reasoning_effort (low/medium/high)
Opt-in	Per request	Always on (varies by model)
Debuggability	High - you can read the reasoning	Low - black box
Cost predictability	High - you see token counts	Lower - effort levels are approximate

The case for hidden thinking (o-series): the model cannot optimize its scratchpad to look impressive to you rather than to be useful. There is some evidence that visible thinking causes models to write more for the audience, which slightly degrades actual reasoning quality.

The case for visible thinking (Claude): you can validate the reasoning chain, catch systematic errors, and understand why a model reached a surprising conclusion. In high-stakes applications, this matters.

Benchmark Performance Reality

From published evals and internal measurements across a range of tasks:

Task category                     Improvement over base model
----------------------------------------------------------
Competition math (AIME/AMC)       30-60%
Competitive programming           20-40%
Graduate-level science Q&A        15-30%
Complex software debugging        10-25%
Standard software engineering     5-15%  (often not worth 4-8x cost)
API documentation tasks           0-5%
Creative writing                  -5% to +2%  (often worse)
Simple classification             0-3%

The consistent pattern: the more a task resembles a formal problem with checkable correctness, the more inference compute helps. Tasks with fuzzy success criteria improve little or not at all.

Building a Two-Path System

The practical implementation is a router that sends most requests to a standard model and escalates to a reasoning model when the task warrants it. The routing signal is usually available in the task metadata or content:

class ThinkingRouter:
    THINKING_SIGNALS = [
        "prove", "verify", "find all edge cases",
        "is this mathematically correct",
        "debug why this algorithm fails",
        "optimize this query", "identify vulnerabilities",
        "step by step", "formal proof",
    ]

    SKIP_THINKING = [
        "summarize", "translate", "classify",
        "what is", "list the", "format this",
        "write a docstring", "rename",
    ]

    def route(self, task: str, force_thinking: bool = False) -> dict:
        task_lower = task.lower()

        if force_thinking:
            return {"thinking": True, "budget": 8000}

        # Skip list takes priority
        if any(sig in task_lower for sig in self.SKIP_THINKING):
            return {"thinking": False}

        # Check for signals that benefit from reasoning
        if any(sig in task_lower for sig in self.THINKING_SIGNALS):
            return {"thinking": True, "budget": 6000}

        # Default: no thinking
        return {"thinking": False}

In production, replace keyword matching with a fast classifier model (haiku-class) that predicts whether a task benefits from extended thinking. The classifier adds one cheap API call but routes more accurately than heuristics.

Honest Assessment

What works:

Extended thinking genuinely solves problems that standard models fail on. If your eval baseline shows a model failing at complex algorithmic tasks or multi-step logic, try a reasoning model before concluding the task is impossible. The improvement on hard math and coding problems is real and consistent.

The token budget control in Claude’s API is genuinely useful. You can measure actual thinking consumption per request type and dial the budget to match the real complexity distribution in your workload.

What does not work:

Using reasoning models as a default upgrade. A reasoning model on a simple task is not better than a standard model. It is often slightly worse (more mechanical, over-engineered answers) and always more expensive. Extended thinking is not a general quality multiplier.

Assuming more thinking tokens always correlates with a better answer. On tasks without a checkable correct answer, extra thinking often produces longer, more hedge-y answers rather than better ones. Log and evaluate actual output quality, not token count.

What to actually do:

Run your workload through both a standard model and a reasoning model and measure the quality delta on your actual tasks. Do not assume benchmark results transfer directly to your use case. If you find a category of requests where reasoning models consistently produce better results, route those specifically. If you find the delta is marginal, stick to the standard model and use the budget difference to make more API calls, run evals, or give users faster responses.

The economics are simple: reasoning models are a premium tool for a specific kind of problem. Build your system to use that tool precisely, not broadly.

How a Standard Model Generates Output#

What Reasoning Models Actually Do#

The Token Math#

Training Scaling vs Inference Scaling#

What Happens During the Thinking Phase#

The Decision Framework#

Configuring Token Budgets#

How o-Series Differs From Extended Thinking#

Benchmark Performance Reality#

Building a Two-Path System#

Honest Assessment#

Comments