In December 2024, OpenAI's o3 scored 87.5% on ARC-AGI — a benchmark that previous models couldn't crack past 30%. The underlying model wasn't dramatically larger than its predecessors. What changed was how long it was allowed to think.
That's inference-time scaling. And it's reshaping how you should approach prompting.
What inference-time scaling actually means
For most of AI's history, the dominant lever for better performance was training-time compute: bigger models, more data, longer training runs. Inference was cheap and fast. You sent a query, the model ran a forward pass, you got an answer.
Inference-time scaling flips this. Instead of squeezing all intelligence into the model's weights during training, you allocate additional compute at generation time — letting the model reason through a problem before committing to an answer.
Think of it as the difference between someone answering a math question immediately versus taking scratch paper to work through it. The scratch paper version is slower and uses more resources, but the answer is dramatically more reliable.
The key insight is that reasoning quality and reasoning time have a computable tradeoff. Give a capable model more steps to think, and it explores more solution paths, catches more errors, and produces more coherent outputs.
The two flavors of inference-time scaling
Extended chain-of-thought (reasoning tokens)
This is what o1, o3, Claude 3.7 Sonnet with extended thinking, and Gemini 2.0 Flash Thinking all use. The model generates an internal "thinking" trace — sometimes thousands of tokens — before producing its final answer. You often don't see this trace (or see a summarized version), but it's doing real work.
Claude 3.7's extended thinking budget goes up to 128K tokens of internal reasoning. For a complex architectural decision or a proof, that's the equivalent of 20-30 pages of working notes before the model gives you its conclusion.
The mechanism is essentially teaching models to reason step-by-step through RLHF on reasoning traces, then letting them generate longer traces at inference time. More steps = more opportunities to self-correct.
Search-based scaling
The second flavor uses explicit search or sampling strategies at inference time:
- Tree of Thought: Generate multiple reasoning branches, evaluate each, continue from the most promising ones
- Self-consistency: Sample N answers independently, return the majority or synthesize
- Best-of-N sampling: Generate N responses, use a verifier or reward model to pick the best
Search-based approaches are older and can be applied on top of existing models without special training. But they're computationally expensive in proportion to N — if you're sampling 32 responses to pick one, you're paying 32x inference cost.
Why it works
The core reason is that generation is autoregressive. Once a model commits to a token, that token becomes context for all subsequent tokens. If the first few tokens of an answer push toward a wrong approach, the rest of the answer follows that path.
Extended reasoning gives the model a chance to explore multiple approaches before that commitment becomes irreversible. It's not that the model "knows more" — it's that it has more processing budget to explore the answer space before finalizing.
There's also a verification asymmetry: verifying a solution is often easier than generating it. A model with extended thinking can generate a candidate answer, then allocate tokens to checking that answer against its stated criteria. Chain-of-thought prompting does this manually; inference-time scaling does it automatically at scale.
The models and their reasoning modes
OpenAI o1 and o3: The originals. o1 made reasoning tokens mainstream; o3 demonstrated that scaling inference-time compute on a capable base model produces genuinely surprising capability jumps. o3-mini is the cost-optimized version.
Claude 3.7 Sonnet with extended thinking: Anthropic's implementation lets you set a thinking budget in tokens (up to 128K). Extended thinking is an explicit API parameter — you opt in and pay for it. The output includes the thinking summary alongside the answer.
Gemini 2.0 Flash Thinking: Google's entry, optimized for speed. Flash Thinking is notably faster than o1/o3 at comparable reasoning tasks, though typically weaker on the hardest problems.
DeepSeek R1: The open-source shock from early 2025. Competitive with o1 on most benchmarks, weights available, significantly cheaper to run via API. Changed the cost calculus for reasoning models substantially.
Practical implications for prompting
Don't rush the reasoning models. The biggest prompting mistake I see is treating o3 or Claude extended thinking like a fast chat model. Don't ask follow-up questions mid-stream. Don't break complex problems into tiny pieces to "help" the model. Give it the full problem context upfront and let the reasoning run.
For a reasoning model, this:
I need to design a caching strategy for a distributed system with 50M daily active users,
sub-100ms P99 latency requirements, multi-region deployment across US/EU/APAC,
and an eventual consistency tolerance for non-critical reads. Consider Redis, Memcached,
CDN-level caching, and application-level patterns. What's the architecture?
...outperforms giving it one piece at a time.
Match the model to the task type. This is the most important practical decision:
| Task type | Use reasoning model? |
|---|---|
| Multi-step math/proofs | Yes |
| Complex code debugging | Yes |
| Legal/contract analysis | Yes |
| Strategic decisions with tradeoffs | Yes |
| Creative writing | No |
| Simple Q&A / lookups | No |
| Summarization | No |
| Code generation (straightforward) | Maybe |
| Brainstorming | No |
Reasoning models can actually be worse at creative tasks. They tend to overthink structure and produce competent but uninteresting output. A standard Claude Sonnet 3.5 or GPT-4o will write a better short story than o3 because creativity doesn't benefit from extended verification.
Set explicit thinking budgets. With Claude's extended thinking, you control the token budget. For most tasks, 8K-16K thinking tokens hits a sweet spot. Diminishing returns kick in past 32K for most problems — reserve the 128K budget for genuinely complex reasoning chains like formal proofs or comprehensive architectural decisions.
Don't add explicit reasoning instructions. With standard models, "think step by step" helps significantly. With dedicated reasoning models, it's redundant and can interfere with their internal reasoning process. Let them reason on their own.
When to use inference-time scaling
Use a reasoning model or extended thinking when:
- The problem has a verifiable correct answer (math, code, logic)
- Mistakes are expensive (production architecture, legal review, financial analysis)
- The problem requires holding multiple constraints simultaneously
- You need the model to catch its own errors before you do
- Intermediate steps matter, not just the final answer
A rule I use: if a human expert would take out scratch paper to solve it, use a reasoning model.
When not to use it
Inference-time scaling has a real cost. o3 API calls are 10-40x more expensive than GPT-4o for comparable output length, depending on the thinking budget used. Claude's extended thinking adds cost in proportion to the thinking token budget.
Skip it when:
- You're doing high-volume, low-stakes processing (document classification, tagging, summarization at scale)
- Latency matters more than accuracy (user-facing chat, autocomplete)
- The task is fundamentally creative rather than analytical
- You're iterating quickly on prompt design — fast feedback loops beat reasoning depth in early stages
More detailed guidance on working with these models is in the prompting reasoning models post — specifically the section on framing problems to match reasoning model strengths.
The bigger picture
Inference-time scaling doesn't replace training-time scaling. The best systems combine both: train capable base models, then allow extended reasoning at inference time. The capability frontier is moving on both axes simultaneously.
What it means for practitioners is that raw model size is no longer the only dimension that matters. A smaller model with extended reasoning can outperform a larger model without it on the right tasks. This changes how you should evaluate and select models — and how you should budget compute for different parts of your pipeline.
The advanced prompting track goes deeper into the prompting techniques that matter most for reasoning models: meta-prompting, prompt chaining, and automatic prompt optimization. If you're building systems that use o3 or Claude extended thinking seriously, those techniques matter more than ever.
The practical summary: stop treating inference compute as a fixed cost. It's a dial you can turn up when you need accuracy. Turn it up for the problems where accuracy is worth paying for. Leave it at default for everything else.



