When OpenAI released o1, I spent the first few hours prompting it the same way I'd prompt GPT-4. I kept getting worse results than I expected.
Then I read the docs more carefully and realized: reasoning models require a genuinely different prompting approach. The intuitions built up from years of prompting GPT-4 and Claude don't all transfer.
Here's what I've learned about getting the most out of reasoning models.
What's Actually Different
Standard models generate tokens left-to-right, producing the visible response as they go. The response is the generation.
Reasoning models (o1, o3, Claude extended thinking) have an additional step: they first produce an internal chain of reasoning — sometimes called a "thinking" process or scratchpad — before generating the visible response. This internal reasoning is often hidden from users.
The practical effects:
- Much better on hard reasoning tasks — math, complex code, multi-step logic, scientific analysis
- Slower — they need time for that internal thinking
- More expensive — you're paying for all that internal compute
- Less sensitive to prompting tricks — you don't need to guide the reasoning process because it happens automatically
What to Stop Doing
Stop saying "think step by step"
This is the biggest habit to break. With standard models, adding "think step by step" or "let's work through this carefully" triggers chain-of-thought reasoning. With reasoning models, they're already thinking step by step internally. Repeating this instruction is redundant and may interfere with the model's natural reasoning process.
Stop adding reasoning scaffolding
With standard models, you might structure a prompt like:
First, identify the key factors. Then, evaluate each factor. Finally, synthesize a recommendation.
Reasoning models don't need this scaffolding — they do it internally. Prescribing the reasoning steps can actually constrain the model's ability to reason flexibly.
Stop using few-shot examples for reasoning demonstrations
With standard models, few-shot examples that show reasoning ("Input: X → Let me think... A, then B, then C → Output: Y") help the model learn to reason. With reasoning models, this is less necessary. They can reason without being shown how.
What to Start Doing
Write clear, complete problem specifications
Reasoning models respond well to thorough problem descriptions. Include all constraints, requirements, and edge cases upfront. The model will handle reasoning about them — your job is to make sure it knows what it needs to know.
Bad: "Write a function to sort a list."
Better: "Write a Python function that sorts a list of integers in ascending
order. It must handle: empty lists, lists with duplicates, negative numbers,
and very large lists (100M+ elements) efficiently. Return a new list rather
than sorting in place. Include type hints and a brief docstring."
Specify output format explicitly
Reasoning models produce variable-length outputs. Be specific about what you want:
- Should the answer be a number, a paragraph, a bulleted list?
- How long?
- What should and shouldn't be included?
Provide your answer as a JSON object with these exact keys:
- "recommendation": one sentence
- "confidence": "high" | "medium" | "low"
- "key_risks": array of strings (max 3)
- "rationale": 2-3 sentences
Trust the model to structure its reasoning
Rather than prescribing how to think, describe what you need. The model will figure out how to reason about it.
Bad: "First, analyze the requirements. Second, design the architecture.
Third, identify potential issues."
Better: "Design the database schema for a multi-tenant SaaS application
with these requirements: [requirements]. I need the final schema, the key
design decisions you made, and the main tradeoffs."
Claude Extended Thinking
Claude's extended thinking (available on certain API configurations) works similarly to o1 — Claude reasons internally before producing a visible response.
A few Claude-specific notes:
Budget tokens for thinking
When using extended thinking via the API, you set a "budget tokens" parameter that controls how much thinking to do. More budget = more thorough reasoning = higher cost and latency. Match the budget to the task complexity.
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=16000,
thinking={
"type": "enabled",
"budget_tokens": 10000 # How much to think before responding
},
messages=[{"role": "user", "content": your_prompt}]
)
Extended thinking is best for:
- Complex, multi-step math or science problems
- Code requiring careful architecture planning
- Legal or contract analysis requiring precise logical reasoning
- Problems where you're currently getting inconsistent results
Extended thinking adds limited value for:
- Simple factual questions
- Creative writing
- Basic summarization
- Tasks that are already easy for Claude
When Reasoning Models Are Worth the Cost
Reasoning models cost 5–10× more than standard models and have higher latency. When is this justified?
Justified:
- The task is where reasoning models measurably outperform standard models (complex math, careful code)
- The cost of an error is high (legal, medical, financial analysis)
- You've tested a standard model and it's failing consistently on this task type
- Volume is low enough that the price difference is acceptable
Not justified:
- Simple tasks that standard models handle well
- High-volume, latency-sensitive applications
- Creative tasks where reasoning isn't the bottleneck
- When you haven't first tested whether a standard model does well enough
The honest answer is: try the standard model first. If it's failing on a specific task type, then try a reasoning model. Don't assume you need reasoning mode for everything — the cost adds up fast.
Practical Benchmarks
Rough guidance on task categories (as of early 2026):
| Task | Standard model | Reasoning model | Use reasoning model? |
|---|---|---|---|
| Complex math problems | Often fails | Significantly better | Yes |
| Competitive programming | Variable | Substantially better | Yes |
| Multi-step logical deduction | Moderate | Better | Usually yes |
| Code review of complex code | Good | Marginally better | Sometimes |
| Scientific reasoning | Moderate | Better | Yes |
| Simple Q&A | Good | Marginally better | No |
| Creative writing | Good | Same or worse | No |
| Basic summarization | Good | Same | No |
The Bottom Line
Reasoning models are a genuine step change for hard reasoning tasks. But they're not universal upgrades — they require different prompting, cost more, and are slower.
The key adjustments:
- Remove reasoning scaffolding (they don't need "think step by step")
- Write thorough, complete problem specifications
- Specify output format precisely
- Use them selectively for tasks where they actually improve on standard models
Get those right and reasoning models can handle problems that used to require significant prompt engineering work to get anywhere close to correct.



