What are reasoning models and how are they different?

Reasoning models (like OpenAI's o1/o3 and Claude with extended thinking) spend additional compute 'thinking' before generating their visible response. They produce an internal chain of reasoning — sometimes called a scratchpad — that they use to work through complex problems. The result is dramatically better performance on math, science, coding, and multi-step logic tasks, at the cost of higher latency and price.

Should I still use chain-of-thought prompting with reasoning models?

Not necessarily. Reasoning models think internally by default — you don't need to say 'think step by step' because they're already doing it. In fact, instructing a reasoning model to 'think step by step' in the user prompt may be redundant or even counterproductive. Let the model do its internal reasoning; focus your prompt on clearly specifying the task and desired output format.

When should I use a reasoning model vs. a standard model?

Use reasoning models for: complex math and science problems, multi-step logic, code that requires careful planning, legal or medical analysis requiring nuanced judgment, tasks where you currently get inconsistent results from standard models. Use standard models for: simple Q&A, creative writing, basic summarization, high-volume low-latency tasks. Reasoning models are slower and more expensive — don't use them where standard models work fine.

How to Prompt Reasoning Models: o1, o3, and Claude Extended Thinking

When OpenAI released o1, I spent the first few hours prompting it the same way I'd prompt GPT-4. I kept getting worse results than I expected.

Then I read the docs more carefully and realized: reasoning models require a genuinely different prompting approach. The intuitions built up from years of prompting GPT-4 and Claude don't all transfer.

Here's what I've learned about getting the most out of reasoning models.

What's Actually Different

Standard models generate tokens left-to-right, producing the visible response as they go. The response is the generation.

Reasoning models (o1, o3, Claude extended thinking) have an additional step: they first produce an internal chain of reasoning — sometimes called a "thinking" process or scratchpad — before generating the visible response. This internal reasoning is often hidden from users.

The practical effects:

Much better on hard reasoning tasks — math, complex code, multi-step logic, scientific analysis
Slower — they need time for that internal thinking
More expensive — you're paying for all that internal compute
Less sensitive to prompting tricks — you don't need to guide the reasoning process because it happens automatically

What to Stop Doing

Stop saying "think step by step"

This is the biggest habit to break. With standard models, adding "think step by step" or "let's work through this carefully" triggers chain-of-thought reasoning. With reasoning models, they're already thinking step by step internally. Repeating this instruction is redundant and may interfere with the model's natural reasoning process.

Stop adding reasoning scaffolding

With standard models, you might structure a prompt like:

First, identify the key factors. Then, evaluate each factor. Finally, synthesize a recommendation.

Reasoning models don't need this scaffolding — they do it internally. Prescribing the reasoning steps can actually constrain the model's ability to reason flexibly.

Stop using few-shot examples for reasoning demonstrations

With standard models, few-shot examples that show reasoning ("Input: X → Let me think... A, then B, then C → Output: Y") help the model learn to reason. With reasoning models, this is less necessary. They can reason without being shown how.

What to Start Doing

Write clear, complete problem specifications

Reasoning models respond well to thorough problem descriptions. Include all constraints, requirements, and edge cases upfront. The model will handle reasoning about them — your job is to make sure it knows what it needs to know.

Bad: "Write a function to sort a list."

Better: "Write a Python function that sorts a list of integers in ascending
order. It must handle: empty lists, lists with duplicates, negative numbers,
and very large lists (100M+ elements) efficiently. Return a new list rather
than sorting in place. Include type hints and a brief docstring."

Specify output format explicitly

Reasoning models produce variable-length outputs. Be specific about what you want:

Should the answer be a number, a paragraph, a bulleted list?
How long?
What should and shouldn't be included?

Provide your answer as a JSON object with these exact keys:
- "recommendation": one sentence
- "confidence": "high" | "medium" | "low"
- "key_risks": array of strings (max 3)
- "rationale": 2-3 sentences

Trust the model to structure its reasoning

Rather than prescribing how to think, describe what you need. The model will figure out how to reason about it.

Bad: "First, analyze the requirements. Second, design the architecture.
Third, identify potential issues."

Better: "Design the database schema for a multi-tenant SaaS application
with these requirements: [requirements]. I need the final schema, the key
design decisions you made, and the main tradeoffs."

Claude Extended Thinking

Claude's extended thinking (available on certain API configurations) works similarly to o1 — Claude reasons internally before producing a visible response.

A few Claude-specific notes:

Budget tokens for thinking

When using extended thinking via the API, you set a "budget tokens" parameter that controls how much thinking to do. More budget = more thorough reasoning = higher cost and latency. Match the budget to the task complexity.

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # How much to think before responding
    },
    messages=[{"role": "user", "content": your_prompt}]
)

Extended thinking is best for:

Complex, multi-step math or science problems
Code requiring careful architecture planning
Legal or contract analysis requiring precise logical reasoning
Problems where you're currently getting inconsistent results

Extended thinking adds limited value for:

Simple factual questions
Creative writing
Basic summarization
Tasks that are already easy for Claude

When Reasoning Models Are Worth the Cost

Reasoning models cost 5–10× more than standard models and have higher latency. When is this justified?

Justified:

The task is where reasoning models measurably outperform standard models (complex math, careful code)
The cost of an error is high (legal, medical, financial analysis)
You've tested a standard model and it's failing consistently on this task type
Volume is low enough that the price difference is acceptable

Not justified:

Simple tasks that standard models handle well
High-volume, latency-sensitive applications
Creative tasks where reasoning isn't the bottleneck
When you haven't first tested whether a standard model does well enough

The honest answer is: try the standard model first. If it's failing on a specific task type, then try a reasoning model. Don't assume you need reasoning mode for everything — the cost adds up fast.

Practical Benchmarks

Rough guidance on task categories (as of early 2026):

Task	Standard model	Reasoning model	Use reasoning model?
Complex math problems	Often fails	Significantly better	Yes
Competitive programming	Variable	Substantially better	Yes
Multi-step logical deduction	Moderate	Better	Usually yes
Code review of complex code	Good	Marginally better	Sometimes
Scientific reasoning	Moderate	Better	Yes
Simple Q&A	Good	Marginally better	No
Creative writing	Good	Same or worse	No
Basic summarization	Good	Same	No

The Bottom Line

Reasoning models are a genuine step change for hard reasoning tasks. But they're not universal upgrades — they require different prompting, cost more, and are slower.

The key adjustments:

Remove reasoning scaffolding (they don't need "think step by step")
Write thorough, complete problem specifications
Specify output format precisely
Use them selectively for tasks where they actually improve on standard models

Get those right and reasoning models can handle problems that used to require significant prompt engineering work to get anywhere close to correct.

When OpenAI released o1, I spent the first few hours prompting it the same way I'd prompt GPT-4. I kept getting worse results than I expected.

Here's what I've learned about getting the most out of reasoning models.

What's Actually Different

Standard models generate tokens left-to-right, producing the visible response as they go. The response is the generation.

The practical effects:

Much better on hard reasoning tasks — math, complex code, multi-step logic, scientific analysis
Slower — they need time for that internal thinking
More expensive — you're paying for all that internal compute
Less sensitive to prompting tricks — you don't need to guide the reasoning process because it happens automatically

What to Stop Doing

Stop saying "think step by step"

Stop adding reasoning scaffolding

With standard models, you might structure a prompt like:

First, identify the key factors. Then, evaluate each factor. Finally, synthesize a recommendation.

Reasoning models don't need this scaffolding — they do it internally. Prescribing the reasoning steps can actually constrain the model's ability to reason flexibly.

Stop using few-shot examples for reasoning demonstrations

What to Start Doing

Write clear, complete problem specifications

Bad: "Write a function to sort a list."

Better: "Write a Python function that sorts a list of integers in ascending
order. It must handle: empty lists, lists with duplicates, negative numbers,
and very large lists (100M+ elements) efficiently. Return a new list rather
than sorting in place. Include type hints and a brief docstring."

Specify output format explicitly

Reasoning models produce variable-length outputs. Be specific about what you want:

Should the answer be a number, a paragraph, a bulleted list?
How long?
What should and shouldn't be included?

Provide your answer as a JSON object with these exact keys:
- "recommendation": one sentence
- "confidence": "high" | "medium" | "low"
- "key_risks": array of strings (max 3)
- "rationale": 2-3 sentences

Trust the model to structure its reasoning

Rather than prescribing how to think, describe what you need. The model will figure out how to reason about it.

Bad: "First, analyze the requirements. Second, design the architecture.
Third, identify potential issues."

Better: "Design the database schema for a multi-tenant SaaS application
with these requirements: [requirements]. I need the final schema, the key
design decisions you made, and the main tradeoffs."

Claude Extended Thinking

Claude's extended thinking (available on certain API configurations) works similarly to o1 — Claude reasons internally before producing a visible response.

A few Claude-specific notes:

Budget tokens for thinking

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # How much to think before responding
    },
    messages=[{"role": "user", "content": your_prompt}]
)

Extended thinking is best for:

Complex, multi-step math or science problems
Code requiring careful architecture planning
Legal or contract analysis requiring precise logical reasoning
Problems where you're currently getting inconsistent results

Extended thinking adds limited value for:

Simple factual questions
Creative writing
Basic summarization
Tasks that are already easy for Claude

When Reasoning Models Are Worth the Cost

Reasoning models cost 5–10× more than standard models and have higher latency. When is this justified?

Justified:

The task is where reasoning models measurably outperform standard models (complex math, careful code)
The cost of an error is high (legal, medical, financial analysis)
You've tested a standard model and it's failing consistently on this task type
Volume is low enough that the price difference is acceptable

Not justified:

Simple tasks that standard models handle well
High-volume, latency-sensitive applications
Creative tasks where reasoning isn't the bottleneck
When you haven't first tested whether a standard model does well enough

Practical Benchmarks

Rough guidance on task categories (as of early 2026):

Task	Standard model	Reasoning model	Use reasoning model?
Complex math problems	Often fails	Significantly better	Yes
Competitive programming	Variable	Substantially better	Yes
Multi-step logical deduction	Moderate	Better	Usually yes
Code review of complex code	Good	Marginally better	Sometimes
Scientific reasoning	Moderate	Better	Yes
Simple Q&A	Good	Marginally better	No
Creative writing	Good	Same or worse	No
Basic summarization	Good	Same	No

The Bottom Line

Reasoning models are a genuine step change for hard reasoning tasks. But they're not universal upgrades — they require different prompting, cost more, and are slower.

The key adjustments:

Remove reasoning scaffolding (they don't need "think step by step")
Write thorough, complete problem specifications
Specify output format precisely
Use them selectively for tasks where they actually improve on standard models

Get those right and reasoning models can handle problems that used to require significant prompt engineering work to get anywhere close to correct.

How to Prompt Reasoning Models: o1, o3, and Claude Extended Thinking

What's Actually Different

What to Stop Doing

What to Start Doing

Claude Extended Thinking

When Reasoning Models Are Worth the Cost

Practical Benchmarks

The Bottom Line

Related articles

GPT-4o vs o1 vs o3: Which OpenAI Model to Use When

Best Claude System Prompts for 2026: Templates That Actually Work

Build Your First AI Agent: A Beginner's Step-by-Step Guide

How to Prompt Reasoning Models: o1, o3, and Claude Extended Thinking

What's Actually Different

What to Stop Doing

What to Start Doing

Claude Extended Thinking

When Reasoning Models Are Worth the Cost

Practical Benchmarks

The Bottom Line

Related articles

GPT-4o vs o1 vs o3: Which OpenAI Model to Use When

Best Claude System Prompts for 2026: Templates That Actually Work

Build Your First AI Agent: A Beginner's Step-by-Step Guide