Most Claude outputs come back in seconds. Extended thinking can take 30 seconds, sometimes longer. That latency isn't a bug — it's Claude actually working through a problem before committing to an answer, the same way you'd want a surgeon to review scans before operating.
Extended thinking is Claude's internal reasoning mode, introduced with Claude 3 and now deeply integrated into claude-sonnet-4-6. You set a token budget, Claude thinks privately in a scratchpad, then delivers a final answer. The thinking is hidden by default. The quality difference on hard problems is significant enough that once you've used it on a genuinely complex task, you'll find it hard to go back.
This guide covers how it works under the hood, how to enable it via the API, which problems benefit (and which don't), and how to write prompts that get the most out of it.
What extended thinking actually does
Standard prompting works like this: you write a prompt, Claude predicts the next token, repeats until done. Even when you use chain-of-thought prompting — asking Claude to "think step by step" — you're steering the visible output. Claude is reasoning in the response itself.
Extended thinking is different. Claude reasons in a private scratchpad before generating the visible output. This scratchpad is used to explore dead ends, backtrack, reconsider assumptions, and verify intermediate steps — things that don't fit neatly into a forward-only text generation process. The final answer comes only after that internal process completes.
Think of it as the difference between watching someone solve a problem out loud versus waiting for them to finish thinking and then explain their answer. The second version is often more coherent and more accurate, because the solver isn't constrained to make every intermediate statement sound polished.
The thinking block uses the same transformer under the hood. There's no separate model. What changes is that Claude is given budget to generate tokens that aren't shown to you, which frees it to reason more exploratorily without worrying about whether the intermediate steps read well.
Enabling extended thinking via the API
You enable it with a thinking parameter on the messages endpoint. The key setting is budget_tokens — how many tokens Claude can spend on internal reasoning before writing the final answer.
India developers: AICredits provides Claude API access with INR / UPI billing — no USD card needed to start experimenting with extended thinking.
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=16000,
thinking={
"type": "enabled",
"budget_tokens": 10000
},
messages=[{
"role": "user",
"content": "A snail is at the bottom of a 30-foot well. Each day it climbs 3 feet. Each night it slides back 2 feet. How many days does it take to reach the top?"
}]
)
# response.content is a list of blocks
# Block 0: type "thinking" — the internal scratchpad (redacted in most contexts)
# Block 1: type "text" — the final answer
thinking_block = response.content[0]
answer_block = response.content[1]
print(f"Thinking tokens used: {thinking_block.thinking[:100]}...")
print(f"Answer: {answer_block.text}")
The minimum budget_tokens is 1,024. The maximum depends on the model — claude-sonnet-4-6 supports up to 32,000+ thinking tokens on complex tasks. max_tokens must be set high enough to cover both the thinking budget and the final response.
One thing that trips people up: budget_tokens is a budget, not a fixed allocation. Claude won't always use all of it. On a simpler problem it might spend 800 tokens thinking even if you gave it 10,000. This is fine — you're charged for what's actually used.
Cost math
Thinking tokens are billed the same as regular output tokens. With Sonnet 4.6 at ~$0.015 per 1K output tokens:
- 1,024 thinking tokens → ~$0.015 per call
- 5,000 thinking tokens → ~$0.075 per call
- 10,000 thinking tokens → ~$0.15 per call
- 30,000 thinking tokens → ~$0.45 per call
For one-off tasks, this is negligible. For high-volume pipelines processing thousands of documents, it adds up fast. Model your expected usage before defaulting to a high budget.
Also worth knowing: thinking tokens don't count toward your context window for the next turn in a conversation. They're ephemeral. If you want Claude to "remember" its reasoning for follow-up questions, you'd need to capture the thinking block and pass it back manually.
Problems that benefit most from extended thinking
Not every task gets better with a thinking budget. Here's where the difference is actually meaningful:
Math and logic puzzles. Anything requiring multiple inference steps where an early mistake compounds. The snail-and-well problem above is a classic example — most models get it wrong on the first try because they don't account for the final day correctly. Extended thinking handles this reliably.
Complex code debugging. When you paste in a stack trace and surrounding code, extended thinking lets Claude map the execution path, form hypotheses about root causes, and test them before responding. Particularly useful for asynchronous bugs or race conditions that require reasoning about state over time.
Multi-step planning. "Design a database schema for a SaaS app with these requirements" involves considering normalization, query patterns, scaling tradeoffs, and tenant isolation simultaneously. Extended thinking lets Claude hold all of those in tension before recommending anything.
Decision analysis with real tradeoffs. "Should we use Postgres or DynamoDB for this use case" — the right answer depends on access patterns, team familiarity, scaling needs, and budget. Extended thinking tends to surface the actual decision drivers rather than hedging with "it depends."
Structured argument generation. Legal memos, investment theses, technical design documents — anywhere the final output needs internal logical consistency. Claude can check its own reasoning before committing.
Problems where extended thinking doesn't help (or hurts)
Simple Q&A and factual lookup. If you're asking "What's the capital of France," extended thinking adds latency and cost with zero benefit. The answer doesn't require deliberation.
Classification tasks. Sentiment analysis, category tagging, intent detection — these are pattern matching, not reasoning chains. Standard prompting wins on speed and cost.
Summarization. Compressing a document into bullet points doesn't benefit from internal deliberation. Claude already has all the information it needs in the context window.
Creative writing. Counterintuitively, extended thinking can make creative output worse by over-deliberating on choices that should feel spontaneous. A character's voice doesn't need a formal reasoning chain to sound authentic.
Real-time chat. If a user is waiting for a response in a live conversation, 30-second latency kills the experience. Extended thinking is better suited for async tasks, batch processing, and background jobs where quality matters more than speed.
Prompting patterns that work best
Extended thinking changes what Claude does internally, but your prompt still determines what problem it's working on. A few patterns that consistently work well:
State the problem completely upfront. Don't hint at the answer or embed assumptions. If you write "Given that X is true, how do we handle Y," you've already constrained the reasoning. Write "Here's the situation: [facts]. What's the best approach to Y?" and let Claude form its own premises.
Reinforce the thinking mode explicitly. Add "Think through this carefully before giving your final answer" or "Consider all relevant factors before recommending." This is redundant with the API setting, but it seems to help Claude allocate thinking budget more deliberately on hard problems.
Ask for confidence. "How confident are you in this answer, and what would change your recommendation?" This forces Claude to flag uncertainty rather than paper over it. Particularly useful for factual claims where hallucination is a risk.
For math: ask for verification. "After solving, verify your answer is correct by checking it against the original problem." This uses part of the thinking budget for a second-pass check — basically getting Claude to double-check its own work.
For planning: ask about failure modes. "Before recommending, consider what could go wrong with each option." This pushes the thinking toward adversarial analysis rather than just selecting the best-case path.
Tuning the budget
There's no universal right answer for budget_tokens. A rough heuristic:
- 1,024–2,000: Simple logic problems, short proofs, quick tradeoff analysis
- 5,000–10,000: Complex code debugging, multi-step math, design decisions with several competing factors
- 15,000–32,000: Very hard reasoning, multi-document synthesis, problems that require backtracking significantly
Start at 5,000 for most production use cases and monitor token usage in your responses. If Claude is consistently using less than half the budget, reduce it. If the answers still feel shallow, increase it — though you'll hit diminishing returns on most problems somewhere around 15,000 tokens.
One pattern that works well: set a moderate budget (8,000) and add "If you need more reasoning space, say so in your answer and I'll rerun with a higher budget." Claude can tell you when it's hitting the ceiling.
Streaming and displaying thinking
You can stream extended thinking with stream=True. The thinking block streams first, then the text block. This is useful for showing progress in a UI — even if you don't display the raw thinking, you can show a "Claude is reasoning..." indicator while the thinking stream runs.
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 8000},
messages=[{"role": "user", "content": your_prompt}]
) as stream:
for event in stream:
if hasattr(event, 'type'):
if event.type == 'content_block_start':
block_type = event.content_block.type
# "thinking" or "text"
# handle streaming events
final_message = stream.get_final_message()
Whether to show the thinking block to users is a product decision. The raw thinking is often messy — Claude explores dead ends, changes its mind, writes things that look wrong in isolation. Showing it can build trust ("look how carefully Claude reasoned through this") or undermine it ("Claude considered the wrong approach for a while"). Most production apps hide the thinking and just show the final answer.
If you do surface it, consider showing a condensed summary rather than the raw text. The thinking block can be thousands of tokens of exploratory reasoning that doesn't read well as prose.
Combining with context engineering
Extended thinking and context engineering compound each other. When you give Claude a well-structured context — relevant documents, clear problem framing, explicit constraints — the thinking budget goes toward reasoning rather than extracting and organizing information.
The worst use of extended thinking is a vague prompt with a huge budget. Claude will spend tokens figuring out what you're even asking. The best use is a precise, complete problem statement with all relevant context attached — then a generous thinking budget to actually solve it.
For document-heavy tasks, load your source material, structure it clearly (XML tags work well here), then enable extended thinking. The combination of rich context and extended reasoning is where you see outputs that look qualitatively different from standard API calls.
When to use extended thinking vs other approaches
Use it when:
- The problem has a verifiably correct answer (math, logic, code) and you need that answer to be right
- You're processing tasks async or in batch where latency doesn't matter
- The downstream cost of a wrong answer exceeds the cost of extra thinking tokens
- You need Claude to catch its own mistakes before you see the output
Don't use it when:
- Users are waiting in real-time
- The task is mostly retrieval or pattern-matching, not reasoning
- You're running at high volume with cost constraints
- The problem is genuinely ambiguous and no amount of reasoning produces a "correct" answer
Extended thinking isn't a general upgrade — it's a targeted tool. Used on the right problems with well-structured prompts, it's the difference between Claude guessing and Claude actually working through the answer. That gap is worth understanding and exploiting deliberately.



