Gemini 2.5 Pro has a thinking mode that changes how the model works at a fundamental level. Before it writes a single word of the visible response, it reasons through the problem in an internal scratchpad. The result is meaningfully better on hard problems. But enable it on the wrong tasks and you're paying 3× more for the same answer, arriving 3 seconds later.
The model itself doesn't warn you when you've misused it. It'll happily think for 8,000 tokens about whether "Paris" is the capital of France, then bill you accordingly.
What's actually happening under the hood
When you enable thinking, the model generates an internal reasoning trace before producing its final response. This trace is not a chain-of-thought prompt — you're not telling it how to reason. The model decides its own reasoning path. By default, you get a compressed summary of the thinking process via the API, not the raw trace.
The result: for problems where the path to the answer isn't obvious, the model arrives at significantly better final responses. It can backtrack, explore alternatives, catch its own errors mid-reasoning, and synthesize conclusions across multiple steps before committing to an output. On straightforward tasks, it does the same thing it always would — just slower and more expensively.
It's conceptually identical to Claude's extended thinking. Same pattern, different implementation details, similar use case profile.
When to turn it on
The practical test: would a smart human reach for paper to work through this problem? If yes, thinking mode helps. If no, it's overhead.
Enable thinking for:
- Math word problems with multiple steps and constraints
- Logic puzzles where the answer requires tracking several conditions simultaneously
- Step-by-step code debugging where the cause isn't immediately obvious
- Security reviews of non-trivial code
- Contract analysis: "Given this 50-page contract, what are all the termination conditions and their effective dates?"
- Architecture tradeoff analysis: competing approaches with different consequences
- Anything phrased as "figure out why X is happening" rather than "do X"
Don't bother for:
- Summarizing a paragraph
- Spam classification
- Translation
- JSON reformatting
- Simple factual retrieval
- Template-filling tasks where the structure is predetermined
The failure mode isn't bad output — it's wasted spend. Thinking mode on a simple task returns the right answer, just expensively. At low volume this doesn't matter. At production scale, enabling thinking indiscriminately will double or triple your Gemini bill without corresponding quality gains.
The budget_tokens parameter
The thinking budget controls how many tokens the model can use for its internal reasoning. It's a ceiling, not a target — the model won't necessarily use all of it, but it can't exceed it.
| Budget range | Use case |
|---|---|
| 2,000–4,000 | Light reasoning tasks, reduces latency vs. default |
| 8,000–16,000 | Standard complex tasks — good default |
| 32,000+ | Very hard problems: research synthesis, adversarial analysis |
Start at 8,000 and only increase it if the quality is consistently poor. More budget doesn't automatically mean better answers — on most tasks, the model finds the answer well within 8,000 tokens and the extra budget goes unused. Increasing to 32,000 for a problem that resolves in 4,000 thinking tokens accomplishes nothing except slowing the request down while it decides the budget is available.
One calibration approach: run 50 representative hard queries at budget 4,000, 8,000, and 16,000. Compare answer quality. For most problem categories, you'll see a plateau where quality stops improving before you hit 16,000.
Prompting patterns that work with thinking enabled
The temptation when using a reasoning model is to over-scaffold — walk it through the steps yourself, tell it exactly how to approach the problem. Resist this. The thinking mode's advantage is that the model finds its own reasoning path. If you specify the path, you're bypassing the thing that makes it useful.
State the problem, don't prescribe the method. Instead of "Use dynamic programming to solve this" — just give it the problem. If dynamic programming is the right approach, it'll find that. If there's a better approach, it can find that too.
Use constraints, not instructions. "The solution must run in O(n log n) or better" tells the model what outcome to produce. "Use a heap sort" tells it how to produce it. Constraints work better because they leave the reasoning path open while preventing unacceptable solutions.
Ask for confidence alongside the answer. "What's your answer, and how certain are you?" The thinking process allows the model to internally assess its own uncertainty. Without asking, it often presents uncertain answers with the same confidence as certain ones.
For debugging: full context, no leading. Paste the complete error message and the relevant code. Ask what's wrong. Don't say "I think the issue might be in the initialization" — that primes the model to look there first and potentially miss the real root cause. Let it reason from the full picture.
API example via aicredits.in
Indian developers: access all models via AICredits.in — INR billing, UPI top-up, single API key for Claude, GPT-4o, Gemini and more.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["AICREDITS_API_KEY"],
base_url="https://api.aicredits.in/v1"
)
response = client.chat.completions.create(
model="google/gemini-2.5-pro",
messages=[
{
"role": "user",
"content": "A factory produces widgets at 120/hour for the first 3 hours, then 95/hour for the next 5 hours, then stops for a 45-minute break, then runs at 110/hour for 4 more hours. A shipment of 1,200 widgets needs to leave by hour 10. Will it make it? If not, by how many widgets does it fall short?"
}
],
extra_body={
"thinking": {
"type": "enabled",
"budget_tokens": 8000
}
}
)
print(response.choices[0].message.content)
The extra_body field passes thinking configuration through the OpenAI-compatible client. This is an aicredits.in extension — the parameter gets forwarded to the Gemini API correctly on the backend.
Without thinking enabled (same client, no extra_body):
response = client.chat.completions.create(
model="google/gemini-2.5-pro",
messages=[
{
"role": "user",
"content": "Summarize this paragraph in two sentences."
}
]
)
Use the simple form for tasks that don't need reasoning. Reserve the extra_body form for the hard problems.
Gemini thinking vs. Claude extended thinking
| Gemini 2.5 Pro | Claude (extended thinking) | |
|---|---|---|
| Thinking visible? | Summary only (default) | Full thinking blocks |
| Budget control | budget_tokens int | budget_tokens int |
| Latency overhead | +2–8s | +3–15s |
| Best task types | Math, code, logic | Analysis, writing, reasoning |
| Access | Via API (aicredits.in) | Via API (aicredits.in) |
The key difference in practice: Claude gives you the full thinking trace, which lets you debug the reasoning when it goes wrong. Gemini gives you a summary. For development and debugging, Claude's full trace is useful — you can see exactly where the reasoning diverged. For production deployments where you only care about the final answer, the difference is minimal.
Both models are available through aicredits.in on the same API key, so you can test both on your specific workload and pick the one that works better for your use case. See the reasoning models guide for a broader comparison of thinking model behavior across providers.
Cost considerations at scale
Thinking tokens are billed as output tokens. An 8,000 budget_tokens call that uses all 8,000 thinking tokens plus a 500-token final response is billed as 8,500 output tokens. Output tokens cost more than input tokens on most model pricing schedules — this adds up fast.
At 100,000 requests per day with thinking enabled on all of them at 8,000 budget: that's up to 800 million thinking tokens per day before counting your actual responses. At Gemini 2.5 Pro output pricing, that's a significant daily cost. The break-even question is: does the quality improvement from thinking reduce downstream costs (user churn, support tickets, retry rates) by more than it increases API spend?
For customer-facing hard tasks — technical support, contract analysis, financial modeling — the answer is usually yes. For background batch jobs with soft quality requirements, the answer is usually no.
The practical approach: whitelist specific request types for thinking mode rather than enabling it globally. In code:
THINKING_TASKS = {"architecture_review", "contract_analysis", "complex_debugging", "security_review"}
def call_gemini(task_type: str, query: str) -> str:
params = {
"model": "google/gemini-2.5-pro",
"messages": [{"role": "user", "content": query}]
}
if task_type in THINKING_TASKS:
params["extra_body"] = {"thinking": {"type": "enabled", "budget_tokens": 8000}}
response = client.chat.completions.create(**params)
return response.choices[0].message.content
Debugging example: thinking mode vs. standard
Here's a real scenario where the difference is clear. Take this Python code:
def process_records(records):
results = []
for i in range(len(records)):
if records[i]["status"] == "active":
results.append({
"id": records[i]["id"],
"value": records[i]["value"] * 1.1
})
return results
records = [
{"id": 1, "status": "active", "value": 100},
{"id": 2, "status": "inactive", "value": 200},
{"id": 3, "status": "active", "value": None},
]
print(process_records(records))
With thinking disabled, asking "what's wrong with this code?" typically returns: "the code will raise a TypeError when it encounters None as a value because you can't multiply None by 1.1."
Correct. But incomplete.
With thinking enabled, the model reasons through it more carefully: it catches the None multiplication, but also notes that iterating with range(len(records)) is un-Pythonic and fragile (would fail if records were a generator), that there's no error handling for missing keys (records without a "status" or "value" key would raise KeyError), and that the 1.1 multiplier is a magic number with no explanation.
Standard mode finds the obvious crash. Thinking mode finds the obvious crash plus the three bugs waiting to happen. That's the practical difference — not that standard mode is wrong, but that it reasons shallowly on complex inputs.
Working with the Gemini 2.0 Flash baseline
If you're evaluating whether to upgrade to 2.5 Pro with thinking, start with the Gemini 2.0 Flash guide to understand where the previous tier lands. Flash is faster and cheaper, and for structured tasks (extraction, classification, summarization) it's often good enough that 2.5 Pro with thinking is overkill.
The right mental model: Flash for high-volume structured work, 2.5 Pro standard for moderate-complexity generation and reasoning, 2.5 Pro with thinking for hard problems where quality is the constraint. Use them as tiers, not as a ladder where newer always means better for your use case.
That tiered thinking maps directly to the routing strategies in the LLM model routing guide — if you're building a system that handles mixed query types, thinking mode is one more lever in your tier configuration, not a setting you toggle globally.



