What is prompt engineering?

Prompt engineering is the practice of crafting inputs to AI language models to produce accurate, useful, and reliable outputs. It involves choosing the right words, structure, context, and format to guide the AI toward the response you actually need — rather than a generic or off-target one.

Which AI models benefit most from better prompting?

All major large language models — including ChatGPT (GPT-4o), Claude, and Gemini — respond significantly to prompt quality. The same task can produce dramatically different results depending on how you structure your request. Better prompting improves output across every major model.

Do I need technical skills to do prompt engineering?

No. Prompt engineering is done in natural language — you write text instructions, not code. Basic prompting needs no technical background at all. Advanced techniques like prompt chaining or agentic workflows can benefit from light scripting knowledge, but the core skill is clear written communication.

Where can I learn more about prompt engineering?

MasterPrompting.net offers a structured curriculum from beginner to advanced, covering every major technique from basic clarity and context to chain-of-thought, meta-prompting, and agentic workflows. Start with the Beginner track to build a solid foundation.

Gemini 2.5 Pro Thinking Mode: A Practical Prompting Guide

Gemini 2.5 Pro has a thinking mode that changes how the model works at a fundamental level. Before it writes a single word of the visible response, it reasons through the problem in an internal scratchpad. The result is meaningfully better on hard problems. But enable it on the wrong tasks and you're paying 3× more for the same answer, arriving 3 seconds later.

The model itself doesn't warn you when you've misused it. It'll happily think for 8,000 tokens about whether "Paris" is the capital of France, then bill you accordingly.

What's actually happening under the hood

When you enable thinking, the model generates an internal reasoning trace before producing its final response. This trace is not a chain-of-thought prompt — you're not telling it how to reason. The model decides its own reasoning path. By default, you get a compressed summary of the thinking process via the API, not the raw trace.

The result: for problems where the path to the answer isn't obvious, the model arrives at significantly better final responses. It can backtrack, explore alternatives, catch its own errors mid-reasoning, and synthesize conclusions across multiple steps before committing to an output. On straightforward tasks, it does the same thing it always would — just slower and more expensively.

It's conceptually identical to Claude's extended thinking. Same pattern, different implementation details, similar use case profile.

When to turn it on

The practical test: would a smart human reach for paper to work through this problem? If yes, thinking mode helps. If no, it's overhead.

Enable thinking for:

Math word problems with multiple steps and constraints
Logic puzzles where the answer requires tracking several conditions simultaneously
Step-by-step code debugging where the cause isn't immediately obvious
Security reviews of non-trivial code
Contract analysis: "Given this 50-page contract, what are all the termination conditions and their effective dates?"
Architecture tradeoff analysis: competing approaches with different consequences
Anything phrased as "figure out why X is happening" rather than "do X"

Don't bother for:

Summarizing a paragraph
Spam classification
Translation
JSON reformatting
Simple factual retrieval
Template-filling tasks where the structure is predetermined

The failure mode isn't bad output — it's wasted spend. Thinking mode on a simple task returns the right answer, just expensively. At low volume this doesn't matter. At production scale, enabling thinking indiscriminately will double or triple your Gemini bill without corresponding quality gains.

The budget_tokens parameter

The thinking budget controls how many tokens the model can use for its internal reasoning. It's a ceiling, not a target — the model won't necessarily use all of it, but it can't exceed it.

Budget range	Use case
2,000–4,000	Light reasoning tasks, reduces latency vs. default
8,000–16,000	Standard complex tasks — good default
32,000+	Very hard problems: research synthesis, adversarial analysis

Start at 8,000 and only increase it if the quality is consistently poor. More budget doesn't automatically mean better answers — on most tasks, the model finds the answer well within 8,000 tokens and the extra budget goes unused. Increasing to 32,000 for a problem that resolves in 4,000 thinking tokens accomplishes nothing except slowing the request down while it decides the budget is available.

One calibration approach: run 50 representative hard queries at budget 4,000, 8,000, and 16,000. Compare answer quality. For most problem categories, you'll see a plateau where quality stops improving before you hit 16,000.

Prompting patterns that work with thinking enabled

The temptation when using a reasoning model is to over-scaffold — walk it through the steps yourself, tell it exactly how to approach the problem. Resist this. The thinking mode's advantage is that the model finds its own reasoning path. If you specify the path, you're bypassing the thing that makes it useful.

State the problem, don't prescribe the method. Instead of "Use dynamic programming to solve this" — just give it the problem. If dynamic programming is the right approach, it'll find that. If there's a better approach, it can find that too.

Use constraints, not instructions. "The solution must run in O(n log n) or better" tells the model what outcome to produce. "Use a heap sort" tells it how to produce it. Constraints work better because they leave the reasoning path open while preventing unacceptable solutions.

Ask for confidence alongside the answer. "What's your answer, and how certain are you?" The thinking process allows the model to internally assess its own uncertainty. Without asking, it often presents uncertain answers with the same confidence as certain ones.

For debugging: full context, no leading. Paste the complete error message and the relevant code. Ask what's wrong. Don't say "I think the issue might be in the initialization" — that primes the model to look there first and potentially miss the real root cause. Let it reason from the full picture.

API example via aicredits.in

Indian developers: access all models via AICredits.in — INR billing, UPI top-up, single API key for Claude, GPT-4o, Gemini and more.

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["AICREDITS_API_KEY"],
    base_url="https://api.aicredits.in/v1"
)

response = client.chat.completions.create(
    model="google/gemini-2.5-pro",
    messages=[
        {
            "role": "user",
            "content": "A factory produces widgets at 120/hour for the first 3 hours, then 95/hour for the next 5 hours, then stops for a 45-minute break, then runs at 110/hour for 4 more hours. A shipment of 1,200 widgets needs to leave by hour 10. Will it make it? If not, by how many widgets does it fall short?"
        }
    ],
    extra_body={
        "thinking": {
            "type": "enabled",
            "budget_tokens": 8000
        }
    }
)
print(response.choices[0].message.content)

The extra_body field passes thinking configuration through the OpenAI-compatible client. This is an aicredits.in extension — the parameter gets forwarded to the Gemini API correctly on the backend.

Without thinking enabled (same client, no extra_body):

response = client.chat.completions.create(
    model="google/gemini-2.5-pro",
    messages=[
        {
            "role": "user",
            "content": "Summarize this paragraph in two sentences."
        }
    ]
)

Use the simple form for tasks that don't need reasoning. Reserve the extra_body form for the hard problems.

Gemini thinking vs. Claude extended thinking

	Gemini 2.5 Pro	Claude (extended thinking)
Thinking visible?	Summary only (default)	Full thinking blocks
Budget control	`budget_tokens` int	`budget_tokens` int
Latency overhead	+2–8s	+3–15s
Best task types	Math, code, logic	Analysis, writing, reasoning
Access	Via API (aicredits.in)	Via API (aicredits.in)

The key difference in practice: Claude gives you the full thinking trace, which lets you debug the reasoning when it goes wrong. Gemini gives you a summary. For development and debugging, Claude's full trace is useful — you can see exactly where the reasoning diverged. For production deployments where you only care about the final answer, the difference is minimal.

Both models are available through aicredits.in on the same API key, so you can test both on your specific workload and pick the one that works better for your use case. See the reasoning models guide for a broader comparison of thinking model behavior across providers.

Cost considerations at scale

Thinking tokens are billed as output tokens. An 8,000 budget_tokens call that uses all 8,000 thinking tokens plus a 500-token final response is billed as 8,500 output tokens. Output tokens cost more than input tokens on most model pricing schedules — this adds up fast.

At 100,000 requests per day with thinking enabled on all of them at 8,000 budget: that's up to 800 million thinking tokens per day before counting your actual responses. At Gemini 2.5 Pro output pricing, that's a significant daily cost. The break-even question is: does the quality improvement from thinking reduce downstream costs (user churn, support tickets, retry rates) by more than it increases API spend?

For customer-facing hard tasks — technical support, contract analysis, financial modeling — the answer is usually yes. For background batch jobs with soft quality requirements, the answer is usually no.

The practical approach: whitelist specific request types for thinking mode rather than enabling it globally. In code:

THINKING_TASKS = {"architecture_review", "contract_analysis", "complex_debugging", "security_review"}

def call_gemini(task_type: str, query: str) -> str:
    params = {
        "model": "google/gemini-2.5-pro",
        "messages": [{"role": "user", "content": query}]
    }
    if task_type in THINKING_TASKS:
        params["extra_body"] = {"thinking": {"type": "enabled", "budget_tokens": 8000}}
    
    response = client.chat.completions.create(**params)
    return response.choices[0].message.content

Debugging example: thinking mode vs. standard

Here's a real scenario where the difference is clear. Take this Python code:

def process_records(records):
    results = []
    for i in range(len(records)):
        if records[i]["status"] == "active":
            results.append({
                "id": records[i]["id"],
                "value": records[i]["value"] * 1.1
            })
    return results

records = [
    {"id": 1, "status": "active", "value": 100},
    {"id": 2, "status": "inactive", "value": 200},
    {"id": 3, "status": "active", "value": None},
]
print(process_records(records))

With thinking disabled, asking "what's wrong with this code?" typically returns: "the code will raise a TypeError when it encounters None as a value because you can't multiply None by 1.1."

Correct. But incomplete.

With thinking enabled, the model reasons through it more carefully: it catches the None multiplication, but also notes that iterating with range(len(records)) is un-Pythonic and fragile (would fail if records were a generator), that there's no error handling for missing keys (records without a "status" or "value" key would raise KeyError), and that the 1.1 multiplier is a magic number with no explanation.

Standard mode finds the obvious crash. Thinking mode finds the obvious crash plus the three bugs waiting to happen. That's the practical difference — not that standard mode is wrong, but that it reasons shallowly on complex inputs.

Working with the Gemini 2.0 Flash baseline

If you're evaluating whether to upgrade to 2.5 Pro with thinking, start with the Gemini 2.0 Flash guide to understand where the previous tier lands. Flash is faster and cheaper, and for structured tasks (extraction, classification, summarization) it's often good enough that 2.5 Pro with thinking is overkill.

The right mental model: Flash for high-volume structured work, 2.5 Pro standard for moderate-complexity generation and reasoning, 2.5 Pro with thinking for hard problems where quality is the constraint. Use them as tiers, not as a ladder where newer always means better for your use case.

That tiered thinking maps directly to the routing strategies in the LLM model routing guide — if you're building a system that handles mixed query types, thinking mode is one more lever in your tier configuration, not a setting you toggle globally.

The model itself doesn't warn you when you've misused it. It'll happily think for 8,000 tokens about whether "Paris" is the capital of France, then bill you accordingly.

What's actually happening under the hood

It's conceptually identical to Claude's extended thinking. Same pattern, different implementation details, similar use case profile.

When to turn it on

The practical test: would a smart human reach for paper to work through this problem? If yes, thinking mode helps. If no, it's overhead.

Enable thinking for:

Math word problems with multiple steps and constraints
Logic puzzles where the answer requires tracking several conditions simultaneously
Step-by-step code debugging where the cause isn't immediately obvious
Security reviews of non-trivial code
Contract analysis: "Given this 50-page contract, what are all the termination conditions and their effective dates?"
Architecture tradeoff analysis: competing approaches with different consequences
Anything phrased as "figure out why X is happening" rather than "do X"

Don't bother for:

Summarizing a paragraph
Spam classification
Translation
JSON reformatting
Simple factual retrieval
Template-filling tasks where the structure is predetermined

The budget_tokens parameter

The thinking budget controls how many tokens the model can use for its internal reasoning. It's a ceiling, not a target — the model won't necessarily use all of it, but it can't exceed it.

Budget range	Use case
2,000–4,000	Light reasoning tasks, reduces latency vs. default
8,000–16,000	Standard complex tasks — good default
32,000+	Very hard problems: research synthesis, adversarial analysis

Prompting patterns that work with thinking enabled

API example via aicredits.in

Indian developers: access all models via AICredits.in — INR billing, UPI top-up, single API key for Claude, GPT-4o, Gemini and more.

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["AICREDITS_API_KEY"],
    base_url="https://api.aicredits.in/v1"
)

response = client.chat.completions.create(
    model="google/gemini-2.5-pro",
    messages=[
        {
            "role": "user",
            "content": "A factory produces widgets at 120/hour for the first 3 hours, then 95/hour for the next 5 hours, then stops for a 45-minute break, then runs at 110/hour for 4 more hours. A shipment of 1,200 widgets needs to leave by hour 10. Will it make it? If not, by how many widgets does it fall short?"
        }
    ],
    extra_body={
        "thinking": {
            "type": "enabled",
            "budget_tokens": 8000
        }
    }
)
print(response.choices[0].message.content)

Without thinking enabled (same client, no extra_body):

response = client.chat.completions.create(
    model="google/gemini-2.5-pro",
    messages=[
        {
            "role": "user",
            "content": "Summarize this paragraph in two sentences."
        }
    ]
)

Use the simple form for tasks that don't need reasoning. Reserve the extra_body form for the hard problems.

Gemini thinking vs. Claude extended thinking

	Gemini 2.5 Pro	Claude (extended thinking)
Thinking visible?	Summary only (default)	Full thinking blocks
Budget control	`budget_tokens` int	`budget_tokens` int
Latency overhead	+2–8s	+3–15s
Best task types	Math, code, logic	Analysis, writing, reasoning
Access	Via API (aicredits.in)	Via API (aicredits.in)

Cost considerations at scale

The practical approach: whitelist specific request types for thinking mode rather than enabling it globally. In code:

THINKING_TASKS = {"architecture_review", "contract_analysis", "complex_debugging", "security_review"}

def call_gemini(task_type: str, query: str) -> str:
    params = {
        "model": "google/gemini-2.5-pro",
        "messages": [{"role": "user", "content": query}]
    }
    if task_type in THINKING_TASKS:
        params["extra_body"] = {"thinking": {"type": "enabled", "budget_tokens": 8000}}
    
    response = client.chat.completions.create(**params)
    return response.choices[0].message.content

Debugging example: thinking mode vs. standard

Here's a real scenario where the difference is clear. Take this Python code:

def process_records(records):
    results = []
    for i in range(len(records)):
        if records[i]["status"] == "active":
            results.append({
                "id": records[i]["id"],
                "value": records[i]["value"] * 1.1
            })
    return results

records = [
    {"id": 1, "status": "active", "value": 100},
    {"id": 2, "status": "inactive", "value": 200},
    {"id": 3, "status": "active", "value": None},
]
print(process_records(records))

With thinking disabled, asking "what's wrong with this code?" typically returns: "the code will raise a TypeError when it encounters None as a value because you can't multiply None by 1.1."

Correct. But incomplete.

Gemini 2.5 Pro Thinking Mode: A Practical Prompting Guide

What's actually happening under the hood

When to turn it on

The budget_tokens parameter

Prompting patterns that work with thinking enabled

API example via aicredits.in

Gemini thinking vs. Claude extended thinking

Cost considerations at scale

Debugging example: thinking mode vs. standard

Working with the Gemini 2.0 Flash baseline

Related articles

Claude Extended Thinking — How to Prompt for Deep Reasoning

Long-Context Prompting: How to Use 200K+ Token Windows Without Losing Quality

ChatGPT Deep Research: How to Get Actually Useful Research Reports

Gemini 2.5 Pro Thinking Mode: A Practical Prompting Guide

What's actually happening under the hood

When to turn it on

The budget_tokens parameter

Prompting patterns that work with thinking enabled

API example via aicredits.in

Gemini thinking vs. Claude extended thinking

Cost considerations at scale

Debugging example: thinking mode vs. standard

Working with the Gemini 2.0 Flash baseline

Related articles

Claude Extended Thinking — How to Prompt for Deep Reasoning

Long-Context Prompting: How to Use 200K+ Token Windows Without Losing Quality

ChatGPT Deep Research: How to Get Actually Useful Research Reports