What is prompt engineering?

Prompt engineering is the practice of crafting inputs to AI language models to produce accurate, useful, and reliable outputs. It involves choosing the right words, structure, context, and format to guide the AI toward the response you actually need — rather than a generic or off-target one.

Which AI models benefit most from better prompting?

All major large language models — including ChatGPT (GPT-4o), Claude, and Gemini — respond significantly to prompt quality. The same task can produce dramatically different results depending on how you structure your request. Better prompting improves output across every major model.

Do I need technical skills to do prompt engineering?

No. Prompt engineering is done in natural language — you write text instructions, not code. Basic prompting needs no technical background at all. Advanced techniques like prompt chaining or agentic workflows can benefit from light scripting knowledge, but the core skill is clear written communication.

Where can I learn more about prompt engineering?

MasterPrompting.net offers a structured curriculum from beginner to advanced, covering every major technique from basic clarity and context to chain-of-thought, meta-prompting, and agentic workflows. Start with the Beginner track to build a solid foundation.

LLM Routing: How to Choose the Right Model for Each Task

Most teams pick a model that works for their hardest use cases and use it for everything. It's the path of least resistance. It also means you're paying GPT-4o prices for tasks that GPT-4o-mini handles just as well — and waiting 3 seconds for responses that could come back in 300ms.

LLM routing is the practice of matching each request to the model best suited for it. Done right, it cuts costs by 60–80% on mixed workloads without any quality degradation on the tasks that matter.

Why one model doesn't fit all

The cost difference between models is enormous and frequently underestimated.

GPT-4o runs around $2.50 per million input tokens and $10 per million output tokens. GPT-4o-mini is $0.15 and $0.60 — roughly 15-17x cheaper. Claude Opus 4 costs $15/$75 per million tokens. Haiku 3.5 costs $0.80/$4. Gemini 1.5 Pro is about $1.25/$5. Gemini Flash 2.0 is $0.10/$0.40.

For a system handling 1 million messages per month at an average of 500 input tokens and 200 output tokens each, using GPT-4o across the board runs about $3,250/month. Routing 70% of those requests to GPT-4o-mini drops the bill to roughly $1,050/month — without touching the 30% of complex tasks that actually need the larger model.

Latency matters too. GPT-4o-mini and Gemini Flash regularly return responses in 300-600ms. GPT-4o and Claude Sonnet average 2-5 seconds. For a real-time chat interface, that gap is the difference between a fluid conversation and something that feels like it's thinking.

The routing decision framework

Three axes determine which model a request should go to:

Task complexity. Does this require multi-step reasoning, handling ambiguous instructions, or synthesizing information from many sources? Or is it a clearly defined extraction task with a predictable output format? Complex reasoning needs a capable model. Simple classification or reformatting doesn't.

Latency requirement. Is this a user-facing interaction where response time affects perceived quality? Or a background batch job where nobody's watching the clock? User-facing usually means choosing speed. Batch jobs can absorb slower, more capable models.

Cost sensitivity. How many of these requests are you running per day? A researcher running 50 deep analysis tasks can absorb high model costs. A SaaS product running 500,000 classification requests cannot.

Practical routing rules with examples

The routing map I use for most production systems:

Simple extraction and classification → small model. Extracting structured fields from text, categorizing incoming messages, translating short snippets, generating short summaries of known documents. Haiku 3.5, GPT-4o-mini, or Gemini Flash 2.0 handle these without breaking a sweat. These tasks are usually well-defined and forgiving of minor errors.

Standard generation → medium model. Writing emails, answering FAQs, summarizing lengthy documents, generating code for moderately complex functions, structured Q&A with a knowledge base. Claude Sonnet or GPT-4o are the right tier here — capable enough for nuanced tasks, not overkill for most.

Complex reasoning → large model. Multi-step analysis, debugging subtle code issues, synthesizing conflicting information, tasks where output quality is directly tied to business outcomes. This is where Claude Opus or o3 earns its price tag.

Speed-critical → fastest available. Real-time autocomplete, streaming chat where perceived latency matters most, interactive coding assistants. Gemini Flash 2.0 and GPT-4o-mini are the go-to tier.

A customer support system is a good example. Ticket classification (routing a ticket to the right team) → Gemini Flash. Drafting a canned response for a common question → GPT-4o-mini. Resolving a complex escalation with context from 20 previous messages → GPT-4o or Claude Sonnet. A refund dispute that needs policy interpretation → Claude Sonnet or Opus.

Three ways to implement routing

Option A: Rule-based routing

The simplest approach. You classify each request by task type before it hits the LLM layer, then send it to the appropriate model.

def route_request(task_type: str, request: dict) -> str:
    routing_map = {
        "classification": "claude-haiku-3-5",
        "extraction": "gpt-4o-mini",
        "generation": "claude-sonnet-4-5",
        "analysis": "claude-opus-4",
    }
    return routing_map.get(task_type, "claude-sonnet-4-5")

This works well when your system has clearly defined task types. A pipeline with three distinct steps — extract, transform, summarize — can route each step to a different model. The downside is that you need to know the task type upfront, which doesn't work for open-ended user input.

Option B: Meta-model routing

Use a small, cheap model to classify the incoming request, then route based on that classification. The classifier itself is cheap to run (a few milliseconds, a fraction of a cent), and it handles the ambiguity of open-ended input.

def classify_complexity(user_input: str) -> str:
    # Use GPT-4o-mini to classify complexity
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": "Classify this request as: simple (factual lookup, short extraction), "
                      "standard (generation, summarization), or complex (reasoning, analysis, debugging). "
                      "Reply with one word only."
        }, {
            "role": "user", 
            "content": user_input
        }]
    )
    return response.choices[0].message.content.strip().lower()

def route_to_model(complexity: str) -> str:
    return {
        "simple": "gpt-4o-mini",
        "standard": "claude-sonnet-4-5",
        "complex": "claude-opus-4"
    }.get(complexity, "claude-sonnet-4-5")

The meta-classifier adds about 200-400ms latency, which may or may not matter depending on your use case. For batch processing, it's essentially free.

Option C: Difficulty-based escalation

Try the small model first. If it signals uncertainty — through a low confidence score, a refusal, or an output that fails your validation step — escalate to a larger model automatically.

This is particularly good when you can't classify complexity upfront, and when errors are detectable. A JSON extraction task either produces valid JSON or it doesn't. A classification task that returns "I'm not sure" needs escalation. A structured output that fails schema validation needs another shot with a more capable model.

Structured outputs and JSON mode make this pattern cleaner — you can validate the output programmatically and escalate on failure rather than trying to judge quality subjectively.

Real cost savings, roughly

A customer support platform at 1 million messages per month, averaging 400 input tokens and 150 output tokens:

Without routing (all GPT-4o): ~$2,600/month.

With routing (60% GPT-4o-mini for simple classification/FAQ, 30% GPT-4o for standard generation, 10% GPT-4o for complex cases): ~$620/month. That's a 76% reduction. At scale, this pays for engineering time to implement the router in the first week.

These numbers are rough estimates — actual costs depend heavily on your specific token counts, model versions, and request distribution. But the directional savings are consistent: mixed workloads almost always route mostly to cheaper models.

Context caching compounds these savings. If you're routing to a medium or large model with a long system prompt, caching that prompt across requests dramatically reduces costs further.

Tools that help

LiteLLM is the most practical starting point. It gives you a unified interface to 100+ models and lets you define routing rules without managing each provider's SDK separately. You can set fallbacks, load balance across models, and log costs out of the box.

RouteLLM (open source, from LMSYS) uses trained classifiers specifically for routing between strong and weak models. It's more sophisticated than a hand-rolled classifier but requires setup and fine-tuning.

Custom routing logic is often the right answer. A few dozen lines of Python using the task type and a simple heuristic (input length, presence of certain keywords, step type in your pipeline) beat a complex framework for most real systems.

When not to route

Routing introduces complexity. Sometimes that trade-off isn't worth it.

If consistency matters more than cost, use one model. If your product's quality signature is tied to a specific model's output characteristics, mixing models will introduce variance users notice.

If your volume is low, the cost savings don't justify the engineering overhead. Below roughly 100,000 requests per month, the savings from routing are usually in the hundreds of dollars — rarely worth building a routing layer for.

If you're still figuring out what your system needs to do, don't optimize yet. Routing is an optimization for a known, stable workload. Premature routing on a product you're still shaping creates refactoring work.

Monitoring routing quality

A routing system that silently sends the wrong requests to the wrong models is worse than no routing — you get bad outputs and don't know why.

Log which model handled each request alongside the outcome (did the user ask for a correction? did validation fail? did the task complete?). Track escalation rate if you're using difficulty-based routing. A sudden spike in escalations means either your traffic is changing or your classifier is degrading.

Set up quality sampling: randomly route 2-5% of "simple" requests to your medium model and compare outputs. If quality differences are negligible, you're routing correctly. If there's meaningful degradation, your classifier is over-routing to the cheap tier.

The goal isn't just lower costs — it's the same quality at lower costs. Keep measuring both.

Why one model doesn't fit all

The cost difference between models is enormous and frequently underestimated.

The routing decision framework

Three axes determine which model a request should go to:

Practical routing rules with examples

The routing map I use for most production systems:

Three ways to implement routing

Option A: Rule-based routing

The simplest approach. You classify each request by task type before it hits the LLM layer, then send it to the appropriate model.

def route_request(task_type: str, request: dict) -> str:
    routing_map = {
        "classification": "claude-haiku-3-5",
        "extraction": "gpt-4o-mini",
        "generation": "claude-sonnet-4-5",
        "analysis": "claude-opus-4",
    }
    return routing_map.get(task_type, "claude-sonnet-4-5")

Option B: Meta-model routing

def classify_complexity(user_input: str) -> str:
    # Use GPT-4o-mini to classify complexity
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": "Classify this request as: simple (factual lookup, short extraction), "
                      "standard (generation, summarization), or complex (reasoning, analysis, debugging). "
                      "Reply with one word only."
        }, {
            "role": "user", 
            "content": user_input
        }]
    )
    return response.choices[0].message.content.strip().lower()

def route_to_model(complexity: str) -> str:
    return {
        "simple": "gpt-4o-mini",
        "standard": "claude-sonnet-4-5",
        "complex": "claude-opus-4"
    }.get(complexity, "claude-sonnet-4-5")

The meta-classifier adds about 200-400ms latency, which may or may not matter depending on your use case. For batch processing, it's essentially free.

Option C: Difficulty-based escalation

Try the small model first. If it signals uncertainty — through a low confidence score, a refusal, or an output that fails your validation step — escalate to a larger model automatically.

Structured outputs and JSON mode make this pattern cleaner — you can validate the output programmatically and escalate on failure rather than trying to judge quality subjectively.

Real cost savings, roughly

A customer support platform at 1 million messages per month, averaging 400 input tokens and 150 output tokens:

Without routing (all GPT-4o): ~$2,600/month.

Context caching compounds these savings. If you're routing to a medium or large model with a long system prompt, caching that prompt across requests dramatically reduces costs further.

Tools that help

When not to route

Routing introduces complexity. Sometimes that trade-off isn't worth it.

If consistency matters more than cost, use one model. If your product's quality signature is tied to a specific model's output characteristics, mixing models will introduce variance users notice.

Monitoring routing quality

A routing system that silently sends the wrong requests to the wrong models is worse than no routing — you get bad outputs and don't know why.

The goal isn't just lower costs — it's the same quality at lower costs. Keep measuring both.

LLM Routing: How to Choose the Right Model for Each Task

Why one model doesn't fit all

The routing decision framework

Practical routing rules with examples

Three ways to implement routing

Option A: Rule-based routing

Option B: Meta-model routing

Option C: Difficulty-based escalation

Real cost savings, roughly

Tools that help

When not to route

Monitoring routing quality

Related articles

Async Python for LLM Apps — Patterns That Actually Work in Production

Claude Sonnet 4.6 — The Complete Guide

Instructor Library — The Best Way to Get Structured Outputs from Any LLM

LLM Routing: How to Choose the Right Model for Each Task

Why one model doesn't fit all

The routing decision framework

Practical routing rules with examples

Three ways to implement routing

Option A: Rule-based routing

Option B: Meta-model routing

Option C: Difficulty-based escalation

Real cost savings, roughly

Tools that help

When not to route

Monitoring routing quality

Related articles

Async Python for LLM Apps — Patterns That Actually Work in Production

Claude Sonnet 4.6 — The Complete Guide

Instructor Library — The Best Way to Get Structured Outputs from Any LLM