Most teams pick a model that works for their hardest use cases and use it for everything. It's the path of least resistance. It also means you're paying GPT-4o prices for tasks that GPT-4o-mini handles just as well — and waiting 3 seconds for responses that could come back in 300ms.
LLM routing is the practice of matching each request to the model best suited for it. Done right, it cuts costs by 60–80% on mixed workloads without any quality degradation on the tasks that matter.
Why one model doesn't fit all
The cost difference between models is enormous and frequently underestimated.
GPT-4o runs around $2.50 per million input tokens and $10 per million output tokens. GPT-4o-mini is $0.15 and $0.60 — roughly 15-17x cheaper. Claude Opus 4 costs $15/$75 per million tokens. Haiku 3.5 costs $0.80/$4. Gemini 1.5 Pro is about $1.25/$5. Gemini Flash 2.0 is $0.10/$0.40.
For a system handling 1 million messages per month at an average of 500 input tokens and 200 output tokens each, using GPT-4o across the board runs about $3,250/month. Routing 70% of those requests to GPT-4o-mini drops the bill to roughly $1,050/month — without touching the 30% of complex tasks that actually need the larger model.
Latency matters too. GPT-4o-mini and Gemini Flash regularly return responses in 300-600ms. GPT-4o and Claude Sonnet average 2-5 seconds. For a real-time chat interface, that gap is the difference between a fluid conversation and something that feels like it's thinking.
The routing decision framework
Three axes determine which model a request should go to:
Task complexity. Does this require multi-step reasoning, handling ambiguous instructions, or synthesizing information from many sources? Or is it a clearly defined extraction task with a predictable output format? Complex reasoning needs a capable model. Simple classification or reformatting doesn't.
Latency requirement. Is this a user-facing interaction where response time affects perceived quality? Or a background batch job where nobody's watching the clock? User-facing usually means choosing speed. Batch jobs can absorb slower, more capable models.
Cost sensitivity. How many of these requests are you running per day? A researcher running 50 deep analysis tasks can absorb high model costs. A SaaS product running 500,000 classification requests cannot.
Practical routing rules with examples
The routing map I use for most production systems:
Simple extraction and classification → small model. Extracting structured fields from text, categorizing incoming messages, translating short snippets, generating short summaries of known documents. Haiku 3.5, GPT-4o-mini, or Gemini Flash 2.0 handle these without breaking a sweat. These tasks are usually well-defined and forgiving of minor errors.
Standard generation → medium model. Writing emails, answering FAQs, summarizing lengthy documents, generating code for moderately complex functions, structured Q&A with a knowledge base. Claude Sonnet or GPT-4o are the right tier here — capable enough for nuanced tasks, not overkill for most.
Complex reasoning → large model. Multi-step analysis, debugging subtle code issues, synthesizing conflicting information, tasks where output quality is directly tied to business outcomes. This is where Claude Opus or o3 earns its price tag.
Speed-critical → fastest available. Real-time autocomplete, streaming chat where perceived latency matters most, interactive coding assistants. Gemini Flash 2.0 and GPT-4o-mini are the go-to tier.
A customer support system is a good example. Ticket classification (routing a ticket to the right team) → Gemini Flash. Drafting a canned response for a common question → GPT-4o-mini. Resolving a complex escalation with context from 20 previous messages → GPT-4o or Claude Sonnet. A refund dispute that needs policy interpretation → Claude Sonnet or Opus.
Three ways to implement routing
Option A: Rule-based routing
The simplest approach. You classify each request by task type before it hits the LLM layer, then send it to the appropriate model.
def route_request(task_type: str, request: dict) -> str:
routing_map = {
"classification": "claude-haiku-3-5",
"extraction": "gpt-4o-mini",
"generation": "claude-sonnet-4-5",
"analysis": "claude-opus-4",
}
return routing_map.get(task_type, "claude-sonnet-4-5")
This works well when your system has clearly defined task types. A pipeline with three distinct steps — extract, transform, summarize — can route each step to a different model. The downside is that you need to know the task type upfront, which doesn't work for open-ended user input.
Option B: Meta-model routing
Use a small, cheap model to classify the incoming request, then route based on that classification. The classifier itself is cheap to run (a few milliseconds, a fraction of a cent), and it handles the ambiguity of open-ended input.
def classify_complexity(user_input: str) -> str:
# Use GPT-4o-mini to classify complexity
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": "Classify this request as: simple (factual lookup, short extraction), "
"standard (generation, summarization), or complex (reasoning, analysis, debugging). "
"Reply with one word only."
}, {
"role": "user",
"content": user_input
}]
)
return response.choices[0].message.content.strip().lower()
def route_to_model(complexity: str) -> str:
return {
"simple": "gpt-4o-mini",
"standard": "claude-sonnet-4-5",
"complex": "claude-opus-4"
}.get(complexity, "claude-sonnet-4-5")
The meta-classifier adds about 200-400ms latency, which may or may not matter depending on your use case. For batch processing, it's essentially free.
Option C: Difficulty-based escalation
Try the small model first. If it signals uncertainty — through a low confidence score, a refusal, or an output that fails your validation step — escalate to a larger model automatically.
This is particularly good when you can't classify complexity upfront, and when errors are detectable. A JSON extraction task either produces valid JSON or it doesn't. A classification task that returns "I'm not sure" needs escalation. A structured output that fails schema validation needs another shot with a more capable model.
Structured outputs and JSON mode make this pattern cleaner — you can validate the output programmatically and escalate on failure rather than trying to judge quality subjectively.
Real cost savings, roughly
A customer support platform at 1 million messages per month, averaging 400 input tokens and 150 output tokens:
Without routing (all GPT-4o): ~$2,600/month.
With routing (60% GPT-4o-mini for simple classification/FAQ, 30% GPT-4o for standard generation, 10% GPT-4o for complex cases): ~$620/month. That's a 76% reduction. At scale, this pays for engineering time to implement the router in the first week.
These numbers are rough estimates — actual costs depend heavily on your specific token counts, model versions, and request distribution. But the directional savings are consistent: mixed workloads almost always route mostly to cheaper models.
Context caching compounds these savings. If you're routing to a medium or large model with a long system prompt, caching that prompt across requests dramatically reduces costs further.
Tools that help
LiteLLM is the most practical starting point. It gives you a unified interface to 100+ models and lets you define routing rules without managing each provider's SDK separately. You can set fallbacks, load balance across models, and log costs out of the box.
RouteLLM (open source, from LMSYS) uses trained classifiers specifically for routing between strong and weak models. It's more sophisticated than a hand-rolled classifier but requires setup and fine-tuning.
Custom routing logic is often the right answer. A few dozen lines of Python using the task type and a simple heuristic (input length, presence of certain keywords, step type in your pipeline) beat a complex framework for most real systems.
When not to route
Routing introduces complexity. Sometimes that trade-off isn't worth it.
If consistency matters more than cost, use one model. If your product's quality signature is tied to a specific model's output characteristics, mixing models will introduce variance users notice.
If your volume is low, the cost savings don't justify the engineering overhead. Below roughly 100,000 requests per month, the savings from routing are usually in the hundreds of dollars — rarely worth building a routing layer for.
If you're still figuring out what your system needs to do, don't optimize yet. Routing is an optimization for a known, stable workload. Premature routing on a product you're still shaping creates refactoring work.
Monitoring routing quality
A routing system that silently sends the wrong requests to the wrong models is worse than no routing — you get bad outputs and don't know why.
Log which model handled each request alongside the outcome (did the user ask for a correction? did validation fail? did the task complete?). Track escalation rate if you're using difficulty-based routing. A sudden spike in escalations means either your traffic is changing or your classifier is degrading.
Set up quality sampling: randomly route 2-5% of "simple" requests to your medium model and compare outputs. If quality differences are negligible, you're routing correctly. If there's meaningful degradation, your classifier is over-routing to the cheap tier.
The goal isn't just lower costs — it's the same quality at lower costs. Keep measuring both.



