Most teams pick one model and use it for everything. The logic is understandable: pick something capable, ship the feature, move on. But that choice compounds fast. Claude Opus 4.7 costs roughly 60× more per token than Claude Haiku 4.5. If you're running a SaaS pushing 10 million tokens a day, you're spending around $4,200/day using Opus across the board. Tier routing can bring that number to roughly $180/day — same quality on the vast majority of tasks.
That gap is the entire business case for model routing.
The three tiers and what belongs in each
Not all queries are equal. A yes/no spam classification call doesn't need the same model as a "design me a fault-tolerant event sourcing architecture." Tiering them works like this:
Tier 1 — Nano: anthropic/claude-haiku-4-5, openai/gpt-4o-mini, google/gemini-2.0-flash
This tier handles anything that's structurally simple. Intent classification ("is this a billing question or a technical question?"), PII extraction, yes/no decisions, simple factual lookups, JSON reformatting, spam detection, language detection. These tasks don't need deep reasoning — they need speed and low cost. Latency under 500ms is typical here.
Tier 2 — Mid: anthropic/claude-sonnet-4-6, openai/gpt-4o, google/gemini-2.5-pro
The workhorse tier. Code review, content generation, multi-step reasoning, customer support responses, summarization of long documents, moderately complex SQL generation. The vast majority of production use cases belong here — models are capable, latency is acceptable (1–4 seconds), and cost is reasonable.
Tier 3 — Frontier: anthropic/claude-opus-4-7, openai/o3, google/gemini-2.5-pro with thinking enabled
Reserve this for requests where quality is the only thing that matters. Architecture decisions with long-term consequences, complex multi-file debugging where the root cause isn't obvious, adversarial reasoning, novel research synthesis. You're paying 10–60× more per token — that needs to be justified by the task.
The cost math
Here's a concrete scenario: 10 million tokens per day routed across tiers vs. a flat strategy.
All Tier 3 (Opus): ~$4,200/day
All Tier 2 (Sonnet): ~$300/day
80% Tier 1 / 15% Tier 2 / 5% Tier 3: ~$180/day
Those numbers assume mixed input/output. The exact figures depend on your token ratios, but the order of magnitude is accurate. The 80/15/5 split is realistic for a typical SaaS — most requests are classifications, lookups, or short-form responses that never needed Sonnet in the first place.
Building the router with LiteLLM
LiteLLM gives you a unified router interface across providers. Pair it with aicredits.in as the backend and you have one API key that covers all three tiers without juggling separate Anthropic, OpenAI, and Google billing dashboards.
import os
from litellm import Router
router = Router(
model_list=[
{
"model_name": "nano",
"litellm_params": {
"model": "openai/gpt-4o-mini",
"api_key": os.environ["AICREDITS_API_KEY"],
"api_base": "https://api.aicredits.in/v1"
}
},
{
"model_name": "mid",
"litellm_params": {
"model": "anthropic/claude-sonnet-4-6",
"api_key": os.environ["AICREDITS_API_KEY"],
"api_base": "https://api.aicredits.in/v1"
}
},
{
"model_name": "frontier",
"litellm_params": {
"model": "anthropic/claude-opus-4-7",
"api_key": os.environ["AICREDITS_API_KEY"],
"api_base": "https://api.aicredits.in/v1"
}
}
],
fallbacks=[{"nano": ["mid"]}, {"mid": ["frontier"]}]
)
The fallbacks config is what makes this production-safe. If the nano tier returns an error or times out, it automatically retries on mid. If mid fails, it escalates to frontier. You don't drop requests — you just pay a bit more for that call.
The classifier-based router
With the router defined, you need something to decide which tier each query goes to. The simplest approach: use a nano model as the classifier itself.
def classify_query(query: str) -> str:
resp = router.completion(
model="nano",
messages=[{
"role": "user",
"content": f"""Classify this query into one tier:
- nano: simple factual lookup, yes/no, extraction, formatting
- mid: reasoning, analysis, generation, code review
- frontier: novel architecture, complex multi-step, adversarial
Query: {query}
Reply with exactly one word: nano, mid, or frontier."""
}]
)
tier = resp.choices[0].message.content.strip().lower()
return tier if tier in ("nano", "mid", "frontier") else "mid"
def route(query: str) -> str:
tier = classify_query(query)
return router.completion(
model=tier,
messages=[{"role": "user", "content": query}]
).choices[0].message.content
The safety valve at the end of classify_query is important: if the model returns anything unexpected, default to mid. Not nano. You'd rather spend a bit more than give a bad answer.
aicredits.in: one key for everything
Indian developers: access all models via AICredits.in — INR billing, UPI top-up, single API key for Claude, GPT-4o, Gemini and more.
This matters specifically for the routing use case. Managing three separate international billing accounts — Anthropic, OpenAI, Google — means three international payment methods, three cost dashboards, three sets of rate limits to track. With aicredits.in as the backend, the LiteLLM router above works with a single AICREDITS_API_KEY regardless of which provider a request routes to. The unified billing view also makes it easier to see actual per-tier costs across providers.
The standard setup:
from openai import OpenAI
import os
client = OpenAI(
api_key=os.environ["AICREDITS_API_KEY"],
base_url="https://api.aicredits.in/v1"
)
Models: anthropic/claude-sonnet-4-6, openai/gpt-4o, google/gemini-2.5-pro — use these exact strings.
OpenRouter as an alternative
If you want provider-level routing without writing a classifier, OpenRouter has a different approach. You can route by model capability tags — :nitro suffix for speed-optimized instances, specific open-source models like meta-llama/llama-3.3-70b-instruct for cost-sensitive tasks, frontier models for hard tasks.
OpenRouter's advantage is a built-in model leaderboard and the ability to define fallbacks declaratively. The tradeoff: you lose the fine-grained programmatic control of LiteLLM, and for India-based teams, payment via INR is simpler through aicredits.in.
Both are worth knowing. For a system where you're writing routing logic yourself and want maximum control, LiteLLM wins. For a system where you want pre-built routing heuristics and open-model access, OpenRouter is better.
Adding few-shot examples to the classifier
Zero-shot classification ("here are three tiers, which is this?") is a reasonable starting point. But after running the system for a week, you'll have real examples of misclassified queries. That's your training data.
Few-shot classifier with real examples from your logs:
CLASSIFIER_SYSTEM = """You classify user queries into routing tiers for an LLM system.
nano: simple, single-step, predictable output
- "What timezone is Mumbai in?"
- "Extract all email addresses from this text"
- "Is this sentence positive or negative?"
mid: requires reasoning, generation, or analysis
- "Review this Python function for bugs and edge cases"
- "Write a product description for this camera given these specs"
- "Explain why this SQL query is slow"
frontier: complex, multi-step, high-stakes
- "Design a database schema for a multi-tenant SaaS with audit logging requirements"
- "This microservice is producing intermittent 503s — here are 200 lines of logs"
- "Analyze whether this contract clause creates liability exposure"
Reply with exactly one word: nano, mid, or frontier."""
def classify_query(query: str) -> str:
resp = router.completion(
model="nano",
messages=[
{"role": "system", "content": CLASSIFIER_SYSTEM},
{"role": "user", "content": query}
]
)
tier = resp.choices[0].message.content.strip().lower()
return tier if tier in ("nano", "mid", "frontier") else "mid"
Update the few-shot examples every few weeks with real misclassification cases from your logs. After three iterations, classification accuracy is typically 90%+ on your specific use case. A generic classifier without domain-specific examples usually lands around 70–75%.
When routing backfires
Three failure modes to know before you deploy this.
Classifier latency kills the gain. A nano classification call takes 200–400ms. If your mid tier responds in 600ms, the combined classification + routed call is now 800–1000ms — slower than just calling mid directly. This breaks even at around 1.5 seconds of mid tier latency. If your tasks tend to be fast, skip the classifier and use rule-based routing instead (keyword matching, request metadata, user plan tier).
Misclassification compounds. The classifier routes a "simple" question to nano. The question turns out to require multi-step reasoning. Nano gives a bad answer. Now you've paid twice (classifier + routed call) and returned garbage. Fix this by making the classifier conservative: when ambiguous, default mid, not nano. The marginal cost difference between mid and nano is small compared to the cost of a bad answer reaching a user.
Binary tiers miss the middle. A strict nano/mid/frontier split works, but in practice many queries are borderline mid. Consider a complexity score instead — route anything scoring below 0.3 to nano, above 0.7 to frontier, and everything else to mid. You can generate this score in the same classifier call:
"Reply with JSON: {\"tier\": \"nano|mid|frontier\", \"confidence\": 0.0-1.0}"
Then use confidence to decide whether to escalate borderline classifications.
Monitoring the router in production
Log three things per request: tier_selected, tokens_used, cost_usd. After two weeks, run a breakdown:
- What percentage of requests land in each tier?
- What's the average cost per tier?
- Which tier has the highest error rate?
The common failure you'll catch: too many requests landing in frontier. This usually means the classifier prompt is too vague, or your application is sending queries that are genuinely hard to classify (long, mixed-intent messages). Tighten the classifier prompt with concrete examples from your actual traffic — few-shot classification is more accurate than zero-shot for this use case.
A second thing to watch: are your nano-tier calls actually cheaper per request after accounting for classifier overhead? If nano average response latency + classifier latency exceeds mid latency, you've built a more expensive, slower system. Real cost optimization sometimes means pruning the nano tier entirely and routing everything as mid/frontier.
Combining routing with caching
Routing reduces cost by matching model to complexity. Prompt caching reduces cost by reusing expensive computation across similar requests. The two strategies stack.
For a multi-tier system: cache at the tier level. Repeated classification queries hit the nano cache. Repeated mid-tier requests with the same system prompt hit the mid cache. See the context caching guide for implementation details — the savings from caching frequently-used system prompts can exceed the savings from routing on high-traffic endpoints.
Rule-based routing as a complement
Not everything needs an LLM classifier. Some routing decisions are deterministic and should stay that way.
Route by user tier: paying enterprise customers might always get mid or frontier. Free-tier users get nano. This is a business decision, not a complexity decision, and it's better encoded as a rule than delegated to a classifier.
Route by endpoint: your /api/search/suggest autocomplete endpoint should never hit frontier. Your /api/review/architecture endpoint probably always should. Hardcode these. Don't let the classifier override endpoint-level intent.
Route by token count: requests with more than 4,000 input tokens are probably complex enough to warrant mid at minimum. Add a simple check before calling the classifier at all:
def get_tier(query: str, endpoint: str = None) -> str:
if endpoint in ALWAYS_FRONTIER:
return "frontier"
if endpoint in ALWAYS_NANO:
return "nano"
if len(query.split()) > 500:
return "mid"
return classify_query(query)
Rule-based checks are free — zero latency, zero tokens. Use them to handle the obvious cases before the classifier touches anything.
What to do this week
Start with a cost audit. Pull the last 7 days of API spend, break it down by endpoint if you can. Find the top 3 endpoints by total token cost. For each one, ask: what percentage of these requests are actually complex? If the answer is "less than 30%", that endpoint is a routing candidate.
Then build the classifier for that one endpoint first. Not a general-purpose router — a specific classifier trained on that endpoint's actual traffic patterns. Add a few-shot examples from your logs. Run it in shadow mode (log the tier decision without actually routing) for a week before switching it live.
That one endpoint will tell you whether the approach works for your system before you invest in infrastructure for all of them.



