All posts

LLM Routing: What It Is and How to Cut Costs With It

Does this request actually need your most expensive model? Semantic routing answers that question automatically — before the expensive model ever sees it.

SY

Shubham Yadav

Machine Learning Researcher

June 8, 202610 min read

There's a version of LLM cost optimization that doesn't require you to touch your prompts, compress your context, or negotiate with your provider. It just requires asking one question before every API call: does this request actually need your most expensive model?

Most of the time, the answer is no. Semantic routing is how you act on that answer automatically.

Quick answer: LLM semantic routing is a technique that classifies incoming queries by complexity or intent before they reach a model, then directs each query to the cheapest model capable of handling it well. Simple queries go to a fast, cheap model. Complex ones escalate to a powerful model. Teams that implement routing typically see 40–70% reductions in blended API cost with no quality degradation on well-routed requests.

What Is LLM Semantic Routing?

LLM semantic routing is the practice of analyzing an incoming query and directing it to the most appropriate model — before the expensive model ever sees it.

Instead of sending every request to GPT-4o or Claude Sonnet by default, a semantic router evaluates what the query is asking for, estimates how hard it is, and routes accordingly. Simple queries go to a fast, cheap model. Complex ones escalate to the powerful model. Users get the same quality of answer either way, because the hard model only runs when it actually needs to.

The word "semantic" is important. This isn't rule-based routing where you match keywords and write an if-else tree. It's meaning-based — the router understands the intent and complexity of the query, not just its surface words. "What's 15% of 80?" and "Can you help me calculate the tip?" are phrased differently but are the same kind of request. A semantic router treats them the same way.

At its core, the routing decision comes from a lightweight classifier that reads the incoming query and outputs a routing decision. That classifier runs fast and costs almost nothing. The savings come from the requests it correctly identifies as simple, which never reach the expensive model at all.

How Does LLM Semantic Routing Reduce API Costs?

Semantic routing reduces LLM API costs by ensuring that only the minority of queries that genuinely require frontier-level reasoning are billed at frontier-model rates.

The distribution of queries in most production applications is heavily skewed toward simple requests. Users ask factual questions, request short summaries, ask for reformatting, or make conversational small talk. These tasks don't require frontier-level reasoning. A model like GPT-4o Mini, Claude Haiku, or Gemini Flash handles them indistinguishably from GPT-4o — at a fraction of the cost.

The genuinely hard queries — multi-step reasoning, ambiguous instructions, tasks requiring deep domain knowledge, long-form generation with high quality requirements — are a minority of production traffic. Often 20–30%, sometimes less. Routing means you pay premium prices for that 20–30% while the rest runs cheaply. The blended cost drops substantially. Quality on hard queries stays exactly the same because those queries still reach the best model.

The error case — a hard query getting incorrectly routed to the cheap model — is recoverable. Most routing implementations include a fallback: if the cheap model's output fails a quality check, the request escalates automatically. Misroutes are caught before they reach the user.

What Are the Three Types of LLM Semantic Routing?

There are three main approaches to LLM semantic routing, each trading off simplicity against accuracy and engineering overhead.

1. Embedding-based routing — the most common approach. You embed the incoming query using a fast embedding model and compare it against a set of labeled example queries. The router finds the closest match and assigns the route. Works well when query types are well-defined and you have examples to train against. Fast, cheap to run, and straightforward to implement with sentence-transformers or OpenAI's embedding API.

2. LLM-as-classifier routing — uses a small, cheap language model to read the query and output a routing decision. Prompt it with: "Classify this query as simple, moderate, or complex. Respond with one word." This handles novel phrasing and edge cases better than embedding similarity, since the classifier can reason about the query rather than just pattern-match against examples. The tradeoff is a small added latency for the classification call — though at the scale of a cheap model making a one-token decision, that cost is negligible.

3. Rule-based hybrid routing — combines a fast heuristic layer (query length, certain patterns, conversation depth) with a semantic layer for ambiguous cases. Often the most practical approach in production: obvious cases resolve instantly (a two-word query is almost certainly simple, a 500-word multi-part question is almost certainly complex) and the classifier only fires on ambiguous queries.

How Do You Implement LLM Semantic Routing?

The core loop is: classify the query, pick a model, call it, optionally validate the output and escalate if the quality check fails.

The simplest version can be implemented in a few dozen lines:

async def routed_completion(query: str, conversation_history: list) -> str:
    complexity = classify_query(query)

    if complexity == "simple":
        model = "gpt-4o-mini"
        max_tokens = 300
    elif complexity == "moderate":
        model = "gpt-4o-mini"
        max_tokens = 600
    else:
        model = "gpt-4o"
        max_tokens = 1500

    response = await call_llm(model, query, conversation_history, max_tokens)

    # fallback: if output fails quality check, escalate
    if complexity != "complex" and not passes_quality_check(response):
        response = await call_llm("gpt-4o", query, conversation_history, max_tokens)

    return response

The classify_query function is where the work lives. A simple version prompts a cheap model with the query and asks for a complexity label. A more robust version adds conversation history, considers query length, and checks whether the request involves multi-step reasoning or specialized knowledge.

What you log matters as much as what you route. Every call should record which model was used, whether it escalated, and the estimated cost. After a week of production data you'll know your routing accuracy, your actual cost distribution, and where the fallback is firing — which tells you whether your classifier needs more examples or a different framing.

What Signals Indicate a Query Needs a More Capable Model?

A query should escalate to a more capable model when it involves multi-step dependencies, specialized domain knowledge, ambiguous context, or accurate numerical reasoning.

Signals that reliably indicate a complex request:

  • Multi-step instructions where steps depend on each other ("first do X, then based on what you find, do Y")
  • Requests involving ambiguous references that require resolving context across a long conversation
  • Tasks requiring accurate numerical reasoning, especially multi-step calculations
  • Questions where the user signals uncertainty ("I'm not sure how to frame this, but...")
  • Requests involving specialized domain knowledge: legal, medical, financial, highly technical

Signals that reliably indicate a simple request:

  • Short factual questions with a clear, single answer
  • Reformatting or transformation tasks ("turn this into bullet points", "fix the grammar")
  • Sentiment classification, intent detection, or categorization
  • Conversational small talk or simple acknowledgements
  • Summarization of a short, well-structured document

These aren't rules to hard-code — they're patterns to use when building classifier training examples and test cases.

How Much Cost Reduction Can You Expect from LLM Semantic Routing?

Teams with mixed workloads typically see blended cost reductions of 40–70% after implementing routing. Teams with simple-skewed workloads — customer support, FAQ bots, form-filling assistants — see reductions at the higher end of that range.

Latency often improves as well. The cheap model path isn't just cheaper — it's faster. GPT-4o Mini and similar models respond in a fraction of the time of frontier models. For users asking simple questions, routing means faster answers, not just cheaper ones. That's a quality improvement, not a tradeoff.

The quality impact on correctly-routed requests is essentially zero — by definition the simple model is handling requests it can fully satisfy. The quality impact on misrouted requests is caught by the fallback layer before it reaches the user, as long as one is in place.

Where Does Semantic Routing Fit in a Broader LLM Cost Strategy?

Semantic routing is the highest-leverage starting point in a broader LLM cost stack, because the model tier is the single biggest variable in per-request cost.

Think of it as a hierarchy: routing decides which model runs, prompt optimization decides how many tokens it consumes, output constraints decide how long it responds. Each layer compounds the savings of the others.

A well-routed, well-prompted, output-constrained application can run at a fraction of the cost of a naively-built one — with the same or better user-facing quality. Routing first, then refine from there.

Frequently Asked Questions: LLM Semantic Routing

What is the difference between semantic routing and rule-based routing?

Rule-based routing matches on surface features — keywords, query length, explicit conditions. Semantic routing classifies by meaning and intent. A semantic router recognizes that "can you help me figure out the tip?" and "calculate 18% of $47.50" are the same type of request, even though they share no keywords. Rule-based routing would treat them differently unless you explicitly write a rule for conversational phrasing.

Does semantic routing require training data?

Embedding-based routing requires labeled example queries to compare against. LLM-as-classifier routing requires only a well-written prompt — no training data needed, just a few good examples in the prompt itself. The hybrid approach typically starts with heuristic rules (no training data) and adds a classifier for ambiguous cases as you collect production queries.

What happens when a query is misrouted to the cheap model?

Most production implementations include a fallback: the cheap model's output is validated (schema check for structured outputs, LLM-as-judge for open-ended generation), and if it fails, the request escalates to the capable model automatically. The user sees a slightly slower response on the rare misroute, but not a degraded one.

How do you measure whether LLM routing is working?

Track three metrics: (1) pool distribution — what percentage of requests hit each model tier; (2) fallback rate — how often the cheap model's output fails quality checks and escalates; (3) cost per successful output — total spend including failed calls divided by outputs that passed validation. If fallback rate is above ~10%, the classifier needs better training examples or a different framing.

Can semantic routing work alongside LiteLLM or other routing infrastructure?

Yes — semantic routing is the classification layer that decides which model alias to call. LiteLLM's Router (or any provider-abstraction layer) is the infrastructure layer that handles load balancing, fallbacks, and provider switching within each alias. They operate at different levels and complement each other. See the LiteLLM Router Setup Guide for how to wire the two together.