All posts

RouteLLM vs vLLM Semantic Router: Which Should You Use?

RouteLLM, semantic-router, and vLLM each solve a different layer of the routing problem. Here's what each tool actually does, where they overlap, and how to choose.

SY

Shubham Yadav

Machine Learning Researcher

June 8, 202611 min read

LiteLLM's router handles fallbacks, load balancing, and cost routing across providers — but it doesn't make the routing decision for you. It executes the route you specify; you still have to write the logic that classifies a query as simple, complex, math, or code.

Two open-source tools take a different approach: they own the routing decision itself. RouteLLM trains a classifier on human preference data to decide which queries need a strong model. The semantic-router library routes by intent using embedding similarity. They solve overlapping problems with fundamentally different philosophies.

This post covers:

  • What RouteLLM does — how it works, all four router types, and when to use it
  • What the semantic-router library does — embedding-based intent routing with full code
  • Where vLLM actually fits — speculative decoding and multi-LoRA serving vs. routing
  • Three-tier routing — adding a reasoning model layer on top of the stack
  • Side-by-side comparison — which tool to pick for your use case

1. RouteLLM: Classifier-Based Routing from Human Preference Data

RouteLLM is an open-source LLM routing library from LMSys that trains a classifier on Chatbot Arena human preference data to predict, for a given query, whether a strong model will produce a meaningfully better answer than a weak one.

RouteLLM's central claim is that the binary routing problem — strong model or weak model — is best solved by learning from human judgments about model quality, not by engineering features. Their training data is hundreds of thousands of human-evaluated model comparisons. The classifier learns which types of queries consistently produce better outcomes on the strong model.

If the probability of needing the strong model is above a calibrated threshold, the query goes to the strong model. Below it, the weak model handles it. The threshold directly controls the cost-quality tradeoff.

RouteLLM's four router architectures:

Router type How it works Best for
mf (matrix factorization) Low-rank decomposition of query-model quality relationship Recommended default — strongest benchmark results
bert Fine-tuned BERT classifier More interpretable than mf, slightly slower
causal_llm Small causal LM as classifier Handles novel phrasing best, most inference overhead
sr (similarity-based) Nearest-neighbor retrieval from labeled golden dataset Simplest — no trained weights, just retrieval
from routellm.controller import Controller

client = Controller(
    routers=["mf"],
    strong_model="gpt-4o",
    weak_model="gpt-4o-mini",
)

# Threshold encoded in the model string.
# "0.11593" means ~11.6% of queries go to the strong model.
response = client.chat.completions.create(
    model="router-mf-0.11593",
    messages=[{"role": "user", "content": query}],
)

The threshold parameter is the main lever. RouteLLM publishes calibration curves showing the quality-cost tradeoff at different thresholds — you pick the operating point that matches your acceptable quality loss.

Benchmark results: at 50% cost reduction relative to always using GPT-4, the matrix factorization router achieves around 95% of GPT-4's score on MT-Bench. The performance drop is real but concentrated on the specific queries the router misjudged — not spread uniformly across all outputs.

Limitation: when RouteLLM sends a query to the wrong model, diagnosing why is difficult. The classifier is a black box. And because it's trained on Chatbot Arena data, it may not reflect your specific application's quality standards.

2. semantic-router: Explicit Intent Routing Without Training Data

The semantic-router library (from Aurelio AI, separate from vLLM) routes incoming queries by matching them against predefined intent routes using embedding similarity — no training required, just labeled example utterances.

Rather than training a classifier on preference data, semantic routing is explicit and designed: you define the routes, you control the mapping, and you can inspect exactly why a query landed where it did.

from semantic_router import Route, RouteLayer
from semantic_router.encoders import OpenAIEncoder

math_route = Route(
    name="math",
    utterances=[
        "what is the derivative of x squared",
        "calculate compound interest over 10 years",
        "solve this system of equations",
        "how many combinations are there if I pick 3 from 10",
    ],
)

code_route = Route(
    name="code",
    utterances=[
        "write a Python function to parse JSON",
        "debug this SQL query",
        "refactor this class to use composition",
        "what does this regex match",
    ],
)

general_route = Route(
    name="general",
    utterances=[
        "summarize this article",
        "what is the capital of France",
        "help me write a professional email",
    ],
)

encoder = OpenAIEncoder()
layer = RouteLayer(encoder=encoder, routes=[math_route, code_route, general_route])

route_result = layer("integrate x squared from 0 to 5")
# → RouteChoice(name='math', similarity_score=0.89)

Once you have the route, send the query to the appropriate model:

MODEL_MAP = {
    "math":    "anthropic/claude-sonnet-4-6",
    "code":    "gpt-4o",
    "general": "gpt-4o-mini",
}

async def complete(query: str, history: list) -> str:
    route = layer(query)
    model = MODEL_MAP.get(route.name, "gpt-4o-mini")
    return await litellm.acompletion(
        model=model,
        messages=history + [{"role": "user", "content": query}],
    )

Key advantage over RouteLLM: fully interpretable, no training required, add new routes without retraining, and easy to tune without touching model weights.

Key limitation: a sparse or poorly-designed route set produces frequent misroutes. Out-of-distribution queries fall to a default route. The quality of your utterance examples matters significantly.

3. vLLM: High-Throughput Serving (Not a Routing Library)

vLLM is a high-performance inference serving engine — not a routing library. Its relevance to routing comes through two specific features: speculative decoding and multi-LoRA serving.

This distinction matters because vLLM gets referenced in "LLM routing" discussions in a way that conflates infrastructure with decision-making.

Speculative decoding — vLLM's most routing-adjacent feature — uses a small draft model to propose token sequences that a larger target model accepts or rejects. When tokens are accepted, you get the output of the large model at the speed of the small one. At 70–80% acceptance rate, you've done 70–80% of generation work at cheap-model cost. This is not routing in the application sense — both models run on every query:

vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --speculative-model meta-llama/Llama-3.1-8B-Instruct \
    --num-speculative-tokens 5

Multi-LoRA serving is closer to traditional routing. A single vLLM instance serves multiple fine-tuned adapters on top of a base model, switching between them per request. If you have task-specific fine-tuned models — a customer support LoRA, a coding LoRA — vLLM routes requests to the appropriate adapter at inference time:

from vllm import LLM, SamplingParams

llm = LLM(model="mistralai/Mistral-7B-v0.1", enable_lora=True)

response = llm.generate(
    prompts=[query],
    lora_request=LoRARequest("coding-adapter", 1, "/adapters/code"),
)

You still need external logic to decide which adapter to invoke — vLLM is the infrastructure, not the decision layer.

When vLLM is relevant: you're self-hosting and need high-throughput inference, or you need multi-LoRA serving for task-specific fine-tuned models. For teams building on API providers rather than self-hosting, vLLM's serving optimizations don't apply directly.

Three-Tier LLM Routing: Adding a Reasoning Model Layer

Neither RouteLLM nor semantic routing addresses the increasingly important question of whether a query needs extended chain-of-thought reasoning — which adds a third tier to any modern routing stack.

Reasoning models (o1, o3, DeepSeek-R1, Claude with extended thinking) produce significantly better results on multi-step logical inference, math, and code debugging. They also cost 5–10× more than non-reasoning counterparts and respond slower. Sending every query through extended reasoning is wasteful; skipping it on genuinely hard queries degrades quality in ways that matter.

When extended reasoning helps vs. doesn't:

Routing decision Helps significantly Rarely helps
Use extended reasoning Formal math, logical proofs, multi-hop reasoning, algorithmic debugging Factual questions, summarization, sentiment analysis, reformatting
Signal to look for Multi-step dependencies, "why is this wrong", "prove that" Retrieval-like answers, single-hop questions

In practice, this adds a third tier:

REASONING_TRIGGERS = [
    "prove that", "derive", "step by step", "walk me through why",
    "debug", "why is this wrong", "find the flaw",
    "optimize this algorithm",
]

def needs_reasoning(query: str) -> bool:
    if any(t in query.lower() for t in REASONING_TRIGGERS):
        return True
    if len(query.split()) > 80 and "?" in query:
        return True
    return False

async def complete(query: str, history: list) -> str:
    if needs_reasoning(query):
        return await litellm.acompletion(
            model="anthropic/claude-opus-4-8",
            messages=history + [{"role": "user", "content": query}],
            thinking={"type": "enabled", "budget_tokens": 8000},
        )

    route = layer(query)  # semantic router handles the rest
    model = MODEL_MAP.get(route.name, "gpt-4o-mini")
    return await litellm.acompletion(
        model=model,
        messages=history + [{"role": "user", "content": query}],
    )

RouteLLM's binary strong/weak framing predates reasoning models as a distinct tier and doesn't map cleanly onto this three-tier structure.

RouteLLM vs Semantic Router vs vLLM: Side-by-Side Comparison

RouteLLM semantic-router vLLM
What it decides Strong vs. weak model Which task/intent pool Which LoRA adapter (multi-LoRA)
Decision method Trained classifier (preference data) Embedding similarity External logic required
Setup required Minimal — pre-trained classifiers Curate example utterances per route Build separate classification layer
Interpretability Low — black box classifier High — inspect route matches N/A
Handles novel queries Well — generalizes from training data Falls back to default route N/A
Task-specific routing No — binary only Yes — explicit per-task pools Yes — per-LoRA adapter
Works with API providers Yes Yes No — self-hosted only
Best for Automatic quality-based routing Explicit task-type routing High-throughput self-hosted serving

Use RouteLLM when: you want automatic routing without defining routes manually, you don't have labeled query data, and your primary concern is quality-appropriate model selection rather than task-specific selection.

Use semantic-router when: you have well-defined task types, want explicit control over which model handles each, and value being able to inspect and evolve routing decisions without retraining.

Use vLLM when: you're self-hosting and need high-throughput inference or multi-LoRA task routing on top of a base model.

The Complete LLM Routing Stack

In a well-instrumented production system, these tools layer rather than compete:

  1. LiteLLM Router — provider abstraction, fallback chains, cost/latency routing within a model tier
  2. RouteLLM or semantic-router — decides which tier and which task pool receives the request
  3. Reasoning gate — determines whether the request warrants extended thinking before it's sent

None of these tools makes the others redundant. RouteLLM doesn't handle provider failover. Semantic router doesn't handle rate limits. LiteLLM doesn't make the routing decision. They solve adjacent problems, and the cost and quality benefits of combining them are larger than any one layer provides on its own.

The choice between RouteLLM and semantic routing is a choice between learned routing and designed routing. Learned routing generalizes well and requires less upfront work. Designed routing is predictable, debuggable, and easier to evolve as your application changes. Which one you reach for depends on how well-defined your query taxonomy is and how much you value being able to explain individual routing decisions.

LLM Routing Tool Selection Checklist

  • Identify whether your routing decision is binary (strong vs. weak) or task-based (math, code, creative) — binary favors RouteLLM, task-based favors semantic-router
  • Install RouteLLM if going the learned route: pip install routellm and use the mf router as default
  • Build a semantic-router RouteLayer with 4–6 example utterances per task type if going the explicit route
  • Wire either tool into LiteLLM Router so provider fallbacks and rate limit handling are covered
  • If your application includes multi-step reasoning, add a reasoning-model gate (needs_reasoning()) as a third tier before the main routing logic
  • For vLLM multi-LoRA serving: write the classification layer yourself — vLLM executes the route, it doesn't decide it
  • Log which tool made each routing decision and which model was selected — you need this to tune the classifier or improve utterance coverage
  • After one week of production data: check what percentage of requests hit each tier and whether fallback rate is above ~10%