LiteLLM Router Setup: Fallback, Cost Routing & Model Pools
A step-by-step walkthrough of LiteLLM's Router class — defining model pools, configuring multi-provider fallbacks, enabling cost-based routing, and adding task-specific pools for math, code, and creative tasks.
Shubham Yadav
Machine Learning Researcher
LiteLLM's Router is the fastest way to get multi-provider LLM routing into production. It handles provider abstraction, load balancing, automatic fallback, and cost-based routing behind a unified API — without requiring custom infrastructure.
This guide covers the complete setup:
- Installation and basic Router configuration
- Defining model pools for tier-based and task-specific routing
- Multi-provider fallback — automatic retries on provider failure or rate limits
- Cost-based and latency-based routing strategies
- Task-specific pools for math, code, and creative workloads
- Logging and cost tracking via LiteLLM callbacks
- A minimal production configuration you can deploy immediately
1. Why Use LiteLLM Router: Provider Abstraction, Fallback, and Cost Routing
LiteLLM solves the core problem of production LLM routing: provider lock-in. You write one call signature, and LiteLLM handles the translation to OpenAI, Anthropic, Google, Azure, Mistral, or any other provider. Swapping providers or adding a new one doesn't require changing application code.
The Router layer adds:
- Load balancing across multiple deployments of the same model
- Automatic fallback when a provider fails, rate-limits, or times out
- Built-in routing strategies for cost and latency optimization
- Context window fallbacks — automatically escalates to a larger-context model if the prompt is too long
LiteLLM is maintained by a team that tracks provider API changes, which matters when you're running across multiple providers in production.
2. How to Install and Configure LiteLLM Router
pip install litellm
For production use, install the extras that enable Redis-backed rate limit tracking and caching:
pip install litellm[proxy] redis
The core object is Router. At minimum it takes a model_list — a list of model configurations that tells the router what's available, under what alias, and with what credentials:
from litellm import Router
router = Router(
model_list=[
{
"model_name": "gpt-4o",
"litellm_params": {
"model": "gpt-4o",
"api_key": os.environ["OPENAI_API_KEY"],
},
},
{
"model_name": "claude-sonnet",
"litellm_params": {
"model": "anthropic/claude-sonnet-4-6",
"api_key": os.environ["ANTHROPIC_API_KEY"],
},
},
]
)
response = router.completion(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}]
)
The model_name field is your internal alias — what your application code refers to. The litellm_params.model is the actual provider model string. This separation is what lets you change providers without changing application code.
3. How to Define LiteLLM Model Pools
A model pool is a group of deployments behind the same alias. The router treats them as interchangeable and distributes load according to your routing strategy. This is the foundation for fallback and cost routing.
The most common pool structure groups deployments by capability tier:
router = Router(
model_list=[
# Fast, cheap pool
{
"model_name": "fast",
"litellm_params": {
"model": "gpt-4o-mini",
"api_key": os.environ["OPENAI_API_KEY"],
},
},
{
"model_name": "fast",
"litellm_params": {
"model": "anthropic/claude-haiku-4-5-20251001",
"api_key": os.environ["ANTHROPIC_API_KEY"],
},
},
# Capable, expensive pool
{
"model_name": "capable",
"litellm_params": {
"model": "gpt-4o",
"api_key": os.environ["OPENAI_API_KEY"],
},
},
{
"model_name": "capable",
"litellm_params": {
"model": "anthropic/claude-sonnet-4-6",
"api_key": os.environ["ANTHROPIC_API_KEY"],
},
},
],
routing_strategy="cost-based-routing",
)
Both fast deployments compete for requests routed to the fast alias. The router picks between them based on the active routing strategy. Your application code calls router.completion(model="fast", ...) and the router handles the rest.
4. Configuring Multi-Provider Fallback and Context Window Escalation
Fallbacks are configured separately from the model list and operate at the alias level. When a request to one alias fails — provider outage, rate limit, timeout — the router automatically retries against the fallback alias:
router = Router(
model_list=[...],
fallbacks=[
{"fast": ["capable"]}, # if fast fails, try capable
{"capable": ["fast"]}, # if capable fails, try fast
],
context_window_fallbacks=[
{"capable": ["capable-32k"]}, # if context is too long, try larger window
],
num_retries=2,
retry_after=0.5, # seconds between retries
timeout=30,
)
The context_window_fallbacks key handles a specific failure mode worth calling out: prompt too long for the model's context window. Rather than returning an error, the router automatically escalates to a model with a larger context. This is particularly useful for document processing pipelines where input length varies unpredictably.
For rate limit handling, LiteLLM tracks each provider's rate limit state and routes around saturated deployments before the request fails:
router = Router(
model_list=[...],
cooldown_time=60, # seconds to avoid a deployment after rate limit
allowed_fails=2, # failures before a deployment is cooled down
)
5. LiteLLM Routing Strategies: Cost-Based, Latency-Based, and Round-Robin
LiteLLM supports three primary routing strategies. Choosing the right one per pool is more effective than applying a single strategy globally:
| Routing strategy | How it works | Best for |
|---|---|---|
cost-based-routing |
Routes to the cheapest available deployment in the pool | High-volume pools where spend matters |
latency-based-routing |
Routes to the lowest p50 latency deployment | Latency-sensitive user-facing requests |
simple-shuffle (default) |
Round-robin across deployments | Equal-capability pools, even load distribution |
With routing_strategy="cost-based-routing", set explicit cost overrides if the built-in LiteLLM pricing table doesn't match your negotiated rates:
router = Router(
model_list=[
{
"model_name": "fast",
"litellm_params": {
"model": "gpt-4o-mini",
"api_key": os.environ["OPENAI_API_KEY"],
"input_cost_per_token": 0.00000015,
"output_cost_per_token": 0.0000006,
},
},
{
"model_name": "fast",
"litellm_params": {
"model": "anthropic/claude-haiku-4-5-20251001",
"api_key": os.environ["ANTHROPIC_API_KEY"],
"input_cost_per_token": 0.00000025,
"output_cost_per_token": 0.00000125,
},
},
],
routing_strategy="cost-based-routing",
)
A practical default: use latency-based routing on the fast pool (cheap models already vary in response time) and cost-based routing on the capable pool (expensive models have more pricing variation worth optimizing).
6. Task-Specific Model Pools: Math, Code, and Creative Routing
Tier-based routing handles complexity. Task-specific pools go further — they route by what the request is doing, not just how hard it is. A math reasoning task and a creative writing task of similar complexity benefit from different models.
Define pools by task type:
router = Router(
model_list=[
# Math and reasoning pool
{
"model_name": "math",
"litellm_params": {
"model": "anthropic/claude-sonnet-4-6",
"api_key": os.environ["ANTHROPIC_API_KEY"],
},
},
{
"model_name": "math",
"litellm_params": {
"model": "gpt-4o",
"api_key": os.environ["OPENAI_API_KEY"],
},
},
# Code generation pool
{
"model_name": "code",
"litellm_params": {
"model": "gpt-4o",
"api_key": os.environ["OPENAI_API_KEY"],
},
},
# Creative and long-form writing pool
{
"model_name": "creative",
"litellm_params": {
"model": "anthropic/claude-sonnet-4-6",
"api_key": os.environ["ANTHROPIC_API_KEY"],
},
},
# Default fallback pool
{
"model_name": "default",
"litellm_params": {
"model": "gpt-4o-mini",
"api_key": os.environ["OPENAI_API_KEY"],
},
},
],
fallbacks=[
{"math": ["default"]},
{"code": ["default"]},
{"creative": ["default"]},
],
)
The router itself doesn't know which pool to use — that's your classification logic. A two-layer classifier works well: fast keyword heuristics first, LLM classifier only for ambiguous cases:
def classify_task(query: str) -> str:
query_lower = query.lower()
# Fast heuristic layer — zero cost
math_signals = ["calculate", "solve", "equation", "proof", "derivative", "integral", "%", "how many"]
code_signals = ["write a function", "debug", "refactor", "implement", "sql", "regex", "script"]
creative_signals = ["write a story", "poem", "essay", "blog post", "marketing copy", "tone"]
if any(s in query_lower for s in math_signals):
return "math"
if any(s in query_lower for s in code_signals):
return "code"
if any(s in query_lower for s in creative_signals):
return "creative"
# LLM classifier only fires on ambiguous cases
return classify_with_llm(query)
def classify_with_llm(query: str) -> str:
response = router.completion(
model="default",
messages=[
{
"role": "system",
"content": "Classify the user's request as one of: math, code, creative, default. Respond with one word only.",
},
{"role": "user", "content": query},
],
max_tokens=5,
)
label = response.choices[0].message.content.strip().lower()
return label if label in {"math", "code", "creative"} else "default"
async def complete(query: str, history: list) -> str:
pool = classify_task(query)
return await router.acompletion(
model=pool,
messages=history + [{"role": "user", "content": query}],
)
The heuristic layer runs at zero cost. The LLM classifier only fires on ambiguous queries — and uses the default pool (cheapest model), so classification cost is negligible. Expensive models only run on requests specifically routed to them.
7. LiteLLM Cost Tracking and Logging Setup
LiteLLM exposes a callback system for logging. The key fields for cost monitoring: input_tokens, output_tokens, model, response_cost, and response_time_ms:
import litellm
def log_completion(kwargs, completion_response, start_time, end_time):
usage = completion_response.usage
cost = litellm.completion_cost(completion_response=completion_response)
print({
"model": completion_response.model,
"pool": kwargs.get("model"), # the alias you called with
"input_tokens": usage.prompt_tokens,
"output_tokens": usage.completion_tokens,
"cost_usd": cost,
"latency_ms": (end_time - start_time).total_seconds() * 1000,
"fallback_triggered": kwargs.get("metadata", {}).get("fallback_triggered", False),
})
litellm.success_callback = [log_completion]
litellm.failure_callback = [log_completion]
Route these events to your logging pipeline — Postgres, ClickHouse, Datadog, whatever you're already using. The pool field (the alias) is what lets you break down cost per routing tier, which is the key metric for evaluating whether routing is working.
For Redis-backed rate limit tracking across multiple workers — necessary if you're running more than one process:
router = Router(
model_list=[...],
redis_host=os.environ["REDIS_HOST"],
redis_port=6379,
redis_password=os.environ["REDIS_PASSWORD"],
)
Without Redis, each worker maintains its own rate limit state. For single-process development it doesn't matter; for production with multiple workers it does.
8. Complete LiteLLM Production Router Configuration
A full production setup with tiered pools, task-specific routing, fallbacks, cost-based strategy, Redis, and logging:
import os
import litellm
from litellm import Router
router = Router(
model_list=[
{"model_name": "fast", "litellm_params": {"model": "gpt-4o-mini", "api_key": os.environ["OPENAI_API_KEY"]}},
{"model_name": "fast", "litellm_params": {"model": "anthropic/claude-haiku-4-5-20251001", "api_key": os.environ["ANTHROPIC_API_KEY"]}},
{"model_name": "capable", "litellm_params": {"model": "gpt-4o", "api_key": os.environ["OPENAI_API_KEY"]}},
{"model_name": "capable", "litellm_params": {"model": "anthropic/claude-sonnet-4-6", "api_key": os.environ["ANTHROPIC_API_KEY"]}},
{"model_name": "math", "litellm_params": {"model": "anthropic/claude-sonnet-4-6", "api_key": os.environ["ANTHROPIC_API_KEY"]}},
{"model_name": "code", "litellm_params": {"model": "gpt-4o", "api_key": os.environ["OPENAI_API_KEY"]}},
{"model_name": "creative", "litellm_params": {"model": "anthropic/claude-sonnet-4-6", "api_key": os.environ["ANTHROPIC_API_KEY"]}},
],
routing_strategy="cost-based-routing",
fallbacks=[
{"math": ["capable"]},
{"code": ["capable"]},
{"creative": ["capable"]},
{"capable": ["fast"]},
],
context_window_fallbacks=[
{"capable": ["capable"]},
],
num_retries=2,
timeout=30,
cooldown_time=60,
allowed_fails=2,
redis_host=os.environ.get("REDIS_HOST"),
redis_port=6379,
)
litellm.success_callback = [log_completion]
litellm.failure_callback = [log_completion]
LiteLLM Router Production Monitoring: What to Track
After the first week in production, these four metrics tell you whether the setup is working:
| Metric | What it reveals | Action if off |
|---|---|---|
| Pool distribution (% per alias) | If >90% hits capable, classifier is too conservative |
Lower escalation threshold |
| Fallback rate per pool | Frequent fallbacks = unreliable provider or under-sized pool | Add a second deployment to the pool |
| Cost per pool vs baseline | Which pools are carrying the savings | Tune pool assignments to move more traffic to fast |
| Escalation accuracy | Sample escalated requests — wrong classification or genuinely hard? | Wrong: fix training examples. Hard: threshold is correct |
The configuration above is a starting point. After a week of production data you'll have enough to tune pool assignments, adjust fallback chains, and decide whether task-specific routing is delivering enough accuracy to justify the classification overhead. For most applications it does. For simple workloads with low task diversity, tier-based routing alone is sufficient — and simpler to maintain.
LiteLLM Router Setup Checklist
- Install
litellmandlitellm[proxy] redisfor production - Define at minimum a
fastandcapablepool with two providers each - Configure
fallbacksso every pool has at least one fallback alias - Set
cooldown_timeandallowed_failsto route around rate-limited deployments - Add
context_window_fallbacksfor any pipeline handling variable-length inputs - Set
routing_strategy="cost-based-routing"and add explicit cost overrides for negotiated rates - Add task-specific pools (
math,code,creative) and a two-layer classifier if task diversity warrants it - Wire up logging callbacks and route events to your observability pipeline
- Configure Redis if running more than one worker process
- Monitor pool distribution and fallback rate in the first week; tune from real data
Keep reading
LLM Routing: What It Is and How to Cut Costs With It
Does this request actually need your most expensive model? Semantic routing answers that question automatically — before the expensive model ever sees it.
RouteLLM vs vLLM Semantic Router: Which Should You Use?
RouteLLM, semantic-router, and vLLM each solve a different layer of the routing problem. Here's what each tool actually does, where they overlap, and how to choose.
Anthropic Prompt Caching: How It Works + When to Use It
How Anthropic prompt caching works, what it costs to write and read the cache, and the conditions under which it cuts input token spend by up to 90%.