LiteLLM Router Setup: Fallback, Cost Routing & Model Pools

A practical, code-first guide to setting up the LiteLLM Router in production - covering model pools, all six routing strategies, three fallback types, cost-based routing, and Redis-backed reliability.

Shubham Yadav

Machine Learning Researcher

June 13, 2026

14 min read

On this page

What Is the LiteLLM Router?
Quick Setup: Your First Router in 5 Minutes
Model Pools: How to Group and Manage Deployments
Routing Strategies: Which One Should You Use?
Fallback Configuration: Never Let a Failed Model Break Your App
Cost Routing: Route to the Cheapest Model Automatically
Routing Groups: Apply Different Strategies Per Model
Production Tips: Cooldowns, Retries & Redis
Key Takeaways
FAQ
Useful Sources

Your LLM provider just returned a 429. Your app is down. Your users are angry.

That's the exact problem the LiteLLM Router solves - and it does it in fewer than 20 lines of Python. With 51,000+ GitHub stars on BerriAI/litellm and adoption at Stripe, Netflix, and Google ADK, it's become the default reliability layer for production LLM applications.

This guide covers everything: model pools, all six routing strategies, three fallback types, cost-based routing, routing groups, and the production settings that actually matter.

TL;DR

Install with pip install litellm

Define a model_list with multiple deployments under the same model_name alias

Use simple-shuffle (default) for most cases - it's the fastest

Add fallbacks, context_window_fallbacks, and content_policy_fallbacks for resilience

Use cost-based-routing when mixing GPT-4o with cheaper models like Claude Haiku or Gemini Flash

Add Redis for distributed deployments tracking cooldowns and TPM/RPM across instances

What Is the LiteLLM Router?

LiteLLM is an open-source Python library that gives you a single, unified interface to call 100+ LLM providers - OpenAI, Anthropic, Azure, Bedrock, Gemini, Groq, and more - using the OpenAI format.

The LiteLLM Router is its production-grade load balancing layer. It sits in front of your model deployments and handles:

LiteLLM load balancing across multiple deployments of the same model
LiteLLM fallback chains that trigger automatically on failure
Retries with fixed and exponential backoff
Cost tracking and LiteLLM cost routing to the cheapest available deployment
Cooldowns that isolate failing deployments without taking down the whole model group

The router lives at litellm/router.py and is used as a Python SDK or behind the LiteLLM Proxy Server. This is load balancing across deployments of the same model - a different job from semantic routing as the decision layer, which picks which model handles a query based on its meaning.

Quick Setup: Your First Router in 5 Minutes

Install the library:

pip install litellm

Here's a working router with two deployments - one Azure, one OpenAI - both aliased under "gpt-3.5-turbo":

import os
from litellm import Router

model_list = [
    {
        "model_name": "gpt-3.5-turbo",       # model alias (what you call in your code)
        "litellm_params": {
            "model": "azure/chatgpt-v-2",     # actual model name
            "api_key": os.getenv("AZURE_API_KEY"),
            "api_version": os.getenv("AZURE_API_VERSION"),
            "api_base": os.getenv("AZURE_API_BASE"),
        },
    },
    {
        "model_name": "gpt-3.5-turbo",
        "litellm_params": {
            "model": "gpt-3.5-turbo",         # OpenAI direct
            "api_key": os.getenv("OPENAI_API_KEY"),
        },
    },
]

router = Router(model_list=model_list)

response = router.completion(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Hey, how's it going?"}],
)
print(response)

Key fields in litellm_params:

model - the actual provider model string (e.g., "azure/chatgpt-v-2", "gpt-3.5-turbo")
api_key - provider API key
api_base - endpoint URL (required for Azure)
api_version - API version string (Azure-specific)

That's it. Every call to router.completion(model="gpt-3.5-turbo", ...) will now load-balance across both deployments automatically.

Model Pools: How to Group and Manage Deployments

A model pool is just multiple deployments sharing the same model_name alias. The router treats them as a single logical model and picks between them based on your routing strategy.

Here's a pool of three deployments under one alias, with weights and rate limits:

import os
from litellm import Router

model_list = [
    {
        "model_name": "gpt-3.5-turbo",
        "litellm_params": {
            "model": "azure/chatgpt-v-2",
            "api_key": os.getenv("AZURE_API_KEY"),
            "api_version": os.getenv("AZURE_API_VERSION"),
            "api_base": os.getenv("AZURE_API_BASE"),
            "weight": 9,       # ~90% of traffic
            "rpm": 900,        # requests per minute limit
            "tpm": 100000,     # tokens per minute limit
        },
    },
    {
        "model_name": "gpt-3.5-turbo",
        "litellm_params": {
            "model": "azure/chatgpt-functioncalling",
            "api_key": os.getenv("AZURE_API_KEY"),
            "api_version": os.getenv("AZURE_API_VERSION"),
            "api_base": os.getenv("AZURE_API_BASE"),
            "weight": 1,       # ~10% of traffic
            "rpm": 100,
        },
    },
    {
        "model_name": "gpt-3.5-turbo",
        "litellm_params": {
            "model": "gpt-3.5-turbo",
            "api_key": os.getenv("OPENAI_API_KEY"),
            "order": 2,        # lower priority - used as last resort in pool
            "rpm": 1000,
        },
    },
]

router = Router(model_list=model_list, routing_strategy="simple-shuffle")

Three parameters that control pool behavior:

weight - relative pick frequency. weight: 9 on one deployment and weight: 1 on another means ~90%/10% split.
rpm / tpm - rate limits per deployment. The router uses these to avoid overloading any single endpoint.
order - priority within the pool. order: 1 is tried first; order: 2 is tried only when order: 1 fails or is on cooldown.

Routing Strategies: Which One Should You Use?

The default simple-shuffle works for most teams. Here's when to switch.

These strategies all balance load within a model group; choosing between models of different capability is a separate decision - see our RouteLLM vs semantic-router comparison.

Set your strategy at router initialization:

router = Router(model_list=model_list, routing_strategy="latency-based-routing")

Strategy comparison

Strategy	How It Works	Best For	Pros	Cons
`simple-shuffle` (default)	Random pick, weighted by `rpm`/`weight`	General purpose, high throughput	Minimal latency overhead, no state	Doesn't optimize for latency or cost
`least-busy`	Routes to deployment with fewest active requests	High-concurrency scenarios	Prevents overload on any single deployment	Slight overhead for request tracking
`latency-based-routing`	Routes to deployment with lowest cached response time	Latency-critical user-facing apps	Optimizes speed over time	Initial overhead; needs warm-up calls
`usage-based-routing`	Routes to deployment with lowest current TPM usage	Strict rate-limit compliance	Respects limits evenly	Not recommended for production - adds significant Redis latency
`cost-based-routing`	Routes to deployment with lowest cost per token	Mixed-model pools (e.g., GPT-4o + Groq)	Automatic cost minimization	Async only; unknown cost for unrecognized models
`weighted-pick`	RPM/TPM-aware distribution using `weight` param	Controlled traffic splits	Predictable distribution	Manual weight management

Warning: The LiteLLM docs explicitly flag usage-based-routing as bad for production performance due to Redis overhead on every request. Stick to simple-shuffle unless you have a specific reason to switch.

For latency-based-routing, you can tune the time window and add a buffer to prevent one fast deployment from absorbing all traffic:

router = Router(
    model_list=model_list,
    routing_strategy="latency-based-routing",
    routing_strategy_args={
        "ttl": 3600,                  # latency window in seconds
        "lowest_latency_buffer": 0.5, # consider deployments within 50% of the fastest
    },
)

Fallback Configuration: Never Let a Failed Model Break Your App

Fallbacks let you define a backup model chain that triggers automatically when the primary fails. After num_retries are exhausted on the primary, the router moves to the next model in the fallback list - in order.

The three fallback types

LiteLLM has three distinct fallback parameters, each targeting a different error class:

fallbacks - catches general errors: 429 rate limits, 500 server errors, connection failures
content_policy_fallbacks - catches ContentPolicyViolationError specifically
context_window_fallbacks - catches ContextWindowExceededError when input tokens exceed the model's limit

Critical rule: Every model referenced in a fallback must also exist in model_list. If it's missing, the router raises a BadRequestError.

Python SDK example

import os
from litellm import Router

model_list = [
    {
        "model_name": "gpt-3.5-turbo",
        "litellm_params": {
            "model": "azure/chatgpt-v-2",
            "api_key": os.getenv("AZURE_API_KEY"),
            "api_version": os.getenv("AZURE_API_VERSION"),
            "api_base": os.getenv("AZURE_API_BASE"),
            "rpm": 6,
        },
    },
    {
        "model_name": "gpt-4",
        "litellm_params": {
            "model": "azure/gpt-4-ca",
            "api_base": "https://my-endpoint-canada-berri992.openai.azure.com/",
            "api_key": os.getenv("AZURE_API_KEY"),
            "rpm": 6,
        },
    },
    {
        "model_name": "gpt-4-large-context",
        "litellm_params": {
            "model": "gpt-4-turbo",
            "api_key": os.getenv("OPENAI_API_KEY"),
        },
    },
]

router = Router(
    model_list=model_list,
    fallbacks=[{"gpt-3.5-turbo": ["gpt-4"]}],               # general errors
    context_window_fallbacks=[{"gpt-3.5-turbo": ["gpt-4-large-context"]}],
    num_retries=3,
)

response = router.completion(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Hello!"}],
)

Proxy YAML config example

model_list:
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: azure/chatgpt-v-2
      api_key: os.environ/AZURE_API_KEY
      api_version: os.environ/AZURE_API_VERSION
      api_base: os.environ/AZURE_API_BASE
      rpm: 6

  - model_name: gpt-4
    litellm_params:
      model: azure/gpt-4-ca
      api_base: https://my-endpoint-canada-berri992.openai.azure.com/
      api_key: os.environ/AZURE_API_KEY
      rpm: 6

  - model_name: claude-opus
    litellm_params:
      model: anthropic/claude-3-opus-20240229
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  fallbacks: [{"gpt-3.5-turbo": ["gpt-4"]}]
  context_window_fallbacks: [{"gpt-3.5-turbo": ["gpt-4"]}]
  content_policy_fallbacks: [{"gpt-3.5-turbo": ["claude-opus"]}]
  enable_pre_call_checks: true   # required for context window enforcement
  num_retries: 3
  allowed_fails: 3
  cooldown_time: 30

Chain execution order

Request hits gpt-3.5-turbo
Fails → retried num_retries times within the same model group (using order-based priority if set)
Still failing → router checks the fallbacks list and tries gpt-4
If the error was a ContextWindowExceededError, context_window_fallbacks is used instead
If the error was a ContentPolicyViolationError, content_policy_fallbacks is used instead

enable_pre_call_checks: true adds a pre-flight check that filters out deployments whose context window is smaller than the incoming message - before the API call is even made.

Cost Routing: Route to the Cheapest Model Automatically

Cost-based routing sends each request to the deployment with the lowest cost per token at that moment. It's one of the simplest levers in a broader toolkit for cutting your LLM API costs.

The formula is simple:

cost = (input_tokens × input_cost_per_token) + (output_tokens × output_cost_per_token)

LiteLLM looks up pricing from its internal model_prices_and_context_window.json. If a deployment isn't in that map, it defaults to $1 per request - so it'll be deprioritized automatically.

Code example with custom pricing

import os
import asyncio
from litellm import Router

model_list = [
    {
        "model_name": "my-model",
        "litellm_params": {
            "model": "gpt-4o",
            "api_key": os.getenv("OPENAI_API_KEY"),
            "input_cost_per_token": 0.000005,    # $5 per 1M input tokens
            "output_cost_per_token": 0.000015,   # $15 per 1M output tokens
        },
        "model_info": {"id": "openai-gpt4o"},
    },
    {
        "model_name": "my-model",
        "litellm_params": {
            "model": "claude-haiku-20240307",
            "api_key": os.getenv("ANTHROPIC_API_KEY"),
            "input_cost_per_token": 0.00000025,  # $0.25 per 1M input tokens
            "output_cost_per_token": 0.00000125, # $1.25 per 1M output tokens
        },
        "model_info": {"id": "claude-haiku"},
    },
    {
        "model_name": "my-model",
        "litellm_params": {
            "model": "gemini/gemini-1.5-flash",
            "api_key": os.getenv("GEMINI_API_KEY"),
            # Uses LiteLLM's built-in pricing from model_prices_and_context_window.json
        },
        "model_info": {"id": "gemini-flash"},
    },
]

router = Router(model_list=model_list, routing_strategy="cost-based-routing")

async def main():
    response = await router.acompletion(
        model="my-model",
        messages=[{"role": "user", "content": "Summarize this document."}],
    )
    print(response)
    print("Picked deployment:", response._hidden_params["model_id"])
    # Expect claude-haiku or gemini-flash - whichever is cheapest

asyncio.run(main())

Practical tip: This strategy shines when you're mixing GPT-4o (expensive) with Claude Haiku (~20× cheaper on input tokens) or Gemini Flash. Routine queries - summarization, classification, simple Q&A - can be routed to the cheaper model automatically, without changing a line of application code. The flip side: reserve your most capable tier for the queries that truly need it - see when routing to reasoning models for hard tasks actually pays off.

Note that cost-based-routing is async only - use router.acompletion(), not router.completion().

Routing Groups: Apply Different Strategies Per Model

Routing groups let you assign different LiteLLM routing strategies to different model groups within the same router - without spinning up a second instance.

The classic use case: latency-based routing for your expensive gpt-4o calls, but plain simple-shuffle for cheap utility models. (For routing logic driven by query signals rather than deployment health, see advanced signal-driven routing patterns.)

Rules:

Each model_name can belong to at most one group. Overlap raises a ValueError at init.
Models not in any group use the top-level routing_strategy (the implicit default group).
The name "default" is reserved.
Groups can be updated at runtime via Router.update_settings(routing_groups=[...]).

Python SDK example

from litellm import Router

router = Router(
    model_list=[
        {"model_name": "gpt-4o", "litellm_params": {"model": "openai/gpt-4o", "api_key": "..."}},
        {"model_name": "gpt-4o", "litellm_params": {"model": "azure/gpt-4o", "api_base": "...", "api_key": "..."}},
        {"model_name": "gpt-4o-mini", "litellm_params": {"model": "openai/gpt-4o-mini", "api_key": "..."}},
    ],
    routing_strategy="simple-shuffle",   # default for ungrouped models
    routing_groups=[
        {
            "group_name": "latency-sensitive",
            "models": ["gpt-4o"],
            "routing_strategy": "latency-based-routing",
            "routing_strategy_args": {"ttl": 3600},
        },
    ],
)

Proxy YAML equivalent

router_settings:
  routing_strategy: simple-shuffle   # fallback for ungrouped models
  routing_groups:
    - group_name: hot-path
      models: [gpt-4o, claude-sonnet]
      routing_strategy: latency-based-routing
      routing_strategy_args:
        ttl: 60           # short window - react quickly to latency spikes

    - group_name: batch
      models: [gpt-4o-mini, llama-70b]
      routing_strategy: usage-based-routing-v2
      routing_strategy_args:
        rpm: 10000

Result: gpt-4o uses latency-based routing across OpenAI + Azure deployments. gpt-4o-mini uses the top-level simple-shuffle.

Production Tips: Cooldowns, Retries & Redis

Cooldowns

Cooldowns apply to individual deployments, not entire model groups. When a deployment crosses the failure threshold, it's temporarily removed from the pool while healthy alternatives keep serving requests.

Default values (from constants.py):

allowed_fails: 3 failures per minute before cooldown
cooldown_time: 5 seconds

Automatic cooldown triggers:

Condition	Trigger	Duration
Rate limit (429)	Immediate	5s default
High failure rate	>50% failures in current minute	5s default
Non-retryable errors	401, 404, 408	5s default

router = Router(
    model_list=model_list,
    allowed_fails=3,       # cooldown after 3 failures/min
    cooldown_time=30,      # stay cooled down for 30 seconds
)

Retries

LiteLLM supports both fixed and exponential backoff retries:

router = Router(
    model_list=model_list,
    num_retries=3,
    retry_after=5,   # wait at least 5s before retrying (for RateLimitErrors, exponential backoff applies)
)

For fine-grained control, use RetryPolicy to set different retry counts per error type:

from litellm.router import RetryPolicy, AllowedFailsPolicy

retry_policy = RetryPolicy(
    RateLimitErrorRetries=3,
    ContentPolicyViolationErrorRetries=3,
    AuthenticationErrorRetries=0,   # don't retry auth errors - they won't self-heal
    TimeoutErrorRetries=2,
)

Redis for distributed deployments

In a multi-instance deployment (Kubernetes, auto-scaling), you need Redis to share cooldown state and TPM/RPM tracking across all router instances:

router = Router(
    model_list=model_list,
    routing_strategy="simple-shuffle",
    redis_host=os.environ["REDIS_HOST"],
    redis_password=os.environ["REDIS_PASSWORD"],
    redis_port=os.environ["REDIS_PORT"],
    cache_responses=True,
)

Or in your proxy config.yaml:

router_settings:
  routing_strategy: simple-shuffle
  redis_host: <your-redis-host>
  redis_password: <your-redis-password>
  redis_port: 6379
  enable_pre_call_checks: true

Max parallel requests

Use max_parallel_requests to cap concurrent calls per deployment - critical for preventing a single deployment from getting hammered during traffic spikes:

# Per deployment
model_list = [{
    "model_name": "gpt-4",
    "litellm_params": {
        "model": "azure/gpt-4",
        "api_key": os.getenv("AZURE_API_KEY"),
        "max_parallel_requests": 10,   # max 10 concurrent calls to this deployment
    },
}]

# Or set a global default
router = Router(model_list=model_list, default_max_parallel_requests=20)

Pre-call checks

enable_pre_call_checks=True filters out deployments before the API call based on:

Context window - skips deployments whose limit is smaller than the incoming message
Region - skips deployments outside a required region (e.g., EU-only)

This pairs with context_window_fallbacks to route long prompts to larger-context models automatically, with zero provider API calls wasted on requests that would fail anyway.

Key Takeaways

simple-shuffle is the right default for most production workloads - lowest latency overhead, no Redis dependency.
Model pools group multiple deployments under one alias; use weight, rpm/tpm, and order to control traffic distribution and priority.
Three fallback types cover different failure modes: fallbacks (general), content_policy_fallbacks, and context_window_fallbacks - all models referenced must be in model_list.
cost-based-routing is async-only and automatically picks the cheapest deployment - ideal when mixing GPT-4o with Claude Haiku or Gemini Flash.
Routing groups let you assign per-model strategies (e.g., latency-based for gpt-4o, simple-shuffle for gpt-4o-mini) without running multiple routers.
Cooldowns isolate individual deployments, not entire model groups - healthy peers keep serving while a failing one recovers.
Redis is required for multi-instance deployments to share cooldown state and rate limit tracking across router instances.

FAQ

What is the LiteLLM Router and what does it do?

The LiteLLM Router is a Python class (from litellm import Router) that load-balances requests across multiple LLM deployments. It handles retries, cooldowns, fallbacks, and cost tracking automatically. You define a model_list with your deployments, initialize the router, and call router.completion() or router.acompletion() exactly like the OpenAI SDK.

What is the difference between fallbacks, context_window_fallbacks, and content_policy_fallbacks?

They target different error types. fallbacks catches general errors like 429 rate limits and 500 server errors. context_window_fallbacks triggers specifically when the input exceeds a model's token limit - requires enable_pre_call_checks: true. content_policy_fallbacks triggers on ContentPolicyViolationError, letting you route flagged content to a different provider (e.g., from Azure OpenAI to Anthropic). You can configure all three simultaneously.

How does LiteLLM cost routing work?

With routing_strategy="cost-based-routing", the router calculates (input_tokens × input_cost_per_token) + (output_tokens × output_cost_per_token) for each healthy deployment and picks the cheapest one. Pricing comes from LiteLLM's built-in model_prices_and_context_window.json, or you can override it per deployment with input_cost_per_token and output_cost_per_token in litellm_params. Deployments not found in the price map are assigned a cost of $1 and deprioritized.

When should I use routing groups?

Use routing groups when different models in the same router need different strategies. For example: gpt-4o with latency-based-routing (because users are waiting) and gpt-4o-mini with simple-shuffle (because it's a background batch job). Without routing groups, you'd need to run two separate router instances. Each model_name can belong to at most one group - overlap raises a ValueError at initialization.

Do I need Redis for the LiteLLM Router?

Not for single-instance deployments. Redis becomes necessary when you're running multiple LiteLLM Proxy instances (e.g., in Kubernetes with horizontal scaling) and need to share cooldown state, TPM/RPM tracking, and response caching across all instances. Without Redis in that scenario, each instance tracks state independently and you'll get inconsistent rate limit behavior.

Useful Sources

Keep reading

llmroutingcost optimization

RouteLLM vs vLLM Semantic Router: Which One Actually Cuts Costs?

RouteLLM and vLLM Semantic Router both reduce LLM costs - but they solve fundamentally different problems. Here's the benchmark data, the architecture breakdown, and the exact decision framework to pick the right one.

SYShubham Yadav

15 min read

llmroutingcost optimization

LLM Routing: What It Is and How to Cut Costs With It

LLM routing directs each query to the right model instead of defaulting to the most expensive one. Done right, it cuts inference costs by 40–85% while retaining 95%+ of output quality.

SYShubham Yadav

18 min read

routingllmreasoning

When to Use Reasoning Models vs Standard LLMs

Reasoning models don't just generate text - they think before they answer. Here's what that actually means, how they're built, and when to use one over a standard LLM.

SYShubham Yadav

12 min read

Back to all posts