All posts

LiteLLM Router Setup: Fallback, Cost Routing & Model Pools

A step-by-step walkthrough of LiteLLM's Router class — defining model pools, configuring multi-provider fallbacks, enabling cost-based routing, and adding task-specific pools for math, code, and creative tasks.

SY

Shubham Yadav

Machine Learning Researcher

June 8, 202612 min read

LiteLLM's Router is the fastest way to get multi-provider LLM routing into production. It handles provider abstraction, load balancing, automatic fallback, and cost-based routing behind a unified API — without requiring custom infrastructure.

This guide covers the complete setup:

  • Installation and basic Router configuration
  • Defining model pools for tier-based and task-specific routing
  • Multi-provider fallback — automatic retries on provider failure or rate limits
  • Cost-based and latency-based routing strategies
  • Task-specific pools for math, code, and creative workloads
  • Logging and cost tracking via LiteLLM callbacks
  • A minimal production configuration you can deploy immediately

1. Why Use LiteLLM Router: Provider Abstraction, Fallback, and Cost Routing

LiteLLM solves the core problem of production LLM routing: provider lock-in. You write one call signature, and LiteLLM handles the translation to OpenAI, Anthropic, Google, Azure, Mistral, or any other provider. Swapping providers or adding a new one doesn't require changing application code.

The Router layer adds:

  • Load balancing across multiple deployments of the same model
  • Automatic fallback when a provider fails, rate-limits, or times out
  • Built-in routing strategies for cost and latency optimization
  • Context window fallbacks — automatically escalates to a larger-context model if the prompt is too long

LiteLLM is maintained by a team that tracks provider API changes, which matters when you're running across multiple providers in production.

2. How to Install and Configure LiteLLM Router

pip install litellm

For production use, install the extras that enable Redis-backed rate limit tracking and caching:

pip install litellm[proxy] redis

The core object is Router. At minimum it takes a model_list — a list of model configurations that tells the router what's available, under what alias, and with what credentials:

from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "gpt-4o",
            "litellm_params": {
                "model": "gpt-4o",
                "api_key": os.environ["OPENAI_API_KEY"],
            },
        },
        {
            "model_name": "claude-sonnet",
            "litellm_params": {
                "model": "anthropic/claude-sonnet-4-6",
                "api_key": os.environ["ANTHROPIC_API_KEY"],
            },
        },
    ]
)

response = router.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}]
)

The model_name field is your internal alias — what your application code refers to. The litellm_params.model is the actual provider model string. This separation is what lets you change providers without changing application code.

3. How to Define LiteLLM Model Pools

A model pool is a group of deployments behind the same alias. The router treats them as interchangeable and distributes load according to your routing strategy. This is the foundation for fallback and cost routing.

The most common pool structure groups deployments by capability tier:

router = Router(
    model_list=[
        # Fast, cheap pool
        {
            "model_name": "fast",
            "litellm_params": {
                "model": "gpt-4o-mini",
                "api_key": os.environ["OPENAI_API_KEY"],
            },
        },
        {
            "model_name": "fast",
            "litellm_params": {
                "model": "anthropic/claude-haiku-4-5-20251001",
                "api_key": os.environ["ANTHROPIC_API_KEY"],
            },
        },
        # Capable, expensive pool
        {
            "model_name": "capable",
            "litellm_params": {
                "model": "gpt-4o",
                "api_key": os.environ["OPENAI_API_KEY"],
            },
        },
        {
            "model_name": "capable",
            "litellm_params": {
                "model": "anthropic/claude-sonnet-4-6",
                "api_key": os.environ["ANTHROPIC_API_KEY"],
            },
        },
    ],
    routing_strategy="cost-based-routing",
)

Both fast deployments compete for requests routed to the fast alias. The router picks between them based on the active routing strategy. Your application code calls router.completion(model="fast", ...) and the router handles the rest.

4. Configuring Multi-Provider Fallback and Context Window Escalation

Fallbacks are configured separately from the model list and operate at the alias level. When a request to one alias fails — provider outage, rate limit, timeout — the router automatically retries against the fallback alias:

router = Router(
    model_list=[...],
    fallbacks=[
        {"fast": ["capable"]},       # if fast fails, try capable
        {"capable": ["fast"]},        # if capable fails, try fast
    ],
    context_window_fallbacks=[
        {"capable": ["capable-32k"]}, # if context is too long, try larger window
    ],
    num_retries=2,
    retry_after=0.5,  # seconds between retries
    timeout=30,
)

The context_window_fallbacks key handles a specific failure mode worth calling out: prompt too long for the model's context window. Rather than returning an error, the router automatically escalates to a model with a larger context. This is particularly useful for document processing pipelines where input length varies unpredictably.

For rate limit handling, LiteLLM tracks each provider's rate limit state and routes around saturated deployments before the request fails:

router = Router(
    model_list=[...],
    cooldown_time=60,  # seconds to avoid a deployment after rate limit
    allowed_fails=2,   # failures before a deployment is cooled down
)

5. LiteLLM Routing Strategies: Cost-Based, Latency-Based, and Round-Robin

LiteLLM supports three primary routing strategies. Choosing the right one per pool is more effective than applying a single strategy globally:

Routing strategy How it works Best for
cost-based-routing Routes to the cheapest available deployment in the pool High-volume pools where spend matters
latency-based-routing Routes to the lowest p50 latency deployment Latency-sensitive user-facing requests
simple-shuffle (default) Round-robin across deployments Equal-capability pools, even load distribution

With routing_strategy="cost-based-routing", set explicit cost overrides if the built-in LiteLLM pricing table doesn't match your negotiated rates:

router = Router(
    model_list=[
        {
            "model_name": "fast",
            "litellm_params": {
                "model": "gpt-4o-mini",
                "api_key": os.environ["OPENAI_API_KEY"],
                "input_cost_per_token": 0.00000015,
                "output_cost_per_token": 0.0000006,
            },
        },
        {
            "model_name": "fast",
            "litellm_params": {
                "model": "anthropic/claude-haiku-4-5-20251001",
                "api_key": os.environ["ANTHROPIC_API_KEY"],
                "input_cost_per_token": 0.00000025,
                "output_cost_per_token": 0.00000125,
            },
        },
    ],
    routing_strategy="cost-based-routing",
)

A practical default: use latency-based routing on the fast pool (cheap models already vary in response time) and cost-based routing on the capable pool (expensive models have more pricing variation worth optimizing).

6. Task-Specific Model Pools: Math, Code, and Creative Routing

Tier-based routing handles complexity. Task-specific pools go further — they route by what the request is doing, not just how hard it is. A math reasoning task and a creative writing task of similar complexity benefit from different models.

Define pools by task type:

router = Router(
    model_list=[
        # Math and reasoning pool
        {
            "model_name": "math",
            "litellm_params": {
                "model": "anthropic/claude-sonnet-4-6",
                "api_key": os.environ["ANTHROPIC_API_KEY"],
            },
        },
        {
            "model_name": "math",
            "litellm_params": {
                "model": "gpt-4o",
                "api_key": os.environ["OPENAI_API_KEY"],
            },
        },
        # Code generation pool
        {
            "model_name": "code",
            "litellm_params": {
                "model": "gpt-4o",
                "api_key": os.environ["OPENAI_API_KEY"],
            },
        },
        # Creative and long-form writing pool
        {
            "model_name": "creative",
            "litellm_params": {
                "model": "anthropic/claude-sonnet-4-6",
                "api_key": os.environ["ANTHROPIC_API_KEY"],
            },
        },
        # Default fallback pool
        {
            "model_name": "default",
            "litellm_params": {
                "model": "gpt-4o-mini",
                "api_key": os.environ["OPENAI_API_KEY"],
            },
        },
    ],
    fallbacks=[
        {"math": ["default"]},
        {"code": ["default"]},
        {"creative": ["default"]},
    ],
)

The router itself doesn't know which pool to use — that's your classification logic. A two-layer classifier works well: fast keyword heuristics first, LLM classifier only for ambiguous cases:

def classify_task(query: str) -> str:
    query_lower = query.lower()

    # Fast heuristic layer — zero cost
    math_signals = ["calculate", "solve", "equation", "proof", "derivative", "integral", "%", "how many"]
    code_signals = ["write a function", "debug", "refactor", "implement", "sql", "regex", "script"]
    creative_signals = ["write a story", "poem", "essay", "blog post", "marketing copy", "tone"]

    if any(s in query_lower for s in math_signals):
        return "math"
    if any(s in query_lower for s in code_signals):
        return "code"
    if any(s in query_lower for s in creative_signals):
        return "creative"

    # LLM classifier only fires on ambiguous cases
    return classify_with_llm(query)


def classify_with_llm(query: str) -> str:
    response = router.completion(
        model="default",
        messages=[
            {
                "role": "system",
                "content": "Classify the user's request as one of: math, code, creative, default. Respond with one word only.",
            },
            {"role": "user", "content": query},
        ],
        max_tokens=5,
    )
    label = response.choices[0].message.content.strip().lower()
    return label if label in {"math", "code", "creative"} else "default"


async def complete(query: str, history: list) -> str:
    pool = classify_task(query)
    return await router.acompletion(
        model=pool,
        messages=history + [{"role": "user", "content": query}],
    )

The heuristic layer runs at zero cost. The LLM classifier only fires on ambiguous queries — and uses the default pool (cheapest model), so classification cost is negligible. Expensive models only run on requests specifically routed to them.

7. LiteLLM Cost Tracking and Logging Setup

LiteLLM exposes a callback system for logging. The key fields for cost monitoring: input_tokens, output_tokens, model, response_cost, and response_time_ms:

import litellm

def log_completion(kwargs, completion_response, start_time, end_time):
    usage = completion_response.usage
    cost = litellm.completion_cost(completion_response=completion_response)

    print({
        "model": completion_response.model,
        "pool": kwargs.get("model"),          # the alias you called with
        "input_tokens": usage.prompt_tokens,
        "output_tokens": usage.completion_tokens,
        "cost_usd": cost,
        "latency_ms": (end_time - start_time).total_seconds() * 1000,
        "fallback_triggered": kwargs.get("metadata", {}).get("fallback_triggered", False),
    })

litellm.success_callback = [log_completion]
litellm.failure_callback = [log_completion]

Route these events to your logging pipeline — Postgres, ClickHouse, Datadog, whatever you're already using. The pool field (the alias) is what lets you break down cost per routing tier, which is the key metric for evaluating whether routing is working.

For Redis-backed rate limit tracking across multiple workers — necessary if you're running more than one process:

router = Router(
    model_list=[...],
    redis_host=os.environ["REDIS_HOST"],
    redis_port=6379,
    redis_password=os.environ["REDIS_PASSWORD"],
)

Without Redis, each worker maintains its own rate limit state. For single-process development it doesn't matter; for production with multiple workers it does.

8. Complete LiteLLM Production Router Configuration

A full production setup with tiered pools, task-specific routing, fallbacks, cost-based strategy, Redis, and logging:

import os
import litellm
from litellm import Router

router = Router(
    model_list=[
        {"model_name": "fast",     "litellm_params": {"model": "gpt-4o-mini",                      "api_key": os.environ["OPENAI_API_KEY"]}},
        {"model_name": "fast",     "litellm_params": {"model": "anthropic/claude-haiku-4-5-20251001", "api_key": os.environ["ANTHROPIC_API_KEY"]}},
        {"model_name": "capable",  "litellm_params": {"model": "gpt-4o",                            "api_key": os.environ["OPENAI_API_KEY"]}},
        {"model_name": "capable",  "litellm_params": {"model": "anthropic/claude-sonnet-4-6",       "api_key": os.environ["ANTHROPIC_API_KEY"]}},
        {"model_name": "math",     "litellm_params": {"model": "anthropic/claude-sonnet-4-6",       "api_key": os.environ["ANTHROPIC_API_KEY"]}},
        {"model_name": "code",     "litellm_params": {"model": "gpt-4o",                            "api_key": os.environ["OPENAI_API_KEY"]}},
        {"model_name": "creative", "litellm_params": {"model": "anthropic/claude-sonnet-4-6",       "api_key": os.environ["ANTHROPIC_API_KEY"]}},
    ],
    routing_strategy="cost-based-routing",
    fallbacks=[
        {"math":     ["capable"]},
        {"code":     ["capable"]},
        {"creative": ["capable"]},
        {"capable":  ["fast"]},
    ],
    context_window_fallbacks=[
        {"capable": ["capable"]},
    ],
    num_retries=2,
    timeout=30,
    cooldown_time=60,
    allowed_fails=2,
    redis_host=os.environ.get("REDIS_HOST"),
    redis_port=6379,
)

litellm.success_callback = [log_completion]
litellm.failure_callback = [log_completion]

LiteLLM Router Production Monitoring: What to Track

After the first week in production, these four metrics tell you whether the setup is working:

Metric What it reveals Action if off
Pool distribution (% per alias) If >90% hits capable, classifier is too conservative Lower escalation threshold
Fallback rate per pool Frequent fallbacks = unreliable provider or under-sized pool Add a second deployment to the pool
Cost per pool vs baseline Which pools are carrying the savings Tune pool assignments to move more traffic to fast
Escalation accuracy Sample escalated requests — wrong classification or genuinely hard? Wrong: fix training examples. Hard: threshold is correct

The configuration above is a starting point. After a week of production data you'll have enough to tune pool assignments, adjust fallback chains, and decide whether task-specific routing is delivering enough accuracy to justify the classification overhead. For most applications it does. For simple workloads with low task diversity, tier-based routing alone is sufficient — and simpler to maintain.

LiteLLM Router Setup Checklist

  • Install litellm and litellm[proxy] redis for production
  • Define at minimum a fast and capable pool with two providers each
  • Configure fallbacks so every pool has at least one fallback alias
  • Set cooldown_time and allowed_fails to route around rate-limited deployments
  • Add context_window_fallbacks for any pipeline handling variable-length inputs
  • Set routing_strategy="cost-based-routing" and add explicit cost overrides for negotiated rates
  • Add task-specific pools (math, code, creative) and a two-layer classifier if task diversity warrants it
  • Wire up logging callbacks and route events to your observability pipeline
  • Configure Redis if running more than one worker process
  • Monitor pool distribution and fallback rate in the first week; tune from real data