All posts

Prefill Activation Routing: Predicting Model Failure Early

Most routing systems decide before the model does any work. Activation routing flips that — it reads what happens inside the model during prefill and uses those signals to decide whether to escalate.

SY

Shubham Yadav

Machine Learning Researcher

June 8, 202610 min read

Most routing systems make their decisions before the model does any work — classifying the query by length, keywords, or embedding similarity, then sending it to a model pool. The routing decision is made entirely from the outside.

Prefill activation routing flips that. Instead of classifying the query before sending it to a model, you send it to a small model first, read what happens inside the model during the prefill phase, and use those internal signals to decide whether that model can handle the request — or whether it needs to escalate.

Quick answer: Prefill activation routing is a technique that runs an incoming query through a small model's forward pass, reads the internal activation states produced during prefill, and uses a lightweight probe to predict whether the small model will succeed or fail — before committing to full generation. It outperforms embedding-based classifiers specifically on the ambiguous middle-ground queries where surface-level routing makes the most mistakes.

What Is the Prefill Phase in an LLM?

The prefill phase is the step where an LLM processes your entire input prompt in a single forward pass before generating any output tokens.

When you send a message to an LLM, the model doesn't generate tokens immediately. It first processes every token in your prompt in one pass. This is the prefill phase. Every token gets converted into internal representations called activations — vectors that capture how the model is "understanding" the input at each layer of the network. These activations encode not just the surface meaning of the text but the model's internal state as it builds up a representation of what's being asked.

After prefill, the model enters the decode phase: generating output tokens one at a time, each conditioned on everything that came before. The prefill activations set the foundation for everything that follows.

The key insight behind activation routing: those prefill activations contain information about whether the model is likely to succeed or fail at the task. A query that looks simple on the surface might produce activations indicating the model is uncertain or operating near the edge of its capability. A query that looks complex might produce activations indicating the model has seen this pattern many times and will handle it confidently.

Why Do Query-Level Classifiers Fail on Ambiguous Queries?

Embedding-based and keyword classifiers make routing decisions from surface features of the query — they don't know how hard the query is for the specific model they're routing to.

A keyword-based or embedding-based classifier can easily distinguish "what is 2+2" from "prove the Riemann hypothesis." The hard cases are queries in the middle — the ones that look moderate but are actually at the edge of a small model's capability, or the ones that look complex but are routine for a model with the right training.

Consider: "what were the downstream effects of the 1997 Asian financial crisis on Southeast Asian manufacturing supply chains?" An embedding-based classifier might route this to a premium model because it's long and contains complex-sounding phrases. But a model trained heavily on economics content might handle it comfortably. The classifier is making a judgment based on surface features. It doesn't know how hard the question is for the specific model it's routing to.

Activation routing sidesteps this problem. It doesn't try to estimate difficulty in the abstract. It asks a more specific question: is this model going to handle this query well? The answer comes from observing the model's internal response to the input — not from pattern-matching on the text.

How Does Prefill Activation Routing Work?

Activation routing works in three steps: run the query through a small model's prefill pass to collect activations, feed those activations through a lightweight probe trained to predict success or failure, then route or escalate based on the probe's output.

Step 1: Collect prefill activations. Run the incoming query through a small, fast model — not to generate a response, but to collect the activations from the prefill phase. Specifically, look at activations from middle-to-late layers of the model, where the internal representation of the task is most developed. These activations form a vector representing the model's internal state after processing the input.

Step 2: Run the probe. Pass those activations through a lightweight probe — usually a small linear classifier or shallow neural network trained to predict whether the base model will succeed or fail on this type of input. The probe has learned which activation patterns correlate with confident, correct outputs versus uncertain or incorrect ones.

Step 3: Route or escalate. If the probe predicts the small model will handle the query well, the request stays with that model and generation proceeds. If the probe predicts failure, the request escalates to a stronger model. The user sees the output of whichever model actually handled the request.

The cost is the prefill pass through the small model plus probe inference — both fast and cheap. If the probe correctly keeps a request on the cheap path, you've paid a small upfront cost to avoid a large downstream one. If the probe escalates, you've paid slightly more than routing directly to the large model — but that overhead pays for the cases where the probe correctly keeps requests cheap.

How Do You Train the Activation Probe?

The probe is trained on (activations, success/failure) pairs collected from production: examples of queries where the small model succeeded and examples where it failed, labeled accordingly.

For structured output tasks, labeling is straightforward — did the model return valid JSON? Did it extract the right fields? For open-ended generation, you need either human ratings or an automated quality metric. LLM-as-judge works well here: use a strong model to score outputs from the weak model, and use those scores as training labels.

The probe itself doesn't need to be complex. A logistic regression over the mean-pooled activations from a few key layers often performs surprisingly well. More sophisticated approaches use small neural networks that attend over the full activation sequence, capturing positional information about where the model is most uncertain — sometimes uncertainty is concentrated at specific tokens, which is a stronger signal than aggregate uncertainty.

One critical constraint: the probe is model-specific and dataset-specific. A probe trained on GPT-4o Mini's activations doesn't transfer to Claude Haiku, even for the same task. A probe trained on customer support queries won't generalize to code generation. You're training a probe for a specific (model, task domain) combination, which requires sufficient production data for each combination. This is the main practical constraint on adoption — it requires a data flywheel that smaller deployments may not have yet.

What Does the Research Show About Activation Routing Accuracy?

The RouteLLM paper (LMSYS, 2024) found that activation-based routing outperformed embedding-based classifiers specifically on queries near the capability boundary of the smaller model — exactly where surface-level routing makes the most mistakes.

The intuition behind why is direct. An embedding classifier sees the query as a point in semantic space and compares it to known examples. An activation probe sees the model's actual response to the query — the uncertainty, the attention patterns, the internal representations — which is a richer signal for predicting whether that specific model will do well.

The gap between approaches is largest on ambiguous queries. On clearly easy or clearly hard queries, most routing methods agree. The value of activation routing concentrates in the middle — the moderate queries that are hard to classify from the outside but easy to identify from the inside once you know what to look for in the activations.

Where Does Activation Routing Fit in a Production LLM Stack?

Activation routing is a more sophisticated routing decision mechanism that replaces the classifier in an existing routing architecture — model pools, fallbacks, and cost tracking stay the same.

For most teams, the practical path is to start with a simpler routing approach — keyword heuristics, embedding classifiers, or an LLM-based classifier — and layer in activation routing later, once you have enough production data to train probes and enough traffic to justify the engineering investment. The simpler approaches get you 60–70% of the cost savings with a fraction of the complexity. Activation routing closes the remaining gap, particularly on the ambiguous middle-ground queries that simpler approaches misroute most often.

The teams for whom it makes immediate sense: those operating at high volume with well-defined task domains, where the cost of misrouting is significant and the data to train probes is already available. At that scale, the improvement in routing accuracy translates directly into meaningful cost reduction and quality gains.

Prefill activation routing is still an active research area and the tooling is immature compared to embedding-based routing. The ideas are solid and the empirical results are promising, but the path from research result to production-ready open-source library hasn't fully closed yet. Treat it as a technique to understand and plan for — not necessarily one to deploy next sprint.

Frequently Asked Questions: Prefill Activation Routing

What is a prefill activation in an LLM?

A prefill activation is the internal vector representation produced at a given layer of an LLM when it processes an input token during the prefill phase. Every token in your prompt produces a set of activations — one per layer — that encode how the model is "understanding" that token in context. Middle-to-late layer activations are most useful for routing because they reflect the model's developed understanding of the task, not just low-level token features.

How is activation routing different from embedding-based routing?

Embedding-based routing classifies the query text using a separate embedding model and compares it against labeled examples. It never involves the routing model itself. Activation routing runs the query through the small model's prefill pass and reads the model's own internal response to the input. The key difference: activation routing tells you how hard the query is for that specific model, not just how it compares to other queries in embedding space.

Does activation routing require retraining the small model?

No. The small model is used as-is. Only the lightweight probe is trained — and it's trained on activations extracted from the model's existing forward passes, not on the model weights themselves. The probe is typically a simple linear classifier or small neural network, not another LLM.

What is the RouteLLM paper?

RouteLLM is a research paper from LMSYS (published 2024) that evaluates multiple LLM routing strategies — including activation-based routing, embedding classifiers, and matrix factorization approaches — on real routing benchmarks. It's the most comprehensive empirical comparison of LLM routing techniques currently available, and the primary source for activation routing benchmark results.

When should I use activation routing instead of semantic routing?

Use semantic routing (embedding-based or LLM-as-classifier) when you're starting out or have limited production data. Use activation routing when: you have enough production data to train a probe (thousands of labeled examples per task domain), you're operating at high enough volume that routing accuracy meaningfully affects costs, and you've already extracted most of the savings from simpler approaches. Activation routing closes the gap on ambiguous queries — if those are a small fraction of your traffic, the simpler approach is probably sufficient.