Prefill Activation Routing: Predicting Model Failure Early

Prefill activation routing reads a model's internal hidden states before a single token is generated - predicting failure in advance, slashing inference costs by up to 74%, and routing queries to the right model every time.

Shubham Yadav

Machine Learning Researcher

June 17, 2026

17 min read

On this page

What Is Prefill Activation Routing?
Why Do LLMs Fail - and Why Can't We See It Coming?
How Prefill Activations Encode Model Failure
Encoder-Target Decoupling: The Architecture That Makes It Work
Real Results: What the Data Actually Shows
How to Apply Prefill Activation Routing in Your AI Pipeline
Prefill Routing vs. Semantic Routing: Which Wins?
Key Takeaways
FAQ
Sources

TL;DR

Most LLM routers look at what you asked. Prefill activation routing looks at how the model is reacting to it - before generating a single token. By reading hidden-state geometry during the prefill phase, a lightweight probe can predict whether a target model will succeed or fail on a given query. NVIDIA's March 2026 research showed this approach closes 45.58% of the accuracy gap to a theoretical oracle while cutting costs by 74.31%. Oxford researchers confirmed the underlying mechanism: LLMs literally encode their own likelihood of failure in pre-generation activations. Here's exactly how it works and how to build it into your pipeline.

What Is Prefill Activation Routing?

Prefill activation routing is a mechanistic approach to LLM selection that uses a model's internal hidden states - captured during the prefill phase - as a predictive signal for correctness, before any output token is generated.

Traditional LLM routing relies on semantic query features: embeddings, perplexity, question length, TF-IDF. (For a primer on that approach, see semantic routing versus this activation-based method.) These signals describe what the user asked. They don't describe whether a specific model will answer it correctly.

Prefill activation routing flips the frame entirely. Instead of analyzing the query text, it analyzes the transformer's activation geometry as it processes that query. The signal is mechanistic, not semantic. And it's dramatically more predictive.

The key insight, formalized in NVIDIA's March 2026 paper "LLM Router: Rethinking Routing with Prefill Activations" (arXiv:2603.20895v2), is that a model's hidden states during prefill already contain a linearly decodable signal about whether it will succeed. You can read that signal with a lightweight linear probe - and use it to route traffic to the right model before spending a single inference dollar on generation.

This is the foundation of mechanistic routing: routing decisions grounded in the model's internal computational geometry, not surface-level text features.

Why Do LLMs Fail - and Why Can't We See It Coming?

LLMs fail silently. The query looks fine, the model looks capable, and the output is wrong. Semantic routing can't catch this because it doesn't look inside the model.

Here's the structural problem. Frontier models like Claude Opus 4.6, GPT-5.4, and Qwen 3.5 122B achieve similar average benchmark accuracies. But their per-query performance is highly complementary. On any given query, one model might succeed where another fails - even if both look equally capable from the outside.

In NVIDIA's frontier pool of 11 models, 61.52% of queries fell into the "model disagreement" regime - at least one model succeeded while others failed. That's the routing opportunity. The oracle accuracy ceiling was 89.31%. The best single model only hit 65.4%. A gap of nearly 24 percentage points sits there, waiting to be captured.

Semantic routers try to capture it by matching query embeddings to historical performance patterns. The problem:

Semantic similarity ≠ task difficulty. A query that looks simple can be hard for a specific model.
Model-specific failure modes are invisible to embeddings. A model's weakness on a particular reasoning pattern doesn't show up in the query text.
Agreement inflation. Routing evaluations are often inflated by queries where all models agree - masking how poorly routers perform when it actually matters.

The Oxford Internet Institute's February 2026 paper "LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations" (arXiv:2602.09924v2) confirmed this directly. Linear probes trained on pre-generation activations substantially outperform surface features - question length, TF-IDF - for predicting whether a model will succeed. The signal was there all along. We just weren't looking in the right place.

How Prefill Activations Encode Model Failure

What Are Prefill Activations?

Prefill activations are the hidden states a transformer produces while processing the input prompt - before it generates any output tokens.

When a query hits a transformer model, the model runs a forward pass through all its layers to build the KV cache. At each layer, every token position produces a hidden-state vector. These vectors encode the model's internal representation of the input.

The NVIDIA team focused on the upper half of transformer layers (L/2 to L). That's where task-relevant representations concentrate. Lower layers tend to capture surface syntax; upper layers encode semantic and reasoning structure - the stuff that actually predicts whether the model will get the answer right.

The Key Finding: LLMs Know When They're About to Fail

LLMs encode a model-specific notion of difficulty in their pre-generation activations. This signal is linearly decodable - and it predicts failure before generation begins.

The Oxford team ran a controlled experiment using the E2H-AMC dataset, which provides both human IRT difficulty scores and model performance data on identical math problems. They trained linear probes on pre-generation activations and measured how well those probes predicted:

Human IRT difficulty - how hard humans find the problem
Model difficulty - whether the specific model will succeed

Results (Spearman ρ correlation):

Target	Spearman ρ
Human IRT difficulty	0.83–0.87
Model difficulty (Qwen2.5-Math 1.5B/7B)	0.64
Model difficulty (GPT-OSS-20B, low reasoning)	0.58
Model difficulty (GPT-OSS-20B, high reasoning)	0.40

Human difficulty is highly linearly accessible. Model difficulty is also decodable - but with a critical twist: model difficulty becomes less linearly accessible as reasoning capability increases. For GPT-OSS-20B, the Spearman ρ drops from 0.58 to 0.40 as reasoning depth increases, even as accuracy improves. Extended chain-of-thought reasoning obscures the pre-generation difficulty signal.

This has direct implications for LLM failure prediction in production: the signal is strongest for base and lightly-tuned models, and degrades for heavy reasoning models. Plan your probe strategy accordingly.

How Linear Probes Read the Signal

A linear probe is a simple classifier - logistic regression or a shallow MLP - trained on activation vectors to predict binary success (correct/incorrect).

The Oxford team achieved AUROC > 0.7 across most models, with several exceeding 0.8 on math and coding tasks. For LiveCodeBench (code generation), Qwen2.5-Coder models hit AUROC 0.90–0.91. That's a strong discriminative signal from a single forward pass.

The NVIDIA team went further, using Fisher Separability (J) as a layer-selection criterion:

J = ||μ₁ - μ₀||² / (tr(Σ₀) + tr(Σ₁))

Where μ₁ and μ₀ are the mean activation vectors for correct and incorrect responses, and Σ is the within-class covariance. Higher J = better class separation = more useful layer for routing.

Fisher J-based layer selection consistently matched empirical probe performance, giving you a principled, efficient way to identify which transformer layers carry the most model routing signal - without brute-force sweeping every layer.

Encoder-Target Decoupling: The Architecture That Makes It Work

What Is Encoder-Target Decoupling?

Encoder-Target Decoupling separates the model that produces the predictive signal (the Encoder) from the model whose correctness is being predicted (the Target).

This is the central innovation in NVIDIA's work. Instead of asking "what does GPT-5's own hidden state say about whether GPT-5 will succeed?", you ask "what does Qwen 3.5 122B's hidden state say about whether GPT-5 will succeed?"

The Encoder runs the query through its own forward pass and produces activation features. Those features are then used to predict the Target model's correctness probability. The Target model itself never needs to run until after the routing decision is made.

This decoupling has two massive practical consequences:

You can predict closed-source model performance using open-weight encoders. You don't need access to GPT-5's internals to route to it intelligently.
A single encoder forward pass can predict correctness across an entire pool of target models. One pass, K predictions.

Why Foreign Encoders Outperform Self-Encoding

In several cases, a different model's hidden states predict a target model's correctness better than the target model's own hidden states.

This counterintuitive finding held consistently across all three model pools in NVIDIA's experiments. Qwen3.5-122B achieved the highest AUC on every target in the frontier pool - including closed-source targets like GPT-5 and Claude Sonnet 4.

The explanation lies in representation geometry. The NVIDIA team measured three properties of encoder hidden states:

Effective Dimensionality (d_eff): How widely information is distributed across dimensions. Higher = less PCA information loss.
Representational Anisotropy (α): Mean pairwise cosine similarity. Lower = more isotropic = better class separation.
Fisher Separability (J): Direct measure of class-discriminative separation.

Large open-weight models like Qwen3.5-122B tend to have higher effective dimensionality and lower anisotropy in their upper layers - producing richer, more separable representations of query difficulty. Their activations are geometrically better suited to serve as routing signals.

This is why encoder-target decoupling isn't just an architectural trick. It's grounded in the representational geometry of transformer models.

SharedTrunkNet: Routing Across Multiple Models at Once

SharedTrunkNet is a joint multi-output MLP that predicts simultaneous correctness probabilities across K candidate models in a single forward pass. (This is one way to wire activation signals into a mixture-of-models router.)

Architecture:

For each encoder model, extract PCA-reduced hidden-state features from Fisher-selected layers.
Concatenate feature vectors across all K targets: x = [f₁ | f₂ | ··· | f_K]
Feed the concatenated vector into a shared MLP trunk with K output heads - one per target model.
Each head outputs P(correct_k) for its target model.

Training: 10 independently seeded instances with BCEWithLogitsLoss and Adam. Top 5 seeds by validation BCE loss are retained. Predictions are averaged at inference.

Two key advantages over independent per-model probes:

Cross-model context. Features from one encoder supply evidence about query difficulty for other models. The trunk learns shared representations of difficulty that generalize across targets.
Inherent calibration. Joint optimization keeps all output probabilities on a comparable scale, reducing calibration drift.

The routing score for model k on query q is:

s_{k,q} = λ · p̂_k(q) - (1-λ) · C̃_{k,q}

Where C̃ is normalized cost and λ ∈ [0,1] controls the accuracy-cost tradeoff. Setting λ = 1 maximizes accuracy; λ = 0 minimizes cost. Sweeping λ traces the full Pareto frontier.

Real Results: What the Data Actually Shows

Three model pools. Three tiers of evidence. SharedTrunkNet wins every time.

Benchmarks used: MMLU-Pro, Humanity's Last Exam (HLE), LiveCodeBench Model pool includes: Claude Opus 4.6, GPT-5.4, GPT-5.2, Qwen 3.5 122B, Nemotron Super v3 120B, and others

Per-Model Accuracy and Cost Savings

Pool	Oracle Acc.	Best Single Model	SharedTrunkNet Gain	Headroom Captured	Cost Savings
Frontier (11 models)	89.31%	65.4%	+10.9pp	45.58%	74.31%
Small (20 models, 7B–9B)	91.97%	71.2%	+4.2pp	20.4%	64.2%
Mixed (9 models)	89.35%	82.2%	+1.2pp	17.3%	29.3%

Calibration and Discrimination Quality

Pool	Architecture	Mean AUC ↑	Mean Brier ↓
Frontier	SharedTrunkNet	0.8560	0.1509
Frontier	Semantic baseline (Unified Multitask)	0.8040	0.1756
Frontier	kNN	0.7888	0.1808
Small	SharedTrunkNet	0.8260	0.1642
Mixed	SharedTrunkNet	0.8817	0.1111

Oracle Distance Reduction

SharedTrunkNet reduced Oracle Distance by 53.62% in the frontier pool - meaning the routing curve moved dramatically closer to the theoretical perfect-router corner in normalized accuracy-cost space.

The P-AUCCC (Padded Area Under the Cost Coverage Curve) for SharedTrunkNet was 0.4377 vs. 0.3817 for the model-only Pareto baseline - a 14.67% increase in routing efficiency.

The bottom line: Across 1,300+ evaluated configurations spanning kNN, GraphRouter, Matrix Factorization, DeBERTa, and Unified Multitask architectures, SharedTrunkNet led on every single metric in every single pool. Mechanistic routing isn't marginally better. It's categorically better.

How to Apply Prefill Activation Routing in Your AI Pipeline

Step 01 - Select Your Encoder Model

Choose a large open-weight model with high effective dimensionality and low representational anisotropy in its upper layers.

Qwen3.5-122B was the strongest encoder in NVIDIA's experiments, achieving the highest AUC on every frontier target. Qwen3.5-35B was a close second and more cost-efficient for inference.

Practical guidance:

For multi-model AI pipelines with frontier targets (GPT-5, Claude Opus 4.6): Use Qwen3.5-122B or Qwen3.5-35B as encoder.
For small-model pools (7B–9B): Qwen3.5-35B dominates.
For latency-constrained deployments: Nemotron-Nano-30B offers a reasonable tradeoff (AUC 0.8501 on claude-sonnet-4 target).

You don't need to run the encoder at full precision. The hidden states are extracted during a standard forward pass - no generation required. In vLLM deployments, prefill activations can be cached alongside the KV cache, making the overhead negligible.

Step 02 - Extract and Probe Prefill Activations

Extract hidden states from layers L/2 to L of your encoder. Use Fisher Separability (J) to identify the most discriminative layer. Apply PCA to reduce dimensionality.

Implementation steps:

Run the encoder forward pass on the query (no generation).
Extract last-token hidden states from upper-half layers (L/2 to L). Last-token pooling consistently outperformed mean pooling in NVIDIA's experiments.
Compute Fisher J for each candidate layer using labeled (correct/incorrect) training data. Select the layer with highest J.
Apply PCA to reduce to d ∈ {50, ..., 300} dimensions. d=100 was the default in NVIDIA's experiments.
Train a linear probe (logistic regression with L2 regularization) on the PCA-reduced features using 5-fold stratified cross-validation.

For LLM inference optimization, this probe is cheap. A logistic regression forward pass on a 100-dimensional vector is microseconds. The encoder forward pass is the main cost - and it's a single pass with no generation.

Step 03 - Build the Routing Decision Layer

Concatenate per-target features into a single vector and train SharedTrunkNet to predict simultaneous correctness probabilities.

# Pseudocode: SharedTrunkNet inference
features = []
for target_k in model_pool:
    h_k = encoder.forward(query)  # hidden states, upper layers
    f_k = pca_k.transform(h_k)    # PCA reduction
    features.append(f_k)

x = concat(features)               # joint feature vector
p_correct = shared_trunk_net(x)    # [P(correct_1), ..., P(correct_K)]

# Routing score with cost penalty
scores = lambda_ * p_correct - (1 - lambda_) * normalized_costs
selected_model = argmax(scores)

Train SharedTrunkNet with:

BCEWithLogitsLoss
Adam optimizer
85/15 train/validation split
10 independently seeded runs, keep top 5 by validation BCE
Average predictions at inference

Step 04 - Monitor and Retrain

Track probe AUROC per model over time. Retrain when distribution shifts or new models enter the pool.

Key monitoring signals:

Per-model AUROC degradation - indicates distribution shift or model update.
Brier score increase - indicates calibration drift.
Routing delta - accuracy on queries routed to vs. away from each model. A shrinking delta means the probe is losing discriminative power.

One critical caveat from the Oxford research: probe reliability degrades with extended reasoning. For models using high reasoning budgets (chain-of-thought, o1-style), the linear accessibility of difficulty signals decreases. If you're routing to heavy reasoning models, expect lower AUROC and plan for more frequent retraining or non-linear probe architectures. (For the upstream question of routing decisions based on model capability - whether a task warrants a reasoning model at all - capability matters as much as cost.)

Prefill Routing vs. Semantic Routing: Which Wins?

Prefill activation routing wins on every metric that matters for production. Semantic routing wins on deployment simplicity.

Dimension	Semantic Routing	Prefill Activation Routing
Signal source	Query text features (embeddings, perplexity, length)	Transformer hidden states during prefill
Model-specific failure detection	❌ Blind to model internals	✅ Directly reads model state
Closed-source model support	✅ No model access needed	✅ Via encoder-target decoupling
Mean AUC (frontier pool)	0.8040 (best semantic)	0.8560 (SharedTrunkNet)
Brier score (frontier pool)	0.1756 (best semantic)	0.1509 (SharedTrunkNet)
Accuracy gain over best model	+8.3pp (best semantic)	+10.9pp
Cost savings	69.5% (best semantic)	74.31%
Oracle Distance reduction	~44% (best semantic)	53.62%
Deployment complexity	Low - text features only	Medium - requires encoder forward pass
Latency overhead	Minimal	Low (cacheable with KV cache)
Sensitivity to reasoning depth	Stable	Degrades for high-reasoning models
Cross-model context	❌ Per-model independent	✅ Joint optimization via SharedTrunkNet

The verdict: if you're running a multi-model AI pipeline with frontier models and care about accuracy-cost optimization, prefill activation routing is the clear choice. (If you're weighing more conventional tools first, compare RouteLLM and the semantic-router approach.) The performance gap is consistent, large, and reproducible across 1,300+ configurations.

If you're routing between two models on a simple classification task and can't afford the encoder overhead, semantic routing is fine. But that's a narrow use case.

Key Takeaways

Key Takeaways

01. LLMs encode their own failure. Pre-generation activations contain a linearly decodable signal predicting whether the model will succeed. Oxford confirmed this with Spearman ρ = 0.64 for model difficulty on math tasks.

02. Encoder-Target Decoupling is the unlock. You don't need access to a closed-source model's internals. An open-weight encoder (Qwen3.5-122B) predicts GPT-5's correctness better than GPT-5's own hidden states in some cases.

03. Fisher Separability (J) is your layer-selection tool. It identifies which transformer layers carry the most discriminative signal for routing - efficiently and interpretably.

04. SharedTrunkNet closes 45.58% of the oracle gap. In the frontier pool (11 models including Claude Opus 4.6 and GPT-5.4), it delivered +10.9pp accuracy over the best single model at 74.31% cost savings.

05. The signal degrades with reasoning depth. For GPT-OSS-20B, probe AUROC drops from 0.78 to 0.64 as reasoning level increases from low to high. Extended chain-of-thought obscures pre-generation difficulty signals.

06. Semantic routing is not competitive at scale. Across 1,300+ configurations, no semantic architecture - kNN, GraphRouter, DeBERTa, Matrix Factorization - matched SharedTrunkNet on AUC, Brier score, or Oracle Distance.

FAQ

Q: What is prefill activation routing, and how is it different from semantic routing?

Prefill activation routing uses a transformer model's internal hidden states - captured during the input processing (prefill) phase - to predict whether a target model will answer a query correctly, before any output is generated. Semantic routing uses text-level features like query embeddings, perplexity, or question length. The key difference: semantic routing describes the query; prefill activation routing describes how a specific model is reacting to it. The mechanistic signal is more predictive of model-specific failure modes.

Q: Do I need access to the target model's weights to use prefill activation routing?

No. Encoder-Target Decoupling means you run the forward pass on an open-weight encoder (like Qwen3.5-35B or Qwen3.5-122B) and use its hidden states to predict the target model's correctness. The target model - even a closed-source API like GPT-5 or Claude Opus 4.6 - never needs to expose its internals.

Q: How much does the encoder forward pass add to latency?

In practice, very little. The encoder forward pass is a single prefill-only run - no generation, no decoding. In vLLM-based deployments, the prefill activations can be cached alongside the KV cache. The routing overhead (PCA transform + SharedTrunkNet forward pass) is microseconds. The dominant cost is the encoder forward pass itself, which is comparable to a standard prefill operation.

Q: Why does probe reliability degrade for high-reasoning models?

Oxford's research showed that as reasoning depth increases, chain-of-thought length becomes correlated with human difficulty rather than model difficulty. Extended reasoning causes models to spend more tokens on problems humans find hard - even when those problems are well within the model's competence. This decouples generation dynamics from model-relative uncertainty, making the pre-generation signal less linearly accessible. The signal is still there; it's just less linearly separable, suggesting non-linear probes or intermediate-position probing may help.

Q: What benchmarks were used to validate these results?

NVIDIA's experiments used MMLU-Pro (multi-task language understanding), Humanity's Last Exam (HLE, expert-level academic questions), and LiveCodeBench (code generation with contamination-free temporal splits). The Oxford team used MATH, GSM8K, AIME, and LiveCodeBench. Both teams used binary correctness labels per (model, question) pair as ground truth.

Sources

Varshney, T., Surla, A., Xu, M., et al. (NVIDIA). LLM Router: Rethinking Routing with Prefill Activations. arXiv:2603.20895v2, March 2026. https://arxiv.org/abs/2603.20895
Lugoloobi, W., Foster, T., Bankes, W., Russell, C. (Oxford Internet Institute, University of Oxford). LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations. arXiv:2602.09924v2. Accepted at ICLR 2026 Workshop on Latent and Implicit Thinking. https://arxiv.org/abs/2602.09924

Ready to build prefill activation routing into your AI agent platform? The architecture is open, the math is clear, and the performance gap is real. The question isn't whether to move beyond semantic routing - it's how fast.

Keep reading

routingllmreasoning

When to Use Reasoning Models vs Standard LLMs

Reasoning models don't just generate text - they think before they answer. Here's what that actually means, how they're built, and when to use one over a standard LLM.

SYShubham Yadav

12 min read

llmroutingproduction

Signal-Driven Routing for Mixture-of-Models in Production

Signal-driven routing replaces static LLM classification with composable keyword, embedding, and domain signals - cutting costs 3.66x while preserving 95% of GPT-4 quality in production mixture-of-models deployments.

SYShubham Yadav

16 min read

llmroutingcost optimization

LLM Routing: What It Is and How to Cut Costs With It

LLM routing directs each query to the right model instead of defaulting to the most expensive one. Done right, it cuts inference costs by 40–85% while retaining 95%+ of output quality.

SYShubham Yadav

18 min read

Back to all posts