All posts

PagedAttention in vLLM: 14× Throughput with KV Caching

How PagedAttention borrows OS virtual-memory paging to eliminate KV cache fragmentation, and why it lets vLLM reach up to 14× higher throughput.

MK

Mohammed Kafeel

Machine Learning Researcher

June 8, 202611 min read

Quick answer: PagedAttention, the core innovation inside vLLM, borrows OS virtual-memory paging to eliminate KV cache fragmentation. Traditional LLM serving pre-allocates a contiguous memory block per sequence — wasting 20–40% of GPU memory on reserved but unused space. PagedAttention splits the KV cache into small, fixed-size blocks that can live anywhere in GPU memory. That near-zero waste means more sequences fit simultaneously, enabling larger batches and up to 14–24× higher throughput over systems without continuous batching.


What is the KV cache and why does it dominate GPU memory?

The KV cache stores the Key and Value tensors computed at every attention layer so they don't have to be recomputed on each generation step. During autoregressive decoding, a transformer generates one token at a time. Without caching, every new token would require recomputing attention over the entire prompt and all prior outputs — an O(n²) cost per step. The KV cache collapses that to O(n) by reusing prior work.

The memory cost is large. For LLaMA-2 7B (32 layers, 32 heads, 128 head_dim, fp16):

Quantity Calculation Result
K + V per token per layer 2 × 32 heads × 128 × 2 bytes 16,384 bytes
K + V per token across all 32 layers 16,384 × 32 512 KB
K + V for one 4096-token sequence 512 KB × 4,096 2 GB

On an 80 GB A100 with model weights consuming ~14 GB, you have roughly 66 GB left for KV caches — enough for about 33 sequences at maximum context length. Every byte wasted on fragmentation directly reduces that number, shrinks the maximum batch size, and cuts throughput.


The problem: memory fragmentation under traditional serving

Before PagedAttention, production LLM servers handled the KV cache one of two ways, both wasteful:

Static pre-allocation: Reserve max_sequence_length × kv_size of contiguous GPU memory per sequence at request arrival. A request that ends at 200 tokens still held memory for 4,096 tokens the entire time. Utilization in practice: 60–80% — meaning 20–40% of GPU memory was permanently idle.

Dynamic allocation: Grow a contiguous buffer as tokens are generated. This avoids over-reservation but creates external fragmentation: free blocks of different sizes scattered across GPU memory that can't be merged into a new large contiguous allocation. Over time the allocator stalls and throughput degrades.

Both failures share the same root cause: forcing a contiguous memory layout onto data that has no physical reason to be contiguous.


What is PagedAttention?

PagedAttention is a KV cache memory manager that divides GPU memory into fixed-size blocks (pages), assigns pages to sequences on demand, and uses a block table — analogous to a CPU page table — to map logical sequence positions to physical memory locations.

The inspiration is direct: this is exactly how operating systems manage RAM for processes. A process's virtual address space is contiguous; the physical RAM pages backing it are not. The MMU (memory management unit) translates addresses at runtime. PagedAttention applies the same idea to transformer KV caches.

The three building blocks

1. Physical blocks GPU memory is carved into fixed-size blocks of B tokens each (vLLM's default: B = 16). Each block holds the K and V tensors for B tokens, for all layers and heads. Blocks are allocated from a free-block pool as needed and returned to the pool when a sequence finishes.

2. Logical blocks Each sequence has a logical view of its KV cache divided into sequential logical blocks: block 0 (tokens 0–15), block 1 (tokens 16–31), and so on. From the sequence's perspective the cache is contiguous.

3. Block table A per-sequence mapping from logical block number → physical block number. The attention kernel consults the block table at runtime to find where each page of K/V actually lives in GPU memory.

Sequence A: [logical block 0] → [physical block 7]
            [logical block 1] → [physical block 2]
            [logical block 2] → [physical block 15]  ← in-progress, partially filled

Physical blocks 7, 2, and 15 may be nowhere near each other in memory. That is fine — the block table handles all address translation.

How token generation works step by step

  1. A new request arrives. The scheduler allocates the first physical block from the free pool and maps it to logical block 0 of the sequence.
  2. Tokens are generated. K and V are written into the current block sequentially.
  3. When the current block is full (B tokens written), the scheduler allocates the next physical block and updates the block table.
  4. The attention computation iterates over logical blocks, looks up each physical block address in the table, loads K and V, and computes attention. A custom CUDA kernel handles this non-contiguous access pattern.
  5. When the sequence finishes, all its physical blocks are returned to the free pool immediately.

The only internal fragmentation is in the last partial block of each active sequence — at most B − 1 wasted token slots. With B = 16, that is never more than 15 token slots per sequence. In practice, memory utilization rises from the traditional 60–80% to above 96%.


Copy-on-Write: sharing KV cache blocks across beams and prefixes

PagedAttention enables multiple sequences to share physical KV cache blocks read-only, with a copy-on-write (CoW) mechanism for divergence points.

This matters in two key scenarios:

Parallel sampling and beam search

In beam search with k = 4, the four beams share the same prompt. Without sharing, you'd store the prompt's KV cache four times. With PagedAttention:

  1. The prompt is processed once. Its physical blocks get a reference count of 4.
  2. All four beams initially point to the same physical blocks via their block tables.
  3. When a beam generates a new token and its in-progress block diverges from another beam's, the allocator triggers a copy: it allocates a new physical block, copies the existing content, and decrements the original block's reference count.

For a 2,000-token shared prompt across 10 beams, CoW means storing the prompt KV cache once instead of ten times — a 10× reduction in memory for the shared prefix.

System prompt / prefix sharing

When hundreds of requests share the same system prompt, vLLM's automatic prefix caching (APC) hashes the token IDs of the prefix and maps them to a reusable set of physical blocks. The KV values for that prefix are computed once and then served from those shared blocks for every subsequent request that starts with the same prefix — with no re-computation and no extra memory per request.


Continuous batching: the throughput multiplier that PagedAttention enables

PagedAttention solves memory fragmentation. Continuous batching is the scheduling policy that turns that into throughput. The two work together.

Traditional static batching: A batch of N requests is loaded onto the GPU. The scheduler waits until every sequence in the batch has finished generating before loading the next batch. If request 1 finishes at token 50 and request 8 finishes at token 800, the GPU idles for the 750-token gap after request 1 completes.

Continuous batching (iteration-level scheduling): After each generation step, the scheduler checks which sequences just finished and immediately slots in new requests from the queue. The batch size fluctuates token by token, and the GPU is never waiting on stragglers.

This only works at scale if memory can be allocated and freed in small increments without fragmentation. With a contiguous-allocation model, freeing one sequence mid-batch creates a hole that can't be reused for variable-length new arrivals. PagedAttention's block-level allocator makes the memory side of continuous batching trivially safe.


Where the 14× throughput number comes from

The vLLM paper (Kwon et al., UC Berkeley, 2023) benchmarks against three baselines on OPT and LLaMA models under Poisson-distributed request arrivals:

Comparison baseline Throughput gain reported
HuggingFace Transformers (no continuous batching) 14× – 24×
FasterTransformer (static batching) 3× – 4×
Orca (continuous batching, chunked memory) 1.5× – 2.2×

The 14× figure is the floor of the range against HuggingFace Transformers under a moderate request rate on LLaMA-class models. The gain has two independent components:

  1. Continuous batching vs. static batching: Even without PagedAttention, switching from waiting-for-batch-to-finish to iteration-level scheduling can deliver 3–5× on its own, depending on request-length variance.

  2. PagedAttention memory efficiency: With ~96%+ memory utilization instead of ~60–80%, you fit roughly 1.5–2× more sequences in memory simultaneously. More concurrent sequences → larger effective batch size → better GPU utilization on each step → more tokens per second.

The combined effect — more sequences in memory, each served continuously without GPU idle time — produces the headline throughput numbers.


How to use vLLM in practice

Installing and running vLLM is straightforward. The library exposes an OpenAI-compatible API server and a Python client.

Installation

pip install vllm

Start the server (PagedAttention and continuous batching on by default)

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --max-model-len 4096

Python client — batch inference

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")

prompts = [
    "Explain the Attention is All You Need paper in one paragraph.",
    "What is the difference between RLHF and DPO?",
    "Write a Python function to merge two sorted lists.",
]

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

vLLM handles batching internally. You don't need to split your list into fixed-size batches — the scheduler manages that via continuous batching, assigning GPU capacity dynamically as requests arrive and finish.

Key configuration knobs

Parameter Default What it controls
--block-size 16 Tokens per KV cache block. Larger = less overhead, more waste.
--gpu-memory-utilization 0.90 Fraction of GPU memory reserved for KV cache. Lower = safer OOM margin.
--max-num-seqs 256 Maximum concurrent sequences in the scheduler.
--enable-prefix-caching False Enable automatic prefix caching (APC) for shared system prompts.

Common reasons throughput underperforms expectations

Symptom Likely cause Fix
Low GPU utilization despite many requests Requests queue but don't batch — output length too variable Set --max-num-seqs higher; pre-sort by expected length
OOM during warmup --gpu-memory-utilization too high for model + KV cache combined Reduce to 0.85 and re-profile
Throughput plateaus with more requests Bandwidth-bound, not compute-bound — model is small relative to batch Use tensor parallelism (--tensor-parallel-size)
Prefix caching not helping System prompts are not byte-identical (trailing spaces, formatting) Normalize prompts before sending; verify hash hits

PagedAttention vs. Anthropic prompt caching — key differences

These are often confused because both involve "caching" and "KV cache." They solve different problems at different layers:

Dimension PagedAttention (vLLM) Anthropic prompt caching (Claude API)
Where it runs Inside your inference server, on your GPU On Anthropic's servers
What it manages GPU memory layout for KV blocks Re-use of computed KV values across API calls
Who controls it You (by running vLLM) Anthropic (you opt in with cache_control)
Billing impact Reduces infrastructure cost (more throughput/GPU) Directly reduces per-token API cost (~90% on cached prefix)
Transparency Internal — no per-request metadata exposed Explicit — cache_read_input_tokens in the response

If you self-host with vLLM, PagedAttention is always on. If you call the Claude API, prompt caching is a separate pricing feature you explicitly enable.


Frequently asked questions

What is PagedAttention in simple terms? PagedAttention stores the transformer's KV cache in small, fixed-size memory blocks (pages) that can live anywhere in GPU memory, rather than requiring one large contiguous allocation per sequence. A block table maps each sequence's logical positions to physical memory locations, the same way a CPU's memory management unit maps virtual addresses to physical RAM pages.

Why does memory fragmentation reduce LLM throughput? More fragmentation means fewer sequences fit in GPU memory at the same time. Fewer concurrent sequences means smaller effective batch sizes. Smaller batches mean worse GPU utilization per generation step — more time spent loading model weights relative to tokens produced, because weight-loading cost is amortized across fewer outputs per step.

How does vLLM achieve 14× throughput over HuggingFace Transformers? Two mechanisms multiply together: continuous batching (never waiting for a whole batch to finish — slot new requests in as soon as a sequence completes) and PagedAttention (near-zero memory waste means far more sequences fit in GPU memory simultaneously). The 14× figure from the 2023 vLLM paper reflects HuggingFace Transformers without continuous batching; vs. Orca (which has continuous batching but chunked memory management) the gain is 1.5–2.2×.

What block size should I use in vLLM? The default of 16 is a good starting point. Smaller blocks (8) reduce internal fragmentation and are better when sequences have very variable lengths. Larger blocks (32) reduce overhead per block and are better for long, uniform sequences. The throughput difference is usually under 5%; leave it at 16 unless profiling shows a clear reason to change.

Does PagedAttention work with tensor parallelism and multi-GPU serving? Yes. vLLM distributes both model weights and KV cache across GPUs via tensor parallelism (--tensor-parallel-size). Each GPU manages its own shard of the KV cache using PagedAttention. The block table is maintained per-shard and synchronized at the scheduler level.

What is automatic prefix caching (APC) and how does it relate to PagedAttention? APC is a feature built on top of PagedAttention. When multiple requests share the same prefix (e.g., a system prompt), vLLM hashes the prefix token IDs and maps them to a set of reusable physical blocks. The KV values are computed once and the blocks are kept alive for future requests. It is opt-in (--enable-prefix-caching) and most beneficial when you serve many requests with a shared, long system prompt.


Key takeaways

  • The KV cache is the dominant consumer of GPU memory during inference — 512 KB per token for LLaMA-2 7B, scaling linearly with context length.
  • Traditional contiguous allocation wastes 20–40% of GPU memory through over-reservation and fragmentation, limiting batch size and throughput.
  • PagedAttention divides the KV cache into fixed-size blocks mapped via a block table, lifting memory utilization above 96% and allowing near-zero fragmentation.
  • Copy-on-write lets multiple beams or requests share physical KV cache blocks, eliminating redundant storage for shared prefixes.
  • Continuous batching fills freed sequence slots immediately — PagedAttention makes the memory management behind this zero-overhead.
  • The 14× throughput gain over HuggingFace Transformers is the combined effect of continuous batching plus PagedAttention memory efficiency; vs. Orca (continuous batching only) the marginal gain is 1.5–2.2×.
  • Use --enable-prefix-caching when many requests share a long system prompt; verify hits are occurring via vLLM's metrics endpoint.