PagedAttention in vLLM: 14× Throughput with KV Caching

PagedAttention is the memory management algorithm inside vLLM that eliminates KV cache fragmentation, cuts GPU memory waste from 60–80% to under 4%, and delivers up to 24x higher throughput than HuggingFace Transformers.

Mohammed Kafeel

Machine Learning Researcher

June 17, 2026

14 min read

On this page

The Problem: Why LLM Serving Wastes 60–80% of GPU Memory
What Is PagedAttention?
How PagedAttention Manages KV Cache Step by Step
The 14x Throughput: What the Numbers Actually Mean
Memory Sharing: The Hidden Superpower
The Research Behind It: Kwon et al., SOSP 2023
When Should You Use vLLM?
FAQ
Conclusion
Useful Sources

Your GPU is 60–80% wasted right now. Not because of your model. Because of how memory is managed.

Every LLM serving system that pre-allocates a contiguous memory block per request is burning GPU capacity on slots that will never be used. The KV cache for a single LLaMA-13B sequence takes up to 1.7 GB. On an A100 40GB, that leaves room for maybe 7 concurrent sequences - even though the GPU could physically handle far more.

PagedAttention fixes this. It's the core algorithm inside vLLM, and it's why vLLM achieves 14x–24x higher throughput than HuggingFace Transformers on the same hardware, without touching the model at all.

Here's the full breakdown.

TL;DR

Traditional LLM serving wastes 60–80% of GPU memory due to KV cache fragmentation and over-reservation.

PagedAttention borrows OS virtual memory paging to store KV cache in non-contiguous fixed-size blocks - cutting waste to under 4%.

vLLM with PagedAttention achieves 14x–24x higher throughput than HuggingFace Transformers and 2.2x–3.5x over TGI (benchmarked on LLaMA-7B/A10G and LLaMA-13B/A100 40GB).

Memory sharing for parallel sampling and beam search cuts memory usage by up to 55% and boosts throughput by up to 2.2x on top of that.

The Problem: Why LLM Serving Wastes 60–80% of GPU Memory

The short answer: traditional systems pre-allocate memory for the worst case, then hold it hostage for the entire request lifetime.

What is the KV cache, and why does it grow?

During autoregressive generation, every transformer layer computes key and value vectors for each token. These get cached so the model doesn't recompute them at every decoding step - that's the KV cache. Without it, generating 100 tokens from a 1,000-token prompt would require reprocessing the full growing sequence over 100 times.

The KV cache is large and dynamic. For LLaMA-13B in FP16, a single token's KV data is roughly 0.78 MB. A full 2,048-token sequence hits around 1.6 GB. On an A100 40GB - where model weights already consume ~26 GB - that leaves roughly 12 GB for KV cache. That's enough for about 7 concurrent sequences at 2,048 tokens each.

The fragmentation problem

Here's where it gets wasteful. Traditional serving systems don't know how long a response will be when a request arrives. So they do the safe thing: pre-allocate a contiguous memory block for the maximum possible sequence length - say, 2,048 tokens - for every request.

Two things go wrong:

Internal fragmentation: A request that generates 200 tokens still holds a 2,048-token block. The other 1,848 slots are reserved but empty.
External fragmentation: Blocks of different sizes leave gaps in GPU memory that can't be filled by new requests.

The result? In profiling from the original vLLM paper, only 20.4%–38.2% of KV cache memory was actually storing token states in existing systems. The rest was pure waste.

Think of it like booking a hotel room for 7 nights but checking out after 2. The room stays blocked. No one else can use it. Multiply that across hundreds of concurrent requests and you've got a GPU that's mostly idle.

What Is PagedAttention?

PagedAttention is an attention algorithm that stores KV cache in fixed-size, non-contiguous memory blocks - inspired directly by how operating systems manage virtual memory.

Instead of one giant contiguous buffer per request, vLLM paged attention breaks each sequence's KV cache into small KV blocks, each holding the keys and values for a fixed number of tokens (default: 16 tokens per block). These blocks can live anywhere in GPU memory. A lightweight block table per request maps logical block indices to their actual physical GPU addresses.

The OS virtual memory analogy

The mapping is almost one-to-one:

OS Virtual Memory	vLLM PagedAttention
Virtual address space	Logical KV block sequence
Physical memory pages	Physical KV blocks in GPU DRAM
Page table	Block table (per request)
Process	LLM inference request
Bytes	Tokens

An OS lets a program behave as if it has a large, contiguous address space while physically scattering data across RAM. PagedAttention does the same for LLM serving: requests see a logically contiguous KV cache, while the actual data is scattered in fixed-size blocks across GPU memory.

The block table mechanism

Each request gets a block table - a small mapping structure that records:

Which logical block (0, 1, 2…) maps to which physical block in GPU memory
How many slots in each block are currently filled

When the attention kernel needs to compute attention scores, it walks the block table to find where each chunk of KV data lives, then fetches those blocks. The math is identical to standard attention - just computed block-by-block instead of over one contiguous buffer.

Key insight: physical blocks can be scattered anywhere. The logical view is always contiguous. Memory waste is bounded to at most 15 unused slots in the last partially-filled block - that's where the <4% waste figure comes from.

How PagedAttention Manages KV Cache Step by Step

vLLM allocates KV blocks on demand, one at a time, and frees them immediately when a request finishes.

Here's the exact flow for a 7-token prompt with block size B=16:

Prefill phase: The prompt has 7 tokens. vLLM needs ⌈7÷16⌉ = 1 physical block. It maps logical block 0 to, say, physical block 7. The KV cache for all 7 prompt tokens is computed and stored there. One slot remains reserved for the first generated token.
First decode step: The model generates token 8. It fits in the remaining slot of physical block 7. The block table's fill count updates from 7 to 8. No new block needed yet.
Block fills up: Once all 16 slots in physical block 7 are used, vLLM allocates a new physical block (say, block 3) and adds it to the block table as logical block 1. Generation continues.
Request completes: All physical blocks are returned to the free pool immediately. Any waiting request can use them.

Compare this to traditional systems, which would have pre-allocated a 2,048-token block at step 1 and held it until step 4 - regardless of actual output length.

Copy-on-write for shared blocks

When multiple sequences share the same physical block (more on this in the next section), vLLM tracks a reference count per block. If one sequence needs to write a new token to a shared block, vLLM:

Allocates a fresh physical block
Copies the existing content into it
Decrements the reference count on the original

The original block stays intact for the other sequences. This is copy-on-write at block granularity - the same mechanism Linux uses when forking processes.

The 14x Throughput: What the Numbers Actually Mean

vLLM with PagedAttention achieves 14x–24x higher throughput than HuggingFace Transformers and 2.2x–3.5x over TGI, measured on real-world request distributions.

Benchmark conditions

The numbers come from the original vLLM launch benchmarks (June 2023):

Models: LLaMA-7B and LLaMA-13B
Hardware: NVIDIA A10G (LLaMA-7B) and NVIDIA A100 40GB (LLaMA-13B)
Dataset: ShareGPT - real user conversations with variable input/output lengths
Metric: Serving throughput (requests per second)

Single completion (n=1)

Framework	Throughput vs. vLLM	Notes
HuggingFace Transformers	14x–24x slower	Baseline; no KV optimization, contiguous pre-allocation
Text Generation Inference (TGI)	2.2x–2.5x slower	Better than HF, but still fragmented memory management
vLLM (PagedAttention)	Baseline	Near-zero memory waste, on-demand block allocation

Parallel sampling (n=3 outputs per request)

When each request generates 3 parallel completions, memory sharing kicks in hard:

vs. HuggingFace Transformers: 8.5x–15x faster
vs. TGI: 3.3x–3.5x faster

The gap widens with parallel sampling because vLLM can share the prompt's KV blocks across all 3 outputs. HF and TGI can't - they store 3 full copies. (For the full framework breakdown, see vLLM's PagedAttention advantage.)

LMSYS switched Chatbot Arena from a HuggingFace backend to vLLM in April 2023. The results:

Average daily requests: 30,000
Peak daily requests: 60,000
GPU reduction: 50% fewer GPUs needed to handle the same traffic

That's not a benchmark. That's a production system serving millions of users, cutting its compute bill in half. (For the full picture, see the self-hosting economics with vLLM.)

Memory Sharing: The Hidden Superpower

PagedAttention's block table enables multiple sequences to point to the same physical KV blocks - no copying required.

This is where the paged KV cache design pays off beyond just fragmentation reduction. (For how this fits a broader stack, see PagedAttention in multi-tier caching.)

Prefix caching (shared system prompts)

Many production deployments use a fixed system prompt for every request - a long instruction block prepended to each user message. In traditional systems, every request stores its own copy of that system prompt's KV cache.

With vLLM, you can pre-compute and cache the KV blocks for that shared prefix. (This is the basis for KV cache reuse built on PagedAttention.) All incoming requests map their logical blocks 0 through N to the same physical blocks. 10 users hitting the same system prompt = 1 copy of those KV blocks in GPU memory, not 10.

Parallel sampling

When you ask for n=3 outputs from the same prompt, all 3 output sequences share the prompt's KV blocks. They diverge only when they start generating different tokens. Copy-on-write handles the divergence cleanly.

The result: up to 55% memory reduction for parallel sampling, translating to up to 2.2x additional throughput improvement on top of the base PagedAttention gains.

Beam search

Beam search candidates share prefix blocks dynamically. As the search tree evolves, candidates that share a common prefix share physical blocks. When a candidate is pruned, its blocks are freed immediately. vLLM's reference counting handles all of this automatically.

The Research Behind It: Kwon et al., SOSP 2023

The paged attention paper was published at SOSP 2023 - one of the top systems conferences in computer science - by a team from UC Berkeley.

Full citation details:

Title: "Efficient Memory Management for Large Language Model Serving with PagedAttention"
Authors: Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica
Affiliations: UC Berkeley (primary), Stanford University, UC San Diego
Venue: SOSP 2023 (ACM Symposium on Operating Systems Principles), Koblenz, Germany, October 23–26, 2023
arXiv: 2309.06180 (submitted September 12, 2023)
ACM DOI: 10.1145/3600006.3613165

Key findings from the paper

The paper benchmarks vLLM against FasterTransformer and Orca - the state-of-the-art systems at the time:

2–4x higher throughput at the same latency level
The improvement grows with longer sequences, larger models, and more complex decoding algorithms (beam search, parallel sampling)
Memory waste drops from 60–80% (existing systems) to <4% (vLLM)

The paper also quantifies exactly where memory goes in existing systems: on average, only 20.4% of KV cache memory stores actual token states in the worst-performing configuration. The rest is reserved slots, internal fragmentation, and external fragmentation.

When Should You Use vLLM?

vLLM is the right choice for high-throughput serving of concurrent requests - especially with long contexts, parallel sampling, or shared system prompts.

Best fit

High-concurrency serving: The more requests you batch, the more PagedAttention's memory efficiency compounds.
Long context windows: Memory fragmentation gets worse as sequences grow. PagedAttention scales cleanly.
Parallel sampling / beam search: Memory sharing cuts costs dramatically.
Shared system prompts: Prefix caching means you compute that KV cache once, not per request.
Production deployments: LMSYS, LinkedIn, Amazon (Rufus), and Roblox all run vLLM in production.

Not the ideal fit

Single-request, latency-critical edge inference: If you're serving one user at a time and care about time-to-first-token above all else, the overhead of the block management layer adds marginal cost. Tools like llama.cpp or Ollama may be simpler.
Highly quantized, memory-constrained edge devices: vLLM loads models in FP16 by default and uses more idle VRAM than quantized alternatives.

The tradeoffs are real. But for any team running LLMs at scale - multiple concurrent users, production traffic, cost pressure - vLLM with paged attention is the default choice for a reason.

Key Takeaways

Traditional LLM serving wastes 60–80% of GPU memory through KV cache fragmentation and worst-case pre-allocation.

PagedAttention stores KV cache in fixed-size, non-contiguous 16-token blocks, cutting waste to under 4%.

vLLM achieves 14x–24x higher throughput than HuggingFace Transformers and 2.2x–3.5x over TGI (LLaMA-7B/13B on A10G/A100, ShareGPT dataset).

Memory sharing via the block table enables parallel sampling and beam search to cut memory usage by up to 55%, adding up to 2.2x more throughput.

LMSYS ran vLLM in production at 30K avg / 60K peak daily requests and cut their GPU count by 50%.

The paged attention paper (Kwon et al., SOSP 2023, arXiv:2309.06180) shows 2–4x gains over FasterTransformer and Orca at the same latency - gains that grow with model size and sequence length.

FAQ

What is paged attention in vLLM?

Paged attention is vLLM's core memory management algorithm. It divides each request's KV cache into fixed-size blocks (default: 16 tokens per block) that can be stored non-contiguously in GPU memory. A per-request block table maps logical block indices to physical GPU addresses. This eliminates both internal and external memory fragmentation, cutting KV cache waste from 60–80% (traditional systems) to under 4%.

How does PagedAttention improve throughput?

By eliminating memory fragmentation, PagedAttention lets vLLM fit far more concurrent requests into GPU memory. More concurrent requests means larger effective batch sizes. Larger batches mean better GPU utilization. The result is 14x–24x higher throughput than HuggingFace Transformers on LLaMA-7B and LLaMA-13B benchmarks. Memory sharing for parallel sampling and beam search adds another up to 2.2x on top.

What is a KV cache in LLMs?

The KV cache stores the key and value vectors computed during the attention mechanism for all previously generated tokens. During autoregressive generation, each new token needs to attend to all prior tokens - recomputing those vectors from scratch at every step would be prohibitively expensive. The KV cache avoids that recomputation. The cost is GPU memory: for LLaMA-13B in FP16, a single token's KV data is ~0.78 MB, and a full 2,048-token sequence hits ~1.6 GB.

How much faster is vLLM than HuggingFace Transformers?

In the original vLLM benchmarks (June 2023), using LLaMA-7B on an NVIDIA A10G and LLaMA-13B on an NVIDIA A100 40GB with ShareGPT request distributions: 14x–24x faster for single-completion requests (n=1), and 8.5x–15x faster for parallel sampling (n=3). The exact multiplier depends on request rate and model size.

What is the PagedAttention paper?

The paged attention paper is "Efficient Memory Management for Large Language Model Serving with PagedAttention" by Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, et al. from UC Berkeley. It was published at SOSP 2023 (ACM Symposium on Operating Systems Principles) and is available at arXiv:2309.06180. The paper introduces the PagedAttention algorithm and the vLLM serving system, demonstrating 2–4x throughput improvements over FasterTransformer and Orca at the same latency.

What is paged KV cache?

Paged KV cache is the memory layout that PagedAttention uses. Instead of storing a request's entire KV cache as one contiguous tensor, it stores it in fixed-size, non-contiguous blocks (pages) scattered across GPU memory. A block table tracks where each page lives. This is the same principle as OS virtual memory paging - it gives the illusion of a contiguous memory space while physically scattering data wherever free blocks exist. The result is near-zero memory fragmentation and flexible sharing of KV blocks across requests.

Conclusion

The core insight behind paged attention is simple: treat GPU memory the way operating systems treat RAM. Stop pre-allocating worst-case slabs. Use fixed-size pages, allocate on demand, and share blocks wherever possible.

That one shift - from contiguous pre-allocation to paged KV cache - is what turns a 60–80% memory waste problem into a <4% one, and what makes a 14x throughput improvement possible without changing a single model weight.

If you're serving LLMs in production, the question isn't whether to use vLLM paged attention. It's whether you've tuned it correctly for your workload.

Are you serving LLMs in production? What throughput gains have you seen after switching to vLLM - and what's still bottlenecking you?

Useful Sources

Keep reading

llmvllminference

vLLM KV Cache Reuse: A Guide to Cutting Inference Costs

vLLM KV cache reuse cuts Time to First Token by 78% and triples throughput. This guide covers how Automatic Prefix Caching works, how to enable it, and how to extend it across distributed clusters.

MKMohammed Kafeel

17 min read

llmself-hostingvllm

vLLM vs Ollama vs TGI: LLM Serving Framework Comparison

A data-backed comparison of vLLM, Ollama, and TGI - covering throughput benchmarks, concurrency behavior, quantization support, and a 3-question decision framework to pick the right LLM serving framework fast.

SYShubham Yadav

15 min read

routingllmreasoning

When to Use Reasoning Models vs Standard LLMs

Reasoning models don't just generate text - they think before they answer. Here's what that actually means, how they're built, and when to use one over a standard LLM.