vLLM vs Ollama vs TGI: LLM Serving Framework Comparison
A framework decision that's easy to get wrong — they look similar on the surface but are built for fundamentally different use cases. Plus a step-by-step guide to running Llama 4 Scout on a single GPU.
Shubham Yadav
Machine Learning Researcher
Once you've decided to self-host an LLM, the next question is which serving framework to use. This is a choice most teams make once and live with for a long time, because migrating between serving frameworks after you've built infrastructure around one is painful.
The three frameworks that come up in every conversation are vLLM, Ollama, and HuggingFace's Text Generation Inference (TGI). They all do roughly the same thing at a surface level — take a model, expose an API, serve requests — but they're built for fundamentally different operating contexts.
This post covers:
- vLLM vs Ollama vs TGI at a glance — the key architectural difference that determines which one fits your use case
- vLLM deep dive — PagedAttention, continuous batching, throughput benchmarks, and configuration
- Ollama deep dive — GGUF quantization, platform support, and where simplicity is a genuine production advantage
- TGI deep dive — Docker-first setup, HuggingFace Hub integration, and when TGI beats both alternatives
- Full side-by-side comparison table — all three frameworks across 9 dimensions
- Running Llama 4 Scout on a single GPU — 4 complete setup options with working commands
- Production decision guide — which framework to use based on your concurrency, team, and hardware
vLLM vs Ollama vs TGI: Key Differences at a Glance
The core difference between these frameworks isn't features — it's the concurrency model each is built around.
| vLLM | Ollama | TGI | |
|---|---|---|---|
| Throughput at scale | Best-in-class | Lowest | Good |
| Concurrent batching | Continuous batching | None by default | Dynamic batching |
| Ease of setup | Moderate (pip/Linux) | Simplest (one command) | Simple (Docker) |
| Platform support | Linux + CUDA required | MacOS, Linux, Windows, CPU | Linux + CUDA |
| Consumer GPU / CPU support | Limited | Excellent (GGUF) | Good (with quantization) |
| Quantization | AWQ, GPTQ | Auto (GGUF 4-bit/8-bit) | AWQ, GPTQ, bitsandbytes |
| HuggingFace Hub integration | Good | Good | Native |
| Production reliability | High | Low at concurrency | Medium |
| Best for | Production serving (20+ concurrent users) | Dev / local / small teams | Mid-scale / HF ecosystem |
The decision rule: if you're serving more than ~20 concurrent users, use vLLM. If you're setting up a development environment or small internal tool, use Ollama. If you're in the HuggingFace ecosystem and want something between the two, use TGI.
1. vLLM: Maximum Throughput for High-Concurrency Production Serving
vLLM delivers 2–4× higher throughput than naively configured alternatives at 50 concurrent requests, and the gap grows further above 200 concurrent users. It's the right framework when GPU utilization efficiency directly determines whether self-hosting economics work.
vLLM's performance advantage comes down to two mechanisms working together:
PagedAttention manages the KV cache — the memory that stores previous token states for each active request — in small non-contiguous blocks rather than pre-allocated contiguous slabs. Borrowed from OS virtual memory design, this dramatically reduces memory waste from padding and fragmentation, allowing significantly more requests to be active simultaneously on the same GPU.
Continuous batching means the engine doesn't wait for a batch to complete before starting new requests — as soon as a request finishes, a new one takes its slot. At 100% GPU utilization, self-hosting looks economical. Continuous batching is what actually sustains near-100% utilization under variable load.
pip install vllm
# Serve a model
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90
# OpenAI-compatible API at :8000
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"messages": [{"role": "user", "content": "Hello"}]
}'
The --gpu-memory-utilization flag is the main lever for trading throughput against stability. Setting it to 0.90 leaves 10% headroom for spikes. Setting it lower reduces max concurrency but makes OOM crashes less likely during unexpected load bursts.
vLLM limitations: Linux + CUDA required; no consumer CPU or MacOS support; more involved setup than Ollama; aggressive memory management means configuration errors produce OOM crashes that are harder to debug.
2. Ollama: Simplest Setup for Local Development and Single-User Production
Ollama handles one request at a time by default (no continuous batching), which makes it unsuitable for 10+ concurrent interactive users — but for local development, developer environments, and internal tools with low concurrency, its simplicity is a genuine production advantage, not just a convenience.
Ollama's ease of use is built on GGUF-format quantized models via the llama.cpp backend. GGUF quantization (4-bit or 8-bit) reduces model size by 4–8× and allows models to run on consumer hardware that wouldn't support the full-precision version:
| Model size | fp16 VRAM | 4-bit GGUF VRAM | Fits on |
|---|---|---|---|
| 7B / 8B | ~14GB | ~4–5GB | RTX 3060, M1 Mac |
| 13B | ~26GB | ~8GB | RTX 3090, M2 Mac |
| 70B | ~140GB | ~40GB | RTX 4090, M3 Max (96GB) |
# Install (one command, all platforms)
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a model
ollama run llama4:scout
# OpenAI-compatible API at localhost:11434
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama4:scout",
"messages": [{"role": "user", "content": "Hello"}]
}'
Where Ollama specifically wins: internal tools, developer environments, single-user applications, batch jobs without concurrent execution, and teams evaluating whether self-hosting makes sense before committing to the infrastructure investment of vLLM.
Typical throughput: ~15–25 tokens/second on an M3 Max (96GB), compared to 300–600 tokens/second for an A100 with vLLM. Acceptable for a single developer or low-concurrency internal tool; not suitable for serving users in production.
3. TGI (Text Generation Inference): Best for HuggingFace Ecosystem Teams
TGI implements continuous batching and Flash Attention, giving it substantially better multi-user performance than Ollama, while remaining more operationally approachable than vLLM. It typically lands 20–40% below vLLM throughput at high concurrency — a tradeoff that's worth making for teams in the HuggingFace ecosystem.
TGI's Docker-first approach is its most distinctive feature. The container handles CUDA dependencies and environment setup in a reproducible way across machines — one of the most common sources of broken installs when setting up vLLM from scratch.
# Pull and run TGI via Docker
docker run --gpus all \
--shm-size 1g \
-p 8080:80 \
-v $HOME/.cache/huggingface:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-4-Scout-17B-16E-Instruct \
--max-concurrent-requests 128 \
--quantize awq
# OpenAI-compatible API at :8080
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "tgi",
"messages": [{"role": "user", "content": "Hello"}]
}'
Where TGI specifically wins:
- Teams already using HuggingFace Inference Endpoints — you test locally with TGI and deploy with the same config to managed infrastructure
- Teams who want AWQ quantization without custom tooling
- Teams who prefer Docker over pip-installed Python processes for infrastructure consistency and reproducibility
TGI limitation: the 20–40% throughput gap versus vLLM at high concurrency means more GPUs for the same traffic. Whether it matters depends on your traffic levels — at moderate concurrency (20–80 concurrent users), TGI's operational simplicity may outweigh the throughput cost.
How to Run Llama 4 Scout on a Single GPU: 4 Options
Llama 4 Scout is Meta's most accessible large model — 17 billion active parameters with a mixture-of-experts architecture (109B total parameters, but only ~17B activate per token). This MoE structure is what makes it viable on a single GPU: inference compute scales with active parameters, not total parameters.
Hardware compatibility:
| GPU | VRAM | Scout config | Recommended framework |
|---|---|---|---|
| H100 / A100 80GB | 80GB | fp16, 65k context | vLLM |
| A100 40GB / RTX A6000 | 48GB | fp16, 32k context | vLLM |
| RTX 4090 | 24GB | 4-bit AWQ, 16k context | vLLM (quantized) |
| M3 Max (96GB unified) | 96GB | GGUF Q4, 16k context | Ollama |
Option 1: vLLM on A100 80GB (Full-Precision Production Setup)
# Install vLLM with CUDA 12.1 support
pip install vllm
# Authenticate with HuggingFace (Llama 4 requires license acceptance)
huggingface-cli login
# Serve Scout with full fp16 precision
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tensor-parallel-size 1 \
--max-model-len 65536 \
--gpu-memory-utilization 0.88 \
--enable-prefix-caching \
--served-model-name llama4-scout
# For a 48GB GPU, reduce max-model-len to fit the KV cache
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--gpu-memory-utilization 0.85
The --enable-prefix-caching flag is worth enabling if your system prompt is consistent across requests — it caches the KV states for repeated prompt prefixes, which reduces cost and latency for multi-turn conversations significantly.
Option 2: vLLM with AWQ Quantization on RTX 4090 (24GB)
AWQ (Activation-aware Weight Quantization) is the highest-quality 4-bit quantization method available, generally superior to GPTQ at the same compression ratio. The tradeoff is approximately 10–15% quality degradation relative to full precision on reasoning-heavy tasks.
# Use a pre-quantized AWQ version from HuggingFace Hub
vllm serve bartowski/Llama-4-Scout-17B-16E-Instruct-AWQ \
--quantization awq \
--max-model-len 16384 \
--gpu-memory-utilization 0.90
# Or quantize yourself with AutoAWQ (takes 1-2 hours, needs full-model VRAM temporarily)
pip install autoawq
python -c "
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'meta-llama/Llama-4-Scout-17B-16E-Instruct'
quant_path = 'llama4-scout-awq'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map='cuda')
quant_config = {
'zero_point': True,
'q_group_size': 128,
'w_bit': 4,
'version': 'GEMM'
}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
"
vllm serve ./llama4-scout-awq \
--quantization awq \
--max-model-len 16384
Option 3: Ollama on MacBook Pro M3 Max (96GB Unified Memory)
For local development or a team of a few engineers sharing a machine, Apple Silicon with large unified memory is a practical option. No discrete GPU required.
ollama run llama4:scout
# Check what's running
ollama ps
# Run with a specific context window
ollama run llama4:scout --num-ctx 16384
Throughput: ~15–25 tokens/second on an M3 Max. Suitable for a single developer or low-concurrency internal tool, not for serving users in production.
Option 4: TGI with Docker on a Single A100
# Pre-download the model (speeds up container startup)
huggingface-cli download meta-llama/Llama-4-Scout-17B-16E-Instruct
docker run --gpus '"device=0"' \
--shm-size 2g \
-p 8080:80 \
-v $HOME/.cache/huggingface:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-4-Scout-17B-16E-Instruct \
--max-concurrent-requests 64 \
--max-total-tokens 65536 \
--max-input-tokens 32768
TGI's Docker setup is the most reproducible of the three — the container handles CUDA versioning, which is one of the most common sources of broken installs when setting up vLLM from scratch on a new machine.
Which LLM Serving Framework Should You Use? Production Decision Guide
| Use case | Recommended framework | Why |
|---|---|---|
| High-traffic production (20+ concurrent users) | vLLM | 2–4× throughput advantage; continuous batching makes GPU utilization economics work |
| Local development / single developer | Ollama | One-command install; works on CPU, Mac, Windows; no CUDA required |
| Small internal tool (under 10 concurrent users) | Ollama | Simplicity outweighs performance gap at this concurrency level |
| HuggingFace Inference Endpoints migration path | TGI | Local TGI config deploys directly to HF managed infrastructure |
| Teams preferring Docker infrastructure | TGI | Container handles CUDA deps; reproducible across machines |
| 24GB consumer GPU (RTX 4090) | vLLM + AWQ | AWQ quantization fits Scout; maintains continuous batching |
| Apple Silicon (M-series Mac) | Ollama | Only framework with native Metal support and CPU offload |
| Evaluation before infrastructure commit | Ollama → vLLM | Start cheap, migrate when traffic justifies it |
For production deployments, wrap vLLM behind LiteLLM for provider-agnostic routing and automatic API fallback:
from litellm import Router
router = Router(
model_list=[
{
"model_name": "llama4-scout",
"litellm_params": {
"model": "openai/llama4-scout", # LiteLLM treats local vLLM as OpenAI-compatible
"api_base": "http://your-gpu-host:8000/v1",
"api_key": "none",
},
},
# API fallback for when the self-hosted instance is down
{
"model_name": "llama4-scout",
"litellm_params": {
"model": "gpt-4o",
"api_key": os.environ["OPENAI_API_KEY"],
},
},
],
fallbacks=[{"llama4-scout": ["gpt-4o"]}],
)
Sitting LiteLLM in front of vLLM gives you automatic fallback to an API provider if the self-hosted instance goes down — one of the simplest mitigations for the reliability gap between self-hosted and managed APIs.
vLLM vs Ollama vs TGI Setup Checklist
- Confirm your hardware: Linux + CUDA for vLLM/TGI, any platform for Ollama
- Determine your concurrent user target — above ~20, vLLM is required
- Accept the Llama 4 license on HuggingFace Hub before attempting to pull the model
- For vLLM on a 24GB GPU: use a pre-quantized AWQ model from HuggingFace Hub — do not try to run fp16
- For vLLM: set
--gpu-memory-utilization 0.88–0.90and--enable-prefix-cachingif your system prompt is consistent - For TGI: pre-download the model with
huggingface-cli downloadbefore starting the Docker container to avoid timeout on first pull - For Ollama: verify
ollama psshows the model loaded before testing the API - Wrap vLLM in LiteLLM Router with an API provider fallback before going to production
- Monitor tokens/second and GPU utilization in the first week — below 60% utilization, the self-hosting economics don't hold
Keep reading
Run LLMs Locally vs OpenAI API: Real Cost Comparison
Every team scaling an LLM product eventually runs this comparison. Most get it wrong because they only count compute. Here's the full cost stack — and the exact token volume where the math flips.
On-Premises LLM Deployment for HIPAA & GDPR Compliance
For healthcare, fintech, and European companies, the LLM compliance question isn't primarily about cost — it's about what data can legally leave your infrastructure, and under what conditions.
Kubernetes LLM Inference with llm-d: Deploy & Autoscale
How to deploy, scale, and manage open-source LLM inference workloads on Kubernetes using llm-d — the operator-based framework built for production GPU clusters.