All posts

vLLM vs Ollama vs TGI: LLM Serving Framework Comparison

A framework decision that's easy to get wrong — they look similar on the surface but are built for fundamentally different use cases. Plus a step-by-step guide to running Llama 4 Scout on a single GPU.

SY

Shubham Yadav

Machine Learning Researcher

June 8, 202613 min read

Once you've decided to self-host an LLM, the next question is which serving framework to use. This is a choice most teams make once and live with for a long time, because migrating between serving frameworks after you've built infrastructure around one is painful.

The three frameworks that come up in every conversation are vLLM, Ollama, and HuggingFace's Text Generation Inference (TGI). They all do roughly the same thing at a surface level — take a model, expose an API, serve requests — but they're built for fundamentally different operating contexts.

This post covers:

  • vLLM vs Ollama vs TGI at a glance — the key architectural difference that determines which one fits your use case
  • vLLM deep dive — PagedAttention, continuous batching, throughput benchmarks, and configuration
  • Ollama deep dive — GGUF quantization, platform support, and where simplicity is a genuine production advantage
  • TGI deep dive — Docker-first setup, HuggingFace Hub integration, and when TGI beats both alternatives
  • Full side-by-side comparison table — all three frameworks across 9 dimensions
  • Running Llama 4 Scout on a single GPU — 4 complete setup options with working commands
  • Production decision guide — which framework to use based on your concurrency, team, and hardware

vLLM vs Ollama vs TGI: Key Differences at a Glance

The core difference between these frameworks isn't features — it's the concurrency model each is built around.

vLLM Ollama TGI
Throughput at scale Best-in-class Lowest Good
Concurrent batching Continuous batching None by default Dynamic batching
Ease of setup Moderate (pip/Linux) Simplest (one command) Simple (Docker)
Platform support Linux + CUDA required MacOS, Linux, Windows, CPU Linux + CUDA
Consumer GPU / CPU support Limited Excellent (GGUF) Good (with quantization)
Quantization AWQ, GPTQ Auto (GGUF 4-bit/8-bit) AWQ, GPTQ, bitsandbytes
HuggingFace Hub integration Good Good Native
Production reliability High Low at concurrency Medium
Best for Production serving (20+ concurrent users) Dev / local / small teams Mid-scale / HF ecosystem

The decision rule: if you're serving more than ~20 concurrent users, use vLLM. If you're setting up a development environment or small internal tool, use Ollama. If you're in the HuggingFace ecosystem and want something between the two, use TGI.

1. vLLM: Maximum Throughput for High-Concurrency Production Serving

vLLM delivers 2–4× higher throughput than naively configured alternatives at 50 concurrent requests, and the gap grows further above 200 concurrent users. It's the right framework when GPU utilization efficiency directly determines whether self-hosting economics work.

vLLM's performance advantage comes down to two mechanisms working together:

PagedAttention manages the KV cache — the memory that stores previous token states for each active request — in small non-contiguous blocks rather than pre-allocated contiguous slabs. Borrowed from OS virtual memory design, this dramatically reduces memory waste from padding and fragmentation, allowing significantly more requests to be active simultaneously on the same GPU.

Continuous batching means the engine doesn't wait for a batch to complete before starting new requests — as soon as a request finishes, a new one takes its slot. At 100% GPU utilization, self-hosting looks economical. Continuous batching is what actually sustains near-100% utilization under variable load.

pip install vllm

# Serve a model
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
    --tensor-parallel-size 1 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90

# OpenAI-compatible API at :8000
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
        "messages": [{"role": "user", "content": "Hello"}]
    }'

The --gpu-memory-utilization flag is the main lever for trading throughput against stability. Setting it to 0.90 leaves 10% headroom for spikes. Setting it lower reduces max concurrency but makes OOM crashes less likely during unexpected load bursts.

vLLM limitations: Linux + CUDA required; no consumer CPU or MacOS support; more involved setup than Ollama; aggressive memory management means configuration errors produce OOM crashes that are harder to debug.

2. Ollama: Simplest Setup for Local Development and Single-User Production

Ollama handles one request at a time by default (no continuous batching), which makes it unsuitable for 10+ concurrent interactive users — but for local development, developer environments, and internal tools with low concurrency, its simplicity is a genuine production advantage, not just a convenience.

Ollama's ease of use is built on GGUF-format quantized models via the llama.cpp backend. GGUF quantization (4-bit or 8-bit) reduces model size by 4–8× and allows models to run on consumer hardware that wouldn't support the full-precision version:

Model size fp16 VRAM 4-bit GGUF VRAM Fits on
7B / 8B ~14GB ~4–5GB RTX 3060, M1 Mac
13B ~26GB ~8GB RTX 3090, M2 Mac
70B ~140GB ~40GB RTX 4090, M3 Max (96GB)
# Install (one command, all platforms)
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model
ollama run llama4:scout

# OpenAI-compatible API at localhost:11434
curl http://localhost:11434/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "llama4:scout",
        "messages": [{"role": "user", "content": "Hello"}]
    }'

Where Ollama specifically wins: internal tools, developer environments, single-user applications, batch jobs without concurrent execution, and teams evaluating whether self-hosting makes sense before committing to the infrastructure investment of vLLM.

Typical throughput: ~15–25 tokens/second on an M3 Max (96GB), compared to 300–600 tokens/second for an A100 with vLLM. Acceptable for a single developer or low-concurrency internal tool; not suitable for serving users in production.

3. TGI (Text Generation Inference): Best for HuggingFace Ecosystem Teams

TGI implements continuous batching and Flash Attention, giving it substantially better multi-user performance than Ollama, while remaining more operationally approachable than vLLM. It typically lands 20–40% below vLLM throughput at high concurrency — a tradeoff that's worth making for teams in the HuggingFace ecosystem.

TGI's Docker-first approach is its most distinctive feature. The container handles CUDA dependencies and environment setup in a reproducible way across machines — one of the most common sources of broken installs when setting up vLLM from scratch.

# Pull and run TGI via Docker
docker run --gpus all \
    --shm-size 1g \
    -p 8080:80 \
    -v $HOME/.cache/huggingface:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-4-Scout-17B-16E-Instruct \
    --max-concurrent-requests 128 \
    --quantize awq

# OpenAI-compatible API at :8080
curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "tgi",
        "messages": [{"role": "user", "content": "Hello"}]
    }'

Where TGI specifically wins:

  • Teams already using HuggingFace Inference Endpoints — you test locally with TGI and deploy with the same config to managed infrastructure
  • Teams who want AWQ quantization without custom tooling
  • Teams who prefer Docker over pip-installed Python processes for infrastructure consistency and reproducibility

TGI limitation: the 20–40% throughput gap versus vLLM at high concurrency means more GPUs for the same traffic. Whether it matters depends on your traffic levels — at moderate concurrency (20–80 concurrent users), TGI's operational simplicity may outweigh the throughput cost.

How to Run Llama 4 Scout on a Single GPU: 4 Options

Llama 4 Scout is Meta's most accessible large model — 17 billion active parameters with a mixture-of-experts architecture (109B total parameters, but only ~17B activate per token). This MoE structure is what makes it viable on a single GPU: inference compute scales with active parameters, not total parameters.

Hardware compatibility:

GPU VRAM Scout config Recommended framework
H100 / A100 80GB 80GB fp16, 65k context vLLM
A100 40GB / RTX A6000 48GB fp16, 32k context vLLM
RTX 4090 24GB 4-bit AWQ, 16k context vLLM (quantized)
M3 Max (96GB unified) 96GB GGUF Q4, 16k context Ollama

Option 1: vLLM on A100 80GB (Full-Precision Production Setup)

# Install vLLM with CUDA 12.1 support
pip install vllm

# Authenticate with HuggingFace (Llama 4 requires license acceptance)
huggingface-cli login

# Serve Scout with full fp16 precision
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
    --tensor-parallel-size 1 \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.88 \
    --enable-prefix-caching \
    --served-model-name llama4-scout

# For a 48GB GPU, reduce max-model-len to fit the KV cache
vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
    --tensor-parallel-size 1 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.85

The --enable-prefix-caching flag is worth enabling if your system prompt is consistent across requests — it caches the KV states for repeated prompt prefixes, which reduces cost and latency for multi-turn conversations significantly.

Option 2: vLLM with AWQ Quantization on RTX 4090 (24GB)

AWQ (Activation-aware Weight Quantization) is the highest-quality 4-bit quantization method available, generally superior to GPTQ at the same compression ratio. The tradeoff is approximately 10–15% quality degradation relative to full precision on reasoning-heavy tasks.

# Use a pre-quantized AWQ version from HuggingFace Hub
vllm serve bartowski/Llama-4-Scout-17B-16E-Instruct-AWQ \
    --quantization awq \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.90

# Or quantize yourself with AutoAWQ (takes 1-2 hours, needs full-model VRAM temporarily)
pip install autoawq

python -c "
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'meta-llama/Llama-4-Scout-17B-16E-Instruct'
quant_path = 'llama4-scout-awq'

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map='cuda')

quant_config = {
    'zero_point': True,
    'q_group_size': 128,
    'w_bit': 4,
    'version': 'GEMM'
}

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
"

vllm serve ./llama4-scout-awq \
    --quantization awq \
    --max-model-len 16384

Option 3: Ollama on MacBook Pro M3 Max (96GB Unified Memory)

For local development or a team of a few engineers sharing a machine, Apple Silicon with large unified memory is a practical option. No discrete GPU required.

ollama run llama4:scout

# Check what's running
ollama ps

# Run with a specific context window
ollama run llama4:scout --num-ctx 16384

Throughput: ~15–25 tokens/second on an M3 Max. Suitable for a single developer or low-concurrency internal tool, not for serving users in production.

Option 4: TGI with Docker on a Single A100

# Pre-download the model (speeds up container startup)
huggingface-cli download meta-llama/Llama-4-Scout-17B-16E-Instruct

docker run --gpus '"device=0"' \
    --shm-size 2g \
    -p 8080:80 \
    -v $HOME/.cache/huggingface:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-4-Scout-17B-16E-Instruct \
    --max-concurrent-requests 64 \
    --max-total-tokens 65536 \
    --max-input-tokens 32768

TGI's Docker setup is the most reproducible of the three — the container handles CUDA versioning, which is one of the most common sources of broken installs when setting up vLLM from scratch on a new machine.

Which LLM Serving Framework Should You Use? Production Decision Guide

Use case Recommended framework Why
High-traffic production (20+ concurrent users) vLLM 2–4× throughput advantage; continuous batching makes GPU utilization economics work
Local development / single developer Ollama One-command install; works on CPU, Mac, Windows; no CUDA required
Small internal tool (under 10 concurrent users) Ollama Simplicity outweighs performance gap at this concurrency level
HuggingFace Inference Endpoints migration path TGI Local TGI config deploys directly to HF managed infrastructure
Teams preferring Docker infrastructure TGI Container handles CUDA deps; reproducible across machines
24GB consumer GPU (RTX 4090) vLLM + AWQ AWQ quantization fits Scout; maintains continuous batching
Apple Silicon (M-series Mac) Ollama Only framework with native Metal support and CPU offload
Evaluation before infrastructure commit Ollama → vLLM Start cheap, migrate when traffic justifies it

For production deployments, wrap vLLM behind LiteLLM for provider-agnostic routing and automatic API fallback:

from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "llama4-scout",
            "litellm_params": {
                "model": "openai/llama4-scout",  # LiteLLM treats local vLLM as OpenAI-compatible
                "api_base": "http://your-gpu-host:8000/v1",
                "api_key": "none",
            },
        },
        # API fallback for when the self-hosted instance is down
        {
            "model_name": "llama4-scout",
            "litellm_params": {
                "model": "gpt-4o",
                "api_key": os.environ["OPENAI_API_KEY"],
            },
        },
    ],
    fallbacks=[{"llama4-scout": ["gpt-4o"]}],
)

Sitting LiteLLM in front of vLLM gives you automatic fallback to an API provider if the self-hosted instance goes down — one of the simplest mitigations for the reliability gap between self-hosted and managed APIs.

vLLM vs Ollama vs TGI Setup Checklist

  • Confirm your hardware: Linux + CUDA for vLLM/TGI, any platform for Ollama
  • Determine your concurrent user target — above ~20, vLLM is required
  • Accept the Llama 4 license on HuggingFace Hub before attempting to pull the model
  • For vLLM on a 24GB GPU: use a pre-quantized AWQ model from HuggingFace Hub — do not try to run fp16
  • For vLLM: set --gpu-memory-utilization 0.88–0.90 and --enable-prefix-caching if your system prompt is consistent
  • For TGI: pre-download the model with huggingface-cli download before starting the Docker container to avoid timeout on first pull
  • For Ollama: verify ollama ps shows the model loaded before testing the API
  • Wrap vLLM in LiteLLM Router with an API provider fallback before going to production
  • Monitor tokens/second and GPU utilization in the first week — below 60% utilization, the self-hosting economics don't hold