Kubernetes LLM Inference with llm-d: Deploy & Autoscale
How to deploy, scale, and manage open-source LLM inference workloads on Kubernetes using llm-d — the operator-based framework built for production GPU clusters.
Shubham Yadav
Machine Learning Researcher
llm-d is a Kubernetes operator for LLM inference that treats model deployments the way Kubernetes treats application workloads — as declarative specs that the platform reconciles against. Rather than scripting kubectl commands and managing GPU scheduling manually, you define what you want: which model, how many replicas, what hardware, what autoscaling policy. llm-d handles the rest.
This matters because LLM inference has requirements Kubernetes wasn't designed for: GPU affinity, large model download times, KV-cache state that doesn't survive preemption, and throughput autoscaling that doesn't map cleanly to CPU or memory HPA metrics. llm-d adds first-class primitives for each of these on top of standard Kubernetes.
This guide covers:
- llm-d architecture — CRDs, operator, and how it differs from raw Kubernetes deployments
- Installation and cluster prerequisites — GPU operator, storage, namespace setup
- Deploying your first model — an LLMInferenceService manifest from scratch
- Autoscaling configuration — request-queue and throughput-based scaling
- Multi-model serving — multiple models on shared GPU infrastructure with LLMInferencePool
- Observability — the metrics that matter for LLM workloads
- Decision guide — when llm-d is the right choice vs alternatives
1. llm-d Architecture: Kubernetes Operator, CRDs, and the vLLM Engine
llm-d extends Kubernetes with a custom operator and two core CRDs: LLMInferenceService for individual model deployments and LLMInferencePool for multi-model resource groups. The operator watches these resources and reconciles them against cluster state — creating pods, configuring GPU affinity, managing model downloads, and updating routing.
The inference engine is vLLM. llm-d manages the lifecycle; vLLM handles the actual token generation. This separation keeps the inference stack up to date (vLLM releases independently) while the operator handles Kubernetes-specific concerns.
| Component | What it does |
|---|---|
| llm-d operator | Watches CRDs, reconciles pod state, manages GPU scheduling |
LLMInferenceService CRD |
Declares a single model deployment — model ID, hardware, replicas, autoscaling |
LLMInferencePool CRD |
Groups services for shared resource management and load balancing |
| vLLM (inference engine) | Runs inside each pod — tokenization, KV cache, batching |
| Model storage | PVC or object storage (S3/GCS) — model weights downloaded at pod start |
The key architectural decision is statelessness at the pod level: model weights live in persistent storage, pods pull them at startup. Pods can be preempted and rescheduled without losing model state — the next pod downloads from the same weights and resumes serving.
2. Cluster Prerequisites and llm-d Installation
llm-d requires a Kubernetes cluster with GPU nodes and the NVIDIA GPU Operator installed. The GPU Operator handles driver installation, device plugin configuration, and GPU health monitoring across nodes.
# Install NVIDIA GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace
Verify GPUs are allocatable:
kubectl get nodes -o json | jq '.items[].status.allocatable | select(."nvidia.com/gpu")'
Install llm-d via Helm:
helm repo add llmd https://charts.llm-d.ai
helm repo update
helm install llm-d llmd/llm-d \
--namespace llm-d-system \
--create-namespace \
--set global.storageClass=standard
| Prerequisite | Minimum version | Notes |
|---|---|---|
| Kubernetes | 1.27+ | 1.29+ recommended |
| NVIDIA GPU Operator | 23.9+ | Required for GPU device plugin |
| Helm | 3.10+ | For operator installation |
| Storage class | Any | PVC for model weight caching |
The operator runs in llm-d-system and immediately registers both CRDs. It begins watching for LLMInferenceService and LLMInferencePool resources across all namespaces.
3. Deploying Your First LLMInferenceService
The LLMInferenceService manifest declares everything about a model deployment. The minimum viable spec for Llama 3.1 8B on a single A10G:
apiVersion: inference.llm-d.ai/v1alpha1
kind: LLMInferenceService
metadata:
name: llama-3-8b
namespace: inference
spec:
model:
id: meta-llama/Meta-Llama-3.1-8B-Instruct
source:
type: HuggingFace
huggingFaceTokenSecret: hf-token-secret
engine:
type: vLLM
args:
- "--dtype=bfloat16"
- "--max-model-len=32768"
resources:
requests:
nvidia.com/gpu: "1"
memory: "24Gi"
limits:
nvidia.com/gpu: "1"
replicas: 1
Apply and watch:
kubectl apply -f llama-3-8b.yaml
kubectl get llminferenceservice llama-3-8b -n inference -w
First startup takes 3–8 minutes as model weights download. Subsequent restarts use the PVC cache and initialize in 30–90 seconds. The operator creates a Deployment, a Service, and a PVC automatically.
The service exposes an OpenAI-compatible /v1/chat/completions endpoint:
from openai import OpenAI
client = OpenAI(
base_url="http://llama-3-8b.inference.svc.cluster.local/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Explain gradient descent."}]
)
Any OpenAI-compatible client works against the service endpoint without modification.
4. Autoscaling LLM Inference: Request Queue and Throughput Metrics
Standard Kubernetes HPA scales on CPU and memory. Neither maps to LLM inference — a GPU can be 100% utilized while CPU sits idle. llm-d exposes two autoscaling mechanisms suited to inference workloads.
Request queue depth — scale up when pending requests per pod exceed a threshold. Reacts to burst traffic before latency degrades.
Decode throughput — scale down when tokens-per-second per pod falls below a floor. Prevents over-provisioning during low-traffic periods.
spec:
autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 8
metrics:
- type: RequestQueue
requestQueue:
targetAverageQueueDepth: 5
- type: Throughput
throughput:
targetDecodeTokensPerSecond: 400
scaleDownStabilizationSeconds: 300
| Autoscaling metric | When to use | Caution |
|---|---|---|
RequestQueue |
Primary scaling signal — reacts to demand directly | Set scaleDownStabilizationSeconds >120 to avoid flapping |
Throughput |
Secondary — prevents idle over-provisioning | Varies with batch size; tune from production data |
| GPU utilization (via DCGM) | Useful for capacity planning | Lags demand — poor scaling trigger |
The scaleDownStabilizationSeconds setting is important: pod startup takes minutes, so premature scale-down followed immediately by a traffic burst creates latency spikes. Start at 300s and tune from actual traffic patterns.
5. Multi-Model Serving with LLMInferencePool
LLMInferencePool groups multiple LLMInferenceService resources and manages shared GPU resources across them. This enables two patterns: shared-hardware multi-model serving and priority-based resource allocation.
apiVersion: inference.llm-d.ai/v1alpha1
kind: LLMInferencePool
metadata:
name: production-pool
namespace: inference
spec:
services:
- name: llama-3-8b
priority: high
weight: 60
- name: phi-3-mini
priority: normal
weight: 40
resourcePolicy:
gpuBudget: "4"
evictionPolicy: LowPriority
With weight: 60 and weight: 40, the pool allocates 60% of GPU capacity to llama-3-8b and 40% to phi-3-mini during normal operation. Under pressure, evictionPolicy: LowPriority evicts phi-3-mini replicas first to free capacity for the high-priority service.
The operator validates that co-located services don't exceed node VRAM. It won't schedule a 70B model (40GB) and a 13B model (28GB) on the same 80GB node if their combined footprint exceeds available memory. Use the open-source LLM comparison to plan VRAM allocations before writing pool specs.
6. Observability: The Metrics That Matter for LLM Inference
vLLM exposes Prometheus metrics natively. llm-d surfaces additional operator-level metrics. The signals that matter for production:
| Metric | Source | What it tells you |
|---|---|---|
vllm:decode_tokens_per_second |
vLLM | Generation throughput — primary efficiency signal |
vllm:num_requests_waiting |
vLLM | Queue depth — primary scaling signal |
vllm:gpu_cache_usage_perc |
vLLM | KV-cache utilization — high values indicate memory pressure |
llmd:pod_startup_seconds |
llm-d operator | Cold start latency — affects autoscaling responsiveness |
llmd:model_download_seconds |
llm-d operator | Storage performance for cold starts |
nvidia_smi_gpu_utilization_rate |
DCGM | GPU duty cycle — for capacity planning |
Configure Prometheus scraping via ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: vllm-metrics
namespace: inference
spec:
selector:
matchLabels:
app.kubernetes.io/managed-by: llm-d
endpoints:
- port: metrics
interval: 15s
path: /metrics
The three-metric dashboard to build first: vllm:num_requests_waiting (scaling signal), vllm:decode_tokens_per_second per pod (efficiency), vllm:gpu_cache_usage_perc (memory pressure). When KV-cache usage exceeds 80% sustained, throughput degrades — either tighten --max-model-len or add replicas.
llm-d vs Alternatives: Decision Guide
| Situation | Recommendation |
|---|---|
| Single-node, single model, prototype | Ollama or Docker directly — no Kubernetes overhead needed |
| Multi-model on one node, low traffic | vLLM directly with Docker Compose |
| Multi-node cluster, one model in production | llm-d with a single LLMInferenceService |
| Multiple models on a shared GPU fleet | llm-d with LLMInferencePool — primary use case |
| Managed Kubernetes (EKS, GKE, AKS) | llm-d works natively — install GPU operator first |
| On-premises bare-metal GPU cluster | llm-d on Talos or k3s — best option for Kubernetes-native on-prem |
| Traffic too variable to provision fixed GPUs | API endpoints (Groq, Together AI) for burst; llm-d for base load |
llm-d is the right choice when you have a real Kubernetes cluster and need GPU workloads treated as first-class platform concerns. It's not the right choice for prototyping, single-model low-scale serving, or environments where Kubernetes isn't already in use.
llm-d Production Deployment Checklist
- Install NVIDIA GPU Operator and verify GPUs are allocatable:
kubectl get nodes -o json | jq '.items[].status.allocatable' - Install llm-d operator into a dedicated
llm-d-systemnamespace via Helm - Create a Kubernetes Secret for HuggingFace token (or S3 credentials) before deploying any
LLMInferenceService - Set
--max-model-lenin vLLM args explicitly — leave unset and the full model context window is allocated in KV cache - Verify the first pod reaches
Readystate before applying autoscaling config - Check
llmd:model_download_secondson first startup to establish baseline cold-start time - Configure autoscaling with
RequestQueueas the primary metric; setscaleDownStabilizationSeconds: 300 - Deploy a Prometheus ServiceMonitor to scrape vLLM metrics from all inference pods
- Build a dashboard with queue depth, decode throughput, and KV-cache utilization before going to production
- Alert when
vllm:gpu_cache_usage_percexceeds 80% for more than 2 minutes sustained - Run a synthetic load burst to validate the scale-up path before routing real traffic
Keep reading
Run LLMs Locally vs OpenAI API: Real Cost Comparison
Every team scaling an LLM product eventually runs this comparison. Most get it wrong because they only count compute. Here's the full cost stack — and the exact token volume where the math flips.
vLLM vs Ollama vs TGI: LLM Serving Framework Comparison
A framework decision that's easy to get wrong — they look similar on the surface but are built for fundamentally different use cases. Plus a step-by-step guide to running Llama 4 Scout on a single GPU.
On-Premises LLM Deployment for HIPAA & GDPR Compliance
For healthcare, fintech, and European companies, the LLM compliance question isn't primarily about cost — it's about what data can legally leave your infrastructure, and under what conditions.