All posts

Kubernetes LLM Inference with llm-d: Deploy & Autoscale

How to deploy, scale, and manage open-source LLM inference workloads on Kubernetes using llm-d — the operator-based framework built for production GPU clusters.

SY

Shubham Yadav

Machine Learning Researcher

June 8, 202613 min read

llm-d is a Kubernetes operator for LLM inference that treats model deployments the way Kubernetes treats application workloads — as declarative specs that the platform reconciles against. Rather than scripting kubectl commands and managing GPU scheduling manually, you define what you want: which model, how many replicas, what hardware, what autoscaling policy. llm-d handles the rest.

This matters because LLM inference has requirements Kubernetes wasn't designed for: GPU affinity, large model download times, KV-cache state that doesn't survive preemption, and throughput autoscaling that doesn't map cleanly to CPU or memory HPA metrics. llm-d adds first-class primitives for each of these on top of standard Kubernetes.

This guide covers:

  • llm-d architecture — CRDs, operator, and how it differs from raw Kubernetes deployments
  • Installation and cluster prerequisites — GPU operator, storage, namespace setup
  • Deploying your first model — an LLMInferenceService manifest from scratch
  • Autoscaling configuration — request-queue and throughput-based scaling
  • Multi-model serving — multiple models on shared GPU infrastructure with LLMInferencePool
  • Observability — the metrics that matter for LLM workloads
  • Decision guide — when llm-d is the right choice vs alternatives

1. llm-d Architecture: Kubernetes Operator, CRDs, and the vLLM Engine

llm-d extends Kubernetes with a custom operator and two core CRDs: LLMInferenceService for individual model deployments and LLMInferencePool for multi-model resource groups. The operator watches these resources and reconciles them against cluster state — creating pods, configuring GPU affinity, managing model downloads, and updating routing.

The inference engine is vLLM. llm-d manages the lifecycle; vLLM handles the actual token generation. This separation keeps the inference stack up to date (vLLM releases independently) while the operator handles Kubernetes-specific concerns.

Component What it does
llm-d operator Watches CRDs, reconciles pod state, manages GPU scheduling
LLMInferenceService CRD Declares a single model deployment — model ID, hardware, replicas, autoscaling
LLMInferencePool CRD Groups services for shared resource management and load balancing
vLLM (inference engine) Runs inside each pod — tokenization, KV cache, batching
Model storage PVC or object storage (S3/GCS) — model weights downloaded at pod start

The key architectural decision is statelessness at the pod level: model weights live in persistent storage, pods pull them at startup. Pods can be preempted and rescheduled without losing model state — the next pod downloads from the same weights and resumes serving.

2. Cluster Prerequisites and llm-d Installation

llm-d requires a Kubernetes cluster with GPU nodes and the NVIDIA GPU Operator installed. The GPU Operator handles driver installation, device plugin configuration, and GPU health monitoring across nodes.

# Install NVIDIA GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace

Verify GPUs are allocatable:

kubectl get nodes -o json | jq '.items[].status.allocatable | select(."nvidia.com/gpu")'

Install llm-d via Helm:

helm repo add llmd https://charts.llm-d.ai
helm repo update
helm install llm-d llmd/llm-d \
  --namespace llm-d-system \
  --create-namespace \
  --set global.storageClass=standard
Prerequisite Minimum version Notes
Kubernetes 1.27+ 1.29+ recommended
NVIDIA GPU Operator 23.9+ Required for GPU device plugin
Helm 3.10+ For operator installation
Storage class Any PVC for model weight caching

The operator runs in llm-d-system and immediately registers both CRDs. It begins watching for LLMInferenceService and LLMInferencePool resources across all namespaces.

3. Deploying Your First LLMInferenceService

The LLMInferenceService manifest declares everything about a model deployment. The minimum viable spec for Llama 3.1 8B on a single A10G:

apiVersion: inference.llm-d.ai/v1alpha1
kind: LLMInferenceService
metadata:
  name: llama-3-8b
  namespace: inference
spec:
  model:
    id: meta-llama/Meta-Llama-3.1-8B-Instruct
    source:
      type: HuggingFace
      huggingFaceTokenSecret: hf-token-secret
  engine:
    type: vLLM
    args:
      - "--dtype=bfloat16"
      - "--max-model-len=32768"
  resources:
    requests:
      nvidia.com/gpu: "1"
      memory: "24Gi"
    limits:
      nvidia.com/gpu: "1"
  replicas: 1

Apply and watch:

kubectl apply -f llama-3-8b.yaml
kubectl get llminferenceservice llama-3-8b -n inference -w

First startup takes 3–8 minutes as model weights download. Subsequent restarts use the PVC cache and initialize in 30–90 seconds. The operator creates a Deployment, a Service, and a PVC automatically.

The service exposes an OpenAI-compatible /v1/chat/completions endpoint:

from openai import OpenAI

client = OpenAI(
    base_url="http://llama-3-8b.inference.svc.cluster.local/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain gradient descent."}]
)

Any OpenAI-compatible client works against the service endpoint without modification.

4. Autoscaling LLM Inference: Request Queue and Throughput Metrics

Standard Kubernetes HPA scales on CPU and memory. Neither maps to LLM inference — a GPU can be 100% utilized while CPU sits idle. llm-d exposes two autoscaling mechanisms suited to inference workloads.

Request queue depth — scale up when pending requests per pod exceed a threshold. Reacts to burst traffic before latency degrades.

Decode throughput — scale down when tokens-per-second per pod falls below a floor. Prevents over-provisioning during low-traffic periods.

spec:
  autoscaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 8
    metrics:
      - type: RequestQueue
        requestQueue:
          targetAverageQueueDepth: 5
      - type: Throughput
        throughput:
          targetDecodeTokensPerSecond: 400
    scaleDownStabilizationSeconds: 300
Autoscaling metric When to use Caution
RequestQueue Primary scaling signal — reacts to demand directly Set scaleDownStabilizationSeconds >120 to avoid flapping
Throughput Secondary — prevents idle over-provisioning Varies with batch size; tune from production data
GPU utilization (via DCGM) Useful for capacity planning Lags demand — poor scaling trigger

The scaleDownStabilizationSeconds setting is important: pod startup takes minutes, so premature scale-down followed immediately by a traffic burst creates latency spikes. Start at 300s and tune from actual traffic patterns.

5. Multi-Model Serving with LLMInferencePool

LLMInferencePool groups multiple LLMInferenceService resources and manages shared GPU resources across them. This enables two patterns: shared-hardware multi-model serving and priority-based resource allocation.

apiVersion: inference.llm-d.ai/v1alpha1
kind: LLMInferencePool
metadata:
  name: production-pool
  namespace: inference
spec:
  services:
    - name: llama-3-8b
      priority: high
      weight: 60
    - name: phi-3-mini
      priority: normal
      weight: 40
  resourcePolicy:
    gpuBudget: "4"
    evictionPolicy: LowPriority

With weight: 60 and weight: 40, the pool allocates 60% of GPU capacity to llama-3-8b and 40% to phi-3-mini during normal operation. Under pressure, evictionPolicy: LowPriority evicts phi-3-mini replicas first to free capacity for the high-priority service.

The operator validates that co-located services don't exceed node VRAM. It won't schedule a 70B model (40GB) and a 13B model (28GB) on the same 80GB node if their combined footprint exceeds available memory. Use the open-source LLM comparison to plan VRAM allocations before writing pool specs.

6. Observability: The Metrics That Matter for LLM Inference

vLLM exposes Prometheus metrics natively. llm-d surfaces additional operator-level metrics. The signals that matter for production:

Metric Source What it tells you
vllm:decode_tokens_per_second vLLM Generation throughput — primary efficiency signal
vllm:num_requests_waiting vLLM Queue depth — primary scaling signal
vllm:gpu_cache_usage_perc vLLM KV-cache utilization — high values indicate memory pressure
llmd:pod_startup_seconds llm-d operator Cold start latency — affects autoscaling responsiveness
llmd:model_download_seconds llm-d operator Storage performance for cold starts
nvidia_smi_gpu_utilization_rate DCGM GPU duty cycle — for capacity planning

Configure Prometheus scraping via ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vllm-metrics
  namespace: inference
spec:
  selector:
    matchLabels:
      app.kubernetes.io/managed-by: llm-d
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics

The three-metric dashboard to build first: vllm:num_requests_waiting (scaling signal), vllm:decode_tokens_per_second per pod (efficiency), vllm:gpu_cache_usage_perc (memory pressure). When KV-cache usage exceeds 80% sustained, throughput degrades — either tighten --max-model-len or add replicas.

llm-d vs Alternatives: Decision Guide

Situation Recommendation
Single-node, single model, prototype Ollama or Docker directly — no Kubernetes overhead needed
Multi-model on one node, low traffic vLLM directly with Docker Compose
Multi-node cluster, one model in production llm-d with a single LLMInferenceService
Multiple models on a shared GPU fleet llm-d with LLMInferencePool — primary use case
Managed Kubernetes (EKS, GKE, AKS) llm-d works natively — install GPU operator first
On-premises bare-metal GPU cluster llm-d on Talos or k3s — best option for Kubernetes-native on-prem
Traffic too variable to provision fixed GPUs API endpoints (Groq, Together AI) for burst; llm-d for base load

llm-d is the right choice when you have a real Kubernetes cluster and need GPU workloads treated as first-class platform concerns. It's not the right choice for prototyping, single-model low-scale serving, or environments where Kubernetes isn't already in use.

llm-d Production Deployment Checklist

  • Install NVIDIA GPU Operator and verify GPUs are allocatable: kubectl get nodes -o json | jq '.items[].status.allocatable'
  • Install llm-d operator into a dedicated llm-d-system namespace via Helm
  • Create a Kubernetes Secret for HuggingFace token (or S3 credentials) before deploying any LLMInferenceService
  • Set --max-model-len in vLLM args explicitly — leave unset and the full model context window is allocated in KV cache
  • Verify the first pod reaches Ready state before applying autoscaling config
  • Check llmd:model_download_seconds on first startup to establish baseline cold-start time
  • Configure autoscaling with RequestQueue as the primary metric; set scaleDownStabilizationSeconds: 300
  • Deploy a Prometheus ServiceMonitor to scrape vLLM metrics from all inference pods
  • Build a dashboard with queue depth, decode throughput, and KV-cache utilization before going to production
  • Alert when vllm:gpu_cache_usage_perc exceeds 80% for more than 2 minutes sustained
  • Run a synthetic load burst to validate the scale-up path before routing real traffic