Cloud GPU Pricing for LLM Hosting: A100 vs H100 vs Spot Instances (2026)
On-demand and spot GPU rental rates across AWS, GCP, Azure, Lambda Labs, CoreWeave, and RunPod — with per-token cost estimates for the most common self-hosted LLM configurations.
Shubham Yadav
Machine Learning Researcher
GPU rental pricing is one of the most volatile numbers in the LLM infrastructure stack. Providers adjust rates frequently, new hardware generations shift the value curve, and spot pricing can swing 50% in either direction within a week. This page tracks current rates for the GPUs that matter for LLM workloads.
Last verified: June 2026. Always confirm at provider pricing pages before budgeting — AWS, GCP, Azure, Lambda Labs, CoreWeave, RunPod.
Quick answer: Lambda Labs and CoreWeave are 40–60% cheaper than AWS and GCP for equivalent GPU hardware. A single A100 80GB on Lambda Labs costs ~$1.99/hr vs ~$5.07/hr on GCP — the same card at less than half the price. For prototyping, RunPod community cloud is the cheapest option. For production with reliability SLAs, use AWS or GCP on-demand with 1-year reserved instances for a 30–40% discount. H100s are worth the premium only when decode throughput directly limits user-facing latency — they generate roughly 2× more tokens per second than A100s on the same model.
This resource covers:
- GPU quick reference — which card for which model size and why memory bandwidth matters
- On-demand pricing — AWS EC2, GCP, and Azure rate tables
- Specialist providers — Lambda Labs, CoreWeave, and RunPod with cost comparison
- Spot and preemptible pricing — when to use spot and the realistic discount range
- Reserved instance discounts — 1-year and 3-year commitment savings
- Cost-per-token estimates — real $/M token costs at 60% GPU utilization
- Provider selection guide — which provider for which situation
1. GPU Quick Reference: Which Card for Which Model Size
Memory bandwidth — not raw compute — is the primary bottleneck for LLM inference throughput. An H100 generates roughly 2× the tokens per second of an A100 on the same model, which matters directly for cost-per-token at production scale.
| GPU | VRAM | Best for | Memory bandwidth |
|---|---|---|---|
| A10G / L4 | 24 GB | 7–13B models (fp16), 30B (int4) | 600 GB/s |
| A100 40GB | 40 GB | 13–34B models (fp16), 70B (int4) | 1,555 GB/s |
| A100 80GB | 80 GB | 70B (fp16) on 2 cards, 34B single | 2,000 GB/s |
| H100 80GB | 80 GB | 70B (fp16) single card, 405B multi | 3,350 GB/s |
| H100 NVL 94GB | 94 GB | Large models, highest throughput | 3,900 GB/s |
2. On-Demand Pricing: AWS EC2, GCP, and Azure
AWS EC2
| Instance | GPU | VRAM | On-demand / hr |
|---|---|---|---|
| g5.xlarge | 1× A10G | 24 GB | ~$1.01 |
| g5.12xlarge | 4× A10G | 96 GB | ~$5.67 |
| g5.48xlarge | 8× A10G | 192 GB | ~$16.29 |
| p4d.24xlarge | 8× A100 40GB | 320 GB | ~$32.77 |
| p4de.24xlarge | 8× A100 80GB | 640 GB | ~$40.97 |
| p5.48xlarge | 8× H100 80GB SXM | 640 GB | ~$98.32 |
Google Cloud (GCP)
| Instance | GPU | VRAM | On-demand / hr |
|---|---|---|---|
| g2-standard-4 | 1× L4 | 24 GB | ~$0.70 |
| g2-standard-48 | 4× L4 | 96 GB | ~$2.82 |
| a2-highgpu-1g | 1× A100 40GB | 40 GB | ~$3.67 |
| a2-highgpu-8g | 8× A100 40GB | 320 GB | ~$29.39 |
| a2-ultragpu-1g | 1× A100 80GB | 80 GB | ~$5.07 |
| a2-ultragpu-8g | 8× A100 80GB | 640 GB | ~$40.54 |
| a3-highgpu-8g | 8× H100 80GB | 640 GB | ~$98.32 |
Azure
| Instance | GPU | VRAM | On-demand / hr |
|---|---|---|---|
| NC6s v3 | 1× V100 16GB | 16 GB | ~$3.06 |
| ND96asr v4 | 8× A100 40GB | 320 GB | ~$32.77 |
| ND96amsr A100 v4 | 8× A100 80GB | 640 GB | ~$40.97 |
| ND96isr H100 v5 | 8× H100 80GB | 640 GB | ~$98.32 |
3. Specialist Providers: Lambda Labs, CoreWeave, and RunPod
These providers are optimized specifically for ML workloads and typically offer better availability and lower prices than hyperscalers for GPU-only deployments.
Lambda Labs
Lambda Labs is consistently 40–60% cheaper than AWS/GCP for equivalent hardware. The tradeoff is lower availability — H100 and A100 clusters sell out frequently.
| GPU | VRAM | On-demand / hr |
|---|---|---|
| A10 (24GB) | 24 GB | ~$0.75 |
| A100 SXM4 40GB | 40 GB | ~$1.29 |
| A100 SXM4 80GB | 80 GB | ~$1.99 |
| 8× A100 SXM4 80GB | 640 GB | ~$15.92 |
| H100 SXM5 80GB | 80 GB | ~$2.49 |
| 8× H100 SXM5 80GB | 640 GB | ~$19.92 |
CoreWeave
CoreWeave has strong Kubernetes-native infrastructure and better availability than Lambda for reserved capacity. Typically 20–40% cheaper than hyperscalers.
| GPU | VRAM | On-demand / hr |
|---|---|---|
| A100 80GB SXM4 | 80 GB | ~$2.06 |
| H100 80GB PCIe | 80 GB | ~$2.93 |
| H100 80GB SXM5 | 80 GB | ~$3.89 |
RunPod (community cloud / spot)
RunPod community cloud pricing is the cheapest available for interruptible workloads. Not suitable for serving live traffic — use for batch jobs and offline inference.
| GPU | VRAM | Secure cloud / hr | Community (spot) / hr |
|---|---|---|---|
| A10G 24GB | 24 GB | ~$0.49 | ~$0.22–0.38 |
| A100 80GB SXM | 80 GB | ~$2.49 | ~$1.49–1.99 |
| H100 80GB PCIe | 80 GB | ~$2.99 | ~$1.99–2.49 |
| H100 80GB SXM | 80 GB | ~$3.49 | ~$2.29–2.99 |
4. Spot and Preemptible Pricing: When to Use and What to Expect
Spot instances offer 50–80% discounts over on-demand but can be interrupted with minimal notice. The use cases where they make sense for LLM workloads:
- Batch inference jobs (document processing, offline evaluation, dataset generation) where retries are acceptable
- Model evaluation runs where interruption just means re-queuing
- Dev/test environments where occasional interruption is tolerable
Do not use spot for production serving. The savings are not worth the availability risk for user-facing traffic.
| Provider | GPU | On-demand / hr | Spot / hr | Discount |
|---|---|---|---|---|
| AWS | A10G (g5.xlarge) | ~$1.01 | ~$0.30–0.45 | ~55–70% |
| AWS | A100 40GB (p4d) | ~$32.77 | ~$9.83–16.39 | ~50–70% |
| AWS | H100 (p5) | ~$98.32 | ~$29–49 | ~50–70% |
| GCP | L4 (g2) | ~$0.70 | ~$0.21–0.28 | ~60–70% |
| GCP | A100 40GB (a2) | ~$3.67 | ~$1.10–1.47 | ~60–70% |
Spot availability varies significantly by region and time. AWS us-east-1 and GCP us-central1 typically have the best spot availability for GPU instances.
5. Reserved Instance Discounts: 1-Year and 3-Year Commitments
For sustained production workloads, 1-year reserved instances typically cut on-demand rates by 30–40%. 3-year reservations cut them by 50–60%. Only commit to reserved capacity if you have clear evidence of sustained utilization above 60% — below that, on-demand is more economical because you're paying for idle time either way.
| Provider | Commitment | Typical discount vs on-demand |
|---|---|---|
| AWS | 1 year, no upfront | ~30–37% |
| AWS | 1 year, all upfront | ~38–42% |
| AWS | 3 year, all upfront | ~52–60% |
| GCP | 1 year CUD | ~37% |
| GCP | 3 year CUD | ~55% |
| Azure | 1 year reserved | ~36–40% |
6. Cost-Per-Token Estimates for Common Self-Hosted Configurations
These estimates use on-demand pricing at 60% GPU utilization — a realistic production average. At higher utilization the per-token cost drops proportionally.
| Config | Model | Throughput (vLLM) | On-demand rate | Cost / 1M tokens |
|---|---|---|---|---|
| 1× A10G (AWS g5.xlarge) | Llama 3.1 8B fp16 | ~2,000 tok/s | ~$1.01/hr | ~$0.14/M |
| 1× A100 80GB (Lambda) | Llama 3.1 70B int4 | ~600 tok/s | ~$1.99/hr | ~$0.92/M |
| 2× A100 80GB (Lambda) | Llama 3.1 70B fp16 | ~800 tok/s | ~$3.98/hr | ~$1.38/M |
| 1× H100 80GB (Lambda) | Llama 3.1 70B fp16 | ~1,400 tok/s | ~$2.49/hr | ~$0.49/M |
| 8× H100 80GB (Lambda) | Llama 3.1 405B fp16 | ~350 tok/s | ~$19.92/hr | ~$15.81/M |
Throughput numbers are approximate and vary with batch size, sequence length, and quantization. vLLM with PagedAttention is assumed. Measure your actual throughput before committing to a configuration.
GPU Cloud Provider Selection Guide
| Situation | Recommendation |
|---|---|
| Evaluating or prototyping | RunPod community cloud — cheapest per-hour, no commitment |
| Batch jobs and offline inference | AWS/GCP spot or RunPod community — interruption acceptable |
| Production serving, cost matters most | Lambda Labs or CoreWeave — 40–60% cheaper than hyperscalers |
| Production serving, need reliability SLA | AWS or GCP on-demand with reserved instances |
| Compliance or data residency requirements | AWS or Azure — broadest regional coverage and compliance certifications |
| Sustained utilization above 70% | 1-year reserved on AWS/GCP — discount amortizes well above that threshold |
| Kubernetes-native ML platform | CoreWeave — best K8s-native GPU infrastructure among specialists |
| Need H100 but A100 is unavailable | CoreWeave has stronger H100 reserved availability than Lambda |
For most teams starting with self-hosting, Lambda Labs on-demand is the right starting point — lower price than hyperscalers, no commitment, simpler billing. Move to reserved instances or CoreWeave contracts once you have production data on actual utilization.
GPU Infrastructure Setup Checklist
- Identify which model you're deploying and check its VRAM requirement at your target quantization (see open-source LLM comparison)
- Select GPU tier based on VRAM: A10G/L4 for ≤24GB, A100 40GB for ≤40GB, A100 80GB or H100 for 70B models
- Compare Lambda Labs vs AWS/GCP for your GPU tier — Lambda is typically 40–60% cheaper for identical hardware
- Run a throughput benchmark (vLLM with your model and batch size) before committing to a configuration
- Calculate cost-per-token at your expected utilization rate: (hourly_rate / 3600) / (tokens_per_second × utilization)
- Compare self-hosted cost-per-token against the equivalent API (see LLM cost per token)
- Use spot instances for any batch inference workloads — 50–70% savings for interruptible jobs
- Do not use spot for production serving — set a minimum of on-demand or reserved for live traffic
- If GPU utilization will exceed 60% sustained, evaluate 1-year reserved instances before committing to on-demand
- Set up DCGM metrics and alert on GPU utilization falling below 40% — idle GPUs are pure waste
Frequently Asked Questions: Cloud GPU Pricing for LLM Hosting
Is it cheaper to self-host LLMs or use an API like OpenAI?
It depends on scale and model size. At 1B tokens/month on a 70B model, self-hosting on Lambda Labs with an H100 costs ~$0.49/M tokens — cheaper than GPT-4o ($4.38/M) but not cheaper than GPT-4o mini ($0.26/M). The crossover point is typically 3–5B tokens/month on a flagship-quality model. Below that, the API wins on total cost of ownership once engineering overhead is factored in. See self-hosting vs API TCO for the full analysis.
Why are Lambda Labs and CoreWeave so much cheaper than AWS?
Hyperscalers (AWS, GCP, Azure) price GPU instances with margins that subsidize their broader cloud portfolio and SLA commitments. Specialist providers like Lambda Labs and CoreWeave operate GPU-only infrastructure with tighter margins, passing the savings to customers. The tradeoff is less availability — H100 and A100 capacity at Lambda sells out regularly, and SLAs are less comprehensive than AWS.
When should you use spot instances for LLM workloads?
Use spot for batch inference, offline document processing, model evaluation, and dataset generation — any workload where interruption means re-queuing rather than a user-facing error. Never use spot for production serving: the 50–70% cost savings aren't worth the availability risk for live traffic. AWS us-east-1 and GCP us-central1 have the best spot availability for GPU instances.
Is an H100 worth the premium over an A100 for LLM inference?
Yes, if decode throughput is the bottleneck for user-facing latency. The H100 delivers roughly 2× more tokens per second than the A100 80GB due to higher memory bandwidth (3,350 vs 2,000 GB/s). At Lambda Labs, the H100 costs $2.49/hr vs $1.99/hr for the A100 80GB — 25% more for roughly 2× the throughput. The cost-per-token at high utilization is therefore roughly 40% lower on the H100. If your serving workload is throughput-constrained, the H100 is the better value.
What GPU is best for serving a 70B model like Llama 3.3 70B?
A single H100 80GB at fp16 is the best single-card option — fits the full model, delivers ~1,400 tok/s decode throughput. Two A100 80GBs is a cheaper alternative at ~$3.98/hr (Lambda) with 800 tok/s. A single A100 80GB at int4 also works ($1.99/hr, ~600 tok/s) with a small quality tradeoff from quantization. For cost-per-token, the H100 single card wins at high utilization; for upfront flexibility, the A100 80GB int4 is the cheapest entry point.