Cloud GPU Pricing for LLM Hosting: A100 vs H100 vs Spot Instances (2026)

On-demand and spot GPU rental rates across AWS, GCP, Azure, Lambda Labs, CoreWeave, and RunPod — with per-token cost estimates for the most common self-hosted LLM configurations.

Shubham Yadav

Machine Learning Researcher

Updated June 8, 2026

On this page

1. GPU Quick Reference: Which Card for Which Model Size
2. On-Demand Pricing: AWS EC2, GCP, and Azure
3. Specialist Providers: Lambda Labs, CoreWeave, and RunPod
4. Spot and Preemptible Pricing: When to Use and What to Expect
5. Reserved Instance Discounts: 1-Year and 3-Year Commitments
6. Cost-Per-Token Estimates for Common Self-Hosted Configurations
GPU Cloud Provider Selection Guide
GPU Infrastructure Setup Checklist
Frequently Asked Questions: Cloud GPU Pricing for LLM Hosting

GPU rental pricing is one of the most volatile numbers in the LLM infrastructure stack. Providers adjust rates frequently, new hardware generations shift the value curve, and spot pricing can swing 50% in either direction within a week. This page tracks current rates for the GPUs that matter for LLM workloads.

Last verified: June 2026. Always confirm at provider pricing pages before budgeting — AWS, GCP, Azure, Lambda Labs, CoreWeave, RunPod.

Quick answer: Lambda Labs and CoreWeave are 40–60% cheaper than AWS and GCP for equivalent GPU hardware. A single A100 80GB on Lambda Labs costs ~$1.99/hr vs ~$5.07/hr on GCP — the same card at less than half the price. For prototyping, RunPod community cloud is the cheapest option. For production with reliability SLAs, use AWS or GCP on-demand with 1-year reserved instances for a 30–40% discount. H100s are worth the premium only when decode throughput directly limits user-facing latency — they generate roughly 2× more tokens per second than A100s on the same model.

This resource covers:

GPU quick reference — which card for which model size and why memory bandwidth matters
On-demand pricing — AWS EC2, GCP, and Azure rate tables
Specialist providers — Lambda Labs, CoreWeave, and RunPod with cost comparison
Spot and preemptible pricing — when to use spot and the realistic discount range
Reserved instance discounts — 1-year and 3-year commitment savings
Cost-per-token estimates — real $/M token costs at 60% GPU utilization
Provider selection guide — which provider for which situation

1. GPU Quick Reference: Which Card for Which Model Size

Memory bandwidth — not raw compute — is the primary bottleneck for LLM inference throughput. An H100 generates roughly 2× the tokens per second of an A100 on the same model, which matters directly for cost-per-token at production scale.

GPU	VRAM	Best for	Memory bandwidth
A10G / L4	24 GB	7–13B models (fp16), 30B (int4)	600 GB/s
A100 40GB	40 GB	13–34B models (fp16), 70B (int4)	1,555 GB/s
A100 80GB	80 GB	70B (fp16) on 2 cards, 34B single	2,000 GB/s
H100 80GB	80 GB	70B (fp16) single card, 405B multi	3,350 GB/s
H100 NVL 94GB	94 GB	Large models, highest throughput	3,900 GB/s

2. On-Demand Pricing: AWS EC2, GCP, and Azure

AWS EC2

Instance	GPU	VRAM	On-demand / hr
g5.xlarge	1× A10G	24 GB	~$1.01
g5.12xlarge	4× A10G	96 GB	~$5.67
g5.48xlarge	8× A10G	192 GB	~$16.29
p4d.24xlarge	8× A100 40GB	320 GB	~$32.77
p4de.24xlarge	8× A100 80GB	640 GB	~$40.97
p5.48xlarge	8× H100 80GB SXM	640 GB	~$98.32

Google Cloud (GCP)

Instance	GPU	VRAM	On-demand / hr
g2-standard-4	1× L4	24 GB	~$0.70
g2-standard-48	4× L4	96 GB	~$2.82
a2-highgpu-1g	1× A100 40GB	40 GB	~$3.67
a2-highgpu-8g	8× A100 40GB	320 GB	~$29.39
a2-ultragpu-1g	1× A100 80GB	80 GB	~$5.07
a2-ultragpu-8g	8× A100 80GB	640 GB	~$40.54
a3-highgpu-8g	8× H100 80GB	640 GB	~$98.32

Azure

Instance	GPU	VRAM	On-demand / hr
NC6s v3	1× V100 16GB	16 GB	~$3.06
ND96asr v4	8× A100 40GB	320 GB	~$32.77
ND96amsr A100 v4	8× A100 80GB	640 GB	~$40.97
ND96isr H100 v5	8× H100 80GB	640 GB	~$98.32

3. Specialist Providers: Lambda Labs, CoreWeave, and RunPod

These providers are optimized specifically for ML workloads and typically offer better availability and lower prices than hyperscalers for GPU-only deployments.

Lambda Labs

Lambda Labs is consistently 40–60% cheaper than AWS/GCP for equivalent hardware. The tradeoff is lower availability — H100 and A100 clusters sell out frequently.

GPU	VRAM	On-demand / hr
A10 (24GB)	24 GB	~$0.75
A100 SXM4 40GB	40 GB	~$1.29
A100 SXM4 80GB	80 GB	~$1.99
8× A100 SXM4 80GB	640 GB	~$15.92
H100 SXM5 80GB	80 GB	~$2.49
8× H100 SXM5 80GB	640 GB	~$19.92

CoreWeave

CoreWeave has strong Kubernetes-native infrastructure and better availability than Lambda for reserved capacity. Typically 20–40% cheaper than hyperscalers.

GPU	VRAM	On-demand / hr
A100 80GB SXM4	80 GB	~$2.06
H100 80GB PCIe	80 GB	~$2.93
H100 80GB SXM5	80 GB	~$3.89

RunPod (community cloud / spot)

RunPod community cloud pricing is the cheapest available for interruptible workloads. Not suitable for serving live traffic — use for batch jobs and offline inference.

GPU	VRAM	Secure cloud / hr	Community (spot) / hr
A10G 24GB	24 GB	~$0.49	~$0.22–0.38
A100 80GB SXM	80 GB	~$2.49	~$1.49–1.99
H100 80GB PCIe	80 GB	~$2.99	~$1.99–2.49
H100 80GB SXM	80 GB	~$3.49	~$2.29–2.99

4. Spot and Preemptible Pricing: When to Use and What to Expect

Spot instances offer 50–80% discounts over on-demand but can be interrupted with minimal notice. The use cases where they make sense for LLM workloads:

Batch inference jobs (document processing, offline evaluation, dataset generation) where retries are acceptable
Model evaluation runs where interruption just means re-queuing
Dev/test environments where occasional interruption is tolerable

Do not use spot for production serving. The savings are not worth the availability risk for user-facing traffic.

Provider	GPU	On-demand / hr	Spot / hr	Discount
AWS	A10G (g5.xlarge)	~$1.01	~$0.30–0.45	~55–70%
AWS	A100 40GB (p4d)	~$32.77	~$9.83–16.39	~50–70%
AWS	H100 (p5)	~$98.32	~$29–49	~50–70%
GCP	L4 (g2)	~$0.70	~$0.21–0.28	~60–70%
GCP	A100 40GB (a2)	~$3.67	~$1.10–1.47	~60–70%

Spot availability varies significantly by region and time. AWS us-east-1 and GCP us-central1 typically have the best spot availability for GPU instances.

5. Reserved Instance Discounts: 1-Year and 3-Year Commitments

For sustained production workloads, 1-year reserved instances typically cut on-demand rates by 30–40%. 3-year reservations cut them by 50–60%. Only commit to reserved capacity if you have clear evidence of sustained utilization above 60% — below that, on-demand is more economical because you're paying for idle time either way.

Provider	Commitment	Typical discount vs on-demand
AWS	1 year, no upfront	~30–37%
AWS	1 year, all upfront	~38–42%
AWS	3 year, all upfront	~52–60%
GCP	1 year CUD	~37%
GCP	3 year CUD	~55%
Azure	1 year reserved	~36–40%

6. Cost-Per-Token Estimates for Common Self-Hosted Configurations

These estimates use on-demand pricing at 60% GPU utilization — a realistic production average. At higher utilization the per-token cost drops proportionally.

Config	Model	Throughput (vLLM)	On-demand rate	Cost / 1M tokens
1× A10G (AWS g5.xlarge)	Llama 3.1 8B fp16	~2,000 tok/s	~$1.01/hr	~$0.14/M
1× A100 80GB (Lambda)	Llama 3.1 70B int4	~600 tok/s	~$1.99/hr	~$0.92/M
2× A100 80GB (Lambda)	Llama 3.1 70B fp16	~800 tok/s	~$3.98/hr	~$1.38/M
1× H100 80GB (Lambda)	Llama 3.1 70B fp16	~1,400 tok/s	~$2.49/hr	~$0.49/M
8× H100 80GB (Lambda)	Llama 3.1 405B fp16	~350 tok/s	~$19.92/hr	~$15.81/M

Throughput numbers are approximate and vary with batch size, sequence length, and quantization. vLLM with PagedAttention is assumed. Measure your actual throughput before committing to a configuration.

GPU Cloud Provider Selection Guide

Situation	Recommendation
Evaluating or prototyping	RunPod community cloud — cheapest per-hour, no commitment
Batch jobs and offline inference	AWS/GCP spot or RunPod community — interruption acceptable
Production serving, cost matters most	Lambda Labs or CoreWeave — 40–60% cheaper than hyperscalers
Production serving, need reliability SLA	AWS or GCP on-demand with reserved instances
Compliance or data residency requirements	AWS or Azure — broadest regional coverage and compliance certifications
Sustained utilization above 70%	1-year reserved on AWS/GCP — discount amortizes well above that threshold
Kubernetes-native ML platform	CoreWeave — best K8s-native GPU infrastructure among specialists
Need H100 but A100 is unavailable	CoreWeave has stronger H100 reserved availability than Lambda

For most teams starting with self-hosting, Lambda Labs on-demand is the right starting point — lower price than hyperscalers, no commitment, simpler billing. Move to reserved instances or CoreWeave contracts once you have production data on actual utilization.

GPU Infrastructure Setup Checklist

Identify which model you're deploying and check its VRAM requirement at your target quantization (see open-source LLM comparison)
Select GPU tier based on VRAM: A10G/L4 for ≤24GB, A100 40GB for ≤40GB, A100 80GB or H100 for 70B models
Compare Lambda Labs vs AWS/GCP for your GPU tier — Lambda is typically 40–60% cheaper for identical hardware
Run a throughput benchmark (vLLM with your model and batch size) before committing to a configuration
Calculate cost-per-token at your expected utilization rate: (hourly_rate / 3600) / (tokens_per_second × utilization)
Compare self-hosted cost-per-token against the equivalent API (see LLM cost per token)
Use spot instances for any batch inference workloads — 50–70% savings for interruptible jobs
Do not use spot for production serving — set a minimum of on-demand or reserved for live traffic
If GPU utilization will exceed 60% sustained, evaluate 1-year reserved instances before committing to on-demand
Set up DCGM metrics and alert on GPU utilization falling below 40% — idle GPUs are pure waste

Frequently Asked Questions: Cloud GPU Pricing for LLM Hosting

Is it cheaper to self-host LLMs or use an API like OpenAI?

It depends on scale and model size. At 1B tokens/month on a 70B model, self-hosting on Lambda Labs with an H100 costs ~$0.49/M tokens — cheaper than GPT-4o ($4.38/M) but not cheaper than GPT-4o mini ($0.26/M). The crossover point is typically 3–5B tokens/month on a flagship-quality model. Below that, the API wins on total cost of ownership once engineering overhead is factored in. See self-hosting vs API TCO for the full analysis.

Why are Lambda Labs and CoreWeave so much cheaper than AWS?

Hyperscalers (AWS, GCP, Azure) price GPU instances with margins that subsidize their broader cloud portfolio and SLA commitments. Specialist providers like Lambda Labs and CoreWeave operate GPU-only infrastructure with tighter margins, passing the savings to customers. The tradeoff is less availability — H100 and A100 capacity at Lambda sells out regularly, and SLAs are less comprehensive than AWS.

When should you use spot instances for LLM workloads?

Use spot for batch inference, offline document processing, model evaluation, and dataset generation — any workload where interruption means re-queuing rather than a user-facing error. Never use spot for production serving: the 50–70% cost savings aren't worth the availability risk for live traffic. AWS us-east-1 and GCP us-central1 have the best spot availability for GPU instances.

Is an H100 worth the premium over an A100 for LLM inference?

Yes, if decode throughput is the bottleneck for user-facing latency. The H100 delivers roughly 2× more tokens per second than the A100 80GB due to higher memory bandwidth (3,350 vs 2,000 GB/s). At Lambda Labs, the H100 costs $2.49/hr vs $1.99/hr for the A100 80GB — 25% more for roughly 2× the throughput. The cost-per-token at high utilization is therefore roughly 40% lower on the H100. If your serving workload is throughput-constrained, the H100 is the better value.

What GPU is best for serving a 70B model like Llama 3.3 70B?

A single H100 80GB at fp16 is the best single-card option — fits the full model, delivers ~1,400 tok/s decode throughput. Two A100 80GBs is a cheaper alternative at ~$3.98/hr (Lambda) with ~~800 tok/s. A single A100 80GB at int4 also works (~~$1.99/hr, ~600 tok/s) with a small quality tradeoff from quantization. For cost-per-token, the H100 single card wins at high utilization; for upfront flexibility, the A100 80GB int4 is the cheapest entry point.

Back to resources