Run LLMs Locally vs OpenAI API: Real Cost Comparison
Every team scaling an LLM product eventually runs this comparison. Most get it wrong because they only count compute. Here's the full cost stack — and the exact token volume where the math flips.
Shubham Yadav
Machine Learning Researcher
The self-hosting vs API comparison usually starts with a back-of-envelope calculation: GPU rental cost per hour divided by tokens per second looks cheaper than OpenAI's pricing page. The conclusion seems obvious. It almost never is.
The fundamental error in most LLM self-hosting cost comparisons is treating it as a compute arbitrage problem. The question isn't whether raw GPU compute costs less than API tokens — it often does. The question is whether the total system cost, compute plus engineering plus operations plus reliability, comes out ahead. That's a different calculation, and it changes the answer significantly.
This post covers:
- Current LLM API pricing — GPT-4o, GPT-4o Mini, Claude Sonnet, Claude Haiku blended rates
- The full self-hosting cost stack — the 4 layers, including the 2 most teams skip
- The break-even calculation — the exact monthly token volume where self-hosting becomes cheaper
- 5 factors that change the math — compliance, fine-tuning, utilization, team leverage, jurisdiction
- The hybrid architecture — how most teams at scale actually structure this
1. LLM API Pricing in 2026: The Baseline for the Comparison
Before calculating whether self-hosting saves money, you need an accurate picture of what you're actually paying for API access.
Current pricing for the models most teams use in production:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Blended at 3:1 ratio |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | ~$4.40/M |
| GPT-4o Mini | $0.15 | $0.60 | ~$0.26/M |
| Claude Sonnet | $3.00 | $15.00 | ~$5.25/M |
| Claude Haiku | $0.25 | $1.25 | ~$0.44/M |
The blended rate (at a typical 3:1 input-to-output ratio) is what matters for comparisons. Your self-hosting total cost of ownership needs to beat the blended rate — not just the compute rate — to justify the investment.
One important note: LLM API pricing has dropped significantly over the past two years and continues to fall. Any break-even analysis should use current prices, not benchmarks from six months ago.
2. The Self-Hosting LLM Cost Stack: 4 Layers Most Analyses Miss
Self-hosting LLM costs have four distinct layers. Most comparisons only count the first one — which is why most comparisons reach the wrong conclusion.
Layer 1: Compute (the layer everyone counts)
GPU requirements and cloud rental costs for the most commonly self-hosted models:
| Model | GPU requirement | Cloud rental rate | Throughput (vLLM) | Cost per 1M tokens |
|---|---|---|---|---|
| Llama 3.1 8B / Mistral 7B | 1× A10G (24GB) | ~$1.10–1.50/hr | ~2,000 tok/sec | ~$0.15–0.20/M |
| Llama 3.1 70B / Qwen 2.5 72B | 2× A100 80GB | ~$5–7/hr | ~800–1,200 tok/sec | ~$1.40–2.10/M |
| Llama 3.1 405B | 8× A100 or 4× H100 | ~$20–32/hr | ~200–400 tok/sec | ~$3.50–6.00/M |
At first glance this looks like a compelling case for self-hosting — a 70B model costs $1.40–2.10/M tokens vs GPT-4o at $4.40/M. But throughput utilization is the variable that breaks this math.
At 100% GPU utilization, self-hosting looks economical. At 40% utilization — realistic for most teams outside of peak hours — the effective cost per token nearly doubles. GPU time costs the same whether you're generating tokens or idling. With managed APIs, you only pay for tokens you use. For applications with spiky or unpredictable traffic, idle capacity cost can wipe out the per-token savings entirely.
Layer 2: Engineering (the layer most analyses skip — and the largest cost)
Running an LLM in production is not a deploy-and-forget operation. Recurring engineering tasks include:
- Model updates as new versions release, plus evaluation to confirm updates don't regress quality
- Serving infrastructure maintenance: vLLM upgrades, CUDA compatibility, dependency drift
- Incident response when the serving layer fails, runs OOM, or starts producing degraded outputs
- Quantization and optimization work
- Monitoring, alerting, and capacity planning
A realistic estimate: 0.5–1.0 FTE of ongoing maintenance for a single 70B model in production. At $150–200k/year fully loaded, that's $6,250–16,700 per month in engineering cost — before writing a single line of application code.
This cost is fixed. It doesn't scale with token usage. At low volume it completely dominates. At very high volume it amortizes to near-zero per token. The break-even point is largely determined by when this amortized engineering cost per token falls below the API cost per token.
Layer 3: Infrastructure and Operations
Beyond the GPU instance, a production self-hosted deployment has additional infrastructure costs:
| Component | Monthly cost range |
|---|---|
| Load balancer and networking | $50–200 |
| Storage for model weights (70B fp16 ≈ 140GB) | $15–30 |
| Monitoring stack (Grafana, Prometheus) | $50–200 |
| Redis for rate limiting and caching | $30–100 |
| Logging and observability pipeline | $100–500 |
| Total supporting infrastructure | $250–1,000+ |
Individually small, collectively real.
Layer 4: Reliability and Tail Risk
OpenAI, Anthropic, and Google maintain multi-region deployments, dedicated reliability engineering teams, and SLAs that most self-hosted setups can't replicate without significant investment. Reliability failures have costs — user-facing errors, potential churn, engineering time on incident response instead of product work. For applications where uptime is critical, the hidden cost of self-hosted reliability incidents can be substantial.
3. Self-Hosting vs OpenAI API Break-Even: The Exact Token Volume
With the full cost stack in hand, the break-even calculation becomes concrete. At a 70B model vs GPT-4o comparison, the math doesn't flip until ~3 billion tokens per month.
Monthly self-hosting costs for a 70B-class model (comparable quality to GPT-4o on most tasks) on 2× A100s:
| Component | Low estimate | High estimate |
|---|---|---|
| 2× A100 compute (720 hrs/month) | $3,600 | $5,040 |
| Engineering (0.75 FTE) | $9,375 | $12,500 |
| Supporting infrastructure | $250 | $1,000 |
| Total fixed monthly cost | $13,225 | $18,540 |
At 60% average GPU utilization — a realistic production number — 2× A100s deliver ~346M tokens/month effective throughput.
Effective cost per token at 346M tokens/month:
- Low estimate: $13,225 ÷ 346M = $38.20/M tokens (8.7× more expensive than GPT-4o)
- High estimate: $18,540 ÷ 346M = $53.60/M tokens (12.2× more expensive than GPT-4o)
The break-even volume where self-hosting cost per token equals GPT-4o at $4.40/M:
- Low estimates: $13,225 ÷ $4.40 = ~3 billion tokens/month
- High estimates: $18,540 ÷ $4.40 = ~4.2 billion tokens/month
3 billion tokens/month is approximately 10 million average-length conversations — sustained heavy usage from tens of thousands of active daily users. That's not startup scale. That's significant production scale.
For GPT-4o Mini at $0.26/M blended, the break-even point moves even further out. Against a self-hosted Llama 8B, the engineering overhead still dominates at all but very high token volumes.
4. 5 Factors That Change the Self-Hosting LLM Cost Calculation
The break-even calculation above is the baseline, not the final answer. Five factors legitimately shift the math — sometimes enough to make self-hosting the clearly correct choice regardless of volume.
1. Data privacy and compliance requirements. If your application handles data that legally cannot be sent to third-party APIs — HIPAA-regulated health information, certain financial data, classified enterprise content — self-hosting may not be a cost optimization at all. It may be the only legal option. The cost comparison becomes self-hosting vs. not building the feature. Break-even is irrelevant.
2. Custom fine-tuning as a competitive moat. If your advantage depends on a fine-tuned model — specialized domain knowledge, proprietary data, specific behavioral patterns — you cannot get that from API providers. Fine-tuned models must be self-hosted or deployed through a provider's fine-tuning service (which adds its own costs and limitations). The question shifts to: is the quality advantage worth the self-hosting premium?
3. High utilization with predictable traffic. The calculation above used 60% GPU utilization. At 85–90% sustained utilization, effective compute cost per token drops substantially. Batch processing workloads — nightly document analysis, scheduled summarization, bulk extraction — are the clearest self-hosting case because they keep GPUs at near-full utilization with no idle waste.
4. Existing ML infrastructure ownership. The 0.5–1.0 FTE estimate assumes starting from scratch. If you already have ML infrastructure engineers maintaining other systems, the marginal cost of adding an LLM serving layer is much lower. The fixed cost distributes across a broader set of systems.
5. Regulatory data residency requirements. Some jurisdictions require data to remain in specific geographic regions at a level API providers may not satisfy with standard regional deployments. Self-hosting in a specific region may be necessary independent of cost.
Should You Self-Host LLMs? A Decision Guide by Token Volume
| Monthly token volume | Recommendation |
|---|---|
| Under 500M tokens/month | Use the API. The math isn't close — engineering costs alone exceed API savings. |
| 500M–3B tokens/month | Depends on specifics: utilization, engineering leverage, traffic predictability. Run your own numbers. |
| Over 3B tokens/month | Self-hosting worth serious evaluation. Cost savings are real; engineering overhead amortizes. |
| Any volume + compliance requirement | Self-hosting may be the only option regardless of cost. |
| Any volume + custom fine-tuning need | Self-hosting required; evaluate whether quality advantage justifies the premium. |
| Any volume + batch-only workloads | Self-hosting case strengthens significantly — high utilization makes the economics work. |
The Hybrid LLM Architecture: Combining API and Self-Hosted Models
For most teams at scale, the cost-optimal architecture isn't API or self-hosting — it's both, with traffic routed between them by workload type.
- API providers for variable, latency-sensitive, or unpredictable workloads — where the idle-cost problem of reserved GPU capacity would dominate
- Self-hosted models for high-volume, batch, or compliance-constrained workloads — where sustained utilization and predictable traffic make the economics work
LiteLLM's router makes this practical: define both self-hosted and API deployments as model pools and route between them based on task type, system load, or cost thresholds. Latency-sensitive requests go to the API; batch jobs queue against the self-hosted cluster.
This hybrid approach captures most of the cost savings of self-hosting without betting your production reliability on infrastructure you maintain yourself.
The One Number to Track Before Making the Decision
The single most important number to establish before evaluating self-hosting is your current fully-loaded cost per million tokens — including engineering time, not just API invoices.
Most teams undercount this because engineering time doesn't show up on the infrastructure bill. A developer spending 20% of their time on LLM-related work at $150k/year loaded is adding $30k/year — or $2.50/M tokens at 1B tokens/month — to your true cost basis.
Once you have that number, the analysis is straightforward: at what monthly token volume does a fully-loaded self-hosted deployment cost less? That's your threshold. Below it, optimize your API usage. Above it, evaluate self-hosting infrastructure in earnest.
The teams that get this wrong invest in self-hosting infrastructure before they have the volume to justify it, then discover that engineering overhead costs more than the tokens they were trying to save.
Self-Hosting vs API Decision Checklist
- Calculate your current monthly token volume — below 500M, the API is almost certainly cheaper when engineering cost is included
- Calculate your fully-loaded cost per million tokens — add engineering time, not just the API invoice
- Check whether you have a hard compliance requirement (HIPAA, GDPR data residency, sector-specific law) — if yes, self-hosting may be mandatory regardless of cost
- Estimate GPU utilization at your traffic levels — below 60% average utilization, idle compute cost erases the per-token savings
- Identify whether your traffic is predictable/batch (self-hosting-friendly) or spiky/user-driven (API-friendly)
- Factor engineering overhead: 0.5–1.0 FTE for a single production 70B model — does your team have this capacity?
- If above 3B tokens/month: run the full cost stack comparison (compute + engineering + infra + reliability) against your blended API rate
- Consider hybrid architecture: API for latency-sensitive traffic, self-hosted for high-volume batch workloads
- Use LiteLLM Router to route between self-hosted and API deployments so you can migrate incrementally
Keep reading
vLLM vs Ollama vs TGI: LLM Serving Framework Comparison
A framework decision that's easy to get wrong — they look similar on the surface but are built for fundamentally different use cases. Plus a step-by-step guide to running Llama 4 Scout on a single GPU.
Anthropic Prompt Caching: How It Works + When to Use It
How Anthropic prompt caching works, what it costs to write and read the cache, and the conditions under which it cuts input token spend by up to 90%.
How to Cut LLM API Costs by 50% (4 Proven Methods)
Four proven techniques to reduce LLM API token spend in production — system prompt optimization, output controls, model routing, and prompt caching — without degrading output quality.