LLM Cost Per Token: GPT-4o vs Claude vs Gemini (2026)
Current pricing for every major LLM — flagship, mid-tier, budget, and reasoning models. Blended rates, caching discounts, and context tier pricing in one place.
Shubham Yadav
Machine Learning Researcher
LLM pricing has dropped dramatically since 2023 and continues to fall. This page tracks current rates for the models teams actually use in production — updated whenever providers change their pricing.
Last verified: June 2026. Providers change pricing without notice. Always confirm at the official pricing pages before budgeting — OpenAI, Anthropic, Google AI.
Quick answer: GPT-4o mini ($0.26/M blended) and Gemini Flash ($0.13/M) are the cheapest capable models for high-volume production. Flagship models (GPT-4o at $4.38/M, Claude 3.5 Sonnet at $6.00/M) cost 17–46× more. For most applications, routing 70% of traffic to mid-tier models and reserving flagships for complex requests cuts blended costs by 41–55%. Prompt caching — Anthropic 90% off reads, Google 75%, OpenAI 50% — is the highest-leverage single optimization for workloads with large, stable system prompts.
All prices are per 1 million tokens. Blended rate uses a 3:1 input-to-output ratio — typical for chat and instruction-following tasks. For extraction-heavy workloads (more input, less output), real cost skews toward the input price. Calculate your own ratio from a week of API logs before budgeting.
This resource covers:
- Flagship models — GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro blended rates and context tiers
- Mid-tier models — the cheapest capable options for high-volume workloads
- Legacy and budget models — Claude 3 Haiku, Claude 3 Opus
- Reasoning models — o1, o1-mini, o3-mini effective cost including thinking tokens
- Hosted open-source — Groq, Together AI, Fireworks inference API rates
- Prompt caching discounts — how to cut input costs by 50–90%
- 1B token cost table — what each model costs at production scale
- Model selection guide — which tier to use for which workload
1. Flagship Models: GPT-4o vs Claude 3.5 Sonnet vs Gemini 1.5 Pro
GPT-4o and Claude 3.5 Sonnet are the most widely deployed flagship models in production. Gemini 1.5 Pro has the largest context window at the lowest cost within 128k — past that threshold it becomes the most expensive option in this tier.
| Model | Input / 1M | Output / 1M | Blended (3:1) | Context |
|---|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | $4.38 | 128k |
| Claude 3.5 Sonnet | $3.00 | $15.00 | $6.00 | 200k |
| Gemini 1.5 Pro (≤128k) | $1.25 | $5.00 | $2.19 | 2M |
| Gemini 1.5 Pro (>128k) | $5.00 | $10.00 | $8.75 | 2M |
2. Mid-Tier Models: GPT-4o mini vs Claude 3.5 Haiku vs Gemini Flash
GPT-4o mini is the most common starting point for cost-optimized routing — capable enough for the majority of production workloads at 6% of GPT-4o's blended cost. Claude 3.5 Haiku costs ~6× more than GPT-4o mini but outperforms it on instruction-following and structured output tasks. Gemini Flash is the cheapest capable option in this tier for high-volume use cases.
| Model | Input / 1M | Output / 1M | Blended (3:1) | Context |
|---|---|---|---|---|
| GPT-4o mini | $0.15 | $0.60 | $0.26 | 128k |
| Claude 3.5 Haiku | $0.80 | $4.00 | $1.60 | 200k |
| Gemini 1.5 Flash (≤128k) | $0.075 | $0.30 | $0.13 | 1M |
| Gemini 1.5 Flash-8B | $0.0375 | $0.15 | $0.07 | 1M |
3. Legacy and Budget Models
Claude 3 Haiku remains useful as a low-cost fallback. Claude 3 Opus is now largely superseded by Claude 3.5 Sonnet — lower price, better performance — and is only worth running for specific legacy compatibility reasons.
| Model | Input / 1M | Output / 1M | Blended (3:1) |
|---|---|---|---|
| Claude 3 Haiku | $0.25 | $1.25 | $0.50 |
| Claude 3 Opus | $15.00 | $75.00 | $26.25 |
4. Reasoning Models: o1, o1-mini, and o3-mini Effective Cost
Reasoning models generate internal "thinking" tokens that are billed as output tokens but not returned in the response. The effective cost per task is significantly higher than the per-token rate implies — factor in 2–5× the visible output token count for reasoning-heavy tasks when estimating budget.
| Model | Input / 1M | Output / 1M | Notes |
|---|---|---|---|
| OpenAI o1 | $15.00 | $60.00 | Full reasoning, highest quality |
| OpenAI o1-mini | $1.10 | $4.40 | Lighter reasoning, faster |
| OpenAI o3-mini | $1.10 | $4.40 | Newer, stronger than o1-mini |
Use reasoning models when the task genuinely requires multi-step deliberation — complex math, multi-constraint planning, debugging subtle logic errors. For anything a standard model handles correctly, reasoning models are expensive overkill. See when to use reasoning models for routing guidance.
5. Hosted Open-Source Inference APIs: Groq, Together AI, and Fireworks
Open-source models hosted by third-party inference providers can be substantially cheaper than proprietary APIs for high-volume workloads. The quality ceiling is lower, but for well-defined tasks the cost gap is hard to ignore.
| Model | Provider | Input / 1M | Output / 1M |
|---|---|---|---|
| Llama 3.1 8B | Groq | ~$0.05 | ~$0.08 |
| Llama 3.1 70B | Groq | ~$0.59 | ~$0.79 |
| Llama 3.1 70B | Together AI | ~$0.54 | ~$0.54 |
| Mistral 7B | Fireworks | ~$0.10 | ~$0.10 |
Inference API pricing for open-source models is significantly more volatile than proprietary API pricing — providers adjust rates frequently and run promotional pricing. Treat these as rough benchmarks, not billing projections.
6. Prompt Caching Discounts: Anthropic vs Google vs OpenAI
All three major providers offer significant discounts for repeated prompt prefixes. For applications with large, stable system prompts — RAG pipelines, document analysis, agents with long context — prompt caching is often the highest-leverage cost optimization available.
| Provider | Cache write cost | Cache read cost | Min cacheable tokens |
|---|---|---|---|
| Anthropic (Claude 3.5 Sonnet) | $3.75/M (1.25× base input) | $0.30/M (~10% of base) | 1,024 |
| Google (Gemini 1.5 Pro) | Storage fee only | $0.3125/M (25% of base) | 32,768 |
| OpenAI (GPT-4o) | No write fee | $1.25/M (50% of base) | 1,024 |
Anthropic's cache read discount is the deepest — 90% off base input pricing — making it the strongest option for workloads with long, stable system prompts that repeat on every request.
7. What 1 Billion Tokens Per Month Actually Costs
At 1B tokens/month (moderate production scale) and a 3:1 input-to-output ratio:
| Model | Monthly cost at 1B tokens |
|---|---|
| Gemini 1.5 Flash-8B | ~$70 |
| Gemini 1.5 Flash | ~$130 |
| GPT-4o mini | ~$260 |
| Claude 3 Haiku | ~$500 |
| Claude 3.5 Haiku | ~$1,600 |
| Gemini 1.5 Pro (≤128k) | ~$2,190 |
| GPT-4o | ~$4,380 |
| Claude 3.5 Sonnet | ~$6,000 |
| OpenAI o1 | ~$26,250 |
This is API cost only. Engineering time, retry waste, and hallucination-driven re-calls are not reflected here — see hidden LLM costs in production for the full picture.
Model Tier Selection Guide
| Workload | Recommended model | Rationale |
|---|---|---|
| Simple Q&A, extraction, summarization | GPT-4o mini or Gemini Flash | Sufficient quality at 6–17% of flagship cost |
| Instruction-following, structured output | Claude 3.5 Haiku | Stronger than GPT-4o mini on complex instructions |
| Complex reasoning, multi-step tasks | GPT-4o or Claude 3.5 Sonnet | Flagship quality required |
| Math, proof, constraint planning | OpenAI o3-mini or o1 | Reasoning model advantage is measurable |
| Long-document processing (>128k) | Gemini 1.5 Pro | Largest context window, cheapest within 128k |
| High-volume, latency-sensitive | Groq (Llama 3.1 8B or 70B) | Fastest inference, lowest cost for open-source quality |
| Workload with large stable system prompt | Any model with caching | Enable prompt caching — Anthropic gives 90% read discount |
| Mixed workload (simple + complex) | Cascade: mid-tier → flagship | 41–55% cost savings vs all-flagship routing |
LLM API Cost Optimization Checklist
- Pull a week of API logs and calculate your actual input:output ratio — do not assume 3:1
- Identify your top 5 endpoints by token volume and calculate their individual blended rates
- Check whether any high-volume endpoints are on flagship models when mid-tier would suffice
- Enable prompt caching on any endpoint with a system prompt exceeding 1,024 tokens
- Set explicit
max_tokenson every production call — verbose outputs at $10–15/M output tokens are expensive - Run both the flagship and mid-tier model on a 200-request sample from your top workload; measure quality gap
- If quality gap is small, migrate to mid-tier or build a cascade router
- Add cost attribution tags (
team,workload) to every API call for per-team visibility - Set a monthly spend alert at 120% of prior month to catch cost spikes early
- Review pricing quarterly — model costs continue to fall and the optimal tier shifts
Frequently Asked Questions: LLM API Pricing
What is the cheapest LLM API for production use in 2026?
Gemini 1.5 Flash-8B at $0.07/M blended is the cheapest capable model from a major provider. For open-source models via inference API, Groq's Llama 3.1 8B at ~$0.07/M (blended) is equivalent in cost and significantly faster. GPT-4o mini at $0.26/M is the cheapest option with strong instruction-following quality across a broad range of production tasks.
How much cheaper is GPT-4o mini than GPT-4o?
About 17× cheaper at blended 3:1 rates — $0.26/M vs $4.38/M. At 1B tokens/month, that's $260 vs $4,380. For most SaaS workloads where 70% of requests are routine, routing those to GPT-4o mini and reserving GPT-4o for complex requests cuts blended spend by ~55%.
Do reasoning models like o1 really cost that much more?
Yes. OpenAI o1 generates 2,000–8,000 internal thinking tokens per request at $60/M output — a typical request costs $0.12–0.49 in thinking tokens alone, before the visible response. On the same prompt, GPT-4o mini costs $0.001–0.003. Only use reasoning models for tasks where multi-step deliberation demonstrably improves output quality.
How does prompt caching reduce LLM costs?
Caching works by storing a repeated prompt prefix server-side. On subsequent requests that use the same prefix, you pay the cache read rate instead of the full input rate. Anthropic charges $0.30/M for cache reads vs $3.00/M for fresh input — a 90% reduction. If your system prompt is 2,000 tokens and you send 1M requests per month, caching saves ~$5,400/month at Claude 3.5 Sonnet rates.
What is a blended LLM rate and how do I calculate mine?
A blended rate weights input and output token costs by your actual traffic ratio. If your workload sends 3 input tokens for every 1 output token, your blended rate = (input_price × 0.75) + (output_price × 0.25). The 3:1 ratio used in this table is a starting assumption — measure your actual ratio from API logs before making budget decisions.