LLM Cost Per Token: GPT-4o vs Claude vs Gemini (2026)

Current pricing for every major LLM — flagship, mid-tier, budget, and reasoning models. Blended rates, caching discounts, and context tier pricing in one place.

Shubham Yadav

Machine Learning Researcher

Updated June 8, 2026

On this page

1. Flagship Models: GPT-4o vs Claude 3.5 Sonnet vs Gemini 1.5 Pro
2. Mid-Tier Models: GPT-4o mini vs Claude 3.5 Haiku vs Gemini Flash
3. Legacy and Budget Models
4. Reasoning Models: o1, o1-mini, and o3-mini Effective Cost
5. Hosted Open-Source Inference APIs: Groq, Together AI, and Fireworks
6. Prompt Caching Discounts: Anthropic vs Google vs OpenAI
7. What 1 Billion Tokens Per Month Actually Costs
Model Tier Selection Guide
LLM API Cost Optimization Checklist
Frequently Asked Questions: LLM API Pricing

LLM pricing has dropped dramatically since 2023 and continues to fall. This page tracks current rates for the models teams actually use in production — updated whenever providers change their pricing.

Last verified: June 2026. Providers change pricing without notice. Always confirm at the official pricing pages before budgeting — OpenAI, Anthropic, Google AI.

Quick answer: GPT-4o mini ($0.26/M blended) and Gemini Flash ($0.13/M) are the cheapest capable models for high-volume production. Flagship models (GPT-4o at $4.38/M, Claude 3.5 Sonnet at $6.00/M) cost 17–46× more. For most applications, routing 70% of traffic to mid-tier models and reserving flagships for complex requests cuts blended costs by 41–55%. Prompt caching — Anthropic 90% off reads, Google 75%, OpenAI 50% — is the highest-leverage single optimization for workloads with large, stable system prompts.

All prices are per 1 million tokens. Blended rate uses a 3:1 input-to-output ratio — typical for chat and instruction-following tasks. For extraction-heavy workloads (more input, less output), real cost skews toward the input price. Calculate your own ratio from a week of API logs before budgeting.

This resource covers:

Flagship models — GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro blended rates and context tiers
Mid-tier models — the cheapest capable options for high-volume workloads
Legacy and budget models — Claude 3 Haiku, Claude 3 Opus
Reasoning models — o1, o1-mini, o3-mini effective cost including thinking tokens
Hosted open-source — Groq, Together AI, Fireworks inference API rates
Prompt caching discounts — how to cut input costs by 50–90%
1B token cost table — what each model costs at production scale
Model selection guide — which tier to use for which workload

1. Flagship Models: GPT-4o vs Claude 3.5 Sonnet vs Gemini 1.5 Pro

GPT-4o and Claude 3.5 Sonnet are the most widely deployed flagship models in production. Gemini 1.5 Pro has the largest context window at the lowest cost within 128k — past that threshold it becomes the most expensive option in this tier.

Model	Input / 1M	Output / 1M	Blended (3:1)	Context
GPT-4o	$2.50	$10.00	$4.38	128k
Claude 3.5 Sonnet	$3.00	$15.00	$6.00	200k
Gemini 1.5 Pro (≤128k)	$1.25	$5.00	$2.19	2M
Gemini 1.5 Pro (>128k)	$5.00	$10.00	$8.75	2M

2. Mid-Tier Models: GPT-4o mini vs Claude 3.5 Haiku vs Gemini Flash

GPT-4o mini is the most common starting point for cost-optimized routing — capable enough for the majority of production workloads at 6% of GPT-4o's blended cost. Claude 3.5 Haiku costs ~6× more than GPT-4o mini but outperforms it on instruction-following and structured output tasks. Gemini Flash is the cheapest capable option in this tier for high-volume use cases.

Model	Input / 1M	Output / 1M	Blended (3:1)	Context
GPT-4o mini	$0.15	$0.60	$0.26	128k
Claude 3.5 Haiku	$0.80	$4.00	$1.60	200k
Gemini 1.5 Flash (≤128k)	$0.075	$0.30	$0.13	1M
Gemini 1.5 Flash-8B	$0.0375	$0.15	$0.07	1M

3. Legacy and Budget Models

Claude 3 Haiku remains useful as a low-cost fallback. Claude 3 Opus is now largely superseded by Claude 3.5 Sonnet — lower price, better performance — and is only worth running for specific legacy compatibility reasons.

Model	Input / 1M	Output / 1M	Blended (3:1)
Claude 3 Haiku	$0.25	$1.25	$0.50
Claude 3 Opus	$15.00	$75.00	$26.25

4. Reasoning Models: o1, o1-mini, and o3-mini Effective Cost

Reasoning models generate internal "thinking" tokens that are billed as output tokens but not returned in the response. The effective cost per task is significantly higher than the per-token rate implies — factor in 2–5× the visible output token count for reasoning-heavy tasks when estimating budget.

Model	Input / 1M	Output / 1M	Notes
OpenAI o1	$15.00	$60.00	Full reasoning, highest quality
OpenAI o1-mini	$1.10	$4.40	Lighter reasoning, faster
OpenAI o3-mini	$1.10	$4.40	Newer, stronger than o1-mini

Use reasoning models when the task genuinely requires multi-step deliberation — complex math, multi-constraint planning, debugging subtle logic errors. For anything a standard model handles correctly, reasoning models are expensive overkill. See when to use reasoning models for routing guidance.

5. Hosted Open-Source Inference APIs: Groq, Together AI, and Fireworks

Open-source models hosted by third-party inference providers can be substantially cheaper than proprietary APIs for high-volume workloads. The quality ceiling is lower, but for well-defined tasks the cost gap is hard to ignore.

Model	Provider	Input / 1M	Output / 1M
Llama 3.1 8B	Groq	~$0.05	~$0.08
Llama 3.1 70B	Groq	~$0.59	~$0.79
Llama 3.1 70B	Together AI	~$0.54	~$0.54
Mistral 7B	Fireworks	~$0.10	~$0.10

Inference API pricing for open-source models is significantly more volatile than proprietary API pricing — providers adjust rates frequently and run promotional pricing. Treat these as rough benchmarks, not billing projections.

6. Prompt Caching Discounts: Anthropic vs Google vs OpenAI

All three major providers offer significant discounts for repeated prompt prefixes. For applications with large, stable system prompts — RAG pipelines, document analysis, agents with long context — prompt caching is often the highest-leverage cost optimization available.

Provider	Cache write cost	Cache read cost	Min cacheable tokens
Anthropic (Claude 3.5 Sonnet)	$3.75/M (1.25× base input)	$0.30/M (~10% of base)	1,024
Google (Gemini 1.5 Pro)	Storage fee only	$0.3125/M (25% of base)	32,768
OpenAI (GPT-4o)	No write fee	$1.25/M (50% of base)	1,024

Anthropic's cache read discount is the deepest — 90% off base input pricing — making it the strongest option for workloads with long, stable system prompts that repeat on every request.

7. What 1 Billion Tokens Per Month Actually Costs

At 1B tokens/month (moderate production scale) and a 3:1 input-to-output ratio:

Model	Monthly cost at 1B tokens
Gemini 1.5 Flash-8B	~$70
Gemini 1.5 Flash	~$130
GPT-4o mini	~$260
Claude 3 Haiku	~$500
Claude 3.5 Haiku	~$1,600
Gemini 1.5 Pro (≤128k)	~$2,190
GPT-4o	~$4,380
Claude 3.5 Sonnet	~$6,000
OpenAI o1	~$26,250

This is API cost only. Engineering time, retry waste, and hallucination-driven re-calls are not reflected here — see hidden LLM costs in production for the full picture.

Model Tier Selection Guide

Workload	Recommended model	Rationale
Simple Q&A, extraction, summarization	GPT-4o mini or Gemini Flash	Sufficient quality at 6–17% of flagship cost
Instruction-following, structured output	Claude 3.5 Haiku	Stronger than GPT-4o mini on complex instructions
Complex reasoning, multi-step tasks	GPT-4o or Claude 3.5 Sonnet	Flagship quality required
Math, proof, constraint planning	OpenAI o3-mini or o1	Reasoning model advantage is measurable
Long-document processing (>128k)	Gemini 1.5 Pro	Largest context window, cheapest within 128k
High-volume, latency-sensitive	Groq (Llama 3.1 8B or 70B)	Fastest inference, lowest cost for open-source quality
Workload with large stable system prompt	Any model with caching	Enable prompt caching — Anthropic gives 90% read discount
Mixed workload (simple + complex)	Cascade: mid-tier → flagship	41–55% cost savings vs all-flagship routing

LLM API Cost Optimization Checklist

Pull a week of API logs and calculate your actual input:output ratio — do not assume 3:1
Identify your top 5 endpoints by token volume and calculate their individual blended rates
Check whether any high-volume endpoints are on flagship models when mid-tier would suffice
Enable prompt caching on any endpoint with a system prompt exceeding 1,024 tokens
Set explicit max_tokens on every production call — verbose outputs at $10–15/M output tokens are expensive
Run both the flagship and mid-tier model on a 200-request sample from your top workload; measure quality gap
If quality gap is small, migrate to mid-tier or build a cascade router
Add cost attribution tags (team, workload) to every API call for per-team visibility
Set a monthly spend alert at 120% of prior month to catch cost spikes early
Review pricing quarterly — model costs continue to fall and the optimal tier shifts

Frequently Asked Questions: LLM API Pricing

What is the cheapest LLM API for production use in 2026?

Gemini 1.5 Flash-8B at $0.07/M blended is the cheapest capable model from a major provider. For open-source models via inference API, Groq's Llama 3.1 8B at ~$0.07/M (blended) is equivalent in cost and significantly faster. GPT-4o mini at $0.26/M is the cheapest option with strong instruction-following quality across a broad range of production tasks.

How much cheaper is GPT-4o mini than GPT-4o?

About 17× cheaper at blended 3:1 rates — $0.26/M vs $4.38/M. At 1B tokens/month, that's $260 vs $4,380. For most SaaS workloads where 70% of requests are routine, routing those to GPT-4o mini and reserving GPT-4o for complex requests cuts blended spend by ~55%.

Do reasoning models like o1 really cost that much more?

Yes. OpenAI o1 generates 2,000–8,000 internal thinking tokens per request at $60/M output — a typical request costs $0.12–0.49 in thinking tokens alone, before the visible response. On the same prompt, GPT-4o mini costs $0.001–0.003. Only use reasoning models for tasks where multi-step deliberation demonstrably improves output quality.

How does prompt caching reduce LLM costs?

Caching works by storing a repeated prompt prefix server-side. On subsequent requests that use the same prefix, you pay the cache read rate instead of the full input rate. Anthropic charges $0.30/M for cache reads vs $3.00/M for fresh input — a 90% reduction. If your system prompt is 2,000 tokens and you send 1M requests per month, caching saves ~$5,400/month at Claude 3.5 Sonnet rates.

What is a blended LLM rate and how do I calculate mine?

A blended rate weights input and output token costs by your actual traffic ratio. If your workload sends 3 input tokens for every 1 output token, your blended rate = (input_price × 0.75) + (output_price × 0.25). The 3:1 ratio used in this table is a starting assumption — measure your actual ratio from API logs before making budget decisions.

Back to resources