All resources

LLM Cost Per Token: GPT-4o vs Claude vs Gemini (2026)

Current pricing for every major LLM — flagship, mid-tier, budget, and reasoning models. Blended rates, caching discounts, and context tier pricing in one place.

SY

Shubham Yadav

Machine Learning Researcher

Updated June 8, 2026

LLM pricing has dropped dramatically since 2023 and continues to fall. This page tracks current rates for the models teams actually use in production — updated whenever providers change their pricing.

Last verified: June 2026. Providers change pricing without notice. Always confirm at the official pricing pages before budgeting — OpenAI, Anthropic, Google AI.

Quick answer: GPT-4o mini ($0.26/M blended) and Gemini Flash ($0.13/M) are the cheapest capable models for high-volume production. Flagship models (GPT-4o at $4.38/M, Claude 3.5 Sonnet at $6.00/M) cost 17–46× more. For most applications, routing 70% of traffic to mid-tier models and reserving flagships for complex requests cuts blended costs by 41–55%. Prompt caching — Anthropic 90% off reads, Google 75%, OpenAI 50% — is the highest-leverage single optimization for workloads with large, stable system prompts.

All prices are per 1 million tokens. Blended rate uses a 3:1 input-to-output ratio — typical for chat and instruction-following tasks. For extraction-heavy workloads (more input, less output), real cost skews toward the input price. Calculate your own ratio from a week of API logs before budgeting.

This resource covers:

  • Flagship models — GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro blended rates and context tiers
  • Mid-tier models — the cheapest capable options for high-volume workloads
  • Legacy and budget models — Claude 3 Haiku, Claude 3 Opus
  • Reasoning models — o1, o1-mini, o3-mini effective cost including thinking tokens
  • Hosted open-source — Groq, Together AI, Fireworks inference API rates
  • Prompt caching discounts — how to cut input costs by 50–90%
  • 1B token cost table — what each model costs at production scale
  • Model selection guide — which tier to use for which workload

1. Flagship Models: GPT-4o vs Claude 3.5 Sonnet vs Gemini 1.5 Pro

GPT-4o and Claude 3.5 Sonnet are the most widely deployed flagship models in production. Gemini 1.5 Pro has the largest context window at the lowest cost within 128k — past that threshold it becomes the most expensive option in this tier.

Model Input / 1M Output / 1M Blended (3:1) Context
GPT-4o $2.50 $10.00 $4.38 128k
Claude 3.5 Sonnet $3.00 $15.00 $6.00 200k
Gemini 1.5 Pro (≤128k) $1.25 $5.00 $2.19 2M
Gemini 1.5 Pro (>128k) $5.00 $10.00 $8.75 2M

2. Mid-Tier Models: GPT-4o mini vs Claude 3.5 Haiku vs Gemini Flash

GPT-4o mini is the most common starting point for cost-optimized routing — capable enough for the majority of production workloads at 6% of GPT-4o's blended cost. Claude 3.5 Haiku costs ~6× more than GPT-4o mini but outperforms it on instruction-following and structured output tasks. Gemini Flash is the cheapest capable option in this tier for high-volume use cases.

Model Input / 1M Output / 1M Blended (3:1) Context
GPT-4o mini $0.15 $0.60 $0.26 128k
Claude 3.5 Haiku $0.80 $4.00 $1.60 200k
Gemini 1.5 Flash (≤128k) $0.075 $0.30 $0.13 1M
Gemini 1.5 Flash-8B $0.0375 $0.15 $0.07 1M

3. Legacy and Budget Models

Claude 3 Haiku remains useful as a low-cost fallback. Claude 3 Opus is now largely superseded by Claude 3.5 Sonnet — lower price, better performance — and is only worth running for specific legacy compatibility reasons.

Model Input / 1M Output / 1M Blended (3:1)
Claude 3 Haiku $0.25 $1.25 $0.50
Claude 3 Opus $15.00 $75.00 $26.25

4. Reasoning Models: o1, o1-mini, and o3-mini Effective Cost

Reasoning models generate internal "thinking" tokens that are billed as output tokens but not returned in the response. The effective cost per task is significantly higher than the per-token rate implies — factor in 2–5× the visible output token count for reasoning-heavy tasks when estimating budget.

Model Input / 1M Output / 1M Notes
OpenAI o1 $15.00 $60.00 Full reasoning, highest quality
OpenAI o1-mini $1.10 $4.40 Lighter reasoning, faster
OpenAI o3-mini $1.10 $4.40 Newer, stronger than o1-mini

Use reasoning models when the task genuinely requires multi-step deliberation — complex math, multi-constraint planning, debugging subtle logic errors. For anything a standard model handles correctly, reasoning models are expensive overkill. See when to use reasoning models for routing guidance.


5. Hosted Open-Source Inference APIs: Groq, Together AI, and Fireworks

Open-source models hosted by third-party inference providers can be substantially cheaper than proprietary APIs for high-volume workloads. The quality ceiling is lower, but for well-defined tasks the cost gap is hard to ignore.

Model Provider Input / 1M Output / 1M
Llama 3.1 8B Groq ~$0.05 ~$0.08
Llama 3.1 70B Groq ~$0.59 ~$0.79
Llama 3.1 70B Together AI ~$0.54 ~$0.54
Mistral 7B Fireworks ~$0.10 ~$0.10

Inference API pricing for open-source models is significantly more volatile than proprietary API pricing — providers adjust rates frequently and run promotional pricing. Treat these as rough benchmarks, not billing projections.


6. Prompt Caching Discounts: Anthropic vs Google vs OpenAI

All three major providers offer significant discounts for repeated prompt prefixes. For applications with large, stable system prompts — RAG pipelines, document analysis, agents with long context — prompt caching is often the highest-leverage cost optimization available.

Provider Cache write cost Cache read cost Min cacheable tokens
Anthropic (Claude 3.5 Sonnet) $3.75/M (1.25× base input) $0.30/M (~10% of base) 1,024
Google (Gemini 1.5 Pro) Storage fee only $0.3125/M (25% of base) 32,768
OpenAI (GPT-4o) No write fee $1.25/M (50% of base) 1,024

Anthropic's cache read discount is the deepest — 90% off base input pricing — making it the strongest option for workloads with long, stable system prompts that repeat on every request.


7. What 1 Billion Tokens Per Month Actually Costs

At 1B tokens/month (moderate production scale) and a 3:1 input-to-output ratio:

Model Monthly cost at 1B tokens
Gemini 1.5 Flash-8B ~$70
Gemini 1.5 Flash ~$130
GPT-4o mini ~$260
Claude 3 Haiku ~$500
Claude 3.5 Haiku ~$1,600
Gemini 1.5 Pro (≤128k) ~$2,190
GPT-4o ~$4,380
Claude 3.5 Sonnet ~$6,000
OpenAI o1 ~$26,250

This is API cost only. Engineering time, retry waste, and hallucination-driven re-calls are not reflected here — see hidden LLM costs in production for the full picture.


Model Tier Selection Guide

Workload Recommended model Rationale
Simple Q&A, extraction, summarization GPT-4o mini or Gemini Flash Sufficient quality at 6–17% of flagship cost
Instruction-following, structured output Claude 3.5 Haiku Stronger than GPT-4o mini on complex instructions
Complex reasoning, multi-step tasks GPT-4o or Claude 3.5 Sonnet Flagship quality required
Math, proof, constraint planning OpenAI o3-mini or o1 Reasoning model advantage is measurable
Long-document processing (>128k) Gemini 1.5 Pro Largest context window, cheapest within 128k
High-volume, latency-sensitive Groq (Llama 3.1 8B or 70B) Fastest inference, lowest cost for open-source quality
Workload with large stable system prompt Any model with caching Enable prompt caching — Anthropic gives 90% read discount
Mixed workload (simple + complex) Cascade: mid-tier → flagship 41–55% cost savings vs all-flagship routing

LLM API Cost Optimization Checklist

  • Pull a week of API logs and calculate your actual input:output ratio — do not assume 3:1
  • Identify your top 5 endpoints by token volume and calculate their individual blended rates
  • Check whether any high-volume endpoints are on flagship models when mid-tier would suffice
  • Enable prompt caching on any endpoint with a system prompt exceeding 1,024 tokens
  • Set explicit max_tokens on every production call — verbose outputs at $10–15/M output tokens are expensive
  • Run both the flagship and mid-tier model on a 200-request sample from your top workload; measure quality gap
  • If quality gap is small, migrate to mid-tier or build a cascade router
  • Add cost attribution tags (team, workload) to every API call for per-team visibility
  • Set a monthly spend alert at 120% of prior month to catch cost spikes early
  • Review pricing quarterly — model costs continue to fall and the optimal tier shifts

Frequently Asked Questions: LLM API Pricing

What is the cheapest LLM API for production use in 2026?

Gemini 1.5 Flash-8B at $0.07/M blended is the cheapest capable model from a major provider. For open-source models via inference API, Groq's Llama 3.1 8B at ~$0.07/M (blended) is equivalent in cost and significantly faster. GPT-4o mini at $0.26/M is the cheapest option with strong instruction-following quality across a broad range of production tasks.

How much cheaper is GPT-4o mini than GPT-4o?

About 17× cheaper at blended 3:1 rates — $0.26/M vs $4.38/M. At 1B tokens/month, that's $260 vs $4,380. For most SaaS workloads where 70% of requests are routine, routing those to GPT-4o mini and reserving GPT-4o for complex requests cuts blended spend by ~55%.

Do reasoning models like o1 really cost that much more?

Yes. OpenAI o1 generates 2,000–8,000 internal thinking tokens per request at $60/M output — a typical request costs $0.12–0.49 in thinking tokens alone, before the visible response. On the same prompt, GPT-4o mini costs $0.001–0.003. Only use reasoning models for tasks where multi-step deliberation demonstrably improves output quality.

How does prompt caching reduce LLM costs?

Caching works by storing a repeated prompt prefix server-side. On subsequent requests that use the same prefix, you pay the cache read rate instead of the full input rate. Anthropic charges $0.30/M for cache reads vs $3.00/M for fresh input — a 90% reduction. If your system prompt is 2,000 tokens and you send 1M requests per month, caching saves ~$5,400/month at Claude 3.5 Sonnet rates.

What is a blended LLM rate and how do I calculate mine?

A blended rate weights input and output token costs by your actual traffic ratio. If your workload sends 3 input tokens for every 1 output token, your blended rate = (input_price × 0.75) + (output_price × 0.25). The 3:1 ratio used in this table is a starting assumption — measure your actual ratio from API logs before making budget decisions.