All resources

Open-Source LLM Comparison: Llama vs Mistral vs Qwen vs Phi (2026)

Specs, VRAM requirements, benchmark scores, and recommended use cases for every major open-source LLM worth self-hosting — updated as new models release.

SY

Shubham Yadav

Machine Learning Researcher

Updated June 8, 2026

New open-source models release every few weeks. This page tracks the models actually worth self-hosting — covering VRAM requirements, benchmark scores, context windows, and the workloads each model handles best.

Last verified: June 2026. The open-source model landscape moves fast. Benchmark scores and hardware requirements are based on published numbers at time of update — always check the model card before deploying.

Quick answer: Llama 3.3 70B and Qwen 2.5 72B are the strongest open-source models worth self-hosting in 2026 — both score 86+ on MMLU and fit on a single H100 80GB at int4. For single-GPU deployments (≤24GB VRAM), Llama 3.1 8B at int8 outperforms Mistral 7B on every benchmark. Qwen 2.5 14B is the best choice for math-heavy workloads on mid-tier hardware. Mistral Large 2 leads on code generation at the 123B scale. All major families now support 128k context except Mistral 7B and Mistral Small (32k).

If you know your GPU: Use the VRAM table to find every model that fits your hardware, then check benchmarks to pick the strongest one.

If you know your task: Use the model selection guide to find models optimized for your workload, then check the VRAM table to see what hardware you need.

If you're deciding between self-hosting and API: Pair this with the cloud GPU pricing and self-hosting vs API TCO pages.

This resource covers:

  • VRAM requirements — fp16, int8, and int4 memory footprints for every model
  • Benchmark scores — MMLU, HumanEval, MATH, and MT-Bench comparisons
  • Context windows — which models support 128k and which are limited to 32k
  • License summary — Apache 2.0, MIT, and Meta community license restrictions
  • Throughput benchmarks — decode tokens per second on A100 80GB with vLLM
  • Model selection guide — best model per use case and hardware tier

1. VRAM Requirements by Model and Quantization

Lower quantization (int4 < int8 < fp16) reduces VRAM at a small quality cost. For most production inference workloads, int8 is the right default — negligible quality loss, roughly half the VRAM of fp16.

Model Params fp16 VRAM int8 VRAM int4 VRAM
Phi-3.5 Mini 3.8B ~8 GB ~4 GB ~2.5 GB
Mistral 7B v0.3 7B ~14 GB ~8 GB ~5 GB
Llama 3.1 8B 8B ~16 GB ~9 GB ~5.5 GB
Llama 3.2 11B Vision 11B ~22 GB ~12 GB ~7 GB
Qwen 2.5 14B 14B ~28 GB ~15 GB ~9 GB
Mistral Small 22B 22B ~44 GB ~23 GB ~13 GB
Llama 3.3 70B 70B ~140 GB ~75 GB ~42 GB
Qwen 2.5 72B 72B ~144 GB ~77 GB ~44 GB
Mistral Large 2 123B ~246 GB ~130 GB ~74 GB
Llama 3.1 405B 405B ~810 GB ~430 GB ~243 GB

Practical GPU configurations:

  • Single A10G / L4 (24 GB): Phi-3.5 Mini (fp16), Mistral 7B (fp16), Llama 3.1 8B (int8)
  • Single A100 40GB: Qwen 2.5 14B (fp16), Mistral Small 22B (int8)
  • Single A100 80GB / H100 80GB: Llama 3.3 70B (int4), Qwen 2.5 72B (int4)
  • 2× A100 80GB: Llama 3.3 70B (fp16), Qwen 2.5 72B (fp16)
  • 8× H100 80GB: Llama 3.1 405B (int4), Mistral Large 2 (fp16)

2. Benchmark Scores: MMLU, HumanEval, MATH, and MT-Bench

Higher is better on all benchmarks. Scores are from published evals on standard test sets.

Model MMLU HumanEval MATH MT-Bench
Phi-3.5 Mini 3.8B 69.0 62.8 46.4 8.0
Mistral 7B v0.3 64.2 30.5 13.1 7.6
Llama 3.1 8B 66.7 72.6 51.9 8.2
Qwen 2.5 14B 79.7 74.4 79.5 8.7
Mistral Small 22B 77.2 72.6 62.3 8.4
Llama 3.3 70B 86.0 88.4 77.0 9.1
Qwen 2.5 72B 86.1 86.6 83.1 9.1
Mistral Large 2 123B 84.0 92.1 74.2 9.0
Llama 3.1 405B 88.6 89.0 73.8 9.4

Key takeaways:

  • Qwen 2.5 72B and Llama 3.3 70B trade blows at the 70B tier — Qwen edges ahead on math, Llama on general reasoning
  • Qwen 2.5 14B punches well above its weight on math tasks — worth considering for structured/analytical workloads where a smaller model is preferred
  • Phi-3.5 Mini at 3.8B outperforms Mistral 7B on most benchmarks despite half the parameters — the better default for single-GPU budget setups
  • Mistral Large 2 (123B) beats Llama 405B on HumanEval but loses on MMLU — the better code-focused choice at that scale

3. Context Windows: Which Models Support 128k

128k is now standard across most production-grade models. Mistral 7B and Mistral Small remain 32k — relevant if your workload uses long documents.

Model Context window Notes
Phi-3.5 Mini 128k
Mistral 7B v0.3 32k
Llama 3.1 8B 128k
Llama 3.2 11B Vision 128k Multimodal (text + image)
Qwen 2.5 14B 128k
Mistral Small 22B 32k
Llama 3.3 70B 128k
Qwen 2.5 72B 128k
Mistral Large 2 128k
Llama 3.1 405B 128k

Avoid Mistral 7B and Mistral Small for long-document workloads (RAG, summarization, document Q&A). For pure retrieval-augmented generation where the context is large but the task is simple, Llama 3.1 8B at 128k handles most cases well at low cost.


4. License Summary: Apache 2.0, MIT, and Meta Community License

Model family License Commercial use Fine-tuning allowed
Llama 3.x Meta Llama 3 Community License Yes (with restrictions above 700M MAU) Yes
Mistral models Apache 2.0 Yes, unrestricted Yes
Qwen 2.5 Apache 2.0 (most sizes) Yes, unrestricted Yes
Phi-3.5 MIT Yes, unrestricted Yes

Mistral, Qwen 2.5, and Phi-3.5 are the most permissive for commercial deployment. Llama 3's community license is broadly permissive but has a usage threshold clause that matters at very high scale (over 700 million monthly active users).


5. Throughput Benchmarks: vLLM on A100 80GB

Approximate token throughput for single-card inference at batch size 1, measured with vLLM. Higher is better. Decode throughput (tokens per second during generation) is the number that matters for user-facing latency.

Model Tokens/sec (prefill) Tokens/sec (decode)
Llama 3.1 8B fp16 ~4,500 ~2,200
Qwen 2.5 14B int8 ~2,800 ~1,400
Llama 3.3 70B int4 ~1,100 ~620
Qwen 2.5 72B int4 ~1,050 ~600

At batch size 1, an H100 delivers roughly 2× these numbers. Serving at higher batch sizes increases throughput substantially — vLLM's continuous batching handles this automatically. For GPU hardware costs, see cloud GPU pricing.


Model Selection Guide by Use Case

Use case Best model Budget option Notes
General instruction-following Llama 3.3 70B or Qwen 2.5 72B Llama 3.1 8B Both 70B models score similarly — pick by license preference
Code generation and debugging Mistral Large 2 (best HumanEval) Qwen 2.5 14B Mistral leads on code at all scales
Math and structured reasoning Qwen 2.5 72B (83.1 MATH) Qwen 2.5 14B (79.5 MATH) Qwen dominates math benchmarks across all sizes
Single-GPU deployment (≤24GB) Llama 3.1 8B int8 Phi-3.5 Mini fp16 Llama 3.1 8B outperforms Mistral 7B on every benchmark
Long-document processing (RAG) Llama 3.3 70B or Qwen 2.5 72B Llama 3.1 8B Avoid Mistral 7B and Mistral Small — 32k context limit
Multimodal (text + image) Llama 3.2 11B Vision Only model in this table with native vision support
Unrestricted commercial license Mistral or Qwen 2.5 Phi-3.5 Mini Apache 2.0 and MIT impose no usage restrictions
Budget self-hosting on A100 40GB Qwen 2.5 14B fp16 or Mistral Small int8 Both fit comfortably on 40GB

Self-Hosting Model Deployment Checklist

  • Check the model's VRAM requirement at your target quantization before selecting hardware
  • Confirm the model's context window supports your workload — avoid Mistral 7B/Small for long-document tasks
  • Verify license compatibility: Apache 2.0 (Mistral, Qwen 2.5) and MIT (Phi) are unrestricted; Llama 3 has a 700M MAU clause
  • Install vLLM and run a throughput benchmark at your expected batch size before provisioning production hardware
  • Set --max-model-len to the maximum sequence length your workload needs — leaving it unset allocates the full context window in KV cache
  • Use int8 quantization as the default — negligible quality loss at roughly half the VRAM of fp16
  • Only use int4 if int8 doesn't fit your GPU — measure benchmark degradation for your specific task before using in production
  • Run the model on a 200-request sample from your actual workload and compare output quality against your API baseline before migrating
  • Calculate cost-per-token at your expected utilization and compare against the API equivalent (see LLM cost per token)
  • Set up Prometheus + DCGM metrics to monitor GPU utilization and KV-cache pressure from day one

Frequently Asked Questions: Open-Source LLM Selection

What is the best open-source LLM for production use in 2026?

Llama 3.3 70B and Qwen 2.5 72B are the strongest general-purpose options — both score 86+ on MMLU and handle a wide range of tasks. For code specifically, Mistral Large 2 leads on HumanEval at 92.1. For math, Qwen 2.5 72B leads at 83.1 MATH. For single-GPU deployment on a 24GB card, Llama 3.1 8B int8 is the best balance of quality and memory efficiency.

How does Llama 3.3 70B compare to GPT-4o?

Llama 3.3 70B scores 86.0 on MMLU vs GPT-4o's approximate 88–90. On most practical instruction-following tasks, the gap is small — and self-hosting at ~$0.49/M tokens (Lambda Labs H100) is significantly cheaper than GPT-4o's $4.38/M blended rate at production volume. The difference matters most on complex multi-step reasoning, where GPT-4o has a meaningful edge.

What is the minimum VRAM needed to run a 70B LLM?

42GB at int4 quantization — fits on a single A100 80GB or H100 80GB. At int8 (better quality), you need ~75GB, requiring either two A100 80GB cards or an H100 NVL 94GB. At fp16 (full precision), you need ~140GB across multiple cards. For most production use cases, int4 or int8 on a single H100 is the right choice.

Is Mistral still worth using now that Llama 3.1 is available?

Yes, for code. Mistral consistently leads on HumanEval — Mistral Large 2 scores 92.1 vs Llama 405B's 89.0. For general-purpose tasks, Llama 3.3 70B outperforms Mistral models at equivalent parameter counts. For single-GPU budget deployments, Llama 3.1 8B outperforms Mistral 7B on every benchmark. Mistral remains the best open-source choice specifically for code generation workloads.

Can you use Llama commercially without restrictions?

Yes, for most applications. The Meta Llama 3 Community License permits commercial use including fine-tuning and distribution. The only restriction is a usage threshold clause that applies above 700 million monthly active users — effectively irrelevant for all but a handful of global consumer applications.