Open-Source LLM Comparison: Llama vs Mistral vs Qwen vs Phi (2026)
Specs, VRAM requirements, benchmark scores, and recommended use cases for every major open-source LLM worth self-hosting — updated as new models release.
Shubham Yadav
Machine Learning Researcher
New open-source models release every few weeks. This page tracks the models actually worth self-hosting — covering VRAM requirements, benchmark scores, context windows, and the workloads each model handles best.
Last verified: June 2026. The open-source model landscape moves fast. Benchmark scores and hardware requirements are based on published numbers at time of update — always check the model card before deploying.
Quick answer: Llama 3.3 70B and Qwen 2.5 72B are the strongest open-source models worth self-hosting in 2026 — both score 86+ on MMLU and fit on a single H100 80GB at int4. For single-GPU deployments (≤24GB VRAM), Llama 3.1 8B at int8 outperforms Mistral 7B on every benchmark. Qwen 2.5 14B is the best choice for math-heavy workloads on mid-tier hardware. Mistral Large 2 leads on code generation at the 123B scale. All major families now support 128k context except Mistral 7B and Mistral Small (32k).
If you know your GPU: Use the VRAM table to find every model that fits your hardware, then check benchmarks to pick the strongest one.
If you know your task: Use the model selection guide to find models optimized for your workload, then check the VRAM table to see what hardware you need.
If you're deciding between self-hosting and API: Pair this with the cloud GPU pricing and self-hosting vs API TCO pages.
This resource covers:
- VRAM requirements — fp16, int8, and int4 memory footprints for every model
- Benchmark scores — MMLU, HumanEval, MATH, and MT-Bench comparisons
- Context windows — which models support 128k and which are limited to 32k
- License summary — Apache 2.0, MIT, and Meta community license restrictions
- Throughput benchmarks — decode tokens per second on A100 80GB with vLLM
- Model selection guide — best model per use case and hardware tier
1. VRAM Requirements by Model and Quantization
Lower quantization (int4 < int8 < fp16) reduces VRAM at a small quality cost. For most production inference workloads, int8 is the right default — negligible quality loss, roughly half the VRAM of fp16.
| Model | Params | fp16 VRAM | int8 VRAM | int4 VRAM |
|---|---|---|---|---|
| Phi-3.5 Mini | 3.8B | ~8 GB | ~4 GB | ~2.5 GB |
| Mistral 7B v0.3 | 7B | ~14 GB | ~8 GB | ~5 GB |
| Llama 3.1 8B | 8B | ~16 GB | ~9 GB | ~5.5 GB |
| Llama 3.2 11B Vision | 11B | ~22 GB | ~12 GB | ~7 GB |
| Qwen 2.5 14B | 14B | ~28 GB | ~15 GB | ~9 GB |
| Mistral Small 22B | 22B | ~44 GB | ~23 GB | ~13 GB |
| Llama 3.3 70B | 70B | ~140 GB | ~75 GB | ~42 GB |
| Qwen 2.5 72B | 72B | ~144 GB | ~77 GB | ~44 GB |
| Mistral Large 2 | 123B | ~246 GB | ~130 GB | ~74 GB |
| Llama 3.1 405B | 405B | ~810 GB | ~430 GB | ~243 GB |
Practical GPU configurations:
- Single A10G / L4 (24 GB): Phi-3.5 Mini (fp16), Mistral 7B (fp16), Llama 3.1 8B (int8)
- Single A100 40GB: Qwen 2.5 14B (fp16), Mistral Small 22B (int8)
- Single A100 80GB / H100 80GB: Llama 3.3 70B (int4), Qwen 2.5 72B (int4)
- 2× A100 80GB: Llama 3.3 70B (fp16), Qwen 2.5 72B (fp16)
- 8× H100 80GB: Llama 3.1 405B (int4), Mistral Large 2 (fp16)
2. Benchmark Scores: MMLU, HumanEval, MATH, and MT-Bench
Higher is better on all benchmarks. Scores are from published evals on standard test sets.
| Model | MMLU | HumanEval | MATH | MT-Bench |
|---|---|---|---|---|
| Phi-3.5 Mini 3.8B | 69.0 | 62.8 | 46.4 | 8.0 |
| Mistral 7B v0.3 | 64.2 | 30.5 | 13.1 | 7.6 |
| Llama 3.1 8B | 66.7 | 72.6 | 51.9 | 8.2 |
| Qwen 2.5 14B | 79.7 | 74.4 | 79.5 | 8.7 |
| Mistral Small 22B | 77.2 | 72.6 | 62.3 | 8.4 |
| Llama 3.3 70B | 86.0 | 88.4 | 77.0 | 9.1 |
| Qwen 2.5 72B | 86.1 | 86.6 | 83.1 | 9.1 |
| Mistral Large 2 123B | 84.0 | 92.1 | 74.2 | 9.0 |
| Llama 3.1 405B | 88.6 | 89.0 | 73.8 | 9.4 |
Key takeaways:
- Qwen 2.5 72B and Llama 3.3 70B trade blows at the 70B tier — Qwen edges ahead on math, Llama on general reasoning
- Qwen 2.5 14B punches well above its weight on math tasks — worth considering for structured/analytical workloads where a smaller model is preferred
- Phi-3.5 Mini at 3.8B outperforms Mistral 7B on most benchmarks despite half the parameters — the better default for single-GPU budget setups
- Mistral Large 2 (123B) beats Llama 405B on HumanEval but loses on MMLU — the better code-focused choice at that scale
3. Context Windows: Which Models Support 128k
128k is now standard across most production-grade models. Mistral 7B and Mistral Small remain 32k — relevant if your workload uses long documents.
| Model | Context window | Notes |
|---|---|---|
| Phi-3.5 Mini | 128k | |
| Mistral 7B v0.3 | 32k | |
| Llama 3.1 8B | 128k | |
| Llama 3.2 11B Vision | 128k | Multimodal (text + image) |
| Qwen 2.5 14B | 128k | |
| Mistral Small 22B | 32k | |
| Llama 3.3 70B | 128k | |
| Qwen 2.5 72B | 128k | |
| Mistral Large 2 | 128k | |
| Llama 3.1 405B | 128k |
Avoid Mistral 7B and Mistral Small for long-document workloads (RAG, summarization, document Q&A). For pure retrieval-augmented generation where the context is large but the task is simple, Llama 3.1 8B at 128k handles most cases well at low cost.
4. License Summary: Apache 2.0, MIT, and Meta Community License
| Model family | License | Commercial use | Fine-tuning allowed |
|---|---|---|---|
| Llama 3.x | Meta Llama 3 Community License | Yes (with restrictions above 700M MAU) | Yes |
| Mistral models | Apache 2.0 | Yes, unrestricted | Yes |
| Qwen 2.5 | Apache 2.0 (most sizes) | Yes, unrestricted | Yes |
| Phi-3.5 | MIT | Yes, unrestricted | Yes |
Mistral, Qwen 2.5, and Phi-3.5 are the most permissive for commercial deployment. Llama 3's community license is broadly permissive but has a usage threshold clause that matters at very high scale (over 700 million monthly active users).
5. Throughput Benchmarks: vLLM on A100 80GB
Approximate token throughput for single-card inference at batch size 1, measured with vLLM. Higher is better. Decode throughput (tokens per second during generation) is the number that matters for user-facing latency.
| Model | Tokens/sec (prefill) | Tokens/sec (decode) |
|---|---|---|
| Llama 3.1 8B fp16 | ~4,500 | ~2,200 |
| Qwen 2.5 14B int8 | ~2,800 | ~1,400 |
| Llama 3.3 70B int4 | ~1,100 | ~620 |
| Qwen 2.5 72B int4 | ~1,050 | ~600 |
At batch size 1, an H100 delivers roughly 2× these numbers. Serving at higher batch sizes increases throughput substantially — vLLM's continuous batching handles this automatically. For GPU hardware costs, see cloud GPU pricing.
Model Selection Guide by Use Case
| Use case | Best model | Budget option | Notes |
|---|---|---|---|
| General instruction-following | Llama 3.3 70B or Qwen 2.5 72B | Llama 3.1 8B | Both 70B models score similarly — pick by license preference |
| Code generation and debugging | Mistral Large 2 (best HumanEval) | Qwen 2.5 14B | Mistral leads on code at all scales |
| Math and structured reasoning | Qwen 2.5 72B (83.1 MATH) | Qwen 2.5 14B (79.5 MATH) | Qwen dominates math benchmarks across all sizes |
| Single-GPU deployment (≤24GB) | Llama 3.1 8B int8 | Phi-3.5 Mini fp16 | Llama 3.1 8B outperforms Mistral 7B on every benchmark |
| Long-document processing (RAG) | Llama 3.3 70B or Qwen 2.5 72B | Llama 3.1 8B | Avoid Mistral 7B and Mistral Small — 32k context limit |
| Multimodal (text + image) | Llama 3.2 11B Vision | — | Only model in this table with native vision support |
| Unrestricted commercial license | Mistral or Qwen 2.5 | Phi-3.5 Mini | Apache 2.0 and MIT impose no usage restrictions |
| Budget self-hosting on A100 40GB | Qwen 2.5 14B fp16 or Mistral Small int8 | — | Both fit comfortably on 40GB |
Self-Hosting Model Deployment Checklist
- Check the model's VRAM requirement at your target quantization before selecting hardware
- Confirm the model's context window supports your workload — avoid Mistral 7B/Small for long-document tasks
- Verify license compatibility: Apache 2.0 (Mistral, Qwen 2.5) and MIT (Phi) are unrestricted; Llama 3 has a 700M MAU clause
- Install vLLM and run a throughput benchmark at your expected batch size before provisioning production hardware
- Set
--max-model-lento the maximum sequence length your workload needs — leaving it unset allocates the full context window in KV cache - Use int8 quantization as the default — negligible quality loss at roughly half the VRAM of fp16
- Only use int4 if int8 doesn't fit your GPU — measure benchmark degradation for your specific task before using in production
- Run the model on a 200-request sample from your actual workload and compare output quality against your API baseline before migrating
- Calculate cost-per-token at your expected utilization and compare against the API equivalent (see LLM cost per token)
- Set up Prometheus + DCGM metrics to monitor GPU utilization and KV-cache pressure from day one
Frequently Asked Questions: Open-Source LLM Selection
What is the best open-source LLM for production use in 2026?
Llama 3.3 70B and Qwen 2.5 72B are the strongest general-purpose options — both score 86+ on MMLU and handle a wide range of tasks. For code specifically, Mistral Large 2 leads on HumanEval at 92.1. For math, Qwen 2.5 72B leads at 83.1 MATH. For single-GPU deployment on a 24GB card, Llama 3.1 8B int8 is the best balance of quality and memory efficiency.
How does Llama 3.3 70B compare to GPT-4o?
Llama 3.3 70B scores 86.0 on MMLU vs GPT-4o's approximate 88–90. On most practical instruction-following tasks, the gap is small — and self-hosting at ~$0.49/M tokens (Lambda Labs H100) is significantly cheaper than GPT-4o's $4.38/M blended rate at production volume. The difference matters most on complex multi-step reasoning, where GPT-4o has a meaningful edge.
What is the minimum VRAM needed to run a 70B LLM?
42GB at int4 quantization — fits on a single A100 80GB or H100 80GB. At int8 (better quality), you need ~75GB, requiring either two A100 80GB cards or an H100 NVL 94GB. At fp16 (full precision), you need ~140GB across multiple cards. For most production use cases, int4 or int8 on a single H100 is the right choice.
Is Mistral still worth using now that Llama 3.1 is available?
Yes, for code. Mistral consistently leads on HumanEval — Mistral Large 2 scores 92.1 vs Llama 405B's 89.0. For general-purpose tasks, Llama 3.3 70B outperforms Mistral models at equivalent parameter counts. For single-GPU budget deployments, Llama 3.1 8B outperforms Mistral 7B on every benchmark. Mistral remains the best open-source choice specifically for code generation workloads.
Can you use Llama commercially without restrictions?
Yes, for most applications. The Meta Llama 3 Community License permits commercial use including fine-tuning and distribution. The only restriction is a usage threshold clause that applies above 700 million monthly active users — effectively irrelevant for all but a handful of global consumer applications.