Open-Source LLM Comparison: Llama vs Mistral vs Qwen vs Phi (2026)

Specs, VRAM requirements, benchmark scores, and recommended use cases for every major open-source LLM worth self-hosting — updated as new models release.

Shubham Yadav

Machine Learning Researcher

Updated June 8, 2026

On this page

1. VRAM Requirements by Model and Quantization
2. Benchmark Scores: MMLU, HumanEval, MATH, and MT-Bench
3. Context Windows: Which Models Support 128k
4. License Summary: Apache 2.0, MIT, and Meta Community License
5. Throughput Benchmarks: vLLM on A100 80GB
Model Selection Guide by Use Case
Self-Hosting Model Deployment Checklist
Frequently Asked Questions: Open-Source LLM Selection

New open-source models release every few weeks. This page tracks the models actually worth self-hosting — covering VRAM requirements, benchmark scores, context windows, and the workloads each model handles best.

Last verified: June 2026. The open-source model landscape moves fast. Benchmark scores and hardware requirements are based on published numbers at time of update — always check the model card before deploying.

Quick answer: Llama 3.3 70B and Qwen 2.5 72B are the strongest open-source models worth self-hosting in 2026 — both score 86+ on MMLU and fit on a single H100 80GB at int4. For single-GPU deployments (≤24GB VRAM), Llama 3.1 8B at int8 outperforms Mistral 7B on every benchmark. Qwen 2.5 14B is the best choice for math-heavy workloads on mid-tier hardware. Mistral Large 2 leads on code generation at the 123B scale. All major families now support 128k context except Mistral 7B and Mistral Small (32k).

If you know your GPU: Use the VRAM table to find every model that fits your hardware, then check benchmarks to pick the strongest one.

If you know your task: Use the model selection guide to find models optimized for your workload, then check the VRAM table to see what hardware you need.

If you're deciding between self-hosting and API: Pair this with the cloud GPU pricing and self-hosting vs API TCO pages.

This resource covers:

VRAM requirements — fp16, int8, and int4 memory footprints for every model
Benchmark scores — MMLU, HumanEval, MATH, and MT-Bench comparisons
Context windows — which models support 128k and which are limited to 32k
License summary — Apache 2.0, MIT, and Meta community license restrictions
Throughput benchmarks — decode tokens per second on A100 80GB with vLLM
Model selection guide — best model per use case and hardware tier

1. VRAM Requirements by Model and Quantization

Lower quantization (int4 < int8 < fp16) reduces VRAM at a small quality cost. For most production inference workloads, int8 is the right default — negligible quality loss, roughly half the VRAM of fp16.

Model	Params	fp16 VRAM	int8 VRAM	int4 VRAM
Phi-3.5 Mini	3.8B	~8 GB	~4 GB	~2.5 GB
Mistral 7B v0.3	7B	~14 GB	~8 GB	~5 GB
Llama 3.1 8B	8B	~16 GB	~9 GB	~5.5 GB
Llama 3.2 11B Vision	11B	~22 GB	~12 GB	~7 GB
Qwen 2.5 14B	14B	~28 GB	~15 GB	~9 GB
Mistral Small 22B	22B	~44 GB	~23 GB	~13 GB
Llama 3.3 70B	70B	~140 GB	~75 GB	~42 GB
Qwen 2.5 72B	72B	~144 GB	~77 GB	~44 GB
Mistral Large 2	123B	~246 GB	~130 GB	~74 GB
Llama 3.1 405B	405B	~810 GB	~430 GB	~243 GB

Practical GPU configurations:

Single A10G / L4 (24 GB): Phi-3.5 Mini (fp16), Mistral 7B (fp16), Llama 3.1 8B (int8)
Single A100 40GB: Qwen 2.5 14B (fp16), Mistral Small 22B (int8)
Single A100 80GB / H100 80GB: Llama 3.3 70B (int4), Qwen 2.5 72B (int4)
2× A100 80GB: Llama 3.3 70B (fp16), Qwen 2.5 72B (fp16)
8× H100 80GB: Llama 3.1 405B (int4), Mistral Large 2 (fp16)

2. Benchmark Scores: MMLU, HumanEval, MATH, and MT-Bench

Higher is better on all benchmarks. Scores are from published evals on standard test sets.

Model	MMLU	HumanEval	MATH	MT-Bench
Phi-3.5 Mini 3.8B	69.0	62.8	46.4	8.0
Mistral 7B v0.3	64.2	30.5	13.1	7.6
Llama 3.1 8B	66.7	72.6	51.9	8.2
Qwen 2.5 14B	79.7	74.4	79.5	8.7
Mistral Small 22B	77.2	72.6	62.3	8.4
Llama 3.3 70B	86.0	88.4	77.0	9.1
Qwen 2.5 72B	86.1	86.6	83.1	9.1
Mistral Large 2 123B	84.0	92.1	74.2	9.0
Llama 3.1 405B	88.6	89.0	73.8	9.4

Key takeaways:

Qwen 2.5 72B and Llama 3.3 70B trade blows at the 70B tier — Qwen edges ahead on math, Llama on general reasoning
Qwen 2.5 14B punches well above its weight on math tasks — worth considering for structured/analytical workloads where a smaller model is preferred
Phi-3.5 Mini at 3.8B outperforms Mistral 7B on most benchmarks despite half the parameters — the better default for single-GPU budget setups
Mistral Large 2 (123B) beats Llama 405B on HumanEval but loses on MMLU — the better code-focused choice at that scale

3. Context Windows: Which Models Support 128k

128k is now standard across most production-grade models. Mistral 7B and Mistral Small remain 32k — relevant if your workload uses long documents.

Model	Context window	Notes
Phi-3.5 Mini	128k
Mistral 7B v0.3	32k
Llama 3.1 8B	128k
Llama 3.2 11B Vision	128k	Multimodal (text + image)
Qwen 2.5 14B	128k
Mistral Small 22B	32k
Llama 3.3 70B	128k
Qwen 2.5 72B	128k
Mistral Large 2	128k
Llama 3.1 405B	128k

Avoid Mistral 7B and Mistral Small for long-document workloads (RAG, summarization, document Q&A). For pure retrieval-augmented generation where the context is large but the task is simple, Llama 3.1 8B at 128k handles most cases well at low cost.

4. License Summary: Apache 2.0, MIT, and Meta Community License

Model family	License	Commercial use	Fine-tuning allowed
Llama 3.x	Meta Llama 3 Community License	Yes (with restrictions above 700M MAU)	Yes
Mistral models	Apache 2.0	Yes, unrestricted	Yes
Qwen 2.5	Apache 2.0 (most sizes)	Yes, unrestricted	Yes
Phi-3.5	MIT	Yes, unrestricted	Yes

Mistral, Qwen 2.5, and Phi-3.5 are the most permissive for commercial deployment. Llama 3's community license is broadly permissive but has a usage threshold clause that matters at very high scale (over 700 million monthly active users).

5. Throughput Benchmarks: vLLM on A100 80GB

Approximate token throughput for single-card inference at batch size 1, measured with vLLM. Higher is better. Decode throughput (tokens per second during generation) is the number that matters for user-facing latency.

Model	Tokens/sec (prefill)	Tokens/sec (decode)
Llama 3.1 8B fp16	~4,500	~2,200
Qwen 2.5 14B int8	~2,800	~1,400
Llama 3.3 70B int4	~1,100	~620
Qwen 2.5 72B int4	~1,050	~600

At batch size 1, an H100 delivers roughly 2× these numbers. Serving at higher batch sizes increases throughput substantially — vLLM's continuous batching handles this automatically. For GPU hardware costs, see cloud GPU pricing.

Model Selection Guide by Use Case

Use case	Best model	Budget option	Notes
General instruction-following	Llama 3.3 70B or Qwen 2.5 72B	Llama 3.1 8B	Both 70B models score similarly — pick by license preference
Code generation and debugging	Mistral Large 2 (best HumanEval)	Qwen 2.5 14B	Mistral leads on code at all scales
Math and structured reasoning	Qwen 2.5 72B (83.1 MATH)	Qwen 2.5 14B (79.5 MATH)	Qwen dominates math benchmarks across all sizes
Single-GPU deployment (≤24GB)	Llama 3.1 8B int8	Phi-3.5 Mini fp16	Llama 3.1 8B outperforms Mistral 7B on every benchmark
Long-document processing (RAG)	Llama 3.3 70B or Qwen 2.5 72B	Llama 3.1 8B	Avoid Mistral 7B and Mistral Small — 32k context limit
Multimodal (text + image)	Llama 3.2 11B Vision	—	Only model in this table with native vision support
Unrestricted commercial license	Mistral or Qwen 2.5	Phi-3.5 Mini	Apache 2.0 and MIT impose no usage restrictions
Budget self-hosting on A100 40GB	Qwen 2.5 14B fp16 or Mistral Small int8	—	Both fit comfortably on 40GB

Self-Hosting Model Deployment Checklist

Check the model's VRAM requirement at your target quantization before selecting hardware
Confirm the model's context window supports your workload — avoid Mistral 7B/Small for long-document tasks
Verify license compatibility: Apache 2.0 (Mistral, Qwen 2.5) and MIT (Phi) are unrestricted; Llama 3 has a 700M MAU clause
Install vLLM and run a throughput benchmark at your expected batch size before provisioning production hardware
Set --max-model-len to the maximum sequence length your workload needs — leaving it unset allocates the full context window in KV cache
Use int8 quantization as the default — negligible quality loss at roughly half the VRAM of fp16
Only use int4 if int8 doesn't fit your GPU — measure benchmark degradation for your specific task before using in production
Run the model on a 200-request sample from your actual workload and compare output quality against your API baseline before migrating
Calculate cost-per-token at your expected utilization and compare against the API equivalent (see LLM cost per token)
Set up Prometheus + DCGM metrics to monitor GPU utilization and KV-cache pressure from day one

Frequently Asked Questions: Open-Source LLM Selection

What is the best open-source LLM for production use in 2026?

Llama 3.3 70B and Qwen 2.5 72B are the strongest general-purpose options — both score 86+ on MMLU and handle a wide range of tasks. For code specifically, Mistral Large 2 leads on HumanEval at 92.1. For math, Qwen 2.5 72B leads at 83.1 MATH. For single-GPU deployment on a 24GB card, Llama 3.1 8B int8 is the best balance of quality and memory efficiency.

How does Llama 3.3 70B compare to GPT-4o?

Llama 3.3 70B scores 86.0 on MMLU vs GPT-4o's approximate 88–90. On most practical instruction-following tasks, the gap is small — and self-hosting at ~$0.49/M tokens (Lambda Labs H100) is significantly cheaper than GPT-4o's $4.38/M blended rate at production volume. The difference matters most on complex multi-step reasoning, where GPT-4o has a meaningful edge.

What is the minimum VRAM needed to run a 70B LLM?

42GB at int4 quantization — fits on a single A100 80GB or H100 80GB. At int8 (better quality), you need ~75GB, requiring either two A100 80GB cards or an H100 NVL 94GB. At fp16 (full precision), you need ~140GB across multiple cards. For most production use cases, int4 or int8 on a single H100 is the right choice.

Is Mistral still worth using now that Llama 3.1 is available?

Yes, for code. Mistral consistently leads on HumanEval — Mistral Large 2 scores 92.1 vs Llama 405B's 89.0. For general-purpose tasks, Llama 3.3 70B outperforms Mistral models at equivalent parameter counts. For single-GPU budget deployments, Llama 3.1 8B outperforms Mistral 7B on every benchmark. Mistral remains the best open-source choice specifically for code generation workloads.

Can you use Llama commercially without restrictions?

Yes, for most applications. The Meta Llama 3 Community License permits commercial use including fine-tuning and distribution. The only restriction is a usage threshold clause that applies above 700 million monthly active users — effectively irrelevant for all but a handful of global consumer applications.

Back to resources