All posts

LoRA Fine-Tuning vs Full Fine-Tuning: Which Should You Use?

LoRA vs full fine-tuning: how they differ in GPU cost, trainable parameters, and accuracy — and when each is the right choice.

MK

Mohammed Kafeel

Machine Learning Researcher

June 8, 202612 min read

Quick answer: Full fine-tuning updates every weight in the model and needs enough GPU memory to hold the weights, their gradients, and optimizer states — roughly 12–16 bytes per parameter, so a 7B model needs ~100+ GB and multiple high-end GPUs. LoRA freezes the original weights and trains only small "adapter" matrices — typically under 1% of the parameters — cutting trainable memory by 3–4× and fitting on a single consumer GPU. On most tasks LoRA reaches within 1–2% of full fine-tuning accuracy; the gap only widens when you're teaching the model genuinely new knowledge or a new language at scale. For style, format, and domain adaptation, LoRA is the default; full fine-tuning is reserved for large behavioral or knowledge changes where you have the hardware.


What is fine-tuning, and why does the method matter?

Fine-tuning is the process of taking a pre-trained model and continuing to train it on your own data so it adapts to a specific task, domain, or style. The pre-trained model already knows language and general reasoning; fine-tuning specializes it.

The question isn't whether to fine-tune but how much of the model to change. That single decision — update all the weights, or just a tiny slice — drives everything: GPU cost, training time, storage, how many tasks you can serve, and how much accuracy you ultimately get. Full fine-tuning and LoRA sit at the two ends of that spectrum.


How full fine-tuning works (and why it's expensive)

Full fine-tuning updates every parameter in the model via backpropagation, exactly like the original training but on new data. It's conceptually simple and gives the model maximum freedom to change its behavior.

The cost comes from what you must hold in GPU memory during training. It's not just the weights — it's the weights plus everything the optimizer needs:

Component (per parameter, FP16/mixed precision with Adam) Bytes
Model weights (FP16) 2
Gradients (FP16) 2
Optimizer state — Adam momentum (FP32) 4
Optimizer state — Adam variance (FP32) 4
Master weights copy (FP32, for mixed precision) 4
Total ~16

That's the headline number: full fine-tuning needs roughly 16 bytes per parameter, before counting activations. For a 7B model that's ~112 GB just for the training state — more than a single 80 GB A100 holds. You need multiple GPUs and distributed training (FSDP, DeepSpeed ZeRO) to fit it.

Model size Approx. memory for full fine-tuning Hardware reality
7B ~112 GB+ 2× A100 80GB (with sharding)
13B ~210 GB+ 4× A100 80GB
70B ~1.1 TB+ 16× A100 / a node cluster

How LoRA works: train small, freeze the rest

LoRA (Low-Rank Adaptation) freezes the entire pre-trained model and injects small, trainable matrices alongside the existing weight matrices. Only those small matrices are trained.

The insight behind LoRA is that the change a model needs during fine-tuning has low "intrinsic rank" — it can be captured by a much smaller set of numbers than the full weight matrix. So instead of learning a full update matrix ΔW (which is as big as the original weight W), LoRA learns two skinny matrices whose product approximates it.

The math, simply

A weight matrix W has dimensions d × d (say 4096 × 4096 ≈ 16.8M values). Full fine-tuning learns a full-size update ΔW of the same 16.8M values. LoRA instead represents that update as the product of two low-rank matrices:

ΔW ≈ B · A
where  A is r × d   and   B is d × r

r is the rank — a small number like 8, 16, or 64. With d = 4096 and r = 16:

Approach Trainable values for this one matrix
Full fine-tuning 4096 × 4096 = 16,777,216
LoRA (r = 16) (16 × 4096) + (4096 × 16) = 131,072

That's a 128× reduction in trainable parameters for that layer. During the forward pass, the output becomes W·x + (B·A)·x — the frozen original plus the small learned adjustment, scaled by a factor α/r.

Why this slashes memory

Because only A and B are trained, you only need gradients and optimizer states for those — under 1% of the parameters. The huge frozen W needs just its 2 bytes of weight storage, no gradient, no optimizer state. The 16-bytes-per-parameter burden now applies only to the tiny adapter. That's how LoRA training fits on a single 24 GB consumer GPU where full fine-tuning needs a cluster.

QLoRA: going even smaller

QLoRA pushes this further by loading the frozen base model in 4-bit (NF4 quantization) while training the LoRA adapters in higher precision. The frozen weights now take 0.5 bytes each instead of 2, so even a 70B model can be fine-tuned on a single 48 GB GPU. The trainable adapters stay in 16-bit, preserving quality. QLoRA is what made fine-tuning large models accessible to individuals.


LoRA vs full fine-tuning: the head-to-head

Dimension Full fine-tuning LoRA
Parameters trained 100% Typically <1%
Memory (per param) ~16 bytes ~2 bytes (frozen) + tiny adapter
7B hardware 2× A100 80GB 1× RTX 4090 24GB (or less with QLoRA)
Training speed Slower (all weights + optimizer) Faster (few trainable params)
Storage per fine-tune Full model copy (~14 GB for 7B) Just the adapter (~10–200 MB)
Serving many tasks One full model per task One base + many swappable adapters
Catastrophic forgetting Higher risk (all weights move) Lower risk (base frozen)
Max behavioral change Highest High, but bounded by rank
Best for New knowledge, large domain shifts Style, format, task/domain adaptation

The accuracy question: how close does LoRA get?

This is the crux. Here's the honest picture, separating where LoRA matches full fine-tuning from where it falls behind.

Where LoRA matches full fine-tuning (within ~1–2%)

  • Style and format adaptation — teaching tone, output structure, JSON formatting, persona. LoRA is essentially indistinguishable here.
  • Task specialization — classification, extraction, summarization in a known domain.
  • Instruction tuning — adapting a base model to follow instructions. The original LoRA and QLoRA papers showed parity with full fine-tuning on these.
  • Domain adaptation where the knowledge already exists in the model and just needs surfacing (legal, medical phrasing the model has seen).

For the large majority of practical fine-tuning jobs, LoRA lands within 1–2% of full fine-tuning on benchmark accuracy — a gap most applications can't distinguish from noise.

Where full fine-tuning pulls ahead

  • Injecting substantial new knowledge the model never saw in pre-training. Low-rank updates have limited capacity to write large amounts of new factual content.
  • Learning a new language or domain at scale — a big distribution shift needs more representational capacity than a small adapter provides.
  • Very large, diverse fine-tuning datasets (hundreds of thousands+ examples) where the extra capacity of full fine-tuning is actually used.
  • Squeezing the last 1–3% on a leaderboard where every point counts and budget is no object.

The rank trade-off

LoRA's capacity is set by the rank r. Raising it narrows the accuracy gap at the cost of more trainable parameters:

Rank (r) Capacity Typical use
4–8 Low — lightweight adaptation Simple style/format tasks
16–32 Moderate — the common default Most instruction tuning and domain adaptation
64–128 High — closes gap on hard tasks Complex tasks, larger datasets
256+ Approaching full FT capacity When LoRA underperforms and you have the memory

If LoRA underperforms, the first lever is raise the rank before abandoning it for full fine-tuning. Often r = 64 recovers most of the gap.


When should you use each? A decision guide

Reach for LoRA (or QLoRA) when…

  • You're on a single GPU or a tight budget — this covers most individuals and teams.
  • You're adapting style, format, tone, or a specific task rather than teaching new facts.
  • You need to serve many specialized variants — keep one base model and hot-swap lightweight adapters per task or customer.
  • You want fast iteration — LoRA trains quicker and adapters are tiny to store and version.
  • This is the right default for ~80% of fine-tuning needs.

Reach for full fine-tuning when…

  • You're injecting significant new knowledge or doing a large domain/language shift.
  • You have multi-GPU hardware and the budget to use it.
  • You have a large, diverse dataset that can actually exploit the extra capacity.
  • You need the absolute maximum accuracy and have measured that LoRA (even at high rank) leaves a meaningful gap.

A useful rule of thumb

Start with LoRA. Measure. Only escalate to full fine-tuning if a high-rank LoRA still leaves an accuracy gap you can't tolerate — and you have the hardware to close it. The cost asymmetry is so large that LoRA-first is almost always the economically correct order.


How to fine-tune with LoRA (code example)

Using Hugging Face PEFT, LoRA is a few lines on top of a normal training loop:

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    device_map="auto",
)

lora_config = LoraConfig(
    r=16,                       # rank — capacity of the adapter
    lora_alpha=32,             # scaling factor (alpha/r is applied to the update)
    target_modules=["q_proj", "v_proj"],  # which weight matrices get adapters
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# e.g. "trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.062"

QLoRA — same thing, but load the base in 4-bit

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,   # frozen base in 4-bit
    device_map="auto",
)
# then wrap with get_peft_model(model, lora_config) exactly as above

Serving: swap adapters without reloading the base

from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model = PeftModel.from_pretrained(base, "my-customer-A-adapter")
# later, swap to another task's adapter against the same base in memory
model.load_adapter("my-customer-B-adapter", adapter_name="B")

This adapter-swapping is a major operational advantage: one base model in memory, many tasks served.


Common pitfalls and how to avoid them

Pitfall Why it hurts Fix
Rank too low for a hard task Adapter lacks capacity → underfits, blamed on "LoRA can't" Raise r (try 32 → 64) before switching to full FT
Targeting too few modules Only q_proj/v_proj limits expressiveness on complex tasks Add k_proj, o_proj, and MLP layers (gate/up/down)
Expecting LoRA to teach large new knowledge Low-rank updates can't store much new factual content Use full FT, or RAG for knowledge instead of fine-tuning
Wrong alpha/rank ratio Update scaled too weakly or too strongly Common practice: alpha ≈ 2×r; tune if unstable
Merging adapter then quantizing (QLoRA) Merging into 4-bit base degrades quality Merge into the FP16 base, or serve adapter unmerged
Comparing LoRA to full FT on different data/epochs Apples to oranges Hold dataset, epochs, and eval fixed when comparing

Frequently asked questions

What is the main difference between LoRA and full fine-tuning? Full fine-tuning updates every parameter in the model and needs memory for all weights plus their gradients and optimizer states (~16 bytes per parameter). LoRA freezes the original weights and trains only small low-rank adapter matrices — typically under 1% of the parameters — so it needs a fraction of the memory and fits on a single consumer GPU. Full fine-tuning offers maximum flexibility; LoRA offers near-equal accuracy at a tiny fraction of the cost.

How much accuracy do you lose with LoRA? On most practical tasks — style adaptation, formatting, task specialization, instruction tuning — LoRA lands within about 1–2% of full fine-tuning, often indistinguishable from noise. The gap widens when you're injecting substantial new knowledge, adapting to a new language, or training on very large diverse datasets, where full fine-tuning's extra capacity matters. Raising the LoRA rank narrows the gap on harder tasks.

What is the rank (r) in LoRA? The rank controls the size and capacity of the adapter matrices. A higher rank means more trainable parameters and more ability to capture complex changes, at higher memory cost. Common values are 8–32 for typical tasks; 64–128 for harder tasks or larger datasets. If LoRA underperforms, raising the rank is the first thing to try.

What is QLoRA and how is it different from LoRA? QLoRA loads the frozen base model in 4-bit quantization (NF4) while training the LoRA adapters in 16-bit. This shrinks the frozen weights from 2 bytes to 0.5 bytes per parameter, letting you fine-tune very large models — even 70B — on a single GPU. The adapters stay high-precision, so quality is preserved. QLoRA is standard LoRA plus a quantized base.

Can I use full fine-tuning's accuracy with LoRA's cost? Often, yes — for most tasks. By choosing an adequate rank (e.g., 64) and targeting more modules (attention + MLP layers), LoRA closes most of the accuracy gap while keeping the cost advantage. The cases where you genuinely can't are large new-knowledge injection or big distribution shifts, where full fine-tuning's capacity is actually required.

Why does LoRA let me serve many tasks cheaply? Because the base model is frozen and shared, you store only the small adapter (10–200 MB) per task instead of a full model copy (~14 GB for 7B). At serving time, one base model sits in GPU memory and you hot-swap adapters per task or customer. Full fine-tuning would require a separate full model per task, which is far more expensive to store and serve.


Key takeaways

  • Full fine-tuning costs ~16 bytes per parameter (weights + gradients + optimizer states), needing multiple high-end GPUs even for a 7B model.
  • LoRA trains <1% of parameters by learning low-rank adapter matrices, fitting on a single consumer GPU; QLoRA adds a 4-bit frozen base to reach 70B on one GPU.
  • On most tasks LoRA is within 1–2% of full fine-tuning; the gap widens only for large new-knowledge injection or big distribution shifts.
  • Rank (r) is the capacity dial — raise it (16 → 64) before abandoning LoRA for full fine-tuning.
  • LoRA's operational win: one frozen base + many swappable lightweight adapters, versus a full model copy per task.
  • Default to LoRA first, measure, and escalate to full fine-tuning only when a high-rank LoRA leaves an intolerable gap and you have the hardware.