All posts

Context Engineering: Improve LLM Accuracy Without Fine-Tuning

Context engineering — deciding what goes into the model's context window, in what form and order — and why it closes most of the accuracy gap teams reach for fine-tuning to fix.

MK

Mohammed Kafeel

Machine Learning Researcher

June 8, 202613 min read

Quick answer: Context engineering is the practice of deliberately deciding what information goes into the model's context window, in what form, and in what order — so the model produces better answers without any fine-tuning or weight changes. It works because an LLM's accuracy is bounded less by what it knows and more by what you put in front of it at inference time. The highest-leverage moves are: give the model the right facts (retrieval), cut the irrelevant ones (because long contexts degrade accuracy), put critical information at the start or end (models lose the middle), structure the input so the relevant parts are easy to find, and show worked examples of the format you want. Done well, context engineering closes most of the accuracy gap people reach for fine-tuning to fix — at a fraction of the cost.


What is context engineering?

Context engineering is the discipline of designing everything that enters the model's context window for a given request — system instructions, retrieved documents, examples, tool outputs, conversation history, and the user's query — to maximize answer quality. It treats the context window as a scarce, carefully managed resource rather than a dumping ground.

It's broader than prompt engineering. Prompt engineering is about wording a single instruction well. Context engineering is about the entire information payload: which documents to retrieve, how much history to keep, where to place the key fact, what to leave out, and how to format it all. The model's weights never change — only what you feed it does.

The premise is simple but underappreciated: a frozen model with excellent context beats a fine-tuned model with poor context on most real tasks. Before spending on fine-tuning, fix the context.


Why context determines accuracy more than you'd think

An LLM is a function from context to output. Given identical weights, the only thing that varies between a correct and an incorrect answer is the context you provide. Three properties of how LLMs use context make this the dominant lever:

  1. Models can't use what they aren't given. If the answer depends on a fact not in the model's training data and not in the context, the model will guess — confidently. Retrieval fixes this by supplying the fact.

  2. Models degrade with irrelevant context. Counterintuitively, more context often means worse answers. Padding the window with marginally-relevant documents dilutes attention and introduces distractors. Precision beats volume.

  3. Models don't read the context uniformly. Information at the start and end of a long context is used far more reliably than information buried in the middle — the "lost in the middle" effect. Where you place a fact changes whether it's used.

Each of these is a knob you control entirely from outside the model.


The core techniques of context engineering

1. Retrieval: give the model the facts it lacks

Retrieval-augmented generation (RAG) fetches relevant documents from an external source and inserts them into the context so the model answers from real data instead of parametric memory. This is the single biggest accuracy lever for knowledge-dependent tasks.

The accuracy gains come from two things: the model now has the actual facts, and you can ground its answer in citable sources (reducing hallucination). But retrieval quality is everything — retrieve the wrong chunks and you've added confident distractors. Key levers:

  • Chunk size and boundaries. Chunks that split a fact across two pieces, or bundle unrelated facts, hurt retrieval. Chunk on semantic boundaries (sections, paragraphs).
  • Re-ranking. A first-pass vector search is recall-oriented; a re-ranker reorders the top candidates by true relevance, so the best chunks land in context.
  • Retrieve few, high-quality chunks rather than many mediocre ones — see the long-context degradation point above.

2. Context compression: cut the irrelevant

Because long contexts degrade accuracy, removing irrelevant content is itself an accuracy technique — not just a cost optimization. Strategies:

  • Filter before you fill. Only include retrieved chunks above a relevance threshold; don't pad to a token budget for its own sake.
  • Summarize stale history. In long conversations, replace old turns with a running summary instead of carrying every message verbatim.
  • Strip boilerplate. Remove headers, footers, navigation, and repeated legal text from retrieved documents — they're pure distraction tokens.

The mental model: every token you add competes for the model's attention. Irrelevant tokens have negative value.

3. Positioning: exploit "lost in the middle"

Models attend most reliably to the beginning and end of the context, and least reliably to the middle. Use this deliberately:

  • Put the most critical instruction or fact at the very start or very end of the context — not buried among retrieved documents.
  • In RAG, place the highest-relevance chunk last (closest to the question) or first, not in the middle of the pack.
  • Repeat a critical constraint at both the top (in the system prompt) and bottom (just before the answer) for high-stakes requirements.

4. Structure: make the relevant parts easy to find

Formatting the context so its structure is explicit helps the model locate and use the right information.

  • Use clear delimiters — XML-style tags, markdown headers, or labeled sections — to separate instructions from data from examples. <document>...</document>, <instructions>...</instructions>.
  • Label retrieved sources so the model can cite them and you can trace its reasoning.
  • Put instructions before data for most models, so the model knows what to do with the data as it reads it.

5. Few-shot examples: show, don't just tell

Including a few worked examples of input→output in the exact format you want is one of the most reliable accuracy boosts, especially for output structure and edge-case handling.

  • 2–5 well-chosen examples usually outperform a long prose description of the format.
  • Include edge cases in your examples (empty fields, ambiguous inputs) to show the desired handling.
  • Make examples representative of the real distribution — examples that don't match production inputs can mislead.

6. Decomposition: shape the reasoning path

Restructuring a hard task into a sequence of smaller, well-scoped contexts often beats cramming everything into one prompt.

  • Chain-of-thought prompting ("think step by step") improves reasoning by giving the model space to work before answering.
  • Task decomposition — break a multi-part task into separate calls, each with a focused context — reduces the chance of the model dropping a requirement.
  • Provide a scratchpad or structured output schema that forces the model to address each sub-part.

Context engineering vs. fine-tuning: when to use which

Dimension Context engineering Fine-tuning
Changes the model? No — weights frozen Yes — updates weights
Cost Low (prompt/retrieval infra) High (GPUs, data labeling, training)
Iteration speed Minutes — edit and re-run Hours to days per training run
Best for Supplying knowledge, format, reasoning Deep behavior change, new skills, style at scale
Knowledge freshness Always current (retrieve live data) Frozen at training time
Risk Low — easily reversible Catastrophic forgetting, overfitting
Per-task overhead None — same model, different context A model (or adapter) per task

The practical order

Exhaust context engineering before fine-tuning. Most accuracy problems blamed on the model are actually context problems: missing facts (fix with retrieval), distracting facts (fix with compression), badly placed facts (fix with positioning), or unclear format (fix with examples). Fine-tuning is the right tool for teaching genuinely new behavior or style at scale — not for fixing what good context would solve.


A worked diagnostic: why is my LLM getting this wrong?

When accuracy is poor, walk this checklist before touching the model:

Symptom Likely context cause Fix
Confidently wrong facts (hallucination) The fact isn't in context Add retrieval; ground answers in sources
Right facts available but ignored Buried in the middle of a long context Reposition to start/end; cut surrounding noise
Accuracy drops as you add more documents Long-context degradation / distractors Retrieve fewer, higher-relevance chunks; re-rank
Wrong output format Format described in prose, not shown Add 2–5 few-shot examples in the target format
Drops requirements on complex tasks Too much packed into one prompt Decompose into focused sub-calls; use a schema
Inconsistent answers to the same question Irrelevant/variable context bleeding in Stabilize and minimize the context; remove noise
Good on short chats, bad on long ones History overflow pushing out key info Summarize old turns; keep the system prompt anchored

Most "the model isn't smart enough" complaints resolve at one of these rows.


How to apply context engineering (code example)

A structured-context RAG prompt that uses several techniques at once — delimiters, source labeling, positioning, and an explicit instruction:

def build_context(query, retrieved_chunks):
    # 1. Re-rank and keep only the few most relevant chunks (compression)
    top_chunks = rerank(retrieved_chunks, query)[:4]

    # 2. Label sources and use clear delimiters (structure)
    documents = "\n\n".join(
        f"<document id='{i}' source='{c.source}'>\n{c.text}\n</document>"
        for i, c in enumerate(top_chunks)
    )

    # 3. Critical instruction at the START, query restated at the END (positioning)
    system = (
        "You are a precise assistant. Answer ONLY from the documents below. "
        "If the answer is not in the documents, say 'Not found in sources.' "
        "Cite the document id you used."
    )

    user = (
        f"<documents>\n{documents}\n</documents>\n\n"
        f"<instructions>Answer using only the documents above. Cite ids.</instructions>\n\n"
        f"Question: {query}"   # query last — closest to the model's answer
    )
    return system, user

The structure does the work: scoped instruction, only the top re-ranked chunks, explicit delimiters, labeled sources for citation, and the question placed last where the model attends most.


Measuring whether your context engineering works

Context engineering is empirical — you must measure, not guess. A lightweight loop:

  1. Build an eval set. 50–200 real queries with known-correct answers. This is non-negotiable; without it you're tuning blind.
  2. Define a metric. Exact match, factual correctness (LLM-as-judge), citation accuracy, or task success rate — whatever maps to your goal.
  3. Change one variable at a time. Chunk size, number of retrieved docs, positioning, example count. Isolate the effect.
  4. A/B the context, not the model. Same model, two context strategies, compare scores.
  5. Watch for regressions. A change that helps one query class can hurt another — eval across the full set.

The discipline mirrors good experimentation: hold the model fixed, vary the context, measure the delta.


Common pitfalls and how to avoid them

Pitfall Why it hurts Fix
"Just add more context" Long contexts degrade accuracy and add distractors Retrieve fewer, higher-quality, re-ranked chunks
Key instruction buried in the middle Lost-in-the-middle effect — it gets ignored Move critical content to start/end
Describing format in prose only Models follow shown examples better than told rules Add few-shot examples in the exact target format
No eval set You can't tell if changes help or hurt Build 50–200 labeled queries before tuning
Carrying full history in long chats Overflow pushes out the system prompt and key facts Summarize old turns; anchor the system prompt
Trusting first-pass vector search Recall-oriented; best chunk may not be on top Add a re-ranking step
Reaching for fine-tuning first Expensive fix for what context usually solves Exhaust context engineering, measure, then decide

Frequently asked questions

What is context engineering? Context engineering is the practice of deliberately designing what goes into an LLM's context window — instructions, retrieved documents, examples, history, and the query — and how it's formatted and ordered, to maximize answer quality. The model's weights are never changed; only the input is engineered. It's broader than prompt engineering, which focuses on wording a single instruction.

How does context engineering improve accuracy without fine-tuning? An LLM's output is a function of its context, so improving the context improves the output even with frozen weights. The main mechanisms are: supplying missing facts via retrieval, removing irrelevant content that degrades long-context accuracy, positioning critical information where the model attends most (start and end), structuring the input with clear delimiters, and showing the desired output format with few-shot examples.

Why do long contexts sometimes make accuracy worse? Two reasons. First, irrelevant content acts as distractors that dilute the model's attention away from the relevant facts. Second, the "lost in the middle" effect means models use information at the start and end of the context far more reliably than information in the middle — so a long context can bury the key fact in the least-attended region. Retrieving fewer, higher-relevance chunks usually beats stuffing the window.

Is context engineering the same as prompt engineering? No — prompt engineering is a subset. Prompt engineering optimizes the wording of instructions. Context engineering is the broader discipline of managing the entire context payload: which documents to retrieve, how much history to keep, where to place key facts, what to compress or remove, and how to structure it all. Prompt wording is one piece of a larger information-design problem.

When should I fine-tune instead of doing context engineering? Fine-tune when you need to teach genuinely new behavior, a new skill, or a consistent style at scale that can't be specified through instructions and examples. Use context engineering — which is cheaper, faster to iterate, and keeps knowledge current — for supplying facts, controlling output format, and improving reasoning. The practical rule is to exhaust context engineering and measure before deciding fine-tuning is necessary.

What is the "lost in the middle" problem? It's the observed tendency of LLMs to use information placed at the beginning and end of a long context more reliably than information in the middle. A relevant fact buried among many retrieved documents in the middle of the window may be effectively ignored. The fix is to position critical content at the start or end and to keep the context short enough that there is no neglected middle.


Key takeaways

  • Context engineering improves accuracy by designing the input, not changing the model — instructions, retrieval, examples, ordering, and what to leave out.
  • Models can't use facts they aren't given — retrieval (RAG) is the biggest lever for knowledge-dependent tasks.
  • More context often means worse accuracy — irrelevant tokens are distractors; retrieve fewer, higher-quality, re-ranked chunks.
  • Position matters — exploit "lost in the middle" by placing critical facts at the start or end, not buried in the middle.
  • Show, don't tell — 2–5 few-shot examples in the target format beat a prose description.
  • Measure with an eval set — change one context variable at a time and A/B the context, not the model.
  • Exhaust context engineering before fine-tuning — most accuracy problems are context problems, and context fixes are cheaper, faster, and reversible.