The question engineers ask most often when an LLM pipeline underperforms is: “how should I reword this prompt?” That is almost never the right question. The right question is: “what am I putting in the context window, and is it the right information in the right order?”
This shift - from prompt wording to context construction - is what context engineering means. It is not a rebranding. The two disciplines require different skills, different tooling, and produce different categories of wins. A 10% improvement from rewording a prompt is about as good as it gets. A 40-60% quality improvement from restructuring how you build context is routine.
Why Prompt Wording Lost Its Leverage
Three years ago, “think step by step” measurably improved reasoning. “You are an expert software engineer” made code better. XML tags helped Claude parse structure. These tricks worked because models were less capable and more sensitive to exact phrasing.
Modern frontier models have internalized most of those patterns. They think step by step without being told. They handle structured tasks without XML delimiters. The marginal return on prompt wording has dropped close to zero for most tasks.
The variable that still has massive leverage is what is in the context window when the model generates. The model cannot reason about facts it cannot see. It cannot follow patterns that are absent from context. It cannot answer questions that require knowledge that is not in the window. The quality of the answer is largely determined by the quality of what you put in - not how politely you ask.
Context engineering is the discipline of making that construction systematic and measurable.
The Four Pillars
1. Retrieval - Getting the Right Chunks In
The naive RAG pipeline embeds a query, runs a vector search, and dumps the top-5 chunks into context. This works well enough on clean test sets. It breaks in production for one central reason: semantic similarity is not the same as relevance.
A chunk about “pricing” can be highly similar to a query about “how to change subscription pricing” while containing only marketing copy, not the actual procedure. Embedding search finds chunks that use similar words. It does not find chunks that answer the question.
The naive retriever:
def retrieve(query: str, k: int = 5) -> list[str]:
query_embedding = embed(query)
results = vector_store.search(query_embedding, k=k)
return [r.text for r in results]
The better approach uses reranking:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def retrieve_with_rerank(query: str, k_initial: int = 20, k_final: int = 5) -> list[str]:
query_embedding = embed(query)
candidates = vector_store.search(query_embedding, k=k_initial)
# Cross-encoder reads query and chunk together - far more accurate
pairs = [(query, c.text) for c in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [c.text for c, _ in ranked[:k_final]]
The cross-encoder reads the query and the candidate chunk together, which is dramatically more accurate than comparing independent embeddings. The tradeoff is latency - so the pattern is coarse retrieval (top-20) with embeddings, then fine-ranking (top-5) with the cross-encoder. Reranking alone consistently produces a 15-25% improvement in answer quality on RAG benchmarks.
Chunk size also matters more than most teams realize. Standard 512-token chunks are a guess. The right chunk size depends on document structure:
| Document type | Chunk size | Overlap | Reason |
|---|---|---|---|
| API documentation | 256-512 tokens | 10% | Each endpoint is self-contained |
| Narrative text / blog posts | 512-1024 tokens | 20% | Context spans paragraphs |
| Code files | Function/class boundaries | Syntax-aware | Semantic units beat token counts |
| Structured data (tables, configs) | Full row / full section | 0% | Splitting mid-row destroys meaning |
For code specifically: a function that is 800 tokens should stay as one chunk. A function split at line 400 is almost useless - the model needs the signature, body, and return together to understand what it does.
2. Compaction - Fitting More Signal Into Fewer Tokens
The context window is finite. Every token of noise displaces a token of signal. Compaction is the process of transforming raw information into a denser, more useful form before it enters the window.
The naive version of compaction is truncation: keep the first N tokens. This is almost always wrong. Relevant information is distributed throughout a document, not concentrated at the top.
Map-reduce summarization:
import asyncio
async def compact_for_query(document: str, query: str, max_tokens: int = 800) -> str:
chunks = split_into_chunks(document, chunk_size=1000)
async def extract_relevant(chunk: str) -> str:
return await cheap_llm.generate(
f"Extract only the information relevant to: '{query}'\n"
f"Be concise. If nothing is relevant, return NONE.\n\n{chunk}"
)
extractions = await asyncio.gather(*[extract_relevant(c) for c in chunks])
relevant = [e for e in extractions if e.strip() != "NONE"]
if not relevant:
return ""
combined = "\n\n".join(relevant)
if count_tokens(combined) <= max_tokens:
return combined
# Second-pass reduction if still too long
return await cheap_llm.generate(
f"Summarize this into {max_tokens} tokens or fewer, "
f"prioritizing information about '{query}':\n\n{combined}"
)
This produces 60-80% compression while preserving information relevant to the specific query. The compaction calls use a cheap model (Haiku, Flash), and the savings in the main call typically exceed the compaction cost.
Progressive disclosure is a lighter-weight alternative. Instead of summarizing everything upfront, structure retrieval so the model gets metadata first and can request details:
AVAILABLE SECTIONS:
1. [Auth module] JWT token validation and session management - 2,400 tokens
2. [Rate limiter] Sliding window rate limiting per API key - 1,800 tokens
3. [Error handler] Global error handler with structured logging - 900 tokens
Which sections are needed to answer the current question?
This reduces context by 50-70% on tasks where most retrieved content is not relevant to the specific query. The model self-selects, which is more accurate than the retriever guessing.
3. Tool-Result Shaping - Controlling What Comes Back
Every tool call adds tokens to the context. A database query returning a full ORM object is almost always returning far more than the model needs. The discipline of tool-result shaping means designing tool implementations to return the minimum useful output, formatted for the model, not for humans.
Unshaped tool result:
def get_user(user_id: int) -> dict:
user = db.query(User).get(user_id)
return user.to_dict()
# Returns all 30+ fields: password_hash, notification_preferences,
# billing_address, profile_image_url, raw timestamps, metadata...
# 600-800 tokens
Shaped tool result:
from datetime import datetime
def get_user(user_id: int, fields: list[str] = None) -> dict:
user = db.query(User).get(user_id)
default_fields = ["id", "email", "name", "status", "account_age_days"]
selected = fields or default_fields
result = {k: getattr(user, k) for k in selected if hasattr(user, k)}
# Computed fields are more useful than raw timestamps
if "account_age_days" in selected:
result["account_age_days"] = (datetime.now() - user.created_at).days
return result
# Returns 5 fields: id, email, name, status, account_age_days
# 80-100 tokens
That is an 8x reduction per tool call. If an agent makes 10 tool calls per task, that is 35,000+ tokens saved per task run. At Claude Sonnet pricing, that is meaningful cost reduction across thousands of runs.
Error messages are context too. A verbose Python traceback gives the model more to reason about than it needs. Return structured errors instead:
# Instead of: full stack trace + exception chain + locals dump
{"error": "Database connection failed", "code": "DB_CONN_FAILED", "retry_after": 30}
The model needs to know what failed and what to do next - not the full call stack.
4. Ordering - Position Matters More Than You Think
Context is not flat. Models pay differential attention to different positions in the context window, and the pattern is consistent across model families.
Two effects:
- Primacy: Content near the beginning of the context gets higher model attention. System instructions and critical constraints belong at the top.
- Recency: Content near the end also gets high attention - more than the middle. The user’s current question belongs just before the model’s turn.
The “lost in the middle” problem is real and well-measured. When you put 20 retrieved chunks into a context window, the model is most likely to use the ones at position 0 and position 19, and least likely to use the ones packed in the middle. For retrieval applications, this means you should deliberately place your best chunks at the top and bottom of the retrieved content block, not in the middle:
def order_chunks_for_context(chunks: list[Chunk]) -> list[Chunk]:
# Sort by relevance score descending
ranked = sorted(chunks, key=lambda c: c.relevance_score, reverse=True)
if len(ranked) <= 2:
return ranked
n = len(ranked)
ordered = [None] * n
# Most relevant at end (recency - model generates immediately after)
ordered[n - 1] = ranked[0]
# Second most relevant at beginning (primacy)
ordered[0] = ranked[1]
# Fill middle with remaining, least relevant deepest in middle
middle = list(range(1, n - 1))
for i, pos in enumerate(middle):
if i + 2 < len(ranked):
ordered[pos] = ranked[i + 2]
return [c for c in ordered if c is not None]
Reordering retrieved chunks without changing their content produces 10-15% quality improvement on question-answering benchmarks. It is a free win once you understand the mechanism.
The overall structure of a context window for an agent matters too. A well-ordered assembly looks like this:
[System instructions] <-- primacy: model reads this first, weights heavily
[Tool definitions]
[Background / long-term memory] <-- stable reference material
[Retrieved chunks] <-- best chunk at top, best chunk at bottom
[Conversation history] <-- recent turns
[Current user message] <-- recency: immediately before generation
Flipping this - retrieved content at the top, instructions at the bottom - measurably degrades performance on instruction-following tasks. The model can still read everything, but the attention distribution works against you.
What the Full Pipeline Looks Like
A system applying all four pillars:
User query
|
v
[Query expansion]
-- generate 2-3 rephrasings to improve recall
|
v
[Retrieval]
-- top-20 by embedding similarity from vector store
|
v
[Reranking]
-- cross-encoder scores, keep top-8 chunks
|
v
[Compaction]
-- map-reduce extraction against the specific query
-- fit to token budget
|
v
[Context assembly]
-- instructions first
-- chunks ordered: best at start and end
-- recent history
-- current query last
|
v
[LLM call]
|
v
[Tool calls, shaped results]
-- each tool projects to needed fields
-- errors return structured codes, not tracebacks
|
v
Response
None of these steps require a new model or a better prompt. They all operate on what goes into the window.
Measuring the Wins
Treating context construction as engineering means measuring it. Each pillar produces specific, trackable improvements:
| Pillar | Metric | Typical Improvement |
|---|---|---|
| Retrieval | Recall@5 on eval set | +15-25% with reranking |
| Compaction | Input tokens per request | -40-60% with map-reduce |
| Tool-result shaping | Tokens per tool call | -60-80% with field projection |
| Ordering | Answer quality on mid-ranked chunks | +10-15% position optimization |
Set up an evaluation set of 50-100 representative queries with known correct answers. Measure before and after each change. This turns context engineering from intuition into a feedback loop with clear direction.
The most useful telemetry to add to production systems: log, for every request, which retrieved chunks were cited in the final answer. You will almost always find that 30-50% of retrieved content is never referenced. That is your headroom - tokens being wasted on content that did not contribute to the response.
The Honest Assessment
Context engineering is harder than prompt engineering. The number of variables is higher - chunk size, retrieval strategy, reranking model, compaction approach, ordering algorithm, tool result schema - and each has to be evaluated against your actual query distribution, not a toy test set.
What works reliably and is low-risk to adopt: reranking over pure embedding search, field projection in tool results, ordering chunks with best at start and end. These three changes alone outperform any amount of prompt rewording.
What requires more investment: map-reduce compaction adds latency and cost from extra LLM calls. Progressive disclosure requires careful tool design and works better with larger models. Query expansion improves recall but can introduce noise for highly specific queries.
Where to start: ship the reranker first. It is a single code change, no new infrastructure, and it consistently produces the biggest single improvement on any RAG pipeline. After that, add field projection to your tool results. Both together will produce more visible quality improvement than any prompt you could write.
The framing shift matters as much as the techniques. Teams that ask “how do I prompt this better” hit a ceiling fast because prompt wording is not the bottleneck anymore. Teams that ask “what should be in the context, is it there, in the right form, in the right order” keep finding improvements because context construction has many more degrees of freedom. That is what context engineering is: treating the window as the primary design surface, not an afterthought.
Comments