Most developers treat Large Language Models like magic 8-balls: ask a question, shake the API, and hope for a good answer. They obsess over "perfect phrasing," trying to coax the model into obedience with magic words like "think step-by-step." This is prompt engineering — and it's not enough anymore.
In June 2025, Shopify CEO Tobi Lutke posted a single tweet that crystallized what the industry was already feeling: "I really like the term 'context engineering' over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM." Andrej Karpathy amplified it days later, adding that in every industrial-strength LLM application, the real work is "the delicate art and science of filling the context window with just the right information for the next step."
That shift — from crafting clever sentences to architecting information systems — is what this article is about.
From prompt engineering to context engineering
Prompt engineering focuses on optimizing the textual instruction (the "query") to get a specific response. Context engineering focuses on architecting the entire input payload — retrieved documents, conversation history, tool definitions, schema constraints, few-shot examples, and system-level instructions — to maximize reasoning accuracy.
The difference? Prompting is about wording. Context engineering is about information architecture.
Pro Tip: Think of prompt engineering as writing a good SQL query. Think of context engineering as designing the database schema, the indexing strategy, and the view definitions that query runs against.
Karpathy's mental model captures it perfectly: think of an LLM as a CPU, and its context window as RAM. Your job as a developer is like writing an operating system — you decide what code and data gets loaded into working memory for each task. Load the wrong data, and even the best CPU produces garbage.
And that RAM is getting very large. In 2026, context windows have exploded:
| Model | Context Window | Release |
|---|---|---|
| GPT-5 | 400K tokens | Aug 2025 |
| Claude Opus 4.6 | 1M tokens | Feb 2026 |
| Gemini 2.5 Pro | 1M tokens | Mar 2025 |
| Llama 4 Scout | 10M tokens | Apr 2025 |
The challenge is no longer fitting data into the model. It's structuring that data so the model pays attention to the right signals amidst the noise.
Why attention creates the problem
To understand why context engineering is necessary, you need to understand the bottleneck: the attention mechanism. We covered the full mathematics of self-attention in our guide on How Large Language Models Actually Work — the Query, Key, Value projections, the scaled dot product, the softmax normalization. Here's the part that matters for context engineering:
When you feed an LLM 100 pages of text, every token competes for attention with every other token. The softmax function converts raw attention scores into probabilities that sum to 1. If you fill the context window with 10,000 irrelevant tokens, the attention probability for the right token gets diluted. The model gets distracted by "noise" that looks statistically similar to the answer but isn't.
This is why "dump everything into the context window" fails. You're flattening the probability distribution, making the model less confident in the correct answer.
The "Lost in the Middle" problem
Stanford researchers Liu et al. (2023) discovered a striking pattern: LLMs recall information at the beginning and end of a context window far better than information buried in the middle. This U-shaped performance curve — called "Lost in the Middle" — occurs because attention mechanisms naturally prioritize primacy (initial instructions) and recency (the immediate question), often skimming the bulk in between.
Place a critical fact at the 50% mark of a 100K token context, and retrieval accuracy can drop by over 20%. Newer models like Claude Opus 4.6 and GPT-5 have partially mitigated this through improved positional encodings and training techniques, but the pattern still holds under stress.
This has a direct architectural implication — your context should be arranged like this:
[SYSTEM INSTRUCTIONS] <-- Highest Attention (Primacy)
- Role definition
- Output format
- Critical constraints
[REFERENCE DATA] <-- Lowest Attention (The "Trough")
- Retrieved documents (RAG)
- CSV data
- Long conversation history
[USER QUERY] <-- Highest Attention (Recency)
- Specific question
- Immediate task
Common Pitfall: Developers often append "Remember to output JSON" at the very end of a massive prompt. While this leverages recency, if the reasoning instructions are buried in the middle, the model might hallucinate the content even if it gets the format right.
Structuring context with XML fencing
Structure is the antidote to attention drift. LLMs trained on code — which is virtually all modern models — are exceptionally good at parsing structured data formats like XML tags, JSON, and Markdown headers. By wrapping different types of context in distinct delimiters, you create "semantic islands" that the attention mechanism can index.
Anthropic and OpenAI both recommend this approach. It prevents context bleeding, where the model confuses a retrieved document's content with its own instructions.
def build_context(user_query, documents, history):
"""Constructs a semantically fenced context string."""
context = []
# 1. System Block (Primacy)
context.append("<system_instructions>")
context.append("You are a financial analyst. Answer based ONLY on the provided documents.")
context.append("</system_instructions>")
# 2. Data Block (The "Middle" - Protected by tags)
context.append("<retrieved_documents>")
for i, doc in enumerate(documents):
context.append(f'<doc id="{i}">\n{doc}\n</doc>')
context.append("</retrieved_documents>")
# 3. History Block
context.append("<history>")
for msg in history:
context.append(f"{msg['role']}: {msg['content']}")
context.append("</history>")
# 4. Task Block (Recency)
context.append("<user_query>")
context.append(user_query)
context.append("</user_query>")
return "\n".join(context)
Key Insight: Tags like <doc id="1"> aren't just formatting — they give the model a handle to reference. When you ask the model to "cite sources," it can attend specifically to the id attribute within the retrieved_documents block.
Dynamic context injection
Static system prompts break the moment your application needs to handle diverse user intents. Dynamic Context Injection solves this: instead of a single "God Prompt," you build a Context Assembler that programmatically constructs the context window at runtime based on what the user actually needs.
This is the engine behind RAG (Retrieval-Augmented Generation) and agentic workflows:
- Intent Classification — Determine what the user wants (e.g., "Debug code" vs. "Explain concept")
- Asset Retrieval — Fetch relevant rows from a database, vector store, or API
- Pruning — Rank retrieved assets by relevance and cut before the token limit
- Assembly — Format using the fencing technique, respecting primacy and recency
Putting it together: a production context builder
Let's build a support bot that combines structured data (customer orders) with unstructured data (return policies) to answer a specific question:
import pandas as pd
# Mock Data: What we retrieve from our database
orders_data = {
'order_id': ['ORD-101', 'ORD-102', 'ORD-103'],
'status': ['Shipped', 'Processing', 'Delivered'],
'items': ['Laptop', 'Mouse', 'Monitor'],
'customer_id': ['CUST-A', 'CUST-A', 'CUST-B']
}
df_orders = pd.DataFrame(orders_data)
# Mock Data: Retrieved from Vector DB (RAG)
knowledge_base = [
"Policy 1: Returns are accepted within 30 days of delivery.",
"Policy 2: Electronic items typically take 24-48 hours to process before shipping.",
"Policy 3: Refunds are processed to the original payment method."
]
class ContextBuilder:
def __init__(self, system_role):
self.parts = []
self.parts.append(f"<system_role>{system_role}</system_role>")
def add_data(self, data_name, dataframe):
"""Converts DataFrame to compact XML-like string"""
csv_string = dataframe.to_csv(index=False)
section = f"<{data_name}>\n{csv_string}</{data_name}>"
self.parts.append(section)
def add_documents(self, documents):
doc_str = "\n".join([f"- {doc}" for doc in documents])
section = f"<policies>\n{doc_str}\n</policies>"
self.parts.append(section)
def add_instructions(self, instructions):
self.parts.append(f"<instructions>\n{instructions}\n</instructions>")
def build(self, user_query):
self.parts.append(f"<user_query>\n{user_query}\n</user_query>")
return "\n\n".join(self.parts)
# 1. Initialize with Primacy (System Role)
builder = ContextBuilder("You are a helpful support agent for TechStore.")
# 2. Inject Context (The Middle) - filter for current user to reduce noise
current_user = "CUST-A"
user_orders = df_orders[df_orders['customer_id'] == current_user]
builder.add_data("customer_orders", user_orders)
builder.add_documents(knowledge_base)
# 3. Add Reasoning Instructions
builder.add_instructions(
"Check the order status first. If status is Processing, cite Policy 2."
)
# 4. Build Final Payload
final_prompt = builder.build("Where is my mouse?")
print("--- FINAL CONTEXT PAYLOAD ---\n")
print(final_prompt)
Output:
--- FINAL CONTEXT PAYLOAD ---
<system_role>You are a helpful support agent for TechStore.</system_role>
<customer_orders>
order_id,status,items,customer_id
ORD-101,Shipped,Laptop,CUST-A
ORD-102,Processing,Mouse,CUST-A
</customer_orders>
<policies>
- Policy 1: Returns are accepted within 30 days of delivery.
- Policy 2: Electronic items typically take 24-48 hours to process before shipping.
- Policy 3: Refunds are processed to the original payment method.
</policies>
<instructions>
Check the order status first. If status is Processing, cite Policy 2.
</instructions>
<user_query>
Where is my mouse?
</user_query>
Notice what happened: we filtered out CUST-B's data (noise reduction), fenced each data type with tags (structure), put system instructions first (primacy), and the user query last (recency). The model now has everything it needs — and nothing it doesn't.
Context caching: the cost multiplier
Every token you send to an LLM costs money and adds latency. When your context assembler sends the same system prompt, tool definitions, and few-shot examples on every API call, you're paying for the same tokens over and over.
All three major providers now offer prompt caching to solve this:
| Provider | Approach | Cost Savings | Cache Duration |
|---|---|---|---|
| Anthropic | Explicit breakpoints you control | Cached reads 10x cheaper | 5 min (default), 1 hour (extended) |
| OpenAI | Automatic prefix matching | Cached reads 10x cheaper | Automatic |
| Implicit (auto) + Explicit (manual) | Guaranteed discounts on explicit | Configurable |
The practical implication: design your context with caching in mind. Put the stable parts (system prompt, tool schemas, few-shot examples) at the beginning of your payload, and the dynamic parts (retrieved documents, user query) at the end. This way, the prefix stays cached across requests.
Key Insight: Context caching isn't just an optimization — it changes your architecture. With 1-hour caching, you can afford to include 50-page reference manuals in every request, because you only pay full price once. This turns "context engineering" from a compression problem into an information design problem.
Agentic context: tools, memory, and MCP
The most demanding test of context engineering isn't a single API call — it's an agent that makes dozens of calls in a loop, each time deciding what tools to invoke, what results to keep, and what to discard.
In agentic workflows, context management becomes state management. Every tool call returns data that must be selectively injected into the next reasoning step. Include too much, and the agent loses focus. Include too little, and it forgets critical results.
Anthropic's Model Context Protocol (MCP) — an open standard adopted across the industry — addresses this by standardizing how agents connect to external tools and data sources. Instead of each tool having a bespoke integration, MCP provides a universal interface that the agent's context assembler can work with.
The key principles of agentic context engineering:
- Selective memory — Summarize completed subtask results rather than carrying raw tool outputs forward
- Tool definitions as context — Each tool's description competes for attention, so keep definitions concise and precise
- Context budgeting — Reserve tokens for reasoning by aggressively pruning old conversation turns
- Structured handoffs — When an agent delegates to a sub-agent, pass a focused context snapshot, not the full history
Common mistakes in context engineering
1. The "Soup" approach. Dumping raw JSON, unformatted text, and Python code into a single string without separators. The attention mechanism can't distinguish what is data to analyze from what is instruction to follow.
2. Neglecting output schemas. If you need JSON output, provide a one-shot example of that exact JSON structure inside your instructions. Don't assume the model "knows" what you mean by "clean JSON." Few-shot examples are among the highest-leverage context engineering techniques — one concrete example outperforms paragraphs of description.
3. More context = better results. Bigger context windows don't mean you should fill them. As context grows, reasoning quality often degrades — the "distraction" effect. Always prefer 5 relevant documents over 50 loosely related ones. Information density matters more than information volume.
Conclusion
Context engineering marks the maturation of AI development. We've moved from the superstition of "prompt whispering" toward the engineering rigor of information architecture — understanding attention dynamics, respecting the Lost in the Middle phenomenon, structuring inputs with semantic fencing, caching stable context for cost efficiency, and managing state across agentic tool calls.
The shift from prompts to production isn't about writing better sentences. It's about building better systems around the model — systems that deliver the right information, in the right structure, at the right time.
To see how these context principles power enterprise ML infrastructure, check out our guide on Google Vertex AI. To understand the benchmarks pushing these systems to their limits, read about Humanity's Last Exam.