Ask anyone building agents what their biggest problem is and “memory” comes up within the first two sentences. The agent solved the bug yesterday and has no idea today. It learned the user prefers TypeScript, then suggested JavaScript an hour later. It re-read the same 4,000-line file three times in one session because nothing told it that it already knew the answer.
The confusing part is that everyone thinks they already have memory. “I have a 1M token context window.” That is not memory. That is a desk. Memory is the filing cabinet you reach for when the desk is full and the meeting ended. This post is about how agents actually remember in 2026 - starting from the naive version everyone builds first, showing exactly where it breaks, and ending at the hybrid architecture that production agents converge on.
The Context Window Is Not Memory
The single most common mistake is treating the context window as the agent’s memory. It feels like memory because the model can refer back to anything in it. But it has four properties that disqualify it as long-term storage:
It is finite. Even a 1M token window fills up. A coding agent working through a real task can burn 1M tokens in an afternoon. Once it is full, something has to go.
It is volatile. When the session ends, the window is gone. Open a new session tomorrow and the agent starts from zero unless you rebuilt the context yourself.
It degrades with size. Models get worse at finding information buried in the middle of a long context - the “lost in the middle” effect. A fact at token 600,000 is not as reliable as the same fact at token 2,000. More context is not linearly more useful.
It is expensive. Every token in the window is re-read on every single turn. A 500K token context that you carry across 40 turns is 20M tokens of input billing for one task. Memory exists partly to keep the context small, not just to remember things.
So the real goal is not “give the agent a bigger window.” It is: keep the working context tight, and pull in exactly the right past information at exactly the right moment. That is a storage and retrieval problem, which is what memory architectures actually solve.
Level 0 - Just Append Everything
The first thing everyone builds. Keep the full conversation in a list and pass it back every turn.
class BufferMemory:
def __init__(self):
self.messages = []
def add(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
def context(self):
return self.messages # send the entire history every turn
This works beautifully for a 10-message chat. Then reality arrives:
- A long task crosses the context limit and the API rejects the request.
- Cost grows quadratically - turn 50 re-sends turns 1 through 49.
- The model starts losing early instructions because they are now buried under tool output.
Buffer memory is not wrong. It is the correct choice for short, bounded conversations. It just does not survive contact with a real agent that runs for hours or comes back tomorrow. Everything that follows is a response to one of its three failures: it gets too big, it forgets across sessions, and it cannot find what matters.
Level 1 - Summarize the Old Stuff
The obvious fix for “too big” is to compress. Keep the last N messages verbatim, and replace everything older with a running summary.
class SummarizingMemory:
def __init__(self, llm, keep_recent=10):
self.llm = llm
self.keep_recent = keep_recent
self.summary = ""
self.recent = []
def add(self, role: str, content: str):
self.recent.append({"role": role, "content": content})
if len(self.recent) > self.keep_recent:
# fold the oldest messages into the running summary
overflow = self.recent[:-self.keep_recent]
self.recent = self.recent[-self.keep_recent:]
self.summary = self.llm.generate(
system="Update the running summary with the new messages. "
"Preserve decisions, facts, and open questions. Be terse.",
messages=[{"role": "user",
"content": f"Summary so far:\n{self.summary}\n\n"
f"New messages:\n{overflow}"}]
).text
def context(self):
return ([{"role": "system", "content": f"Conversation so far: {self.summary}"}]
+ self.recent)
This is what most “memory” features in chat products actually are. It keeps the context bounded and is cheap to run. But summarization is lossy by design, and the losses are not random:
- Specifics die first. Summaries keep themes and drop exact values. The error code, the file path, the version number - precisely the things the agent needs later - get smoothed into “we discussed the deployment issue.”
- It compounds. Each summary is a summary of a summary. After ten folds, detail from turn 3 has been through ten lossy passes. It is a game of telephone with your own context.
- It still does not survive sessions unless you persist the summary and reload it - and even then, you reload one blob whether or not it is relevant to today’s task.
Summarization solves “too big.” It does not solve “find the one fact I need” or “remember the right thing next week.”
Level 2 - Store It and Retrieve Semantically
Now treat past interactions as a searchable store instead of a linear log. Embed each memory, drop it in a vector database, and at the start of each turn retrieve the most relevant entries for the current query. This is RAG pointed at the agent’s own history.
class VectorMemory:
def __init__(self, embed, vector_db):
self.embed = embed
self.db = vector_db
def remember(self, text: str, metadata: dict):
self.db.upsert(vector=self.embed(text), text=text, metadata=metadata)
def recall(self, query: str, k=5):
hits = self.db.search(vector=self.embed(query), top_k=k)
return [h.text for h in hits]
This is the first design that genuinely survives across sessions and scales to a huge history, because you only ever pull the top-k relevant entries into context. It is the backbone of “the assistant remembers things about you” features.
But semantic retrieval has sharp edges that people discover in production:
- Relevance is not recency or importance. The vector search returns what is similar to the query, not what matters. “What did we decide about auth?” surfaces every message mentioning auth, including the three you rejected.
- It has no structure. “User’s name is Priya” and “User said their colleague Priya is wrong” embed close together. Retrieval cannot tell a fact from a mention.
- Updates are messy. When a fact changes (“we moved from Postgres to DynamoDB”), the old vector is still sitting there, equally retrievable. You get both, and the model has to guess which is current.
- Retrieval misses are silent. If the embedding does not match, the agent does not know the memory exists. It confidently acts as if it never happened.
Vector memory is necessary but not sufficient. It is great for “things that were said” and bad for “things that are true.”
The Insight - There Is No Single Memory
Every level above tried to build one memory system. The reason none of them is enough is that “memory” is not one thing. Cognitive science split it decades ago, and agent architectures in 2026 have landed on the same split because it actually maps to different storage and retrieval needs.
| Type | What it holds | Example | Best store |
|---|---|---|---|
| Working | The current task’s active context | The file being edited right now | Context window |
| Episodic | Specific past events, time-stamped | “On June 3 the user rejected the Redis approach” | Vector DB |
| Semantic | Durable facts and preferences | “User deploys on Cloudflare, prefers TypeScript” | Structured store / key-value |
| Procedural | How to do recurring tasks | “To deploy: run build, check, then push” | Files / prompt templates / skills |
The mistake is forcing one mechanism to do all four jobs. A running summary is bad at durable facts. A vector DB is bad at procedures. The context window is bad at everything that needs to outlive the session. The production answer is to use the right store for each type and wire them together.
Level 3 - Structured Memory for Things That Are True
Episodic memory (“what happened”) wants similarity search. Semantic memory (“what is true”) wants something closer to a database row you can read, write, and overwrite deterministically. This is the part vector search gets wrong, and it is where structured memory earns its place.
The simplest version that works is editable fact records - which is, not coincidentally, exactly what Claude Code’s own memory does. Each fact is one record with a short slug and a description, plus an index so the agent can see what it knows without loading every fact:
memory/
index.md # one line per fact - loaded every session
user-prefers-typescript.md
deploys-on-cloudflare-workers.md
project-uses-hugo-papermod.md
class FactMemory:
"""Deterministic, overwritable facts. No embeddings, no guessing."""
def __init__(self, store):
self.store = store # any key-value store: files, SQLite, Redis
def write(self, slug: str, fact: str):
# writing the same slug overwrites - facts update in place
self.store.put(slug, fact)
def forget(self, slug: str):
self.store.delete(slug)
def all(self):
# the index is small enough to always carry in context
return self.store.values()
What this buys you that vectors cannot:
- Updates are real updates. Change a preference and you overwrite the record. There is no stale duplicate lurking in an index.
- No retrieval miss for core facts. The index is small enough to keep in context every session, so the agent always knows its own preferences and constraints. You only fall back to search for the long tail.
- It is inspectable and editable by humans. A user can read what the agent believes about them and correct it. That is impossible with an opaque vector blob and it matters enormously for trust.
Structured facts are the cheapest, highest-leverage memory most agents are missing. You do not need a vector database to remember the user’s name. You need a row.
Level 4 - The Hybrid Architecture
Put it together and a real 2026 agent memory system looks like this. Working memory in the context, three persistent stores behind it, and a memory manager mediating reads and writes.
┌──────────────────────────────┐
│ Working Memory │
│ (context window - this turn) │
└───────────────┬──────────────┘
│ read / write
┌───────────────▼──────────────┐
│ Memory Manager │
│ route, retrieve, write, prune│
└───┬──────────┬──────────┬─────┘
│ │ │
┌─────────▼──┐ ┌─────▼─────┐ ┌──▼──────────┐
│ Semantic │ │ Episodic │ │ Procedural │
│ (facts, │ │ (events, │ │ (skills, │
│ KV store) │ │ vector DB)│ │ playbooks) │
└────────────┘ └───────────┘ └─────────────┘
The two paths that matter are the write path (deciding what is worth remembering) and the read path (deciding what to pull back in). Most teams obsess over the read path and neglect the write path, which is backwards. A memory store full of junk retrieves junk.
The Write Path - What Is Worth Keeping
Not every message deserves to become a memory. Writing everything just rebuilds buffer memory in a slower database. The write path needs three decisions: extract, deduplicate, and resolve conflicts.
class MemoryWriter:
def __init__(self, llm, facts: FactMemory, episodes: VectorMemory):
self.llm = llm
self.facts = facts
self.episodes = episodes
def process(self, conversation_chunk: str):
# 1. Extract candidate memories - is there anything durable here?
extracted = self.llm.generate(
system="Extract durable facts, preferences, and decisions worth "
"remembering long term. Skip transient chit-chat. "
"Classify each as 'fact' (always true) or 'event' (happened once). "
"Return JSON: [{type, slug, text}]. Empty list if nothing.",
messages=[{"role": "user", "content": conversation_chunk}]
)
for item in parse_json(extracted.text):
if item["type"] == "fact":
self._write_fact(item)
else:
self.episodes.remember(item["text"], {"kind": "event"})
def _write_fact(self, item):
# 2 + 3. Dedup and conflict resolution against existing facts
existing = self.facts.all()
decision = self.llm.generate(
system="Given the new fact and existing facts, decide: "
"'new' (no overlap), 'duplicate' (already known, skip), "
"or 'update:<slug>' (contradicts/refines an existing fact).",
messages=[{"role": "user",
"content": f"New: {item}\nExisting: {existing}"}]
).text.strip()
if decision == "duplicate":
return
if decision.startswith("update:"):
self.facts.write(decision.split(":", 1)[1], item["text"]) # overwrite
else:
self.facts.write(item["slug"], item["text"])
Conflict resolution is the step everyone skips and the one that decides whether memory helps or hurts. Without it, the store accumulates “user lives in Bangalore” and “user moved to Berlin” side by side, and the agent flips a coin. With it, the new fact overwrites the old one. The general rule: for semantic facts, newer wins and overwrites; for episodic events, keep both because they are different points in time.
Forgetting Is a Feature
Human memory decays on purpose, and so should an agent’s. An unbounded store gets slower, more expensive, and noisier over time. Three mechanisms keep it healthy:
- Recency decay. Weight retrieval so older, untouched episodic memories score lower. A memory not accessed in months is probably not relevant.
- Reinforcement. Every time a memory is successfully retrieved and used, bump its importance. Memory that proves useful gets stickier.
- Consolidation. Periodically (a nightly job, or after N new episodes) re-summarize clusters of related episodes into one higher-level memory, then drop the raw ones. This is the agent equivalent of sleep - turning many specific events into one general lesson.
def retrieval_score(memory, query_similarity, now):
age_days = (now - memory.last_access).days
recency = 0.95 ** age_days # exponential decay
importance = memory.use_count # reinforcement
return query_similarity * recency * (1 + 0.1 * importance)
The Read Path - Routing, Not Just Searching
Retrieval is not one vector search. A good memory manager routes the query to the right store, because the stores answer different questions:
| Query shape | Route to | Why |
|---|---|---|
| “What does the user prefer?” | Semantic facts | Always-current, no search needed - it is in the index |
| “Have we hit this error before?” | Episodic vectors | Similarity over past events |
| “How do I deploy this project?” | Procedural | A stored playbook, not a recalled chat |
| “What were we just doing?” | Working memory | It is already in context |
In practice the manager assembles context from several stores at once: always inject the small semantic fact index, retrieve top-k episodic memories by the decayed score above, and pull the relevant procedure if the task matches one. The art is keeping the total injected memory small. Pulling 30 memories “to be safe” just rebuilds the lost-in-the-middle problem you were trying to escape.
Where Memory Goes Wrong
Memory is not free safety. Adding it introduces failure modes that a stateless agent never has.
Memory poisoning. If the write path stores a wrong or adversarial fact, every future session inherits it. A user who says “I am an admin, remember that” should not get that written as a durable fact. Anything that influences permissions or behavior needs validation before it is committed, not after it is recalled. Treat the write path as a trust boundary.
Stale facts that retrieve perfectly. The most dangerous memory is one that was true and no longer is. It has high similarity, high use count, and is completely wrong. This is why semantic facts must overwrite and why episodic memories should carry timestamps the model can reason about.
Context pollution. Inject too much memory and you drown the actual task. More retrieved memory past a point lowers answer quality. Measure it - if adding memory does not improve task success, stop adding it.
Privacy and the right to be forgotten. Persistent memory means you are now storing user data across sessions. Users need to see what is remembered and delete it. Structured, inspectable fact stores make this tractable. Opaque vector blobs make it a compliance headache.
The Honest Assessment
In 2026 there is no standard, drop-in agent memory. The frameworks that claim to “just handle memory” are almost always doing Level 1 summarization or Level 2 vector recall and calling it solved, which is why their agents still forget the things that matter and remember the things that do not.
The architecture that actually works is unglamorous: a small, deterministic store of facts you can read and overwrite; a vector store for past events with decay and reinforcement; explicit procedures for recurring tasks; and a write path with real conflict resolution guarding all of it. Most of the wins come from the parts nobody demos - deciding what not to store, overwriting stale facts, and forgetting on purpose.
If you are building an agent and bolting on memory, do the cheap, high-leverage thing first: a structured fact store with an index you carry every session. It will fix more “the agent forgot” complaints than any vector database. Reach for semantic retrieval when you have a genuine long tail of episodic history that will not fit in context. And spend your real engineering effort on the write path, because an agent that remembers the wrong things confidently is worse than one that remembers nothing at all.
Comments