The agent memory decision nobody talks about: what to forget

9 minute read

TL;DR

Agent memory research focuses on what to remember. The harder problem is what to forget. Context rot degrades LLM output quality as accumulated context grows — a Chroma study found 30%+ accuracy drops from the lost-in-the-middle effect across 18 frontier models. In production Mem0 systems, 33% of stored facts became incorrect within 90 days. Here is how I think about memory eviction: what to keep, what to prune, and why forgetting is a feature, not a bug.

An overflowing filing cabinet beside a single clean document on a desk, representing the contrast between accumulated context and disciplined memory eviction

Why is forgetting harder than remembering?

Every agent framework tutorial shows you how to store memories. Add a vector database. Embed conversations. Retrieve relevant context. The implicit assumption: more memory equals better performance.

The data says otherwise.

Chroma’s 2025 research on context rot tested 18 frontier models — GPT-4.1, Claude Opus 4, Gemini 2.5, Qwen 3, and 14 others. Every model exhibited accuracy degradation as context length increased. The lost-in-the-middle effect showed 30%+ accuracy drops when relevant information sat in positions 5–15 versus the beginning or end of the context window. In typical agent contexts, the ratio of relevant to irrelevant tokens drops sharply as accumulated memories grow.

More memory doesn’t help if the model can’t find the signal in the noise.

What is truth decay and why does it kill agent quality?

MemMachine research formalized a problem practitioners already knew: stored facts rot over time. In production Mem0 systems, a significant fraction of stored facts become incorrect within months.

Facts decay for three reasons:

The world changes. A user’s job title, project status, preferred tools, or team composition shifts. The agent’s memory says “user works at Company X on Project Y” — but that was true three months ago. Now the agent confidently references outdated context, and the user loses trust.

The original extraction was imprecise. LLM-extracted summaries lose nuance. “User prefers Python” might have been “user prefers Python for data scripts but uses TypeScript for web apps.” The lossy compression creates a false memory that the agent treats as ground truth.

Conflicting information accumulates without resolution. Session 1: “User wants dark mode.” Session 47: “User prefers light backgrounds for presentations.” Both are stored. When the agent retrieves both, it has a contradiction with no mechanism to resolve which is current.

The research on reasoning shift under context load makes this worse: as context grows, LLMs lose their ability to self-verify facts. Research on intelligence degradation in long-context LLMs (arXiv:2601.15300) shows that verification accuracy declines substantially as context grows — models lose their ability to catch contradictions in longer documents. The more memories an agent accumulates, the harder it becomes to catch its own contradictions.

How should you structure memory eviction?

I use three strategies, layered:

Tiered TTL (time-to-live). Not all memories deserve the same lifespan.

Memory type	Example	TTL	Eviction behavior
Immutable facts	Allergies, identity, legal constraints	Infinite	Never auto-evict; manual override only
Stable preferences	Communication style, tool preferences	90 days	Decay unless reinforced
Project context	Current tasks, recent decisions, active threads	7–30 days	Aggressive pruning
Session context	Current conversation state	Session end	Delete on session close

Refresh-on-read. When the agent retrieves and uses a memory, boost its relevance score and reset the decay timer. Memories that the agent actively references stay alive. Memories that sit untouched decay naturally. This creates a Darwinian selection pressure: useful memories survive, stale ones don’t.

Contradiction detection. Before storing a new memory, check if it contradicts an existing one. If “user prefers dark mode” already exists and the new memory says “user prefers light backgrounds,” flag it for resolution — don’t store both silently. The MemMachine approach stores entire conversational episodes rather than extracted summaries, preserving the context that explains why a preference changed.

flowchart TD
    NEW[New memory candidate] --> CHECK{Contradicts<br/>existing memory?}
    CHECK -->|No| STORE[Store with TTL<br/>based on type]
    CHECK -->|Yes| RESOLVE{Can conflict<br/>be resolved?}
    RESOLVE -->|Yes, newer wins| UPDATE[Update existing<br/>memory + log change]
    RESOLVE -->|No, ambiguous| FLAG[Store both +<br/>flag for human review]
    
    STORE --> DECAY[Relevance decays<br/>over time]
    DECAY --> USE{Retrieved by<br/>agent?}
    USE -->|Yes| REFRESH[Reset decay timer]
    USE -->|No| PRUNE{Below relevance<br/>threshold?}
    PRUNE -->|Yes| DELETE[Evict memory]
    PRUNE -->|No| DECAY
    
    style DELETE fill:#8b1a1a,stroke:#333,color:#fff
    style STORE fill:#2d5016,stroke:#333,color:#fff
    style FLAG fill:#8b6914,stroke:#333,color:#fff

When does accumulated memory become a security liability?

This is the angle most memory discussions ignore entirely.

The AgentLeak benchmark found 68.9% exposure rates in multi-agent systems. Sensitive data from one user’s context bleeds into another’s through shared memory stores. The most common vector isn’t sophisticated — it’s debug logging. A credential leakage study analyzing 17,000 agent skills found that 73.5% of leaks came from print and console.log statements. Agent frameworks capture stdout into LLM context, which means a logged API key or database credential becomes queryable in natural language.

Accumulated memory amplifies this risk. An agent that stores every tool call response, every API output, and every intermediate result creates a searchable archive of sensitive data. Even with access controls on the memory store, the LLM has already seen the data — and can be prompted to retrieve it.

Practical mitigations:

Scrub credentials before memory storage. Run a regex pattern matcher on every memory candidate. Reject anything that matches API key, token, password, or connection string patterns.
Separate reasoning and execution memory. The agent’s reasoning context (what it’s thinking) and execution context (what tools returned) should be stored separately with different access controls. The reasoning context needs the conclusions, not the raw tool outputs.
Scope memory by user. Never share memory across users, even in multi-tenant systems. A shared memory store is a cross-user data leak waiting to happen.

When should you use no persistent memory at all?

Sometimes the right eviction strategy is “evict everything.”

Stateless by design. Document processing, one-shot question answering, and code generation tasks don’t benefit from persistent memory. Each request is independent. Accumulated memory from previous requests adds noise without helping the current task.

Privacy requirements prohibit storage. If GDPR, HIPAA, or internal policies prevent storing user data between sessions, the cleanest architecture is stateless. Per-session context works, but nothing persists.

When clean context outperforms accumulated context. I’ve seen production chatbots where disabling persistent memory improved user satisfaction scores. The stale and contradictory memories were actively hurting response quality. Fresh context per session, built from the user’s current inputs, performed better than a year of accumulated (and partially rotten) memories.

Key takeaways

Context rot is real and universal. All 18 frontier models tested by Chroma showed accuracy degradation as context grew. More memory ≠ better performance.
a significant fraction of stored facts in production memory systems become incorrect within months. Memory needs maintenance, not just storage.
Use tiered TTL: immutable facts live forever, stable preferences decay over 90 days, project context prunes aggressively at 7–30 days.
Refresh-on-read keeps useful memories alive and lets stale ones die naturally.
Accumulated memory is a security liability. Debug logging causes 73.5% of credential leaks in agent systems. Scrub before storing.
Sometimes no persistent memory is the right architecture. Stateless agents outperform memory-laden ones when the memory has rotted.

FAQ

What is context rot in agent memory systems?

Context rot is the measurable degradation in LLM output quality as accumulated context grows. Chroma’s 2025 research found that across 18 frontier models (GPT-4.1, Claude Opus 4, Gemini 2.5), all exhibited accuracy drops at every context increment. The lost-in-the-middle effect shows 30%+ accuracy drops when relevant information sits in positions 5–15 versus the beginning or end of context.

How fast do stored facts become incorrect in agent memory?

In production Mem0 systems, a significant fraction of stored facts become incorrect within months. Facts decay because the world changes (a user’s job title, a project status, a preference), because the original extraction was imprecise, or because conflicting information was stored without resolving the contradiction.

How should agent memory handle eviction?

Use tiered TTL (time-to-live): immutable facts (allergies, identity) get infinite TTL, stable preferences get 90-day TTL, transient context (current project, recent decisions) gets 7–30 day TTL. Combine with refresh-on-read — boost a memory’s relevance score when the agent retrieves and uses it, resetting the decay timer. Memories that aren’t accessed decay naturally.

Is accumulated agent memory a security risk?

Yes. AgentLeak benchmark found 68.9% exposure rates in multi-agent systems where sensitive data from one context bleeds into another through shared memory. Debug logging is the worst vector — 73.5% of credential leaks in agent systems come from print/console.log statements that get captured into LLM context, making logged credentials queryable in natural language.

When should an agent system use no persistent memory?

When conversations are stateless by design (one-shot question answering, document processing), when privacy requirements prohibit storing user data between sessions, or when the cost of maintaining accurate memory exceeds the benefit of personalization. Many production chatbots perform better with clean context per session than with accumulated memory that includes stale or contradictory facts.

The agent memory decision nobody talks about: what to forget

TL;DR

Why is forgetting harder than remembering?

What is truth decay and why does it kill agent quality?

How should you structure memory eviction?

When does accumulated memory become a security liability?

When should you use no persistent memory at all?

Key takeaways

FAQ

Further reading

Related across topics

Share on

TL;DR

Why is forgetting harder than remembering?

What is truth decay and why does it kill agent quality?

How should you structure memory eviction?

When does accumulated memory become a security liability?

When should you use no persistent memory at all?

Key takeaways

FAQ

Further reading

Related across topics

AI harness engineering: what the top labs get right

Architecting Conversational AI Systems

Share on