Agent memory in 2026: the 47-author taxonomy that maps what’s built and what’s missing

TL;DR — A 47-author survey maps the full landscape of agent memory architecture. Mapping current tools against the taxonomy reveals a clear production gap: procedural memory. Agents remember what happened and what’s true, but not how to do things. This is the memory type with the highest benchmark impact and the least production tooling.
Forty-seven researchers walk into a taxonomy
Agent memory has a naming problem. Working memory, episodic memory, semantic memory, long-term memory, short-term memory, parametric memory, external memory. The same terms mean different things across papers, frameworks, and product docs. A team implementing “long-term memory” in LangMem gets something architecturally different from what Mem0 calls long-term memory, which differs again from what MemGPT ships under the same label.
In December 2025, 47 researchers published “Memory in the Age of AI Agents” (arXiv:2512.13564). It is the field’s first serious attempt at consensus. The paper does not propose a new memory system. It proposes a shared language for describing all of them.
The taxonomy organizes agent memory along three axes.
Forms describe what memory looks like at the implementation level. Token-level memory is raw text in the context window, which is what most “memory” implementations actually are. Parametric memory is knowledge baked into model weights through fine-tuning or RLHF; changes require retraining. Latent memory sits between the two: compressed representations like embeddings, hidden states, and KV cache entries.
Functions describe what memory is for. Factual memory stores stable knowledge (“the user’s timezone is IST,” “the codebase uses PostgreSQL”). Experiential memory records what happened: past interactions, task outcomes, conversation history. Working memory holds active task state, the current plan, intermediate reasoning steps, tool call results. The scratchpad.
Dynamics describe how memory changes. Formation is how new memories get created from experience. Evolution is how stored memories update, consolidate, or decay over time. Retrieval is how relevant memories surface when needed.
This matters beyond academic bookkeeping. The three-axis system exposes gaps that a simpler “short-term vs. long-term” framing hides.
Mapping production tooling to the taxonomy
The survey gives us the map. Here’s what’s built in each territory.
| Memory function | What it stores | Production tools | Benchmark state | Gap severity |
|---|---|---|---|---|
| Factual (semantic) | User preferences, domain facts, stable knowledge | Mem0 (68.4% LLM score), Memori (20x token compression), pgvector | Mature — LoCoMo, LongMemEval | Low |
| Experiential (episodic) | Interaction history, task outcomes, conversation records | MemPalace (96.6% LongMemEval), Zep, LangMem | Improving — but MemPalace scores questioned | Medium |
| Working | Current plan, intermediate state, active reasoning | PRIME (gradient-free evolution), MemGPT paging | Emerging — no standard benchmark | Medium-High |
| Procedural | How to perform tasks, reusable strategies, tool sequences | Almost nothing in production | Early — ALFWorld, TravelPlanner | Critical |
Read the table left to right and the coverage drops off a cliff. Factual memory has multiple production systems with real benchmarks. Episodic memory has a growing tool ecosystem, boosted by MemPalace going viral (23,000 GitHub stars in 48 hours). Working memory has promising research. Procedural memory has papers.
MemPalace: episodic memory’s moment
MemPalace launched on April 5, 2026 and hit 23,000 GitHub stars in two days. That velocity says something about demand. Teams want episodic memory infrastructure and don’t have it.
The architecture borrows from the ancient memory palace mnemonic. Memories organize into five spatial levels:
Wings (people/projects)
└── Halls (memory types: facts, events, reasoning, updates, synthesis)
└── Rooms (conversation threads)
└── Tunnels (cross-wing connections)
└── Drawers (individual entries in ChromaDB)
The two-pass retrieval system classifies queries into hall categories for precision, then searches the full corpus with category-aware bonuses. Everything runs locally. ChromaDB for vectors, SQLite for metadata, optional Llama reranking. Zero cloud calls, zero API costs.
The honest benchmark numbers: 96.6% on LongMemEval, 34% retrieval improvement over flat ChromaDB search. The 100% LoCoMo claim that circulated on launch day was inflated. Setting top_k=50 on pools of 19-32 items is not a meaningful retrieval test. Independent reviews found real-world accuracy drops to 84.2% with AAAK compression enabled. One developer reported 17% accuracy when routing MemPalace output through an LLM for downstream tasks.
The hype is real and the benchmarks need scrutiny. The underlying architecture, spatial hierarchy over flat vector search, does address a genuine retrieval quality problem for episodic memory at scale.
MemMachine: when memories stop being true
Storing memories accurately is half the problem. MemMachine (arXiv 2604.04853) goes after the other half: truth decay.
You tell an agent you work at Company A. You change jobs. The agent keeps drafting emails introducing you as working at Company A. The memory system stored the fact correctly, retrieves it fast, and acts on it with confidence. The fact is just wrong.
MemMachine introduces ground-truth preservation: periodic verification of stored facts against available evidence. When contradictions surface between a stored fact and new information, MemMachine reconciles them rather than overwriting. It preserves the fact’s history while updating the current state.
On the taxonomy’s dynamics axis, MemMachine operates at the evolution layer. Most production tools handle formation (creating memories) and retrieval (finding them) well. Almost none handle evolution, the ongoing maintenance of memory accuracy over time.
PRIME: working memory that learns without training
PRIME (arXiv:2604.07645) from Carnegie Mellon works a different cell in the taxonomy: working memory evolution. During multi-turn interactions, agents need to refine their understanding of what the user actually wants. Not recall facts or replay past events, but actively update their model of the current task.
The standard approach is reinforcement learning: expensive, and bad at credit assignment across long interaction horizons. PRIME sidesteps this entirely with a gradient-free framework. It distills interaction trajectories into structured experiences across three zones: successful strategies, failure patterns, and user preferences. These experiences evolve through meta-level operations and guide future behavior through retrieval-augmented generation.
In practice, this means an agent that gets better at serving you across conversations without anyone retraining the model. The experiences are human-readable. You can inspect what the agent learned, edit incorrect lessons, and understand why it made specific decisions. That kind of interpretability is rare when most memory systems compress everything into opaque embeddings.
The real gap: procedural memory
The taxonomy makes this obvious once you look at it.
The agent memory ecosystem is pouring resources into episodic memory (what happened) and semantic memory (what’s true). MemPalace, Mem0, Memori, Zep, LangMem. Every major memory framework focuses on storing and retrieving facts and events.
Nobody is building production tooling for procedural memory: teaching agents how to do things.
The Mem^p paper (arXiv:2508.06433) defines procedural memory as “distilled, chronicled, and reusable lessons from experiential trajectories.” Not just “User prefers Python” (factual) or “Last Tuesday, the deployment failed” (episodic), but “When a deployment fails with exit code 137, check memory limits first, then container resource quotas, then the OOM killer logs, in that order, because the first two are faster to verify and cover 80% of cases.”
The benchmark numbers make the case. On ALFWorld with GPT-4o:
- Without procedural memory: 42.14% task completion, 23.76 average steps
- With procedural memory: 77.86% task completion, 15.01 average steps
That is a 35 percentage point improvement in task completion and 37% fewer steps. The agent succeeds more often and finishes faster because it has learned reusable strategies instead of reasoning from scratch every time.
The paper also shows cross-model transfer: procedural memories extracted from GPT-4o improve Qwen2.5-14B’s task completion by 5 percentage points. The learned strategies are model-agnostic. They transfer because they encode workflow knowledge, not model-specific reasoning patterns.
Yet production tooling for procedural memory is almost nonexistent. Most agents encode procedural knowledge through handwritten prompts, hardcoded tool chains, or function definitions. When the procedure needs to change, a developer rewrites the prompt. The agent never learns from its own experience.
graph TB
subgraph "Memory Types — Current Tooling Coverage"
F["Factual (Semantic)<br/>Mem0, Memori, pgvector<br/>Coverage: HIGH"] --> R["Retrieval Layer"]
E["Experiential (Episodic)<br/>MemPalace, Zep, LangMem<br/>Coverage: MEDIUM-HIGH"] --> R
W["Working Memory<br/>PRIME, MemGPT paging<br/>Coverage: MEDIUM"] --> R
P["Procedural Memory<br/>Mem^p (research only)<br/>Coverage: CRITICAL GAP"] -.-> R
end
R --> A["Agent Execution"]
subgraph "Dynamics (Evolution)"
D["Truth Verification<br/>MemMachine"] --> F
D --> E
EV["Experience Distillation<br/>PRIME"] --> W
PL["Skill Learning<br/>???"] -.-> P
end
style P fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px
style PL fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px
What practitioners should build first
The taxonomy gives us a prioritized roadmap. Not every memory type matters equally for every use case, but the sequencing holds across most production agent deployments.
Start with factual memory in weeks 1-2. Use Mem0 or Memori. Store user preferences, domain facts, configuration state. This is the fastest path to personalization that persists across sessions. Mem0 achieves 68.4% accuracy at 1,800 tokens per conversation, a 91% latency reduction versus full context with only a 6 percentage point accuracy trade off. Memori offers 20x token compression via semantic triples if token budget is the binding constraint.
Add episodic memory in weeks 3-4. Bring in interaction history with MemPalace, Zep, or a pgvector-backed store. This gives you multi-session coherence: the agent knows what you discussed last Tuesday, what tasks succeeded or failed, what decisions were made and why. The three-tier memory stack provides the architecture. MemPalace’s spatial hierarchy or Zep’s temporal indexing provides the implementation.
Layer in truth verification around weeks 5-6. Apply MemMachine-style ground-truth checks over your factual and episodic stores. Without this, memory accuracy degrades over time as user contexts change. The verification layer is not a separate system. It is a constraint applied to your existing memory stores.
Implement working memory evolution in weeks 7-8. PRIME-style experience distillation matters most for agents that repeatedly perform similar tasks for the same user. The agent should get better at the 50th interaction than it was at the 5th.
Procedural memory is the frontier, and the field needs builders here. The gap between “agent has episodic memory” and “agent learns reusable task strategies” is where the next order-of-magnitude improvement lives. The Mem^p benchmarks show the impact. The production tooling does not exist yet.
The numbers that matter
The Mem0 team’s State of AI Agent Memory 2026 report provides production baselines:
| Approach | Accuracy (LLM score) | p95 latency | Tokens per conversation |
|---|---|---|---|
| Full context | 72.9% | 17.12s | ~26,000 |
| Mem0 (graph-enhanced) | 68.4% | 2.59s | ~1,800 |
| Mem0 (vector-only) | 66.9% | 1.44s | ~1,800 |
| RAG | 61.0% | 0.70s | varies |
| OpenAI Memory | 52.9% | varies | varies |
The latency-accuracy trade off is the architectural decision that defines production memory systems. Full context is the most accurate but 12x slower at p95. Selective memory (Mem0, Memori) accepts a small accuracy hit for dramatic latency and token reduction. The right choice depends on whether your agent is interactive (latency matters) or batch (accuracy matters).
Models that score near-perfectly on LoCoMo plummet to 40-60% on MemoryArena. That gap between passive recall and active, decision-relevant memory use is the real benchmark problem. Retrieval accuracy and task-relevant memory use are different capabilities, and most tools optimize for the former.
Memory is architecture, not a feature
The 47-author taxonomy settles an important question: agent memory is not a feature you bolt onto an agent framework. It is an architectural domain with its own type system, its own failure modes, and its own evolution dynamics.
The field has spent 2024-2025 building factual and episodic memory infrastructure. That work is maturing. MemMachine and PRIME are pushing into evolution dynamics, making memories that stay true and working memory that learns. MemPalace is making episodic retrieval spatially intelligent.
But the taxonomy’s biggest gap, procedural memory, remains wide open. Agents that learn how to do things, not just what happened or what’s true. The memory architectures we described as foundational are exactly that: foundations. The building that goes on top needs procedural memory as a load-bearing wall.
Build factual and episodic memory now. Add truth verification. Layer in working memory evolution. Then turn your attention to the procedural gap. That is where the 35 percentage point improvements are waiting.
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch