If you're adding persistent memory to an LLM app, mem0 is the path of least resistance. It's an open-source memory layer that sits between your app and your model, exposes four primitives — add, search, update, delete — over an embeddings store, and handles the awkward parts: extracting what's worth remembering, scoping memories per user or session, and serving them back through similarity search. You don't rebuild the agent loop; you call memory.add(...) after each turn and memory.search(...) before the next one.
Tip
mem0 sits between your app and your LLM as a thin memory layer — that's the gain when you don't want to rebuild the agent loop, and the friction when memory ergonomics need to be load-bearing rather than incidental.
Key takeaways
- mem0 is a memory layer, not an agent framework. It composes with whatever orchestration you already have.
- Four primitives — add, search, update, delete — plus
get_allfor inspection. That's the whole API surface most apps touch. - The differentiator is the LLM extraction step on
add: raw input becomes structured atomic memories before storage. - Multi-level scoping (user_id, agent_id, run_id) decides what's recalled where. Forgetting to pass them is the most common production bug.
- Graph memory is optional and Neo4j-backed. Default off unless you actually need to traverse relationships.
- mem0 vs Letta is a real choice: thin layer over your loop vs. agent framework where memory is load-bearing.
- mem0 is not a RAG replacement. They compose — mem0 for user-specific memory, RAG for static corpus knowledge.
What mem0 is
mem0 is an open-source library for adding persistent memory to LLM applications. It ships as a Python package and a TypeScript package, and the core abstraction is a Memory class. You instantiate Memory, point it at an embedding model and a vector store, and call its methods to write and read memories.
The four primitives:
add— write a new turn or piece of text into memory. Triggers LLM extraction.search— retrieve memories relevant to a query, scoped by IDs.update— modify an existing memory by ID.delete— remove a memory by ID.
There's also get_all for listing memories under a scope, useful for debugging and admin views.
Three scoping IDs control what's recalled where:
user_id— memories tied to a specific human end user. Cross-session.run_id— memories tied to a specific session or run. Resets per session.agent_id— memories tied to a specific agent identity in a multi-agent system.
You can run mem0 self-hosted (you operate the vector store and the LLM endpoint) or on the hosted Mem0 platform (mem0 operates the infrastructure for you). The library ships an MCP server option so tool-using agents can call mem0's primitives as tools without your code being the middleman.
That's the surface. Everything else is about how the extraction step and scoping IDs interact with your prompt assembly.
The architecture: embeddings + LLM extraction
The thing that separates mem0 from a hand-rolled vector store is what happens on add.
When you write a raw turn — say, "user said: I prefer Postgres for everything except analytics, where I use ClickHouse" — mem0 doesn't just chunk it and embed it. It runs an LLM extraction pass first. The LLM reads the input and emits structured "memories" — atomic facts worth remembering, deduplicated against what's already stored, sometimes phrased to be retrieval-friendly rather than transcript-faithful. Those facts then get embedded and stored.
On search, mem0 embeds the query and runs a similarity search over the stored memory embeddings, returning the top matches. You inject those into the next prompt.
Two consequences of this design.
First, memories are dense. A long turn becomes a handful of facts; a transcript-style RAG would store the same content as a chunk full of filler. When the prompt is tight, atomic memories pay for themselves in tokens-per-useful-fact.
Second, every add costs an LLM call. That's the honest tradeoff. If your app does a lot of writes and few reads, the extraction step is expensive overhead. If reads dominate writes — as in most chat apps — the cost is fine and the retrieval quality goes up.
The extraction LLM is configurable. You can point mem0 at OpenAI, Anthropic, a local model via Ollama, or whatever else fits your stack. Same for the embedding model. The vector store is also pluggable — Qdrant, Chroma, pgvector, others.
Multi-level memory: user / session / agent
Scoping IDs are the mechanism by which mem0 keeps memories from different contexts from leaking into each other. They're also the mechanism most often misused.
User-level memory uses user_id. It persists across sessions. Use it for stable facts about a person — preferences, recurring projects, names of their teammates, the project they're shipping. Anything that should still be true a month from now belongs at user scope.
memory.add("Prefers concise summaries with bullet points",
user_id="alice")
memory.search("what summary style?", user_id="alice")
Session-level memory uses run_id. It scopes a memory to a single conversation or task run. Use it for transient context — what the user is currently working on, scratchpad notes, partial state that only matters for this conversation.
memory.add("Currently debugging a CORS issue with the staging API",
user_id="alice", run_id="session-2026-05-04-abc")
Agent-level memory uses agent_id. In multi-agent systems, this lets a coding agent and a research agent each have their own memories about the same user. A fact recalled by the coding agent doesn't pollute the research agent's prompt.
memory.add("User prefers Python over TypeScript for backend",
user_id="alice", agent_id="code-agent")
You can combine IDs. user_id="alice" plus agent_id="code-agent" plus run_id="session-X" scopes a memory three ways. On search, mem0 returns memories matching the IDs you pass — pass user_id only and you'll see all that user's memories regardless of agent or session; pass user + agent and you'll see only that agent's slice.
The discipline this requires is real. The chat handler that calls mem0 needs to know who the user is, what session it's in, and which agent is acting. Most production failures with mem0 are scoping failures: forgetting to pass user_id and watching memories pool into a single global namespace, or passing the wrong run_id and never recalling the session's own context. See the AI memory systems guide for the broader taxonomy this fits inside.
The four primitives in practice
The typical loop with mem0 looks like this:
- Search for relevant memories before generating.
- Inject them into the prompt.
- Generate the assistant's reply.
- Add the user turn and the assistant turn to memory.
- Repeat.
from mem0 import Memory
memory = Memory()
def handle_turn(user_id: str, run_id: str, user_message: str) -> str:
# 1. Search
relevant = memory.search(query=user_message,
user_id=user_id, run_id=run_id, limit=5)
# 2. Inject
memory_block = "\n".join(m["memory"] for m in relevant["results"])
system_prompt = f"Relevant context:\n{memory_block}\n\nAnswer the user."
# 3. Generate (your LLM call)
reply = llm.generate(system=system_prompt, user=user_message)
# 4. Add (both sides of the turn)
memory.add([
{"role": "user", "content": user_message},
{"role": "assistant", "content": reply},
], user_id=user_id, run_id=run_id)
return reply
update and delete matter when memories go wrong.
update is the corrective primitive. The user says "actually, my favorite framework is Svelte, not React" — you find the contradicting memory and update it rather than letting both versions sit in the store as competing facts.
memory.update(memory_id="mem_abc123",
data="Favorite frontend framework: Svelte")
delete is the right-to-be-forgotten primitive. The user asks you to forget something, the agent realizes a memory was wrong, or a memory has aged out — call delete and it's gone.
memory.delete(memory_id="mem_abc123")
In practice, a lot of teams skip update and just keep adding new memories on top of old ones. That accumulates contradictions. The extraction LLM will sometimes deduplicate or supersede automatically, but not reliably. If correctness matters, update is not optional — see the SurePrompts Quality Rubric on why correctness needs an explicit story.
Graph memory
mem0 ships an optional graph memory mode backed by Neo4j. Instead of storing memories as flat embeddings, the LLM extraction step pulls out entities and relationships and writes them into a graph. "Alice manages Bob" becomes nodes for Alice and Bob with a manages edge between them.
When does this matter? When traversal queries are real. "Who reports to Alice's directs?" "Which projects share a lead with Project X?" "What customers belong to the same account as this one?" Embeddings won't answer those well; a graph will.
When doesn't it matter? Most of the time. Conversational memory, user preferences, recurring task context, working notes — embeddings are enough. Graph mode adds a database to operate, slows down add (entity extraction is heavier than fact extraction), and only helps if your retrieval actually traverses.
Default decision: leave graph mode off. Turn it on when you have a concrete query you can't express without it. See memory recall for how retrieval quality changes with the underlying store.
mem0 vs Letta
mem0 and Letta both add memory to LLM apps, but they make different bets about where memory lives in the stack.
| Dimension | mem0 | Letta |
|---|---|---|
| Architecture | Thin memory layer | Full agent framework |
| Memory primitives | CRUD (add/search/update/delete) | Tool-based block editing |
| Integration | Drop-in to existing app | Build agent inside Letta |
| Cost model | LLM extraction on every add | Reflection during idle / on context pressure |
| State surface | Flat list of memories + optional graph | Memory blocks (core, archival, recall) |
| Best fit | Adding memory to an existing chat app | Designing an agent where memory is load-bearing |
| Self-hosted | Yes | Yes |
| Hosted option | Mem0 platform | Letta Cloud |
The honest read: pick mem0 when you already have an agent loop and you don't want to rewrite it. Pick Letta when you're starting from a memory-first design and you want the memory model and the agent model to be the same model. The agent memory architectures compared sibling lays this out across more frameworks.
A useful tell: how much work is it to remove memory from your app? If the answer is "delete a few mem0 calls," mem0 is the right shape. If the answer is "I'd have to rebuild the agent," that's the Letta shape.
mem0 vs vector-only RAG
People sometimes ask if mem0 replaces RAG. It doesn't. They solve different problems.
| Dimension | mem0 | Vector-only RAG |
|---|---|---|
| Read or write | Read + write | Read-only over a corpus |
| Source | Conversation, user input, runtime events | Indexed documents |
| Storage unit | Atomic extracted memories | Chunks of source text |
| Update story | First-class via update / delete | Reindex |
| Scoping | Per user / session / agent | Per index / namespace |
| Lifetime | Long-lived, mutable | Mostly static |
| When to use | User-specific, conversational memory | Static knowledge corpus |
In a real app you'll usually want both. RAG handles the docs corpus — your product manual, your codebase, whatever the model needs to look up. mem0 handles what the model has learned about this user from their interactions. The prompt assembly step combines them: retrieve docs, retrieve memories, inject both, generate. See hybrid search and chunking for the RAG side and conversation memory for the mem0 side.
Integration patterns
The basic pattern fits cleanly into any chat handler. The shape:
incoming user turn
→ search mem0 with the user message
→ inject memories into the system prompt
→ call your LLM
→ add the user turn + assistant turn to mem0
→ return the reply
This works with OpenAI, Anthropic, local models — mem0 doesn't care which LLM generates the reply. mem0 only cares about the LLM you pointed it at for extraction.
A few patterns worth pulling out:
Inject memories as a system block, not as fake history. The temptation is to inject memories as fake user or assistant turns. Don't. Put them in a clearly-labeled section in the system prompt — "Relevant memories about this user:" — so the model knows what they are and the conversation history stays clean. This composes with the agentic prompt stack layout.
Cap the number of injected memories. Set limit on search and stick to it. Five to ten relevant memories is usually plenty; injecting fifty bloats the prompt and dilutes signal.
Tag the scope on every call. Every add and search should pass user_id at minimum. If you're in a multi-agent system, agent_id too. If the conversation has a session boundary, run_id too. Treat missing IDs as a programming error, not a default.
Use the MCP server option for tool-using agents. If your agent already speaks MCP, mem0's MCP server lets it call add, search, update, and delete as tools directly, with no glue code in your handler. The agent decides when to remember and when to recall. This pairs well with the OpenAI Agents SDK and Mastra patterns.
Stage the extraction LLM. You don't need your most expensive model doing extraction. A cheaper model — a small Claude, a small GPT, a local model via Ollama — handles fact extraction fine, and the savings add up over thousands of writes.
For where mem0 fits in the broader maturity progression, see the Context Engineering Maturity Model. Memory is one of the things that moves you up the levels.
Common failure modes
A short tour of the bugs that show up in production.
Noisy or duplicate memories. Symptom: get_all returns ten variations of the same fact. Cause: extraction LLM is too eager, or the prompt encourages restating. Fix: configure the extraction prompt to deduplicate, periodically run a cleanup pass that calls update to merge variants, or upgrade the extraction model.
Search recall is too broad. Symptom: irrelevant memories injected into prompts, model gets confused or contradicts itself. Cause: similarity threshold too loose, or limit too high. Fix: tighten limit, raise the similarity threshold if your store exposes one, and inspect what's coming back via get_all for the user.
update never called. Symptom: contradictions accumulate — the user said "I prefer X" and later "I prefer Y" and now both are in memory. Cause: handler writes new memories but never modifies old ones. Fix: when the user explicitly corrects a fact, search for the old memory and call update on it. Build this into the handler, not as an afterthought.
Scoping IDs forgotten. Symptom: a user sees memories that aren't theirs. Cause: user_id not passed on add or search. Fix: assert presence of user_id at the entry point of your handler. Treat missing as a 500. This is non-negotiable in multi-tenant SaaS.
Graph memory enabled and ignored. Symptom: latency on add is higher than expected, Neo4j sits idle. Cause: someone turned graph mode on for a "real-world graphs" demo and it stayed on. Fix: turn it off if you're not actually traversing. The cost is real and the benefit is zero if reads don't use the graph.
Extraction model too aggressive. Symptom: the model is "remembering" things the user never said — inferences turned into memories. Cause: the extraction prompt is encouraging the LLM to extrapolate. Fix: tighten the extraction prompt to "facts explicitly stated" and review a sample weekly. This connects to the RCAF discipline of constraining outputs.
Episodic and semantic memory conflated. Symptom: every turn becomes a "memory," and search returns episode-level details that should have been forgotten. Fix: be selective about what you add. Not every turn is worth remembering. See episodic vs semantic memory for agents for the distinction.
When mem0 is the right tool
mem0 fits when:
- You have an existing LLM app and want persistent memory without rebuilding.
- You're running multi-tenant SaaS where user-scoped memory matters and isolation is mandatory.
- You want CRUD memory ergonomics — explicit add, search, update, delete — rather than the model managing memory through tools.
- You need cross-session continuity for end users and don't want to reinvent it.
- You want a layer that composes with whatever orchestration you already have (OpenAI Agents SDK, Mastra, plain handlers, custom).
mem0 is overkill or wrong-shape when:
- You're shipping a single-session prototype where memory doesn't need to persist past the conversation.
- You're building an agent where memory is load-bearing in the design itself — Letta is the better bet there.
- You have only static corpus knowledge and no per-user memory — pure RAG is enough.
- You can't afford the extraction LLM cost on every write — a hand-rolled vector store with raw chunks may be cheaper for write-heavy, read-light workloads.
The decision rule: if removing mem0 would mean editing a handful of call sites in your handler, it's the right shape. If removing it would mean redesigning the agent, you should have used Letta.
What to read next
If you're zooming out from mem0 specifically:
- AI memory systems guide — the three shapes of memory (within-session, provider-managed, application-managed) and where mem0 fits.
- Agent memory architectures compared — mem0, Letta, Zep, and the rest, side by side.
- Episodic vs semantic memory for agents — the distinction that decides what's worth remembering.
If you're picking a different memory model:
- Letta walkthrough — the agent-framework-with-memory alternative.
If you're zooming out from memory to the larger stack:
- Agentic Prompt Stack — where memory sits in the agent prompt layout.
- Context Engineering Maturity Model — what level adding memory moves you to.
- SurePrompts Quality Rubric — how to evaluate whether memory is actually helping.