mem0 Implementation Guide: How to Add Persistent Memory to Any LLM App (2026)

Q: What is mem0?

mem0 is an open-source memory layer for LLM applications. It exposes four primitives — add, search, update, delete — over an embeddings-backed store, with optional graph memory and multi-level scoping for users, sessions, and agents. When you call add, mem0 runs an LLM extraction pass to turn raw text into structured atomic memories, then stores embeddings of those facts. On search, mem0 embeds your query and returns the most relevant memories. It's available as a Python package and a TypeScript package, runs self-hosted or on the Mem0 platform, and ships an MCP server for tool-using agents. The pitch is drop-in persistent memory without rearchitecting your agent loop — your existing chat handler, your existing model, plus a thin memory layer in front.

Q: How is mem0 different from a vector database?

A vector database stores whatever you put in it and retrieves by similarity. mem0 sits on top of one and adds two things: an LLM extraction step on write, and structured memory ergonomics on read. When you add a turn to mem0, it doesn't store the raw transcript — it asks an LLM to pull out the facts worth remembering and stores those as discrete memories. When you search, you get atomic facts, not chunks of conversation. You also get update and delete primitives that target individual memories, not document chunks. The cost is that every add call burns tokens on extraction. The benefit is that retrieved context is denser and more useful per token in the prompt.

Q: What's the difference between user, session, and agent memory in mem0?

Scoping IDs decide what gets recalled where. user_id ties a memory to a person — it persists across sessions, surfaces in any conversation that user has, and is the right scope for stable preferences. run_id ties a memory to a single session or run — useful for transient context that should reset when the conversation ends. agent_id ties a memory to a specific agent in a multi-agent system — one agent's memories stay private to it. You pass these IDs on add and search. Get them wrong and you'll either leak memories across users or forget facts that should have persisted. Pick the scope per memory based on lifetime.

Q: When should I use mem0 vs Letta?

Use mem0 when you have an existing LLM app and you want to bolt on persistent memory without rewriting the agent loop. The CRUD primitives compose with whatever orchestration you already have. Use Letta when you're building an agent from scratch and you want memory ergonomics — the memory blocks, the tool-based editing, the context compression — to be load-bearing in the agent's design, not incidental. mem0 is the right call for multi-tenant SaaS where user-scoped memory matters and you control the prompt assembly. Letta is the right call when the agent is the product and the memory model has to match how the agent thinks. They're not competitors so much as different bets on where memory lives.

Q: Do I need graph memory in mem0?

Probably not, unless your domain has explicit relationships you need to query. Graph memory is an optional Neo4j-backed mode where memories form a graph of entities and relations rather than a flat list of embeddings. It pays off when you need to traverse — who reports to whom, which projects share a lead, which customer belongs to which account. For the common case of remembering user preferences and conversational facts, embedding-only memory is simpler, cheaper, and gives you what you need. Turning graph mode on adds a database to operate, raises the cost of every add, and only helps if you actually run graph queries. Default off; turn it on when you have a concrete use for traversal.

Q: How does mem0 handle multiple users?

Through user_id scoping. Every add call takes a user_id; every search call takes a user_id; mem0 only returns memories whose user_id matches. That's the entire isolation story for the embeddings store. The implication is that scoping discipline is your responsibility — if your chat handler forgets to pass user_id, memories from one user can land in another user's prompt, and there's no second guardrail. In multi-tenant SaaS treat user_id as load-bearing: derive it from the authenticated session, never from request input, and assert it's present before every mem0 call. For agents acting on behalf of users, combine user_id with agent_id so different agents see different slices of the same user's memory.

SurePrompts Team

If you're adding persistent memory to an LLM app, mem0 is the path of least resistance. It's an open-source memory layer that sits between your app and your model, exposes four primitives — add, search, update, delete — over an embeddings store, and handles the awkward parts: extracting what's worth remembering, scoping memories per user or session, and serving them back through similarity search. You don't rebuild the agent loop; you call memory.add(...) after each turn and memory.search(...) before the next one.

Tip

mem0 sits between your app and your LLM as a thin memory layer — that's the gain when you don't want to rebuild the agent loop, and the friction when memory ergonomics need to be load-bearing rather than incidental.

Key takeaways

mem0 is a memory layer, not an agent framework. It composes with whatever orchestration you already have.
Four primitives — add, search, update, delete — plus get_all for inspection. That's the whole API surface most apps touch.
The differentiator is the LLM extraction step on add: raw input becomes structured atomic memories before storage.
Multi-level scoping (user_id, agent_id, run_id) decides what's recalled where. Forgetting to pass them is the most common production bug.
Graph memory is optional and Neo4j-backed. Default off unless you actually need to traverse relationships.
mem0 vs Letta is a real choice: thin layer over your loop vs. agent framework where memory is load-bearing.
mem0 is not a RAG replacement. They compose — mem0 for user-specific memory, RAG for static corpus knowledge.

What mem0 is

mem0 is an open-source library for adding persistent memory to LLM applications. It ships as a Python package and a TypeScript package, and the core abstraction is a Memory class. You instantiate Memory, point it at an embedding model and a vector store, and call its methods to write and read memories.

The four primitives:

add — write a new turn or piece of text into memory. Triggers LLM extraction.
search — retrieve memories relevant to a query, scoped by IDs.
update — modify an existing memory by ID.
delete — remove a memory by ID.

There's also get_all for listing memories under a scope, useful for debugging and admin views.

Three scoping IDs control what's recalled where:

user_id — memories tied to a specific human end user. Cross-session.
run_id — memories tied to a specific session or run. Resets per session.
agent_id — memories tied to a specific agent identity in a multi-agent system.

You can run mem0 self-hosted (you operate the vector store and the LLM endpoint) or on the hosted Mem0 platform (mem0 operates the infrastructure for you). The library ships an MCP server option so tool-using agents can call mem0's primitives as tools without your code being the middleman.

That's the surface. Everything else is about how the extraction step and scoping IDs interact with your prompt assembly.

The architecture: embeddings + LLM extraction

The thing that separates mem0 from a hand-rolled vector store is what happens on add.

When you write a raw turn — say, "user said: I prefer Postgres for everything except analytics, where I use ClickHouse" — mem0 doesn't just chunk it and embed it. It runs an LLM extraction pass first. The LLM reads the input and emits structured "memories" — atomic facts worth remembering, deduplicated against what's already stored, sometimes phrased to be retrieval-friendly rather than transcript-faithful. Those facts then get embedded and stored.

On search, mem0 embeds the query and runs a similarity search over the stored memory embeddings, returning the top matches. You inject those into the next prompt.

Two consequences of this design.

First, memories are dense. A long turn becomes a handful of facts; a transcript-style RAG would store the same content as a chunk full of filler. When the prompt is tight, atomic memories pay for themselves in tokens-per-useful-fact.

Second, every add costs an LLM call. That's the honest tradeoff. If your app does a lot of writes and few reads, the extraction step is expensive overhead. If reads dominate writes — as in most chat apps — the cost is fine and the retrieval quality goes up.

The extraction LLM is configurable. You can point mem0 at OpenAI, Anthropic, a local model via Ollama, or whatever else fits your stack. Same for the embedding model. The vector store is also pluggable — Qdrant, Chroma, pgvector, others.

Multi-level memory: user / session / agent

Scoping IDs are the mechanism by which mem0 keeps memories from different contexts from leaking into each other. They're also the mechanism most often misused.

User-level memory uses user_id. It persists across sessions. Use it for stable facts about a person — preferences, recurring projects, names of their teammates, the project they're shipping. Anything that should still be true a month from now belongs at user scope.

python

memory.add("Prefers concise summaries with bullet points",
           user_id="alice")
memory.search("what summary style?", user_id="alice")

Session-level memory uses run_id. It scopes a memory to a single conversation or task run. Use it for transient context — what the user is currently working on, scratchpad notes, partial state that only matters for this conversation.

python

memory.add("Currently debugging a CORS issue with the staging API",
           user_id="alice", run_id="session-2026-05-04-abc")

Agent-level memory uses agent_id. In multi-agent systems, this lets a coding agent and a research agent each have their own memories about the same user. A fact recalled by the coding agent doesn't pollute the research agent's prompt.

python

memory.add("User prefers Python over TypeScript for backend",
           user_id="alice", agent_id="code-agent")

You can combine IDs. user_id="alice" plus agent_id="code-agent" plus run_id="session-X" scopes a memory three ways. On search, mem0 returns memories matching the IDs you pass — pass user_id only and you'll see all that user's memories regardless of agent or session; pass user + agent and you'll see only that agent's slice.

The discipline this requires is real. The chat handler that calls mem0 needs to know who the user is, what session it's in, and which agent is acting. Most production failures with mem0 are scoping failures: forgetting to pass user_id and watching memories pool into a single global namespace, or passing the wrong run_id and never recalling the session's own context. See the AI memory systems guide for the broader taxonomy this fits inside.

The four primitives in practice

The typical loop with mem0 looks like this:

Search for relevant memories before generating.
Inject them into the prompt.
Generate the assistant's reply.
Add the user turn and the assistant turn to memory.
Repeat.

python

from mem0 import Memory

memory = Memory()

def handle_turn(user_id: str, run_id: str, user_message: str) -> str:
    # 1. Search
    relevant = memory.search(query=user_message,
                             user_id=user_id, run_id=run_id, limit=5)

    # 2. Inject
    memory_block = "\n".join(m["memory"] for m in relevant["results"])
    system_prompt = f"Relevant context:\n{memory_block}\n\nAnswer the user."

    # 3. Generate (your LLM call)
    reply = llm.generate(system=system_prompt, user=user_message)

    # 4. Add (both sides of the turn)
    memory.add([
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": reply},
    ], user_id=user_id, run_id=run_id)

    return reply

update and delete matter when memories go wrong.

update is the corrective primitive. The user says "actually, my favorite framework is Svelte, not React" — you find the contradicting memory and update it rather than letting both versions sit in the store as competing facts.

python

memory.update(memory_id="mem_abc123",
              data="Favorite frontend framework: Svelte")

delete is the right-to-be-forgotten primitive. The user asks you to forget something, the agent realizes a memory was wrong, or a memory has aged out — call delete and it's gone.

python

memory.delete(memory_id="mem_abc123")

In practice, a lot of teams skip update and just keep adding new memories on top of old ones. That accumulates contradictions. The extraction LLM will sometimes deduplicate or supersede automatically, but not reliably. If correctness matters, update is not optional — see the SurePrompts Quality Rubric on why correctness needs an explicit story.

Graph memory

mem0 ships an optional graph memory mode backed by Neo4j. Instead of storing memories as flat embeddings, the LLM extraction step pulls out entities and relationships and writes them into a graph. "Alice manages Bob" becomes nodes for Alice and Bob with a manages edge between them.

When does this matter? When traversal queries are real. "Who reports to Alice's directs?" "Which projects share a lead with Project X?" "What customers belong to the same account as this one?" Embeddings won't answer those well; a graph will.

When doesn't it matter? Most of the time. Conversational memory, user preferences, recurring task context, working notes — embeddings are enough. Graph mode adds a database to operate, slows down add (entity extraction is heavier than fact extraction), and only helps if your retrieval actually traverses.

Default decision: leave graph mode off. Turn it on when you have a concrete query you can't express without it. See memory recall for how retrieval quality changes with the underlying store.

mem0 vs Letta

mem0 and Letta both add memory to LLM apps, but they make different bets about where memory lives in the stack.

Dimension	mem0	Letta
Architecture	Thin memory layer	Full agent framework
Memory primitives	CRUD (add/search/update/delete)	Tool-based block editing
Integration	Drop-in to existing app	Build agent inside Letta
Cost model	LLM extraction on every `add`	Reflection during idle / on context pressure
State surface	Flat list of memories + optional graph	Memory blocks (core, archival, recall)
Best fit	Adding memory to an existing chat app	Designing an agent where memory is load-bearing
Self-hosted	Yes	Yes
Hosted option	Mem0 platform	Letta Cloud

The honest read: pick mem0 when you already have an agent loop and you don't want to rewrite it. Pick Letta when you're starting from a memory-first design and you want the memory model and the agent model to be the same model. The agent memory architectures compared sibling lays this out across more frameworks.

A useful tell: how much work is it to remove memory from your app? If the answer is "delete a few mem0 calls," mem0 is the right shape. If the answer is "I'd have to rebuild the agent," that's the Letta shape.

mem0 vs vector-only RAG

People sometimes ask if mem0 replaces RAG. It doesn't. They solve different problems.

Dimension	mem0	Vector-only RAG
Read or write	Read + write	Read-only over a corpus
Source	Conversation, user input, runtime events	Indexed documents
Storage unit	Atomic extracted memories	Chunks of source text
Update story	First-class via `update` / `delete`	Reindex
Scoping	Per user / session / agent	Per index / namespace
Lifetime	Long-lived, mutable	Mostly static
When to use	User-specific, conversational memory	Static knowledge corpus

In a real app you'll usually want both. RAG handles the docs corpus — your product manual, your codebase, whatever the model needs to look up. mem0 handles what the model has learned about this user from their interactions. The prompt assembly step combines them: retrieve docs, retrieve memories, inject both, generate. See hybrid search and chunking for the RAG side and conversation memory for the mem0 side.

Integration patterns

The basic pattern fits cleanly into any chat handler. The shape:

code

incoming user turn
  → search mem0 with the user message
  → inject memories into the system prompt
  → call your LLM
  → add the user turn + assistant turn to mem0
  → return the reply

This works with OpenAI, Anthropic, local models — mem0 doesn't care which LLM generates the reply. mem0 only cares about the LLM you pointed it at for extraction.

A few patterns worth pulling out:

Inject memories as a system block, not as fake history. The temptation is to inject memories as fake user or assistant turns. Don't. Put them in a clearly-labeled section in the system prompt — "Relevant memories about this user:" — so the model knows what they are and the conversation history stays clean. This composes with the agentic prompt stack layout.

Cap the number of injected memories. Set limit on search and stick to it. Five to ten relevant memories is usually plenty; injecting fifty bloats the prompt and dilutes signal.

Tag the scope on every call. Every add and search should pass user_id at minimum. If you're in a multi-agent system, agent_id too. If the conversation has a session boundary, run_id too. Treat missing IDs as a programming error, not a default.

Use the MCP server option for tool-using agents. If your agent already speaks MCP, mem0's MCP server lets it call add, search, update, and delete as tools directly, with no glue code in your handler. The agent decides when to remember and when to recall. This pairs well with the OpenAI Agents SDK and Mastra patterns.

Stage the extraction LLM. You don't need your most expensive model doing extraction. A cheaper model — a small Claude, a small GPT, a local model via Ollama — handles fact extraction fine, and the savings add up over thousands of writes.

For where mem0 fits in the broader maturity progression, see the Context Engineering Maturity Model. Memory is one of the things that moves you up the levels.

Common failure modes

A short tour of the bugs that show up in production.

Noisy or duplicate memories. Symptom: get_all returns ten variations of the same fact. Cause: extraction LLM is too eager, or the prompt encourages restating. Fix: configure the extraction prompt to deduplicate, periodically run a cleanup pass that calls update to merge variants, or upgrade the extraction model.

Search recall is too broad. Symptom: irrelevant memories injected into prompts, model gets confused or contradicts itself. Cause: similarity threshold too loose, or limit too high. Fix: tighten limit, raise the similarity threshold if your store exposes one, and inspect what's coming back via get_all for the user.

update never called. Symptom: contradictions accumulate — the user said "I prefer X" and later "I prefer Y" and now both are in memory. Cause: handler writes new memories but never modifies old ones. Fix: when the user explicitly corrects a fact, search for the old memory and call update on it. Build this into the handler, not as an afterthought.

Scoping IDs forgotten. Symptom: a user sees memories that aren't theirs. Cause: user_id not passed on add or search. Fix: assert presence of user_id at the entry point of your handler. Treat missing as a 500. This is non-negotiable in multi-tenant SaaS.

Graph memory enabled and ignored. Symptom: latency on add is higher than expected, Neo4j sits idle. Cause: someone turned graph mode on for a "real-world graphs" demo and it stayed on. Fix: turn it off if you're not actually traversing. The cost is real and the benefit is zero if reads don't use the graph.

Extraction model too aggressive. Symptom: the model is "remembering" things the user never said — inferences turned into memories. Cause: the extraction prompt is encouraging the LLM to extrapolate. Fix: tighten the extraction prompt to "facts explicitly stated" and review a sample weekly. This connects to the RCAF discipline of constraining outputs.

Episodic and semantic memory conflated. Symptom: every turn becomes a "memory," and search returns episode-level details that should have been forgotten. Fix: be selective about what you add. Not every turn is worth remembering. See episodic vs semantic memory for agents for the distinction.

When mem0 is the right tool

mem0 fits when:

You have an existing LLM app and want persistent memory without rebuilding.
You're running multi-tenant SaaS where user-scoped memory matters and isolation is mandatory.
You want CRUD memory ergonomics — explicit add, search, update, delete — rather than the model managing memory through tools.
You need cross-session continuity for end users and don't want to reinvent it.
You want a layer that composes with whatever orchestration you already have (OpenAI Agents SDK, Mastra, plain handlers, custom).

mem0 is overkill or wrong-shape when:

You're shipping a single-session prototype where memory doesn't need to persist past the conversation.
You're building an agent where memory is load-bearing in the design itself — Letta is the better bet there.
You have only static corpus knowledge and no per-user memory — pure RAG is enough.
You can't afford the extraction LLM cost on every write — a hand-rolled vector store with raw chunks may be cheaper for write-heavy, read-light workloads.

The decision rule: if removing mem0 would mean editing a handful of call sites in your handler, it's the right shape. If removing it would mean redesigning the agent, you should have used Letta.

What to read next

If you're zooming out from mem0 specifically:

AI memory systems guide — the three shapes of memory (within-session, provider-managed, application-managed) and where mem0 fits.
Agent memory architectures compared — mem0, Letta, Zep, and the rest, side by side.
Episodic vs semantic memory for agents — the distinction that decides what's worth remembering.

If you're picking a different memory model:

Letta walkthrough — the agent-framework-with-memory alternative.

If you're zooming out from memory to the larger stack:

Agentic Prompt Stack — where memory sits in the agent prompt layout.
Context Engineering Maturity Model — what level adding memory moves you to.
SurePrompts Quality Rubric — how to evaluate whether memory is actually helping.

mem0 Implementation Guide: How to Add Persistent Memory to Any LLM App (2026)

Key takeaways

What mem0 is

The architecture: embeddings + LLM extraction

Multi-level memory: user / session / agent

The four primitives in practice

Graph memory

mem0 vs Letta

mem0 vs vector-only RAG

Integration patterns

Common failure modes

When mem0 is the right tool

What to read next

Ready to write better prompts?

Related Articles

AI Memory Systems Guide (2026): Within-Session, Provider, and Application

The Agentic Prompt Stack: 6 Layers for Designing Prompts That Run Agents

The Context Engineering Maturity Model: 5 Levels From Static Prompts to Orchestrated Systems