Which memory architecture should you use for your agent? It depends on three things: whether you control the model and orchestration loop, whether memory is load-bearing for the product (the agent is useless without it) or incidental, and whether you need user-scoped persistence across sessions. Five architectures dominate in 2026 — provider-managed memory (ChatGPT memory, Claude Projects, Gemini saved info), self-managing agents (Letta-style), CRUD memory layers (mem0-style), vector-only RAG-as-memory, and custom in-app schemas. Each has a real fit and real friction. This guide compares them on the dimensions that matter and gives a decision rule that ends with a single recommendation, not a "depends."
Tip
There is no universally right memory architecture — there is the one that matches your control surface, your durability needs, and your token budget. Picking the wrong one means rewriting later, and rewriting memory means migrating user data, which is the one migration users actually notice.
Key takeaways:
- Five architectures cover the field: provider-managed, self-managing (Letta), memory layer (mem0), vector RAG, custom schema. The right one depends on control surface, persistence needs, and token budget — not on which is "best."
- Provider memory is zero effort and zero control. It is the right choice when your product is the provider platform integration, and the wrong choice the moment you need to query, audit, or migrate.
- Letta puts memory hygiene inside the agent loop. mem0 puts it outside as a CRUD API. The choice is about whether the agent should reason about its own memory or whether memory should be a service the agent calls.
- Vector-only RAG-as-memory is cheap and read-only. It is the wrong tool the moment you need to update or contradict a stored fact.
- Custom schemas give the most control and cost the most engineering time. They are usually the right end state for mature products, not the right starting point.
- Mixing architectures works if every memory type has exactly one source of truth. Two systems writing the same user facts produces contradictions you cannot fix.
- Use the AI memory systems guide for the underlying memory taxonomy and the Agentic Prompt Stack Layer 4 for how memory connects to the rest of the agent prompt.
The 5 agent memory architectures in 2026
The field has consolidated around five distinct camps. Each makes a different bet about who owns the memory, who edits it, and where the work happens.
Provider-managed memory
The model provider stores facts about the user across sessions and injects them into the context window automatically. ChatGPT memory does this for ChatGPT users. Claude Projects gives a project-scoped persistent context. Gemini saved info plays a similar role inside Google's surfaces.
Fit. Zero engineering effort to enable. The provider handles storage, retrieval, dedup, and conflict resolution. If you ship a custom GPT, a Claude Project workspace, or any agent that lives inside a provider's first-party app, provider memory is what your users already expect, and it is what they can manage from their own settings.
Friction. Opaque. You cannot enumerate what is stored, query it programmatically, scope it to a sub-feature of your product, or migrate it to a different model. The schema is the provider's. The retrieval rules are the provider's. The privacy posture is the provider's. If you build on top of provider memory and later need to move to your own backend, there is no export path — you are starting conversation memory from scratch for every user.
Self-managing agent (Letta-style)
The agent edits its own memory blocks via tool calls. Letta (originally MemGPT) gives the agent a structured memory area — typed blocks like persona, human, and a working set — and tools that let it rewrite those blocks during normal turns. Memory hygiene happens inside the agent loop, not as a separate pre- or post-processing step.
Fit. When memory is load-bearing and the right write decisions are themselves a reasoning task. The agent decides what is worth remembering and when to consolidate, the same way a human does between meetings. Persistence and structured editing are built in. Multi-agent setups inherit per-agent memory blocks naturally.
Friction. Opinionated. You adopt Letta's runtime, its block schema, and its loop shape. Token cost goes up on most turns because the agent spends some of each turn reasoning about and editing its own memory. If your agent does not need that reasoning — if it just needs "remember this fact" — Letta is heavier than the problem.
Memory layer (mem0-style)
A drop-in CRUD layer over embeddings plus an LLM extraction step. mem0 and similar layers expose add, search, update, delete against a memory store with multi-level scoping (user, session, agent). Your application code calls add after a turn, search before a turn, and the layer handles extraction, embedding, dedup, and retrieval.
Fit. Bolting memory onto an existing app with the least architectural disruption. The agent loop does not change; you add two API calls around it. Multi-level scoping handles the common case (per-user facts, per-session working memory, per-agent persona) without you designing a schema. Migration from no-memory to mem0 is usually a week, not a quarter.
Friction. The LLM extraction step on every add costs tokens. Reads are cheap, writes are not. The layer makes opinionated decisions about what to extract — sometimes it stores facts you did not want stored, or fails to store ones you did. You debug those decisions through prompts to the extraction model, which is a layer of indirection between your intent and the stored result.
Vector-only RAG-as-memory
Embed every turn (or every important utterance), store in a vector index, retrieve by semantic similarity. No structured fields, no updates, no dedup. The store is append-only and the retrieval is "find the k most similar past turns to the current one."
Fit. Cheap, simple, and fits patterns your team already uses if you have RAG in production. For agents where "memory" really means "remember the gist of past conversations so we do not re-explain things," it is enough. Storage and read costs are dominated by the embedding model, which is a known quantity.
Friction. No structured updates. If a user changes their preferences, both the old and new statement live in the index, and retrieval might surface either one. No contradiction resolution. Recall quality depends entirely on the embedding model and the chunking strategy — a mistake in either silently degrades memory recall without throwing an error. Treating vector memory as the only memory architecture is a common entry-level mistake.
Custom in-app schema
Postgres or Redis tables holding typed memory rows, plus an LLM summarizer running outside the request path on a background job. The agent reads from your tables on each turn (joined to the user, the session, whatever scope you need) and writes through your application code with whatever validation you want.
Fit. Mature products where memory shape is well understood and the team already runs a database. Maximum control: you own the schema, the indexes, the scoping, the privacy story, the export format, and the migration path. You can answer "what does the system remember about this user" with a SQL query, which provider memory and Letta cannot.
Friction. Maximum maintenance. You build extraction, dedup, conflict resolution, scoping, decay, and retrieval yourself. Teams often start here, ship a thin version, then realize they have rebuilt mem0 badly. The right time to pick custom is after you know what your memory schema actually wants to be — usually after a year on a layer like mem0 or Letta.
The big comparison
The five architectures across the dimensions that drive the decision.
| Dimension | Provider-managed | Letta (self-managing) | mem0 (memory layer) | Vector-only RAG | Custom schema |
|---|---|---|---|---|---|
| Control surface | None — provider owns it | High — you own the runtime | Medium — you own scopes, layer owns extraction | High on writes, none on retrieval logic unless you build it | Maximum — you own everything |
| Persistence model | Provider-managed, opaque | Persistent memory blocks edited in-loop | Persistent CRUD store with scoping | Append-only vector index | Whatever you build |
| Scoping (user / session / agent) | User-only, provider-defined | Per-agent blocks, scope is the agent | Built-in multi-level scoping | None native — you tag and filter | Whatever you model |
| Structured updates | No | Yes — blocks are typed and rewritable | Yes — update/delete API | No — append-only | Yes — SQL updates |
| Cost per turn | Bundled into provider price | Higher — memory hygiene runs in-loop | Read cheap, write costs an LLM extraction call | Cheap reads, embedding cost on writes | Whatever you build |
| Vendor lock-in | High — provider-specific | Medium — Letta runtime | Low — open layer, swappable | Low — vector DB is portable | None |
| Learning curve | None — toggle a setting | Steep — adopt a runtime and loop shape | Low — three API calls | Low if you have RAG, medium otherwise | High — you design it all |
| When to pick | Your product is the provider integration | Memory is load-bearing and write decisions are themselves reasoning | You need persistent user-scoped memory and want it shipped this week | You need "remember the gist" and nothing more | You know your memory schema and want to own everything |
The table is the centerpiece because the five rows really do separate cleanly along these dimensions. There is no row that is best on every column.
The decision framework
Three questions, asked in order. Each cuts the field.
Question 1: Do you control the model and orchestration loop?
If you are building inside a provider's first-party app — a custom ChatGPT GPT, a Claude Project, an Apps SDK integration — you do not fully control the loop, and provider-managed memory is the answer because it is the only thing your users can see and manage. Stop here.
If you are building on a model API and you own the orchestration, continue.
Question 2: Is memory load-bearing for the product, or incidental?
Load-bearing means the agent is useless without memory. A coaching agent that has no idea what you talked about last week is not a coaching agent. A long-running research agent that cannot recall what it found two hours ago is not doing research. For load-bearing memory, pick Letta (if write decisions are themselves a reasoning task) or a custom schema (if your team is mature enough to own the design).
Incidental means memory makes the agent better but not necessary. A support agent that can recall the user's plan tier is nicer than one that cannot, but it can ask. For incidental memory, mem0 is the lowest-friction answer; vector RAG is fine if you only need similarity-based recall.
Question 3: Do you need user-scoped persistence across sessions, agents, and devices?
If yes — the same user gets the same memory whether they hit your web app, your iOS app, or your Slack bot — you need explicit scoping. mem0, Letta, and custom schemas all support this. Provider memory does too, but only inside the provider's surface. Vector RAG does not, unless you build the scoping yourself with metadata filters.
If no — memory is per-session and disappears when the session ends — you do not need a memory architecture at all. Pass conversation memory inside the request and stop.
The combinations that fall out: provider memory for first-party platform integrations; Letta for reasoning-heavy load-bearing memory you want to ship; mem0 for incidental user-scoped memory you want to ship this week; custom for mature products that have outgrown mem0; vector RAG for the narrow case where similarity is the only retrieval you need. The decision is rarely all five — it is usually two candidates, and the third question picks between them.
Cost model differences
Cost shape matters more than dollar amounts because the dollar amount depends on your traffic. The shape determines what scales linearly with users and what scales with conversation length.
Provider-managed. Bundled into the provider's per-token price. You pay nothing extra. You also cannot break the cost out, which means you cannot optimize it. If your product is paying provider per-seat or per-call pricing, memory is "free" in the sense that you cannot reduce the bill by reducing memory use.
Letta. Token cost amortized across in-loop memory hygiene. Most turns include some tokens spent on the agent reasoning about or editing its memory blocks. There is no separate memory pipeline to budget; the cost is folded into the agent's normal token spend. The trade is that you cannot turn it off cheaply when you do not need it on a given turn — the agent will still consider memory hygiene as part of its loop.
mem0. Reads are a single embedding call plus a similarity lookup — cheap. Writes invoke an LLM extraction call on every add to decide what is worth storing and how to phrase it. The extraction LLM is usually a small model, but it is real money at scale. The cost shape is per-conversation-turn, not per-user, so chatty users cost more than quiet ones in a sublinear but noticeable way.
Vector-only RAG. Reads are cheap. Writes are an embedding call on whatever you are storing. Storage scales linearly with bytes embedded; query cost scales with index size, mitigated by approximate nearest neighbor. The dominant cost is the embedding model — switching to a smaller embedder can cut cost in half with modest recall loss, which is a knob you control.
Custom. Engineering cost is the cost. Per-turn runtime cost is whatever you build — usually low because you can avoid LLM calls in the memory path entirely (relying on SQL queries, summaries computed on background jobs, and indexed retrieval). The bill shows up in headcount: someone owns the schema, the migrations, the dedup logic, and the privacy review.
The honest read is that no architecture is universally cheapest. Your read-write ratio and your conversation length determine which dominates. A high-write, short-conversation product (every turn produces new facts) makes mem0's extraction cost large and Letta's in-loop cost small. A low-write, long-conversation product (occasional facts, lots of recall) inverts that.
Migration paths
Most products do not pick once and stay. The common arcs are predictable.
Provider memory → custom. Teams prototype on a custom GPT or Claude Project, validate demand, then move to their own model and need to bring memory with them. The migration is hard because there is no export. You start fresh and accept that early users lose continuity. Mitigations: warn users before the cutover, prompt them to re-state important facts, and seed the new memory store from your own conversation logs (not from the provider's memory, which you cannot read).
Vector RAG → mem0. A team built RAG-as-memory because they had RAG in production already, then hit the contradiction problem (user said "vegetarian" three months ago and "pescatarian" yesterday, and both surface). Migrating to mem0 is straightforward because mem0 also uses embeddings under the hood — you reindex your existing turns through mem0's extraction pipeline and let mem0 own the writes from then on. Friction is mostly in deciding which historical turns to seed.
mem0 → Letta. When write decisions become reasoning. The migration is harder because Letta's runtime model is different — you are not just swapping a memory layer, you are restructuring the agent loop. Plan for it as a re-architecture, not a swap.
mem0 → custom. When the team has learned what the schema wants to be. This is the most common end-state migration for mature products. The migration is incremental: start by replacing mem0's writes with writes to your own tables (keep mem0 reads working in parallel), validate that your tables produce equivalent retrieval, then cut over reads. The "rebuild mem0 badly" failure mode comes from skipping the learning phase, not from the migration itself.
Letta → custom. Rare, but happens when the team needs control Letta does not give them — usually around scoping or privacy. Painful because Letta's block model does not map cleanly onto a relational schema. Plan for an extraction phase where you decide which Letta block contents become which custom rows.
The arc that does not work is jumping straight to custom from no memory at all. Teams that try this rebuild mem0 badly within six months. Pick a layer first, learn from running it, then decide if you need to own the schema.
Anti-patterns
Things that look like good ideas and are not.
Picking provider memory because it's "free" and then needing to query it. Provider memory is free in dollars and expensive in optionality. The day a customer asks "what does your product remember about me?" or compliance asks "show me everything stored about user X," you cannot answer. Pick provider memory only when the product is the provider integration, not because you want to skip the work of building memory.
Picking Letta for a stateless tool that does not need memory. Letta is heavy. The runtime, the block model, the in-loop hygiene — all of it is justified when memory is load-bearing and write decisions are themselves reasoning. For an agent that needs zero memory or session-only memory, Letta is overkill and adds tokens you did not need to spend.
Picking custom because everything else feels too magic, then rebuilding mem0 badly. This is the single most common architectural failure in agent memory. Engineers see mem0's extraction step and think "I can write that," then six months in they have a worse version with fewer tests. The right move is to ship on mem0, learn what you actually need, then build custom against that learning — not against your guesses.
Mixing two architectures without a single source of truth. Provider memory is on (because the platform provides it) and your app schema also stores user facts. Both write. Both have stale data. Neither is authoritative. There is no merge story. The fix is to pick one source of truth per memory type before you mix architectures, not after.
Treating RAG as memory because RAG is what you have. Vector RAG is a long-term memory substitute only when "memory" really means "find similar past content." It is not a substitute when memory means "remember this user said X yesterday and updated to Y today." If you find yourself adding metadata fields and update logic to your vector index, you have started rebuilding mem0 inside your vector store. Stop and use mem0 or a custom schema.
Not naming a memory architecture at all. Some teams ship agents with conversation history in the request and call it done. That is a memory architecture — it is "session-only, no persistence" — and it is fine if that is the deliberate choice. The anti-pattern is not naming it. An unnamed architecture is one nobody owns, and the day a user asks "why doesn't it remember me from yesterday" there is no one to ask.
What to read next
- AI memory systems guide — the underlying memory taxonomy: short-term, long-term, episodic, semantic, procedural. This guide compares architectures; that one explains the memory types each architecture stores.
- Letta walkthrough — concrete walkthrough of Letta's block model and the in-loop memory edits.
- mem0 implementation guide — practical add/search/update/delete patterns and scoping strategies.
- Episodic vs semantic memory for agents — the cognitive-science distinction translated into architectural choices about what to store and when.
- Agentic Prompt Stack — Layer 4 (Memory access) is where the memory architecture surfaces in the agent prompt itself.
- Context Engineering Maturity Model — memory is one input to context assembly; the maturity model shows where it sits in the broader pipeline.
- SurePrompts Quality Rubric and RCAF — for the prompt-level structure inside any memory-aware agent.
- LangGraph prompting guide and OpenAI Agents SDK prompting guide — how memory plugs into the two most common orchestration runtimes.