Agent Memory Architectures Compared (2026): Provider, Letta, mem0, RAG, Custom

Q: Which memory architecture should I use for my agent?

Start by answering three questions in order. First, do you control the model and orchestration loop? If you are building on a closed platform that injects its own memory (a ChatGPT GPT, for example), provider-managed memory is the only option and the question ends there. Second, is memory load-bearing — does the agent become useless without it, or is it a nice-to-have? If load-bearing, pick Letta or a custom schema where you can guarantee writes and reads. If incidental, mem0 or vector RAG is fine. Third, do you need user-scoped persistence across sessions, agents, and devices? If yes, you need explicit scoping (mem0, Letta, custom). If no, session-only memory is enough. Most products land on mem0 in early stages and migrate to a custom schema once the product shape is stable.

Q: Is ChatGPT memory enough for production apps?

For an app you ship through the ChatGPT platform — a custom GPT, a connector, an Apps SDK integration — provider-managed memory is what you have, and you live within its limits: opaque storage, no programmatic CRUD, no fine-grained scoping, no portability to other models. For your own product running on the OpenAI API, ChatGPT memory does not apply at all; you are building memory yourself or with a layer like mem0 or Letta. The honest answer is that provider memory is enough when your product is the platform integration, and not enough the moment you need to query, audit, export, or migrate the memory you have stored.

Q: When should I switch from mem0 to Letta?

Switch when memory becomes load-bearing in a way mem0's add-and-search loop does not handle cleanly. Three signals show up together. First, you find yourself writing application code to decide what to write to memory and when, and that code is starting to look like a smaller, worse version of an agent reasoning about its own memory. Second, you need structured memory blocks the agent can actively reshape — not just a flat extracted-fact store, but typed sections like persona, current goal, and working set. Third, you want the memory hygiene to live inside the agent loop instead of as a separate pre/post step. If any two of those are true, Letta's self-managing model is probably the right shape. If none are true, mem0 is fine and switching is overkill.

Q: Can I mix multiple memory architectures?

Yes, and many production systems end up doing exactly that, but the rule is one source of truth per memory type. A common stable pattern is: vector RAG for retrieval over a corpus of documents the agent did not write (cheap reads, read-only), plus mem0 or a custom table for user-scoped facts the agent did write (writes, scoping, updates), plus session-only conversation history in the request itself. The failure mode to avoid is two systems both writing user facts — when ChatGPT memory and your app schema both store user preferences, contradictions become unfixable because no single layer owns the truth. Decide which architecture owns each memory type before mixing.

Q: How much does each memory architecture cost per turn?

Cost varies by traffic and provider, so the right framing is the cost shape, not a dollar figure. Provider-managed memory bundles cost into the platform price; you pay nothing extra but cannot itemize it. Letta amortizes cost across in-loop memory hygiene — the agent calls memory-edit tools during normal turns, which adds tokens to most turns instead of a separate pipeline. mem0 charges an LLM extraction call on every add; reads are cheap embedding lookups. Vector RAG is cheap on reads (one embedding plus a similarity query) and pays its cost up front in embedding the corpus. Custom is whatever you build — usually cheap per turn but expensive in engineering time. None of these is universally cheapest; the answer depends on your read-write ratio and how much memory churn each turn produces.

Q: What's the difference between agent memory and RAG?

RAG is read-only retrieval over a corpus the agent did not produce — documentation, knowledge bases, support tickets. Agent memory is what the agent wrote down about its own runs and conversations so future runs can use it. They overlap when memory is implemented as a vector store the agent reads from, but the load-bearing distinction is who writes. RAG corpora are written by humans or upstream pipelines; agent memory is written by the agent itself or by a memory-extraction step that runs on the agent's traces. A production system usually needs both: RAG so the agent can answer questions about your product, and agent memory so the agent remembers what this specific user asked yesterday. Treating the two as the same store is a common architectural mistake.

SurePrompts Team

Which memory architecture should you use for your agent? It depends on three things: whether you control the model and orchestration loop, whether memory is load-bearing for the product (the agent is useless without it) or incidental, and whether you need user-scoped persistence across sessions. Five architectures dominate in 2026 — provider-managed memory (ChatGPT memory, Claude Projects, Gemini saved info), self-managing agents (Letta-style), CRUD memory layers (mem0-style), vector-only RAG-as-memory, and custom in-app schemas. Each has a real fit and real friction. This guide compares them on the dimensions that matter and gives a decision rule that ends with a single recommendation, not a "depends."

Tip

There is no universally right memory architecture — there is the one that matches your control surface, your durability needs, and your token budget. Picking the wrong one means rewriting later, and rewriting memory means migrating user data, which is the one migration users actually notice.

Key takeaways:

Five architectures cover the field: provider-managed, self-managing (Letta), memory layer (mem0), vector RAG, custom schema. The right one depends on control surface, persistence needs, and token budget — not on which is "best."
Provider memory is zero effort and zero control. It is the right choice when your product is the provider platform integration, and the wrong choice the moment you need to query, audit, or migrate.
Letta puts memory hygiene inside the agent loop. mem0 puts it outside as a CRUD API. The choice is about whether the agent should reason about its own memory or whether memory should be a service the agent calls.
Vector-only RAG-as-memory is cheap and read-only. It is the wrong tool the moment you need to update or contradict a stored fact.
Custom schemas give the most control and cost the most engineering time. They are usually the right end state for mature products, not the right starting point.
Mixing architectures works if every memory type has exactly one source of truth. Two systems writing the same user facts produces contradictions you cannot fix.
Use the AI memory systems guide for the underlying memory taxonomy and the Agentic Prompt Stack Layer 4 for how memory connects to the rest of the agent prompt.

The 5 agent memory architectures in 2026

The field has consolidated around five distinct camps. Each makes a different bet about who owns the memory, who edits it, and where the work happens.

Provider-managed memory

The model provider stores facts about the user across sessions and injects them into the context window automatically. ChatGPT memory does this for ChatGPT users. Claude Projects gives a project-scoped persistent context. Gemini saved info plays a similar role inside Google's surfaces.

Fit. Zero engineering effort to enable. The provider handles storage, retrieval, dedup, and conflict resolution. If you ship a custom GPT, a Claude Project workspace, or any agent that lives inside a provider's first-party app, provider memory is what your users already expect, and it is what they can manage from their own settings.

Friction. Opaque. You cannot enumerate what is stored, query it programmatically, scope it to a sub-feature of your product, or migrate it to a different model. The schema is the provider's. The retrieval rules are the provider's. The privacy posture is the provider's. If you build on top of provider memory and later need to move to your own backend, there is no export path — you are starting conversation memory from scratch for every user.

Self-managing agent (Letta-style)

The agent edits its own memory blocks via tool calls. Letta (originally MemGPT) gives the agent a structured memory area — typed blocks like persona, human, and a working set — and tools that let it rewrite those blocks during normal turns. Memory hygiene happens inside the agent loop, not as a separate pre- or post-processing step.

Fit. When memory is load-bearing and the right write decisions are themselves a reasoning task. The agent decides what is worth remembering and when to consolidate, the same way a human does between meetings. Persistence and structured editing are built in. Multi-agent setups inherit per-agent memory blocks naturally.

Friction. Opinionated. You adopt Letta's runtime, its block schema, and its loop shape. Token cost goes up on most turns because the agent spends some of each turn reasoning about and editing its own memory. If your agent does not need that reasoning — if it just needs "remember this fact" — Letta is heavier than the problem.

Memory layer (mem0-style)

A drop-in CRUD layer over embeddings plus an LLM extraction step. mem0 and similar layers expose add, search, update, delete against a memory store with multi-level scoping (user, session, agent). Your application code calls add after a turn, search before a turn, and the layer handles extraction, embedding, dedup, and retrieval.

Fit. Bolting memory onto an existing app with the least architectural disruption. The agent loop does not change; you add two API calls around it. Multi-level scoping handles the common case (per-user facts, per-session working memory, per-agent persona) without you designing a schema. Migration from no-memory to mem0 is usually a week, not a quarter.

Friction. The LLM extraction step on every add costs tokens. Reads are cheap, writes are not. The layer makes opinionated decisions about what to extract — sometimes it stores facts you did not want stored, or fails to store ones you did. You debug those decisions through prompts to the extraction model, which is a layer of indirection between your intent and the stored result.

Vector-only RAG-as-memory

Embed every turn (or every important utterance), store in a vector index, retrieve by semantic similarity. No structured fields, no updates, no dedup. The store is append-only and the retrieval is "find the k most similar past turns to the current one."

Fit. Cheap, simple, and fits patterns your team already uses if you have RAG in production. For agents where "memory" really means "remember the gist of past conversations so we do not re-explain things," it is enough. Storage and read costs are dominated by the embedding model, which is a known quantity.

Friction. No structured updates. If a user changes their preferences, both the old and new statement live in the index, and retrieval might surface either one. No contradiction resolution. Recall quality depends entirely on the embedding model and the chunking strategy — a mistake in either silently degrades memory recall without throwing an error. Treating vector memory as the only memory architecture is a common entry-level mistake.

Custom in-app schema

Postgres or Redis tables holding typed memory rows, plus an LLM summarizer running outside the request path on a background job. The agent reads from your tables on each turn (joined to the user, the session, whatever scope you need) and writes through your application code with whatever validation you want.

Fit. Mature products where memory shape is well understood and the team already runs a database. Maximum control: you own the schema, the indexes, the scoping, the privacy story, the export format, and the migration path. You can answer "what does the system remember about this user" with a SQL query, which provider memory and Letta cannot.

Friction. Maximum maintenance. You build extraction, dedup, conflict resolution, scoping, decay, and retrieval yourself. Teams often start here, ship a thin version, then realize they have rebuilt mem0 badly. The right time to pick custom is after you know what your memory schema actually wants to be — usually after a year on a layer like mem0 or Letta.

The big comparison

The five architectures across the dimensions that drive the decision.

Dimension	Provider-managed	Letta (self-managing)	mem0 (memory layer)	Vector-only RAG	Custom schema
Control surface	None — provider owns it	High — you own the runtime	Medium — you own scopes, layer owns extraction	High on writes, none on retrieval logic unless you build it	Maximum — you own everything
Persistence model	Provider-managed, opaque	Persistent memory blocks edited in-loop	Persistent CRUD store with scoping	Append-only vector index	Whatever you build
Scoping (user / session / agent)	User-only, provider-defined	Per-agent blocks, scope is the agent	Built-in multi-level scoping	None native — you tag and filter	Whatever you model
Structured updates	No	Yes — blocks are typed and rewritable	Yes — `update`/`delete` API	No — append-only	Yes — SQL updates
Cost per turn	Bundled into provider price	Higher — memory hygiene runs in-loop	Read cheap, write costs an LLM extraction call	Cheap reads, embedding cost on writes	Whatever you build
Vendor lock-in	High — provider-specific	Medium — Letta runtime	Low — open layer, swappable	Low — vector DB is portable	None
Learning curve	None — toggle a setting	Steep — adopt a runtime and loop shape	Low — three API calls	Low if you have RAG, medium otherwise	High — you design it all
When to pick	Your product is the provider integration	Memory is load-bearing and write decisions are themselves reasoning	You need persistent user-scoped memory and want it shipped this week	You need "remember the gist" and nothing more	You know your memory schema and want to own everything

The table is the centerpiece because the five rows really do separate cleanly along these dimensions. There is no row that is best on every column.

The decision framework

Three questions, asked in order. Each cuts the field.

Question 1: Do you control the model and orchestration loop?

If you are building inside a provider's first-party app — a custom ChatGPT GPT, a Claude Project, an Apps SDK integration — you do not fully control the loop, and provider-managed memory is the answer because it is the only thing your users can see and manage. Stop here.

If you are building on a model API and you own the orchestration, continue.

Question 2: Is memory load-bearing for the product, or incidental?

Load-bearing means the agent is useless without memory. A coaching agent that has no idea what you talked about last week is not a coaching agent. A long-running research agent that cannot recall what it found two hours ago is not doing research. For load-bearing memory, pick Letta (if write decisions are themselves a reasoning task) or a custom schema (if your team is mature enough to own the design).

Incidental means memory makes the agent better but not necessary. A support agent that can recall the user's plan tier is nicer than one that cannot, but it can ask. For incidental memory, mem0 is the lowest-friction answer; vector RAG is fine if you only need similarity-based recall.

Question 3: Do you need user-scoped persistence across sessions, agents, and devices?

If yes — the same user gets the same memory whether they hit your web app, your iOS app, or your Slack bot — you need explicit scoping. mem0, Letta, and custom schemas all support this. Provider memory does too, but only inside the provider's surface. Vector RAG does not, unless you build the scoping yourself with metadata filters.

If no — memory is per-session and disappears when the session ends — you do not need a memory architecture at all. Pass conversation memory inside the request and stop.

The combinations that fall out: provider memory for first-party platform integrations; Letta for reasoning-heavy load-bearing memory you want to ship; mem0 for incidental user-scoped memory you want to ship this week; custom for mature products that have outgrown mem0; vector RAG for the narrow case where similarity is the only retrieval you need. The decision is rarely all five — it is usually two candidates, and the third question picks between them.

Cost model differences

Cost shape matters more than dollar amounts because the dollar amount depends on your traffic. The shape determines what scales linearly with users and what scales with conversation length.

Provider-managed. Bundled into the provider's per-token price. You pay nothing extra. You also cannot break the cost out, which means you cannot optimize it. If your product is paying provider per-seat or per-call pricing, memory is "free" in the sense that you cannot reduce the bill by reducing memory use.

Letta. Token cost amortized across in-loop memory hygiene. Most turns include some tokens spent on the agent reasoning about or editing its memory blocks. There is no separate memory pipeline to budget; the cost is folded into the agent's normal token spend. The trade is that you cannot turn it off cheaply when you do not need it on a given turn — the agent will still consider memory hygiene as part of its loop.

mem0. Reads are a single embedding call plus a similarity lookup — cheap. Writes invoke an LLM extraction call on every add to decide what is worth storing and how to phrase it. The extraction LLM is usually a small model, but it is real money at scale. The cost shape is per-conversation-turn, not per-user, so chatty users cost more than quiet ones in a sublinear but noticeable way.

Vector-only RAG. Reads are cheap. Writes are an embedding call on whatever you are storing. Storage scales linearly with bytes embedded; query cost scales with index size, mitigated by approximate nearest neighbor. The dominant cost is the embedding model — switching to a smaller embedder can cut cost in half with modest recall loss, which is a knob you control.

Custom. Engineering cost is the cost. Per-turn runtime cost is whatever you build — usually low because you can avoid LLM calls in the memory path entirely (relying on SQL queries, summaries computed on background jobs, and indexed retrieval). The bill shows up in headcount: someone owns the schema, the migrations, the dedup logic, and the privacy review.

The honest read is that no architecture is universally cheapest. Your read-write ratio and your conversation length determine which dominates. A high-write, short-conversation product (every turn produces new facts) makes mem0's extraction cost large and Letta's in-loop cost small. A low-write, long-conversation product (occasional facts, lots of recall) inverts that.

Migration paths

Most products do not pick once and stay. The common arcs are predictable.

Provider memory → custom. Teams prototype on a custom GPT or Claude Project, validate demand, then move to their own model and need to bring memory with them. The migration is hard because there is no export. You start fresh and accept that early users lose continuity. Mitigations: warn users before the cutover, prompt them to re-state important facts, and seed the new memory store from your own conversation logs (not from the provider's memory, which you cannot read).

Vector RAG → mem0. A team built RAG-as-memory because they had RAG in production already, then hit the contradiction problem (user said "vegetarian" three months ago and "pescatarian" yesterday, and both surface). Migrating to mem0 is straightforward because mem0 also uses embeddings under the hood — you reindex your existing turns through mem0's extraction pipeline and let mem0 own the writes from then on. Friction is mostly in deciding which historical turns to seed.

mem0 → Letta. When write decisions become reasoning. The migration is harder because Letta's runtime model is different — you are not just swapping a memory layer, you are restructuring the agent loop. Plan for it as a re-architecture, not a swap.

mem0 → custom. When the team has learned what the schema wants to be. This is the most common end-state migration for mature products. The migration is incremental: start by replacing mem0's writes with writes to your own tables (keep mem0 reads working in parallel), validate that your tables produce equivalent retrieval, then cut over reads. The "rebuild mem0 badly" failure mode comes from skipping the learning phase, not from the migration itself.

Letta → custom. Rare, but happens when the team needs control Letta does not give them — usually around scoping or privacy. Painful because Letta's block model does not map cleanly onto a relational schema. Plan for an extraction phase where you decide which Letta block contents become which custom rows.

The arc that does not work is jumping straight to custom from no memory at all. Teams that try this rebuild mem0 badly within six months. Pick a layer first, learn from running it, then decide if you need to own the schema.

Anti-patterns

Things that look like good ideas and are not.

Picking provider memory because it's "free" and then needing to query it. Provider memory is free in dollars and expensive in optionality. The day a customer asks "what does your product remember about me?" or compliance asks "show me everything stored about user X," you cannot answer. Pick provider memory only when the product is the provider integration, not because you want to skip the work of building memory.

Picking Letta for a stateless tool that does not need memory. Letta is heavy. The runtime, the block model, the in-loop hygiene — all of it is justified when memory is load-bearing and write decisions are themselves reasoning. For an agent that needs zero memory or session-only memory, Letta is overkill and adds tokens you did not need to spend.

Picking custom because everything else feels too magic, then rebuilding mem0 badly. This is the single most common architectural failure in agent memory. Engineers see mem0's extraction step and think "I can write that," then six months in they have a worse version with fewer tests. The right move is to ship on mem0, learn what you actually need, then build custom against that learning — not against your guesses.

Mixing two architectures without a single source of truth. Provider memory is on (because the platform provides it) and your app schema also stores user facts. Both write. Both have stale data. Neither is authoritative. There is no merge story. The fix is to pick one source of truth per memory type before you mix architectures, not after.

Treating RAG as memory because RAG is what you have. Vector RAG is a long-term memory substitute only when "memory" really means "find similar past content." It is not a substitute when memory means "remember this user said X yesterday and updated to Y today." If you find yourself adding metadata fields and update logic to your vector index, you have started rebuilding mem0 inside your vector store. Stop and use mem0 or a custom schema.

Not naming a memory architecture at all. Some teams ship agents with conversation history in the request and call it done. That is a memory architecture — it is "session-only, no persistence" — and it is fine if that is the deliberate choice. The anti-pattern is not naming it. An unnamed architecture is one nobody owns, and the day a user asks "why doesn't it remember me from yesterday" there is no one to ask.

What to read next

AI memory systems guide — the underlying memory taxonomy: short-term, long-term, episodic, semantic, procedural. This guide compares architectures; that one explains the memory types each architecture stores.
Letta walkthrough — concrete walkthrough of Letta's block model and the in-loop memory edits.
mem0 implementation guide — practical add/search/update/delete patterns and scoping strategies.
Episodic vs semantic memory for agents — the cognitive-science distinction translated into architectural choices about what to store and when.
Agentic Prompt Stack — Layer 4 (Memory access) is where the memory architecture surfaces in the agent prompt itself.
Context Engineering Maturity Model — memory is one input to context assembly; the maturity model shows where it sits in the broader pipeline.
SurePrompts Quality Rubric and RCAF — for the prompt-level structure inside any memory-aware agent.
LangGraph prompting guide and OpenAI Agents SDK prompting guide — how memory plugs into the two most common orchestration runtimes.