Letta (MemGPT) Walkthrough: How Self-Managing Agent Memory Works (2026)

Q: What is Letta (formerly MemGPT)?

Letta is an open-source stateful agent framework where the agent itself manages its memory through tool calls. It originated in the MemGPT paper from Berkeley in 2023, which proposed treating an LLM the way an operating system treats a process — with a small amount of fast working memory and tiered storage the process can page in and out. As the project matured beyond the paper into a general-purpose agent framework with persistence as a default, it was rebranded Letta. The core idea stayed: the model is given memory-editing tools and is expected to use them as part of its normal reasoning loop, instead of relying on an external retrieval system to decide what it should remember.

Q: How is Letta different from vector-only RAG?

Vector RAG is a read-only pipeline: chunks are embedded ahead of time, queries fetch nearest neighbors, the model conditions on what it gets. The model has no say in what is remembered or how. Letta is read-plus-write-plus-reflect. The agent uses tool calls to append or rewrite labeled memory blocks, push items into archival storage, and search archival storage on demand. That lets it consolidate, deduplicate, and rephrase — turning raw history into a curated working set. The cost is more tokens per turn and slower loops, because the agent spends turns reasoning about memory instead of only the user's task. Letta is the right shape when memory has to evolve over time; vector RAG is the right shape when retrieval is essentially read-only.

Q: What are memory blocks?

Memory blocks are Letta's load-bearing primitive: labeled, persistent strings the agent can read and edit through tool calls. Two conventional blocks ship with most setups — a `human` block describing what the agent knows about the user, and a `persona` block describing the agent's own identity and behavior. You can add custom blocks for project context, preferences, current task state, or anything else worth keeping in working memory. Blocks live in main context, which means the agent sees them on every turn without paying retrieval cost. The agent edits them with `core_memory_append` (add a line) and `core_memory_replace` (rewrite a substring). Because blocks are bounded in size, the agent has to make choices about what's worth keeping there versus pushing to archival storage.

Q: Does Letta work with non-OpenAI models?

Yes. Letta is provider-agnostic at the design level — it depends on a model that can do reliable tool calls, not on any specific vendor. In practice the framework supports the major hosted providers (OpenAI, Anthropic, Google) and routes through common gateways for self-hosted models. The constraint that matters is tool-calling quality, not the brand on the model. Smaller open-source models often struggle with the discipline Letta expects: knowing when to write to memory, knowing when to search archival storage, not abusing memory edits as a substitute for reasoning. If you're running on a less capable model, expect to spend more effort on the system prompt — teaching the agent the memory protocol rather than assuming it will infer the right behavior.

Q: When should I use Letta vs LangGraph + custom memory?

LangGraph is a lower-level orchestration framework. You define the state schema, the graph of steps, and how memory flows between them. That gives you full control — you can build any memory model you want — but you're responsible for the design. Letta is higher-level and more opinionated: memory blocks plus tool-based editing plus tiered storage are baked in. Use LangGraph when your memory model is unusual, when you need fine-grained control over state transitions, or when you're integrating into an existing pipeline that already has a memory layer. Use Letta when the stateful-agent pattern is what you actually want and you'd rather not reinvent the discipline. The two also compose — Letta agents can be nodes in a larger LangGraph workflow.

Q: What are Letta's main failure modes?

Five recur. First: the agent never edits memory because the system prompt didn't tell it to, so the `human` block stays empty across sessions. Second: the agent appends everything, treating memory like a journal, and core memory blocks balloon until they crowd out useful context. Third: archival memory grows unbounded with no consolidation pass, and search starts returning a junk drawer. Fourth: archival search misses semantically related items because the query and the stored phrasing diverged. Fifth: memory pollution across users when scoping is wrong — facts about user A bleed into agent runs for user B. Each has the same shape of fix: explicit instructions in the system prompt, periodic consolidation, and clean per-user scoping at the storage layer.

SurePrompts Team

Letta (formerly MemGPT) is an open-source stateful agent framework where the agent itself manages its working memory through tool calls. It originated in the MemGPT paper from Berkeley in 2023, which framed an LLM as a process running on a memory-constrained operating system, and was rebranded Letta as the project grew into a general-purpose agent framework with persistence as a default rather than an afterthought.

The shorter way to say it: most agent frameworks ship orchestration and let you bolt memory on. Letta ships memory and lets you bolt orchestration on. If your problem is "this agent has to remember things across sessions and decide what's worth remembering," Letta is opinionated about how that should work. If your problem is "this pipeline needs three retrieval calls in a graph," you probably want something else.

Tip

Letta's central trick is putting the model in charge of memory hygiene — it decides what's worth remembering, what to summarize, and what to archive, all via tool calls inside its normal loop.

Key takeaways

Letta is an open-source agent framework with persistence baked in; the agent manages its own memory via tool use, not via an external retrieval pipeline.
The memory model is OS-inspired: main context (working memory the model sees every turn), recall storage (recent message history searchable on demand), archival storage (long-term searchable knowledge).
The load-bearing primitive is the memory block — a labeled, persistent string the agent edits with core_memory_append and core_memory_replace.
Conventional blocks are human (what the agent knows about the user) and persona (the agent's self-description); custom blocks are common for task and project state.
Letta beats vector-only RAG when memory has to evolve — write, rewrite, consolidate — not just be retrieved.
The cost is real: more tokens per turn and slower loops, because the agent spends some of its reasoning budget on memory housekeeping.
The framework is the wrong tool for stateless single-turn pipelines and prototypes; the right tool for long-running personal-assistant agents and multi-session products.

What Letta is

Letta is a Python framework for building stateful agents. The agent runs as a service: state is persisted in a database by default, so an agent's memory survives process restarts, machine moves, and time off. You instantiate an agent with a model, a set of tools, and an initial set of memory blocks. From that point on, the agent is addressable — you send it a message, it loops, it responds, and any memory it edited along the way is durable.

That persistence-by-default posture is the part that distinguishes Letta from frameworks where state is something you manage yourself. In a typical LangGraph or LangChain pipeline, "memory" is whatever you choose to read and write at each step; if you forget to write, nothing is remembered. In Letta, the agent has a memory and the model has tools to edit it. Forgetting to remember requires actively not calling the tools.

The framework descends from the MemGPT paper, which proposed an OS analogy for managing an LLM's limited context: treat the context window as RAM, give the model tools to page information in and out, and let the model itself decide what's worth keeping hot. The rebrand to Letta reflected the project broadening into a general agent framework rather than only an implementation of that one paper's idea.

A practical consequence of running the agent as a service: you address it the way you'd address any other long-lived process. You don't reconstruct the agent every request from a stack of strings; you talk to an existing instance that already knows who it is, who you are, and what you've discussed. That changes how you think about deployment. The agent isn't a function you call; it's a process you supervise.

The MemGPT memory hierarchy

The OS analogy gives you the mental model. Three tiers, each with different latency and capacity tradeoffs.

Main context is RAM. It's what the model sees on every turn — the system prompt, the memory blocks, and whatever recent conversation is in scope. Cheap to read (it's just in the prompt), but tightly bounded by the context window and by your token budget. This is where memory blocks live.

Recall storage is SSD. It holds recent message history — the conversation log — and the agent can search it via tool calls when something earlier in the session matters. Slower than main context (you pay a tool call), but much larger.

Archival storage is disk. It's the long-term, searchable knowledge store. The agent inserts items into archival memory with a tool call and searches them with another tool call, typically backed by a vector index. Capacity is effectively unlimited; latency is the highest of the three tiers because the agent has to formulate a query and the result has to come back through the loop.

The point of the hierarchy isn't elegance. It's that the model has to make choices about which tier each piece of information lives in, and those choices are the agent's job, not the framework's. A fact that matters every turn belongs in a memory block. A fact that matters once a week belongs in archival memory. A line of conversation that might come up later belongs in recall storage by default — it's already there, just not in main context.

Memory blocks — the load-bearing primitive

A memory block is a labeled string that lives in main context. The label gives the agent a stable handle. The string is editable through tool calls. The size is bounded — typically a few hundred to a couple thousand tokens per block — which forces the agent to curate rather than hoard.

The two conventional blocks:

code

[human]
Name: Priya
Role: ML engineer at a mid-stage fintech
Working on: latency reduction for inference pipeline
Communication style: terse, prefers code over prose
Known constraints: GPU budget locked through Q3

code

[persona]
You are an engineering pair-programmer agent.
You stay in scope on the user's current task.
You ask one clarifying question at a time, never multiple.
You prefer to show a small change and get reaction before suggesting the next.

The human block is what the agent knows about the user. It accretes over time as the agent learns. The persona block is what the agent knows about itself — its identity, voice, and standing instructions. Both are visible to the model on every turn, which is why blocks are powerful: there's no retrieval to fail. The cost is the token count, which is why blocks are bounded.

Custom blocks are common in production. A coding agent might keep a current_project block with the repo, branch, and ticket. A long-running research agent might keep a findings block summarizing what's been established so far. The pattern is the same: stable labels, bounded size, agent-edited.

Tool-based memory editing

The agent edits memory blocks the same way it does everything else — by calling tools.

core_memory_append(label, content) — add to the named block.
core_memory_replace(label, old_substring, new_substring) — rewrite a span.
archival_memory_insert(content) — push an item into long-term storage.
archival_memory_search(query) — search archival memory and return matches.

The names matter less than the shape: write, rewrite, insert, search. The agent uses these in the middle of normal reasoning. Mid-conversation, the model might decide a fact about the user is worth keeping and emit core_memory_append("human", "Prefers TypeScript over JavaScript for new services"). Later, when archival storage gets too noisy, the model might consolidate by reading several items, summarizing, and inserting a single rolled-up entry.

The prompt-engineering implication is the part most people underestimate. The agent's system prompt has to teach when to write versus when not to write, and when to read archival storage versus when to assume it's not there. Without that teaching, you get one of two failure modes: the agent never edits memory and the human block stays empty for weeks, or the agent edits memory on every turn and the blocks become a stream-of-consciousness journal that fills the context window.

A prompt pattern that tends to work:

code

Memory protocol:
- Append to the [human] block when you learn a stable fact about the user
  (preferences, role, ongoing projects). Never append speculation.
- Replace existing lines when a fact changes. Don't append a contradiction.
- Insert into archival memory only for items that won't fit in [human]
  but you might want later — past project decisions, prior conversations
  worth referencing.
- Search archival memory before answering questions about past work.
- If the [human] block grows past ~200 lines, consolidate.

This is the same discipline the Agentic Prompt Stack describes for tool-calling agents in general — explicit, narrow rules about when each tool fires. Memory tools are tools, and they need the same discipline.

Sleep-time compute / heartbeat

Letta also supports the agent running reflective passes when no user message is pending. The mental model is a background heartbeat: every so often, the agent gets a turn with no user input, and it can use that turn to reorganize memory — consolidate archival items, rewrite a human block that's grown messy, summarize recent conversation into a stable note.

Conceptually, this is "sleep-time compute" applied to an agent. During the day the agent reacts; during idle periods it does the housekeeping that keeps memory durable. At the architectural level, what matters is the loop: the agent gets autonomous turns, and you give it instructions about what to do with them. The specific mechanics — how often the heartbeat fires, what the trigger looks like in code — vary by version, and the discipline is the same as for any other agent loop. Tell the model what you want it to do with the turn, or it'll improvise.

Letta vs vector-only RAG

The clearest framing is what each system controls.

Dimension	Vector-only RAG	Letta
Who decides what to remember	The ingestion pipeline	The agent
Who decides what to retrieve	The retriever	The agent (via tool calls)
Memory is	Read-only at runtime	Read + write + reflect
Storage shape	Embedded chunks	Edited semantic blocks plus archival
Cost per query	Cheap	Higher (tool turns spent on memory)
Best fit	Lookup over a fixed corpus	Long-running stateful agents
Failure mode	Stale or irrelevant chunks	Bad memory hygiene by the agent

Vector RAG is a great answer when the corpus is essentially static and the question is "given this user message, what context should I prepend?" Letta is a great answer when the corpus is the conversation itself, or when "what to remember" is itself a decision the agent has to make. The two aren't mutually exclusive — a Letta agent can call a RAG tool over a separate document corpus alongside its own archival memory.

The full taxonomy of memory shapes — within-session, provider-managed, application-managed — lives in the AI memory systems guide. Letta is a specific, opinionated implementation of the application-managed shape with the agent in charge.

Letta vs LangGraph + custom memory

LangGraph and Letta solve overlapping problems from opposite directions.

Dimension	LangGraph + custom memory	Letta
Abstraction level	Lower (you design the schema)	Higher (memory blocks + tools baked in)
Flexibility	Maximum	Constrained to the framework's model
Boilerplate	More	Less
Persistence	Whatever you build	Default, durable across runs
State shape	Whatever you design	Memory blocks + recall + archival
Best fit	Custom memory models, complex graph orchestration	Stateful agents that match the framework's pattern

LangGraph is the right answer when your memory model doesn't fit a "blocks plus archival" shape, when you need precise control over state transitions across many nodes, or when you're embedding agent behavior inside a larger pipeline that already owns persistence. The LangGraph prompting guide covers what those agents look like.

Letta is the right answer when the stateful-agent pattern is exactly what you want and you'd rather inherit the discipline than rebuild it. The two also compose: a Letta agent can be a node inside a LangGraph workflow, with the outer graph owning multi-agent orchestration and the inner Letta agent owning per-conversation memory.

The same comparison applies more loosely to the OpenAI Agents SDK, which gives you a lighter-weight framework with fewer opinions about persistence. If you want explicit memory primitives, Letta is the more opinionated choice.

Common failure modes

The framework gives you the primitives. The failures come from how the agent uses them.

The empty human block. The agent runs for weeks and never appends a fact. The system prompt didn't tell it to, or didn't make the trigger concrete. Fix: add explicit memory-protocol instructions, with examples of what to append and when.

The runaway human block. The agent appends every micro-observation. The block balloons past usable size and starts to crowd out other context. Fix: tighten the protocol — append only stable facts, replace when facts change, consolidate on a schedule.

Archival memory as a junk drawer. The agent inserts everything into archival storage with no consolidation pass. Search starts returning low-signal hits because the index is mostly noise. Fix: a periodic reflective turn that reads recent archival items and consolidates duplicates or near-duplicates into single entries.

Archival search misses. The agent stored a fact in one phrasing and queried in another, semantically related but lexically distant. The retriever returns nothing useful. Fix: on insert, store both a literal version and a paraphrase; on search, have the agent try multiple query phrasings before concluding the fact isn't there. The deeper issue is also covered in memory recall.

Cross-user memory pollution. The agent's memory bleeds across users because scoping is wrong at the storage layer. A fact appended during user A's session shows up in user B's long-term memory. Fix: per-user agent instances, hard scoping at the database layer, and integration tests that exercise the boundary.

Memory edits as a substitute for thinking. A subtler one. The agent learns that calling core_memory_append makes the loop feel productive, and starts doing it instead of actually answering. The output gets thin while the memory blocks grow. Fix: in the system prompt, separate "what you do for the user" from "what you do to memory," and require the user-facing answer to come first.

These failure modes aren't unique to Letta — they show up in any system where memory is written, read, and consolidated. They are more visible in Letta because the agent itself does the writing, so bad behavior shows up as bad agent behavior rather than as a silent retrieval bug.

When Letta is the right tool

Letta earns its weight when persistence and memory hygiene are first-class requirements:

Long-running personal-assistant agents. Same user, weeks or months of interaction, memory has to accumulate and stay coherent.
Multi-session products. A chat agent in a product where context is supposed to survive between sessions, not reset every time.
Agents that learn from interaction. Where the agent's understanding of the user, the project, or the domain has to evolve based on what it observes.
Cases where you want the agent to own memory decisions. Because the alternative — building all the curation logic outside the model — is more brittle than letting the model do it inline.

Letta is overkill when:

Single-turn tools. A summarizer, a translator, a one-shot query agent. There's nothing to remember.
Stateless RAG. A read-only pipeline over a fixed corpus, where retrieval is the only memory operation.
Prototypes. Where you're still figuring out whether memory matters at all. Start with a flat scratchpad, see what hurts, then graduate.
Pipelines where memory is owned elsewhere. If your application already has a memory layer (database of past conversations, structured user profiles), bolting Letta on top duplicates state.

The deeper framing — when application-managed memory is the right shape at all — is the topic of the AI memory systems guide. Letta is one implementation of that shape; mem0 is another with a different posture.

What to read next

The AI memory systems guide for the full taxonomy of within-session, provider-managed, and application-managed memory.
The mem0 implementation guide for a sibling framework with a lighter, library-style posture.
Agent memory architectures compared for a side-by-side of the major options.
Episodic vs semantic memory for agents for the cognitive-science framing of why memory has tiers at all.
The Agentic Prompt Stack, Context Engineering Maturity Model, SurePrompts Quality Rubric, and RCAF for the prompt-engineering frameworks that sit underneath any agent that has to manage its own memory well.

Letta (MemGPT) Walkthrough: How Self-Managing Agent Memory Works (2026)

Key takeaways

What Letta is

The MemGPT memory hierarchy

Memory blocks — the load-bearing primitive

Tool-based memory editing

Sleep-time compute / heartbeat

Letta vs vector-only RAG

Letta vs LangGraph + custom memory

Common failure modes

When Letta is the right tool

What to read next

Build prompts like these in seconds

Related Articles

AI Memory Systems Guide (2026): Within-Session, Provider, and Application

The Agentic Prompt Stack: 6 Layers for Designing Prompts That Run Agents

The Context Engineering Maturity Model: 5 Levels From Static Prompts to Orchestrated Systems