Skip to main content
Back to Blog
LettaMemGPTagent memoryagent frameworksLLM memoryPython

Letta (MemGPT) Walkthrough: How Self-Managing Agent Memory Works (2026)

How Letta's memory-block model, tool-based memory editing, and archival memory let an agent manage its own context — and when it beats vector-only RAG.

SurePrompts Team
May 4, 2026
15 min read

TL;DR

Letta (formerly MemGPT) is an open-source stateful agent framework where the agent itself manages its memory via tool calls. It uses an OS-inspired hierarchy — main context, recall storage, archival storage — and labeled memory blocks the model edits in its normal loop. The right tool when context must survive sessions; overkill for stateless RAG.

Letta (formerly MemGPT) is an open-source stateful agent framework where the agent itself manages its working memory through tool calls. It originated in the MemGPT paper from Berkeley in 2023, which framed an LLM as a process running on a memory-constrained operating system, and was rebranded Letta as the project grew into a general-purpose agent framework with persistence as a default rather than an afterthought.

The shorter way to say it: most agent frameworks ship orchestration and let you bolt memory on. Letta ships memory and lets you bolt orchestration on. If your problem is "this agent has to remember things across sessions and decide what's worth remembering," Letta is opinionated about how that should work. If your problem is "this pipeline needs three retrieval calls in a graph," you probably want something else.

Tip

Letta's central trick is putting the model in charge of memory hygiene — it decides what's worth remembering, what to summarize, and what to archive, all via tool calls inside its normal loop.

Key takeaways

  • Letta is an open-source agent framework with persistence baked in; the agent manages its own memory via tool use, not via an external retrieval pipeline.
  • The memory model is OS-inspired: main context (working memory the model sees every turn), recall storage (recent message history searchable on demand), archival storage (long-term searchable knowledge).
  • The load-bearing primitive is the memory block — a labeled, persistent string the agent edits with core_memory_append and core_memory_replace.
  • Conventional blocks are human (what the agent knows about the user) and persona (the agent's self-description); custom blocks are common for task and project state.
  • Letta beats vector-only RAG when memory has to evolve — write, rewrite, consolidate — not just be retrieved.
  • The cost is real: more tokens per turn and slower loops, because the agent spends some of its reasoning budget on memory housekeeping.
  • The framework is the wrong tool for stateless single-turn pipelines and prototypes; the right tool for long-running personal-assistant agents and multi-session products.

What Letta is

Letta is a Python framework for building stateful agents. The agent runs as a service: state is persisted in a database by default, so an agent's memory survives process restarts, machine moves, and time off. You instantiate an agent with a model, a set of tools, and an initial set of memory blocks. From that point on, the agent is addressable — you send it a message, it loops, it responds, and any memory it edited along the way is durable.

That persistence-by-default posture is the part that distinguishes Letta from frameworks where state is something you manage yourself. In a typical LangGraph or LangChain pipeline, "memory" is whatever you choose to read and write at each step; if you forget to write, nothing is remembered. In Letta, the agent has a memory and the model has tools to edit it. Forgetting to remember requires actively not calling the tools.

The framework descends from the MemGPT paper, which proposed an OS analogy for managing an LLM's limited context: treat the context window as RAM, give the model tools to page information in and out, and let the model itself decide what's worth keeping hot. The rebrand to Letta reflected the project broadening into a general agent framework rather than only an implementation of that one paper's idea.

A practical consequence of running the agent as a service: you address it the way you'd address any other long-lived process. You don't reconstruct the agent every request from a stack of strings; you talk to an existing instance that already knows who it is, who you are, and what you've discussed. That changes how you think about deployment. The agent isn't a function you call; it's a process you supervise.

The MemGPT memory hierarchy

The OS analogy gives you the mental model. Three tiers, each with different latency and capacity tradeoffs.

Main context is RAM. It's what the model sees on every turn — the system prompt, the memory blocks, and whatever recent conversation is in scope. Cheap to read (it's just in the prompt), but tightly bounded by the context window and by your token budget. This is where memory blocks live.

Recall storage is SSD. It holds recent message history — the conversation log — and the agent can search it via tool calls when something earlier in the session matters. Slower than main context (you pay a tool call), but much larger.

Archival storage is disk. It's the long-term, searchable knowledge store. The agent inserts items into archival memory with a tool call and searches them with another tool call, typically backed by a vector index. Capacity is effectively unlimited; latency is the highest of the three tiers because the agent has to formulate a query and the result has to come back through the loop.

The point of the hierarchy isn't elegance. It's that the model has to make choices about which tier each piece of information lives in, and those choices are the agent's job, not the framework's. A fact that matters every turn belongs in a memory block. A fact that matters once a week belongs in archival memory. A line of conversation that might come up later belongs in recall storage by default — it's already there, just not in main context.

Memory blocks — the load-bearing primitive

A memory block is a labeled string that lives in main context. The label gives the agent a stable handle. The string is editable through tool calls. The size is bounded — typically a few hundred to a couple thousand tokens per block — which forces the agent to curate rather than hoard.

The two conventional blocks:

code
[human]
Name: Priya
Role: ML engineer at a mid-stage fintech
Working on: latency reduction for inference pipeline
Communication style: terse, prefers code over prose
Known constraints: GPU budget locked through Q3

code
[persona]
You are an engineering pair-programmer agent.
You stay in scope on the user's current task.
You ask one clarifying question at a time, never multiple.
You prefer to show a small change and get reaction before suggesting the next.

The human block is what the agent knows about the user. It accretes over time as the agent learns. The persona block is what the agent knows about itself — its identity, voice, and standing instructions. Both are visible to the model on every turn, which is why blocks are powerful: there's no retrieval to fail. The cost is the token count, which is why blocks are bounded.

Custom blocks are common in production. A coding agent might keep a current_project block with the repo, branch, and ticket. A long-running research agent might keep a findings block summarizing what's been established so far. The pattern is the same: stable labels, bounded size, agent-edited.

Tool-based memory editing

The agent edits memory blocks the same way it does everything else — by calling tools.

  • core_memory_append(label, content) — add to the named block.
  • core_memory_replace(label, old_substring, new_substring) — rewrite a span.
  • archival_memory_insert(content) — push an item into long-term storage.
  • archival_memory_search(query) — search archival memory and return matches.

The names matter less than the shape: write, rewrite, insert, search. The agent uses these in the middle of normal reasoning. Mid-conversation, the model might decide a fact about the user is worth keeping and emit core_memory_append("human", "Prefers TypeScript over JavaScript for new services"). Later, when archival storage gets too noisy, the model might consolidate by reading several items, summarizing, and inserting a single rolled-up entry.

The prompt-engineering implication is the part most people underestimate. The agent's system prompt has to teach when to write versus when not to write, and when to read archival storage versus when to assume it's not there. Without that teaching, you get one of two failure modes: the agent never edits memory and the human block stays empty for weeks, or the agent edits memory on every turn and the blocks become a stream-of-consciousness journal that fills the context window.

A prompt pattern that tends to work:

code
Memory protocol:
- Append to the [human] block when you learn a stable fact about the user
  (preferences, role, ongoing projects). Never append speculation.
- Replace existing lines when a fact changes. Don't append a contradiction.
- Insert into archival memory only for items that won't fit in [human]
  but you might want later — past project decisions, prior conversations
  worth referencing.
- Search archival memory before answering questions about past work.
- If the [human] block grows past ~200 lines, consolidate.

This is the same discipline the Agentic Prompt Stack describes for tool-calling agents in general — explicit, narrow rules about when each tool fires. Memory tools are tools, and they need the same discipline.

Sleep-time compute / heartbeat

Letta also supports the agent running reflective passes when no user message is pending. The mental model is a background heartbeat: every so often, the agent gets a turn with no user input, and it can use that turn to reorganize memory — consolidate archival items, rewrite a human block that's grown messy, summarize recent conversation into a stable note.

Conceptually, this is "sleep-time compute" applied to an agent. During the day the agent reacts; during idle periods it does the housekeeping that keeps memory durable. At the architectural level, what matters is the loop: the agent gets autonomous turns, and you give it instructions about what to do with them. The specific mechanics — how often the heartbeat fires, what the trigger looks like in code — vary by version, and the discipline is the same as for any other agent loop. Tell the model what you want it to do with the turn, or it'll improvise.

Letta vs vector-only RAG

The clearest framing is what each system controls.

DimensionVector-only RAGLetta
Who decides what to rememberThe ingestion pipelineThe agent
Who decides what to retrieveThe retrieverThe agent (via tool calls)
Memory isRead-only at runtimeRead + write + reflect
Storage shapeEmbedded chunksEdited semantic blocks plus archival
Cost per queryCheapHigher (tool turns spent on memory)
Best fitLookup over a fixed corpusLong-running stateful agents
Failure modeStale or irrelevant chunksBad memory hygiene by the agent

Vector RAG is a great answer when the corpus is essentially static and the question is "given this user message, what context should I prepend?" Letta is a great answer when the corpus is the conversation itself, or when "what to remember" is itself a decision the agent has to make. The two aren't mutually exclusive — a Letta agent can call a RAG tool over a separate document corpus alongside its own archival memory.

The full taxonomy of memory shapes — within-session, provider-managed, application-managed — lives in the AI memory systems guide. Letta is a specific, opinionated implementation of the application-managed shape with the agent in charge.

Letta vs LangGraph + custom memory

LangGraph and Letta solve overlapping problems from opposite directions.

DimensionLangGraph + custom memoryLetta
Abstraction levelLower (you design the schema)Higher (memory blocks + tools baked in)
FlexibilityMaximumConstrained to the framework's model
BoilerplateMoreLess
PersistenceWhatever you buildDefault, durable across runs
State shapeWhatever you designMemory blocks + recall + archival
Best fitCustom memory models, complex graph orchestrationStateful agents that match the framework's pattern

LangGraph is the right answer when your memory model doesn't fit a "blocks plus archival" shape, when you need precise control over state transitions across many nodes, or when you're embedding agent behavior inside a larger pipeline that already owns persistence. The LangGraph prompting guide covers what those agents look like.

Letta is the right answer when the stateful-agent pattern is exactly what you want and you'd rather inherit the discipline than rebuild it. The two also compose: a Letta agent can be a node inside a LangGraph workflow, with the outer graph owning multi-agent orchestration and the inner Letta agent owning per-conversation memory.

The same comparison applies more loosely to the OpenAI Agents SDK, which gives you a lighter-weight framework with fewer opinions about persistence. If you want explicit memory primitives, Letta is the more opinionated choice.

Common failure modes

The framework gives you the primitives. The failures come from how the agent uses them.

The empty human block. The agent runs for weeks and never appends a fact. The system prompt didn't tell it to, or didn't make the trigger concrete. Fix: add explicit memory-protocol instructions, with examples of what to append and when.

The runaway human block. The agent appends every micro-observation. The block balloons past usable size and starts to crowd out other context. Fix: tighten the protocol — append only stable facts, replace when facts change, consolidate on a schedule.

Archival memory as a junk drawer. The agent inserts everything into archival storage with no consolidation pass. Search starts returning low-signal hits because the index is mostly noise. Fix: a periodic reflective turn that reads recent archival items and consolidates duplicates or near-duplicates into single entries.

Archival search misses. The agent stored a fact in one phrasing and queried in another, semantically related but lexically distant. The retriever returns nothing useful. Fix: on insert, store both a literal version and a paraphrase; on search, have the agent try multiple query phrasings before concluding the fact isn't there. The deeper issue is also covered in memory recall.

Cross-user memory pollution. The agent's memory bleeds across users because scoping is wrong at the storage layer. A fact appended during user A's session shows up in user B's long-term memory. Fix: per-user agent instances, hard scoping at the database layer, and integration tests that exercise the boundary.

Memory edits as a substitute for thinking. A subtler one. The agent learns that calling core_memory_append makes the loop feel productive, and starts doing it instead of actually answering. The output gets thin while the memory blocks grow. Fix: in the system prompt, separate "what you do for the user" from "what you do to memory," and require the user-facing answer to come first.

These failure modes aren't unique to Letta — they show up in any system where memory is written, read, and consolidated. They are more visible in Letta because the agent itself does the writing, so bad behavior shows up as bad agent behavior rather than as a silent retrieval bug.

When Letta is the right tool

Letta earns its weight when persistence and memory hygiene are first-class requirements:

  • Long-running personal-assistant agents. Same user, weeks or months of interaction, memory has to accumulate and stay coherent.
  • Multi-session products. A chat agent in a product where context is supposed to survive between sessions, not reset every time.
  • Agents that learn from interaction. Where the agent's understanding of the user, the project, or the domain has to evolve based on what it observes.
  • Cases where you want the agent to own memory decisions. Because the alternative — building all the curation logic outside the model — is more brittle than letting the model do it inline.

Letta is overkill when:

  • Single-turn tools. A summarizer, a translator, a one-shot query agent. There's nothing to remember.
  • Stateless RAG. A read-only pipeline over a fixed corpus, where retrieval is the only memory operation.
  • Prototypes. Where you're still figuring out whether memory matters at all. Start with a flat scratchpad, see what hurts, then graduate.
  • Pipelines where memory is owned elsewhere. If your application already has a memory layer (database of past conversations, structured user profiles), bolting Letta on top duplicates state.

The deeper framing — when application-managed memory is the right shape at all — is the topic of the AI memory systems guide. Letta is one implementation of that shape; mem0 is another with a different posture.

Build prompts like these in seconds

Use the Template Builder to customize 350+ expert templates with real-time preview, then export for any AI model.

Open Template Builder