Mastra Prompting Guide: The TypeScript Framework for AI Agents (2026)

Q: How is Mastra different from LangGraph?

Both let you build deterministic agent graphs, but they sit in different language worlds and lean different ways. LangGraph is Python-first, organised around a stateful graph where nodes mutate a typed state object — it is the dominant choice in research-heavy and Python-backed shops. Mastra is TypeScript-first, organised around two distinct primitives: workflows (deterministic step graphs, similar in spirit to LangGraph) and agents (model-driven loops with tools and memory), which you can compose. Mastra is built on the Vercel AI SDK, so streaming, tool calling, and structured output use the same abstractions you would use in a Next.js app. Pick LangGraph if your stack is Python or you need the larger ecosystem of integrations. Pick Mastra if your backend is already TypeScript and the friction of context-switching to Python outweighs the ecosystem gap.

Q: Should I pick Mastra if my stack is already TypeScript?

Probably yes, with caveats. The big win is staying in one language end-to-end: your prompt inputs, tool schemas, workflow state, and API responses all share Zod types and TypeScript inference. You skip the FastAPI shim that most TypeScript shops bolt on to host a Python agent, and your existing testing, logging, and deployment tooling carry over. The honest counterpoint is that the Python ecosystem still leads on research-grade integrations — new vector stores, evaluation harnesses, and academic implementations land in Python first. If your agent depends on a niche integration that only exists in Python, the language alignment win does not pay for the missing connector. For mainstream stacks (OpenAI, Anthropic, Pinecone, pgvector, standard tool-using agents), Mastra holds up and the TypeScript ergonomics are genuinely nicer than wrapping a Python service.

Q: What's the difference between agents and workflows in Mastra?

Agents and workflows are separate primitives in Mastra, and the distinction matters. An agent is a model-driven loop: you give it instructions, tools, and memory, and the model decides which tool to call next based on the conversation. Use agents when the path is open-ended and you want the model to decide. A workflow is a deterministic step graph: you define steps, declare which one runs next (optionally with conditional branching), and the framework executes them in order. Workflows support suspend and resume, so long-running or human-in-the-loop processes can pause and pick back up. Use workflows when the path is fixed and you want the model used only at specific decision points. The two compose: a workflow step can call an agent, and an agent's tool can trigger a workflow. Most production systems use both.

Q: How do Mastra evals work?

Evals are first-class in Mastra, not a separate library you bolt on later. The framework ships scoring functions you can attach to agent or workflow outputs and run as part of CI or as a regression suite. Built-in scorers cover common cases — correctness against an expected answer, faithfulness to retrieved context, tone, and safety — and you can write custom scorers, including LLM-as-judge prompts that evaluate outputs against rubrics. Because Mastra is TypeScript-native, your eval rubrics live next to the agent code they test, and the same Zod schemas that type your tool outputs can type your golden-set fixtures. The point of having evals built in is that you actually use them: write your first scorer the day you write your first agent, not the week you debug your first regression. Treat eval failures as test failures, not warnings.

Q: Is Mastra production-ready?

Mastra is in active development and is being used in production by teams that have committed to TypeScript-native agents, but it is younger than LangChain or LangGraph and has a smaller community. Production-readiness depends less on the framework's age and more on the discipline you bring: pin versions, treat the agent's instructions as code that ships through review, run evals on every change, and keep memory growth bounded. The core primitives — agents, workflows, tools, memory, RAG, evals — are stable enough to build on, and the deployment story (Node.js, Vercel, Cloudflare) is conventional. The risk profile is the usual early-framework one: APIs may shift, integrations may lag, and you may have to write your own connector for a niche dependency. If your team is comfortable with that tradeoff in exchange for staying in TypeScript, Mastra is a credible production choice today.

SurePrompts Team

Mastra is an open-source TypeScript framework for building AI agents and workflows, built on top of the Vercel AI SDK and organised around six primitives: agents, workflows, tools, memory, RAG, and evals. If your backend is already TypeScript, it lets you build agentic systems without bolting a Python service onto your stack.

Tip

The compounding win of staying in one language is not the syntax — it is that your Zod schemas, tool signatures, workflow state, and API responses are all the same types end-to-end, so the model's outputs and your application's expectations cannot quietly disagree.

Key takeaways

Mastra is TypeScript-native and built on the Vercel AI SDK, so streaming, tool calling, and structured output share the same abstractions as a Next.js app.
Agents and workflows are separate primitives — agents are model-driven loops, workflows are deterministic step graphs. Use both, compose them.
Tool schemas are written in Zod, which means the schema doubles as the prompt-facing description the model sees.
Memory is a first-class primitive (working memory plus conversation memory), not an afterthought you wire up later.
Evals ship with the framework. Use them from day one or you will regret it by month two.
The honest tradeoff is ecosystem: Python still leads on niche integrations and research-grade tooling. If your stack is TypeScript, Mastra usually wins on ergonomics; if it is Python, LangGraph usually wins on community.
Treat the agent's instructions field as a load-bearing prompt slot, not a comment. It is the difference between a useful agent and a hallucinating one.

What Mastra is

Mastra is an open-source TypeScript framework for building AI applications. It runs on Node.js, deploys to Vercel, Cloudflare Workers, or any standard Node server, and is model-agnostic via the Vercel AI SDK — you can swap from OpenAI to Anthropic to Google to a local model without rewriting application logic. The framework comes from the team behind Gatsby, which shows in the developer experience: there is a local dev playground for inspecting agent runs, a CLI for scaffolding, and the typing is genuinely tight rather than nominal.

The six primitives are:

Agents — a model plus instructions, optional tools, optional memory. The agent decides which tool to call and when, in a loop, until it produces a final response.
Workflows — deterministic step graphs. You define steps, declare next-step transitions (optionally conditional), and the framework runs them in order. Workflows support suspend and resume.
Tools — typed functions the model can call. Inputs and outputs use Zod schemas, which the framework also serialises into the tool descriptions the model sees.
Memory — working memory (scratchpad-style state across turns) and conversation memory (history with pruning and summarisation).
RAG — primitives for chunking, embedding, retrieving, and grounding agent responses in external documents.
Evals — a first-class scoring framework, including LLM-as-judge scorers, that you run in CI or as a regression suite.

You do not have to use all six on day one. A first Mastra app is often one agent with a couple of tools and a memory store. The rest comes in as you discover you need it.

Why TypeScript-first matters

The headline reason to pick Mastra is that you stay in one language. The deeper reason is that types flow end-to-end. The Zod schema that defines a tool's input is the same schema that types the model's structured output, the same schema that types the workflow state, and the same schema that types your API response. There is no gap between "what the model promised" and "what your code expects." When the model returns malformed output, the Zod parse fails at a single, well-defined boundary instead of corrupting state three steps downstream.

The practical effect is fewer integration shims. A typical Python agent in a TypeScript shop sits behind a FastAPI service, with hand-written request/response models on both sides, two deployment pipelines, and two sets of logs. With Mastra, the agent is a function in your existing repository. Your existing testing harness runs its evals. Your existing observability sees its tool calls.

The honest counterpoint: the Python ecosystem still leads on research-grade integrations. New vector stores, new academic eval methodologies, and one-off connectors land in Python first and get ported (or never get ported) to TypeScript. If your agent's value depends on a niche integration that only exists in Python, the language-alignment win does not pay for the missing connector. For mainstream stacks — OpenAI, Anthropic, Pinecone, pgvector, generic HTTP tools — Mastra holds up.

Mastra vs LangGraph vs CrewAI

Dimension	Mastra	LangGraph	CrewAI
Primary language	TypeScript	Python	Python
Mental model	Agents + workflows as separate primitives	Stateful graph with typed state	Crew of role-based agents
State model	Zod-typed workflow state, agent memory	Typed state object mutated by nodes	Implicit, role-driven
Evals built-in	Yes, first-class	Available via LangSmith / external	Available via external
Deployment target	Node.js, Vercel, Cloudflare Workers	Python service (FastAPI, Lambda)	Python service
Learning curve	Low if you know Vercel AI SDK	Moderate, graph-thinking required	Low, role metaphor is intuitive
When to pick	TypeScript backends, mixed agent/workflow needs	Research-heavy, Python stack, complex graph state	Quick role-based prototypes

This is a directional comparison, not a benchmark. The right framework depends on your existing stack, your team's language fluency, and whether your problem is more "deterministic graph" (LangGraph), "model-driven loop" (Mastra agents or OpenAI Agents SDK), or "narrative crew of specialists" (CrewAI). For a deeper look at when to use which, see the multi-agent prompting guide.

Writing a Mastra agent

A Mastra agent is configured with a name, a model, instructions, and optional tools and memory. The instructions field is the prompt slot — it is the system prompt the model sees on every turn, and it is the single most important thing you write.

ts

import { Agent } from "@mastra/core/agent";
import { openai } from "@ai-sdk/openai";

export const supportAgent = new Agent({
  name: "support-agent",
  instructions: `
You are a customer support agent for Acme Cloud Storage.

Tone: warm, concise, technically precise. Never apologise more than once per response.

You can do exactly three things:
1. Look up a customer's plan and usage with the lookupAccount tool.
2. Create a support ticket with the createTicket tool.
3. Explain documented features in your own words.

You cannot:
- Issue refunds (escalate via createTicket with priority="billing").
- Make promises about future features.
- Answer questions about other Acme products.

If the user asks something outside your scope, say so plainly and offer to create a ticket.
  `,
  model: openai("gpt-4o"),
  tools: { lookupAccount, createTicket },
});

Three things to notice. First, the instructions name the role, the tone, the allowed actions, and the forbidden actions in roughly that order — this matches the RCAF structure (Role, Context, Action, Format) and is not accidental. Second, the tool list is short — three is roughly the upper bound before tool selection accuracy starts to drop. Third, the escalation path is explicit. Vague agents do vague things; explicit agents stay in scope.

For a structured-output agent — say, an extractor that pulls invoice fields from raw text — you skip tools entirely and use the Vercel AI SDK's generateObject with a Zod schema. The agent's instructions become the extraction rubric, and the schema becomes the contract. The output cannot be malformed; it either parses or the call fails, which is exactly the failure mode you want at a system boundary.

For a RAG-grounded answerer, the instructions name the rule ("answer only from retrieved context, cite source IDs, say 'I don't know' if the context is silent") and a retrieval tool fetches the relevant chunks. The agent loop handles the back-and-forth.

Workflows — deterministic step graphs

Workflows are Mastra's answer to "I want a fixed pipeline, not a free-running agent." A workflow is a graph of Step nodes with declared transitions. Each step has typed input and output (Zod again), and you wire them together with .then() for sequential flow or branching configuration for conditional paths.

The right mental model: agents decide, workflows orchestrate. If the path through your system is "fetch data, summarise it, write a draft, route to human review, send if approved," that is a workflow. The model is used at specific steps (summarise, draft) but the path itself is fixed. If the path is "answer the user, possibly by calling tools, possibly by asking clarifying questions, possibly by chaining tool calls," that is an agent.

Workflows support suspend and resume, which makes them the right primitive for long-running or human-in-the-loop processes. A draft step can suspend the workflow, the human reviews out-of-band, and the resume call picks up with the human's verdict in state. This is much harder to do cleanly inside an agent loop.

The two primitives compose. A workflow step can call an agent (use the agent for a single bounded decision). An agent's tool can trigger a workflow (use the agent as the conversational front-end, the workflow as the deterministic backbone). Most production systems end up using both, and the framework's cleanest property is that it does not pretend they are the same thing.

For a deeper treatment of how to think about agent orchestration across primitives, see the Agentic Prompt Stack canonical.

Tools

A Mastra tool is a typed function with a name, a description, an input schema, an output schema, and an execute function. The schemas are Zod, and the framework serialises them into the tool descriptions the model sees when deciding whether to call the tool.

ts

import { createTool } from "@mastra/core/tools";
import { z } from "zod";

export const lookupAccount = createTool({
  id: "lookupAccount",
  description: "Fetch a customer's plan, current storage usage, and billing status by their email address. Use this when the user asks about their account, usage, or billing.",
  inputSchema: z.object({
    email: z.string().email().describe("The customer's account email"),
  }),
  outputSchema: z.object({
    plan: z.enum(["free", "pro", "enterprise"]),
    usageGB: z.number(),
    billingStatus: z.enum(["active", "past_due", "cancelled"]),
  }),
  execute: async ({ context }) => {
    return await db.accounts.findByEmail(context.email);
  },
});

The description field is a prompt. The model uses it to decide when to call the tool, so write it for the model, not for your future self. "Fetch a customer's plan, current storage usage, and billing status by their email address. Use this when the user asks about their account, usage, or billing." is a description. "Account lookup endpoint" is not. The Zod .describe() calls on individual fields show up in the schema the model sees, so use them for any field whose name is not self-explanatory.

This is what people mean when they say tool use is a prompting problem. The function logic is plumbing; the description and schema are the prompt. For structured output more broadly, the same principle holds — the schema is the contract.

Memory

Mastra ships two kinds of memory. Conversation memory is the rolling history of user and assistant turns. Working memory is a model-managed scratchpad — a persistent state the agent reads and updates across turns, used for things like "the user's name," "the project they are working on," or "the unresolved question from three turns ago."

Both are first-class. You configure them on the agent, not as an afterthought, and the framework handles persistence (by default in-memory, with adapters for SQL stores).

The thing to watch is token economics. Conversation memory grows linearly with turns, and naive approaches will silently push your context window past the model's limit or — more commonly — past the point where the model starts dropping information from the middle. Mastra's memory primitives support pruning (drop oldest turns) and summarisation (replace old turns with a generated summary). Pick a strategy at design time, not after your first production incident. For a structured way to think about this growth, see the Context Engineering Maturity Model.

Working memory is more interesting. The model can update it as part of its response, which lets you build agents that genuinely remember things across long conversations without re-reading the entire transcript every turn. The cost is that you trust the model to keep working memory clean — write the instructions for it carefully and treat it as a prompt slot.

Evals — first-class

The evals framework is one of Mastra's better decisions. Scoring functions (called scorers) attach to agent or workflow outputs and run as part of CI or as a standalone regression suite. Built-in scorers cover correctness, faithfulness to retrieved context, tone, and safety. Custom scorers are TypeScript functions that take the input and output and return a score — and the framework supports LLM-as-judge scorers, where the scorer is itself a model call against a rubric.

The reason this matters: most agent regressions are silent. The model returns plausible text, the JSON parses, the user does not complain immediately, and three weeks later you discover that one of your tools has been hallucinating account IDs. Evals catch this on the next CI run, not in production. Use them from day one. The discipline is not "add evals later when we have time" — it is "write the first scorer the day you write the first agent."

For a complete treatment of the eval discipline, see the Prompt Evaluation Complete Guide. For LLM-as-judge specifically — including how to write rubrics that produce stable scores — see the LLM-as-judge prompting guide. And for the broader rubric we apply to any prompt before shipping, see the SurePrompts Quality Rubric.

A practical note: pair every scorer with a golden set of inputs and expected behaviours. A scorer without a golden set is just a metric; a scorer with a golden set is a regression test. Use the framework's eval harness to run the scorers against the golden set on every change.

Common failure modes

Five patterns show up over and over:

Vague instructions. The agent's instructions field is a system prompt, not a code comment. "You are a helpful assistant" produces a helpful-but-generic assistant. Name the role, the scope, the tone, the allowed actions, and the forbidden actions. If you cannot say what the agent should refuse to do, you have not finished writing the instructions. Fix: rewrite using RCAF (Role, Context, Action, Format) and check it against the SurePrompts Quality Rubric.

Tool descriptions written like internal documentation. The model reads the tool description to decide whether to call it. If the description says "wraps the v2 account API," the model has no idea when to call it. Symptom: the model either never calls the tool or calls it at random. Fix: rewrite the description as instructions to the model: "Use this tool when X. It returns Y. Do not use it for Z."

Workflows-vs-agents misuse. Forcing a deterministic pipeline through an agent loop is slow, expensive, and brittle. Forcing an open-ended conversation through a workflow is rigid and frustrating. Symptom: an agent that "almost always does the same five things in the same order" should be a workflow; a workflow with a step labelled "model decides what to do next" should be an agent. Fix: factor along the boundary — workflow on the outside, agent inside one step (or vice versa).

Default memory ignored until it breaks. Conversation history grows, the context window starts dropping middle turns, the agent forgets what the user said three turns ago. Symptom: quality degrades after long conversations. Fix: configure pruning or summarisation at design time. Pick a token budget and enforce it.

Evals deferred. "We will add evals once the agent works." The agent never works in the sense you mean — it works on the inputs you remember to test by hand. Symptom: silent regressions, "it worked yesterday," no way to compare two prompt versions. Fix: write the first scorer the day you write the first agent. Even a single correctness scorer on three golden inputs is infinitely better than zero.

For a deeper failure-mode taxonomy across multi-agent system designs and agent tool loop failures, see the AI agents prompting guide and the broader agentic AI prompting guide.

When Mastra is right — and when it is not

Pick Mastra when:

Your backend is TypeScript and you want to keep it that way.
You already use the Vercel AI SDK in a Next.js app and want the same abstractions on the server.
Your problem mixes deterministic workflows and model-driven agents, and you want a clean primitive boundary between them.
You value evals as a first-class concept rather than an external dependency.

Pick something else when:

Your stack is Python — LangGraph will integrate more cleanly and your team will hit fewer dead ends.
Your agent depends on a niche Python-only integration (a specific vector store, a specific eval methodology, a research-grade scorer).
Your team has invested significantly in LangChain or CrewAI and the migration cost outweighs the language-alignment win.
You need the largest possible community of prior art and StackOverflow answers — Python frameworks still have more.

The framework is a tool, not an identity. Pick the one that lets your team ship the most reliable agent in the time you have, and revisit the choice in six months.

What to read next

The Agentic Prompt Stack canonical, for the full mental model behind agent + workflow + tools + memory + evals.
The multi-agent prompting guide, for patterns when you have more than one agent collaborating.
The LangGraph prompting guide sibling, for the Python equivalent and a head-to-head sense of the tradeoffs.
The CrewAI prompting guide and OpenAI Agents SDK prompting guide, for the other two major agent frameworks.