Skip to main content
Back to Blog
MastraTypeScriptagent frameworksAI workflowsJavaScriptNode.js

Mastra Prompting Guide: The TypeScript Framework for AI Agents (2026)

How to prompt Mastra agents and workflows: instructions, tools, memory, RAG, and evals in a TypeScript-native framework built on the Vercel AI SDK.

SurePrompts Team
May 4, 2026
15 min read

TL;DR

Mastra is an open-source TypeScript framework for building AI agents and workflows on the Vercel AI SDK. This guide covers the six primitives — agents, workflows, tools, memory, RAG, evals — with concrete prompting patterns, a comparison against LangGraph and CrewAI, common failure modes, and when to pick it over Python-heavy alternatives.

Mastra is an open-source TypeScript framework for building AI agents and workflows, built on top of the Vercel AI SDK and organised around six primitives: agents, workflows, tools, memory, RAG, and evals. If your backend is already TypeScript, it lets you build agentic systems without bolting a Python service onto your stack.

Tip

The compounding win of staying in one language is not the syntax — it is that your Zod schemas, tool signatures, workflow state, and API responses are all the same types end-to-end, so the model's outputs and your application's expectations cannot quietly disagree.

Key takeaways

  • Mastra is TypeScript-native and built on the Vercel AI SDK, so streaming, tool calling, and structured output share the same abstractions as a Next.js app.
  • Agents and workflows are separate primitives — agents are model-driven loops, workflows are deterministic step graphs. Use both, compose them.
  • Tool schemas are written in Zod, which means the schema doubles as the prompt-facing description the model sees.
  • Memory is a first-class primitive (working memory plus conversation memory), not an afterthought you wire up later.
  • Evals ship with the framework. Use them from day one or you will regret it by month two.
  • The honest tradeoff is ecosystem: Python still leads on niche integrations and research-grade tooling. If your stack is TypeScript, Mastra usually wins on ergonomics; if it is Python, LangGraph usually wins on community.
  • Treat the agent's instructions field as a load-bearing prompt slot, not a comment. It is the difference between a useful agent and a hallucinating one.

What Mastra is

Mastra is an open-source TypeScript framework for building AI applications. It runs on Node.js, deploys to Vercel, Cloudflare Workers, or any standard Node server, and is model-agnostic via the Vercel AI SDK — you can swap from OpenAI to Anthropic to Google to a local model without rewriting application logic. The framework comes from the team behind Gatsby, which shows in the developer experience: there is a local dev playground for inspecting agent runs, a CLI for scaffolding, and the typing is genuinely tight rather than nominal.

The six primitives are:

  • Agents — a model plus instructions, optional tools, optional memory. The agent decides which tool to call and when, in a loop, until it produces a final response.
  • Workflows — deterministic step graphs. You define steps, declare next-step transitions (optionally conditional), and the framework runs them in order. Workflows support suspend and resume.
  • Tools — typed functions the model can call. Inputs and outputs use Zod schemas, which the framework also serialises into the tool descriptions the model sees.
  • Memory — working memory (scratchpad-style state across turns) and conversation memory (history with pruning and summarisation).
  • RAG — primitives for chunking, embedding, retrieving, and grounding agent responses in external documents.
  • Evals — a first-class scoring framework, including LLM-as-judge scorers, that you run in CI or as a regression suite.

You do not have to use all six on day one. A first Mastra app is often one agent with a couple of tools and a memory store. The rest comes in as you discover you need it.

Why TypeScript-first matters

The headline reason to pick Mastra is that you stay in one language. The deeper reason is that types flow end-to-end. The Zod schema that defines a tool's input is the same schema that types the model's structured output, the same schema that types the workflow state, and the same schema that types your API response. There is no gap between "what the model promised" and "what your code expects." When the model returns malformed output, the Zod parse fails at a single, well-defined boundary instead of corrupting state three steps downstream.

The practical effect is fewer integration shims. A typical Python agent in a TypeScript shop sits behind a FastAPI service, with hand-written request/response models on both sides, two deployment pipelines, and two sets of logs. With Mastra, the agent is a function in your existing repository. Your existing testing harness runs its evals. Your existing observability sees its tool calls.

The honest counterpoint: the Python ecosystem still leads on research-grade integrations. New vector stores, new academic eval methodologies, and one-off connectors land in Python first and get ported (or never get ported) to TypeScript. If your agent's value depends on a niche integration that only exists in Python, the language-alignment win does not pay for the missing connector. For mainstream stacks — OpenAI, Anthropic, Pinecone, pgvector, generic HTTP tools — Mastra holds up.

Mastra vs LangGraph vs CrewAI

DimensionMastraLangGraphCrewAI
Primary languageTypeScriptPythonPython
Mental modelAgents + workflows as separate primitivesStateful graph with typed stateCrew of role-based agents
State modelZod-typed workflow state, agent memoryTyped state object mutated by nodesImplicit, role-driven
Evals built-inYes, first-classAvailable via LangSmith / externalAvailable via external
Deployment targetNode.js, Vercel, Cloudflare WorkersPython service (FastAPI, Lambda)Python service
Learning curveLow if you know Vercel AI SDKModerate, graph-thinking requiredLow, role metaphor is intuitive
When to pickTypeScript backends, mixed agent/workflow needsResearch-heavy, Python stack, complex graph stateQuick role-based prototypes

This is a directional comparison, not a benchmark. The right framework depends on your existing stack, your team's language fluency, and whether your problem is more "deterministic graph" (LangGraph), "model-driven loop" (Mastra agents or OpenAI Agents SDK), or "narrative crew of specialists" (CrewAI). For a deeper look at when to use which, see the multi-agent prompting guide.

Writing a Mastra agent

A Mastra agent is configured with a name, a model, instructions, and optional tools and memory. The instructions field is the prompt slot — it is the system prompt the model sees on every turn, and it is the single most important thing you write.

ts
import { Agent } from "@mastra/core/agent";
import { openai } from "@ai-sdk/openai";

export const supportAgent = new Agent({
  name: "support-agent",
  instructions: `
You are a customer support agent for Acme Cloud Storage.

Tone: warm, concise, technically precise. Never apologise more than once per response.

You can do exactly three things:
1. Look up a customer's plan and usage with the lookupAccount tool.
2. Create a support ticket with the createTicket tool.
3. Explain documented features in your own words.

You cannot:
- Issue refunds (escalate via createTicket with priority="billing").
- Make promises about future features.
- Answer questions about other Acme products.

If the user asks something outside your scope, say so plainly and offer to create a ticket.
  `,
  model: openai("gpt-4o"),
  tools: { lookupAccount, createTicket },
});

Three things to notice. First, the instructions name the role, the tone, the allowed actions, and the forbidden actions in roughly that order — this matches the RCAF structure (Role, Context, Action, Format) and is not accidental. Second, the tool list is short — three is roughly the upper bound before tool selection accuracy starts to drop. Third, the escalation path is explicit. Vague agents do vague things; explicit agents stay in scope.

For a structured-output agent — say, an extractor that pulls invoice fields from raw text — you skip tools entirely and use the Vercel AI SDK's generateObject with a Zod schema. The agent's instructions become the extraction rubric, and the schema becomes the contract. The output cannot be malformed; it either parses or the call fails, which is exactly the failure mode you want at a system boundary.

For a RAG-grounded answerer, the instructions name the rule ("answer only from retrieved context, cite source IDs, say 'I don't know' if the context is silent") and a retrieval tool fetches the relevant chunks. The agent loop handles the back-and-forth.

Workflows — deterministic step graphs

Workflows are Mastra's answer to "I want a fixed pipeline, not a free-running agent." A workflow is a graph of Step nodes with declared transitions. Each step has typed input and output (Zod again), and you wire them together with .then() for sequential flow or branching configuration for conditional paths.

The right mental model: agents decide, workflows orchestrate. If the path through your system is "fetch data, summarise it, write a draft, route to human review, send if approved," that is a workflow. The model is used at specific steps (summarise, draft) but the path itself is fixed. If the path is "answer the user, possibly by calling tools, possibly by asking clarifying questions, possibly by chaining tool calls," that is an agent.

Workflows support suspend and resume, which makes them the right primitive for long-running or human-in-the-loop processes. A draft step can suspend the workflow, the human reviews out-of-band, and the resume call picks up with the human's verdict in state. This is much harder to do cleanly inside an agent loop.

The two primitives compose. A workflow step can call an agent (use the agent for a single bounded decision). An agent's tool can trigger a workflow (use the agent as the conversational front-end, the workflow as the deterministic backbone). Most production systems end up using both, and the framework's cleanest property is that it does not pretend they are the same thing.

For a deeper treatment of how to think about agent orchestration across primitives, see the Agentic Prompt Stack canonical.

Tools

A Mastra tool is a typed function with a name, a description, an input schema, an output schema, and an execute function. The schemas are Zod, and the framework serialises them into the tool descriptions the model sees when deciding whether to call the tool.

ts
import { createTool } from "@mastra/core/tools";
import { z } from "zod";

export const lookupAccount = createTool({
  id: "lookupAccount",
  description: "Fetch a customer's plan, current storage usage, and billing status by their email address. Use this when the user asks about their account, usage, or billing.",
  inputSchema: z.object({
    email: z.string().email().describe("The customer's account email"),
  }),
  outputSchema: z.object({
    plan: z.enum(["free", "pro", "enterprise"]),
    usageGB: z.number(),
    billingStatus: z.enum(["active", "past_due", "cancelled"]),
  }),
  execute: async ({ context }) => {
    return await db.accounts.findByEmail(context.email);
  },
});

The description field is a prompt. The model uses it to decide when to call the tool, so write it for the model, not for your future self. "Fetch a customer's plan, current storage usage, and billing status by their email address. Use this when the user asks about their account, usage, or billing." is a description. "Account lookup endpoint" is not. The Zod .describe() calls on individual fields show up in the schema the model sees, so use them for any field whose name is not self-explanatory.

This is what people mean when they say tool use is a prompting problem. The function logic is plumbing; the description and schema are the prompt. For structured output more broadly, the same principle holds — the schema is the contract.

Memory

Mastra ships two kinds of memory. Conversation memory is the rolling history of user and assistant turns. Working memory is a model-managed scratchpad — a persistent state the agent reads and updates across turns, used for things like "the user's name," "the project they are working on," or "the unresolved question from three turns ago."

Both are first-class. You configure them on the agent, not as an afterthought, and the framework handles persistence (by default in-memory, with adapters for SQL stores).

The thing to watch is token economics. Conversation memory grows linearly with turns, and naive approaches will silently push your context window past the model's limit or — more commonly — past the point where the model starts dropping information from the middle. Mastra's memory primitives support pruning (drop oldest turns) and summarisation (replace old turns with a generated summary). Pick a strategy at design time, not after your first production incident. For a structured way to think about this growth, see the Context Engineering Maturity Model.

Working memory is more interesting. The model can update it as part of its response, which lets you build agents that genuinely remember things across long conversations without re-reading the entire transcript every turn. The cost is that you trust the model to keep working memory clean — write the instructions for it carefully and treat it as a prompt slot.

Evals — first-class

The evals framework is one of Mastra's better decisions. Scoring functions (called scorers) attach to agent or workflow outputs and run as part of CI or as a standalone regression suite. Built-in scorers cover correctness, faithfulness to retrieved context, tone, and safety. Custom scorers are TypeScript functions that take the input and output and return a score — and the framework supports LLM-as-judge scorers, where the scorer is itself a model call against a rubric.

The reason this matters: most agent regressions are silent. The model returns plausible text, the JSON parses, the user does not complain immediately, and three weeks later you discover that one of your tools has been hallucinating account IDs. Evals catch this on the next CI run, not in production. Use them from day one. The discipline is not "add evals later when we have time" — it is "write the first scorer the day you write the first agent."

For a complete treatment of the eval discipline, see the Prompt Evaluation Complete Guide. For LLM-as-judge specifically — including how to write rubrics that produce stable scores — see the LLM-as-judge prompting guide. And for the broader rubric we apply to any prompt before shipping, see the SurePrompts Quality Rubric.

A practical note: pair every scorer with a golden set of inputs and expected behaviours. A scorer without a golden set is just a metric; a scorer with a golden set is a regression test. Use the framework's eval harness to run the scorers against the golden set on every change.

Common failure modes

Five patterns show up over and over:

Vague instructions. The agent's instructions field is a system prompt, not a code comment. "You are a helpful assistant" produces a helpful-but-generic assistant. Name the role, the scope, the tone, the allowed actions, and the forbidden actions. If you cannot say what the agent should refuse to do, you have not finished writing the instructions. Fix: rewrite using RCAF (Role, Context, Action, Format) and check it against the SurePrompts Quality Rubric.

Tool descriptions written like internal documentation. The model reads the tool description to decide whether to call it. If the description says "wraps the v2 account API," the model has no idea when to call it. Symptom: the model either never calls the tool or calls it at random. Fix: rewrite the description as instructions to the model: "Use this tool when X. It returns Y. Do not use it for Z."

Workflows-vs-agents misuse. Forcing a deterministic pipeline through an agent loop is slow, expensive, and brittle. Forcing an open-ended conversation through a workflow is rigid and frustrating. Symptom: an agent that "almost always does the same five things in the same order" should be a workflow; a workflow with a step labelled "model decides what to do next" should be an agent. Fix: factor along the boundary — workflow on the outside, agent inside one step (or vice versa).

Default memory ignored until it breaks. Conversation history grows, the context window starts dropping middle turns, the agent forgets what the user said three turns ago. Symptom: quality degrades after long conversations. Fix: configure pruning or summarisation at design time. Pick a token budget and enforce it.

Evals deferred. "We will add evals once the agent works." The agent never works in the sense you mean — it works on the inputs you remember to test by hand. Symptom: silent regressions, "it worked yesterday," no way to compare two prompt versions. Fix: write the first scorer the day you write the first agent. Even a single correctness scorer on three golden inputs is infinitely better than zero.

For a deeper failure-mode taxonomy across multi-agent system designs and agent tool loop failures, see the AI agents prompting guide and the broader agentic AI prompting guide.

When Mastra is right — and when it is not

Pick Mastra when:

  • Your backend is TypeScript and you want to keep it that way.
  • You already use the Vercel AI SDK in a Next.js app and want the same abstractions on the server.
  • Your problem mixes deterministic workflows and model-driven agents, and you want a clean primitive boundary between them.
  • You value evals as a first-class concept rather than an external dependency.

Pick something else when:

  • Your stack is Python — LangGraph will integrate more cleanly and your team will hit fewer dead ends.
  • Your agent depends on a niche Python-only integration (a specific vector store, a specific eval methodology, a research-grade scorer).
  • Your team has invested significantly in LangChain or CrewAI and the migration cost outweighs the language-alignment win.
  • You need the largest possible community of prior art and StackOverflow answers — Python frameworks still have more.

The framework is a tool, not an identity. Pick the one that lets your team ship the most reliable agent in the time you have, and revisit the choice in six months.

Build prompts like these in seconds

Use the Template Builder to customize 350+ expert templates with real-time preview, then export for any AI model.

Open Template Builder