Skip to main content
Back to Blog
CrewAIagent frameworksmulti-agentrole-based agentsPythonagent orchestration

CrewAI Prompting Guide: How to Build Role-Based Multi-Agent Systems (2026)

CrewAI prompting guide: how to design role/goal/backstory, write tasks with structured expected_output, pick sequential vs hierarchical, and avoid common failure modes.

SurePrompts Team
May 4, 2026
16 min read

TL;DR

CrewAI is a Python framework for multi-agent systems built on a role/goal/backstory metaphor. This guide covers how to design those three fields so they actually move quality, how to write tasks with structured expected_output, when to pick sequential over hierarchical process, and the failure modes that bite teams in production.

CrewAI is an open-source Python framework for building multi-agent systems. It models a system as a small team — agents with role, goal, and backstory, working through Tasks under a Crew — and gives you a configuration-first way to ship a pipeline without writing graph code.

This guide covers how to design those three fields so they actually move quality, how to write tasks the framework can execute, when sequential beats hierarchical, and the failure modes that bite teams in production.

Tip

The role/goal/backstory pattern is essentially RCAF baked into the framework: Role, plus a stripped-down Context (the backstory), plus an Action (the goal). When you write a CrewAI agent, you are writing an RCAF prompt — just with the slots renamed and persisted as fields rather than inline in a string. Treat them with the same discipline.

Key takeaways:

  • CrewAI's three primitives are Agent (role/goal/backstory + tools), Task (description + expected_output), and Crew (process + agents + tasks). Everything else is configuration on top.
  • The single highest-leverage thing in any CrewAI system is the expected_output field. Vague expected_output is the most common reason CrewAI pipelines feel unreliable.
  • Role/goal/backstory is load-bearing for some agents (writers, critics, strategists) and overrated for others (extractors, routers, arithmetic). Decide deliberately, not by default.
  • Sequential process is cheaper, more predictable, and the right default. Hierarchical process is for genuine runtime branching, not for "the manager pattern feels more impressive."
  • allow_delegation=True on every agent is how you get delegation loops in production. Turn it on per agent only when the pipeline genuinely needs it.
  • CrewAI is right for pipeline-shaped work with real specialization. It is overkill for tasks a single LLM call can do, and the wrong shape for arbitrary state-machine workflows where LangGraph fits better.
  • The fastest debugging move is to log the actual prompt strings the framework sends to the model. Most CrewAI bugs are visible in those strings.

What CrewAI is

CrewAI is a Python framework for building multi-agent systems. The mental model is a small team: each agent has a role and a goal, the team has a workflow, and the framework handles per-step prompting, tool calls, and (optionally) memory.

The three primitives are intentionally minimal:

  • Agent — a configured LLM persona with role, goal, backstory, an optional tools list, an optional allow_delegation flag, and an optional llm setting if you want to mix models across agents. The role/goal/backstory fields are concatenated into the system prompt the framework sends.
  • Task — a unit of work with a description (what to do) and an expected_output (what the result should look like). Tasks are assigned to agents either explicitly via agent=... or implicitly under hierarchical process.
  • Crew — the container that holds agents, tasks, and a process (sequential or hierarchical). When you call crew.kickoff(), the Crew is what runs.

Configuration can be inline Python or YAML files (agents.yaml, tasks.yaml) loaded by a CrewBase class. The YAML pattern is the one most production teams converge on because it separates prompt content from orchestration code, which makes prompt iteration faster and review easier.

CrewAI also exposes Flows — a more recent layer that lets you wire Crews together with explicit control flow when you need branching across crews. Flows are the framework's answer to "what if I do need a graph after all," and they sit one layer above the Crew abstraction. If you find yourself reaching for Flows constantly, that is a signal your problem might be a LangGraph shape rather than a CrewAI shape.

CrewAI vs LangGraph

The two frameworks solve overlapping problems differently. The choice is mostly about what shape your workflow actually has.

DimensionCrewAILangGraph
Mental modelTeam of specialized agentsState graph of nodes and edges
Control flowSequential or hierarchical-via-managerExplicit edges, conditional routing, loops
StateImplicit (task outputs forwarded)Explicit shared state object
Human-in-the-loopPossible, but not first-classFirst-class checkpoints and interrupts
Learning curveLower — write configs, runHigher — design graph, manage state
When to pickPipeline-shaped work with role specializationBranching, retries, HITL, complex state

The honest read: CrewAI gets you to a working multi-agent pipeline faster when the work is essentially linear and the value is in who does what. LangGraph wins when the value is in what runs next under which condition. Plenty of teams start with CrewAI, hit a control-flow ceiling, and migrate the orchestration layer to LangGraph while keeping the underlying agent prompts. That is a reasonable progression, not a failure of either framework.

For the broader landscape, see the multi-agent prompting guide and the AI agents prompting guide. The Agentic Prompt Stack gives you the layer-by-layer model for designing the prompt that runs each individual agent inside the crew.

Designing role, goal, and backstory

This is the load-bearing decision in any CrewAI system. Role, goal, and backstory are concatenated into the system prompt the framework sends on every step. They are doing the work of role prompting — and role prompting moves quality on tasks where frame matters and is mostly noise on tasks where it does not.

The discipline is to write each field for what it actually controls.

Role. A short noun phrase. The job title. "Senior research analyst." "Technical editor." "SQL extraction specialist." Not a sentence, not a paragraph. The role anchors the model in a specific frame; longer roles dilute the anchor.

Goal. A single sentence stating what the agent is trying to accomplish across the whole crew run. Not the per-task goal — that lives in Task.description. The agent goal is the standing purpose. "Find credible primary sources for the requested topic and summarize their key claims." "Convert raw research notes into publication-ready drafts that match the house voice."

Backstory. Two to four sentences of context that change how the agent should behave. This is where role prompting earns its keep — or wastes its tokens. Good backstories include: domain experience that should bias which considerations the agent surfaces, a perspective or stance that should shape tone, and explicit constraints on what the agent should not do. Bad backstories include: marketing copy ("you are a world-class expert"), generic resume bullets, and praise of the agent's own abilities, none of which change behavior.

Three concrete examples.

A research agent:

yaml
role: Senior research analyst
goal: Surface the strongest primary sources on the requested topic
  and summarize their core claims with quotable detail.
backstory: >
  You spent a decade as a research librarian before moving to industry
  analysis. You distrust secondary citations, you check publication
  dates, and you flag when sources disagree rather than averaging them.
  You never invent sources or fabricate quotes — if you cannot find
  something, you say so.

A writing agent:

yaml
role: Technical editor
goal: Turn raw research notes into a publication-ready draft that
  reads in the requested voice and length.
backstory: >
  You came up writing for working engineers, not executives. You cut
  filler. You replace abstractions with concrete examples. You never
  use marketing words like "leverage" or "robust" unless they earn
  their place. You write in second person when giving instructions.

A SQL extraction agent — where backstory matters less:

yaml
role: SQL extraction specialist
goal: Translate the user's question into a single, valid Postgres
  SELECT query against the provided schema.
backstory: >
  You only return SELECT statements — never DDL or DML. You always
  qualify column names with table aliases. You return the query and
  nothing else.

The third example is short on purpose. A SQL agent does not benefit from a paragraph of persona. A writing agent does. Decide per agent, not per project.

Designing tasks

Tasks have two fields that matter: description and expected_output. The single most common CrewAI failure mode lives in expected_output.

Task.description tells the agent what to do for this step. Be specific about inputs, scope, and constraints. The description has access to inputs you pass into crew.kickoff(inputs={...}) via {variable} placeholders, which is how you parameterize a crew across runs.

Task.expected_output tells the agent what the result should look like. This is what most teams under-write. Compare:

Vague — what most first drafts look like:

yaml
description: Research the user's topic and produce a brief.
expected_output: A research brief on the topic.

Structured — what production-quality looks like:

yaml
description: >
  Research the topic "{topic}" and produce a research brief
  for a working engineer audience. Use only sources from the
  last three years unless a foundational source is genuinely
  required. Cite at least five distinct sources.
expected_output: >
  A markdown document with these sections in order:
  1. One-paragraph executive summary.
  2. Five to seven key findings as a bulleted list, each with
     a one-sentence claim and a citation.
  3. A "What is contested" section listing two to four
     points where credible sources disagree.
  4. A bibliography with title, author, publication, and
     date for every cited source.
  Total length: 600 to 900 words.

The structured version is doing the work of an output contract. The agent now has a target shape, and the framework can validate against it. The vague version pushes the burden onto the next agent in the pipeline, which is how you get reports that look different on every run.

If you take one thing from this guide: rewrite every expected_output field to look like the structured version above. It is the highest-leverage change you can make to an existing CrewAI system.

For deeper output design, the SurePrompts Quality Rubric gives you a five-axis scoring model for prompt quality, and the constraint-tightness and output-format axes map directly onto expected_output.

Sequential vs hierarchical process

Process.sequential is the default and should stay the default unless you have a specific reason to change. Tasks run in the order you defined them. Each task's output is forwarded as context to the next task. The flow is deterministic — the framework does not decide who does what next; you did, when you wrote the task list.

Sequential is cheap (no manager LLM), predictable (same task order every run), and easy to debug (you can read the trace top to bottom). It is the right shape for any workflow you can name as a pipeline: research → outline → draft → edit; extract → enrich → validate → store; classify → route → respond.

Process.hierarchical adds a manager_llm (or a full manager_agent) that decides which agent should handle each task, can re-route work, and can ask for revisions. The manager runs an extra LLM call per decision, which adds latency and cost. In return, you get runtime agent orchestration — the system can branch based on intermediate results.

Pick hierarchical only when the workflow genuinely needs runtime branching. Tells:

  • You do not know in advance whether the legal-review agent or the marketing-review agent should handle the next task; it depends on what the previous agent produced.
  • You want the system to retry with a different agent if the first attempt fails a quality bar.
  • You have a coordinator role that genuinely needs to inspect intermediate state and decide.

Tells you do not need hierarchical:

  • The pipeline is the same every run.
  • You reached for it because "manager pattern" sounds more impressive.
  • You wanted parallel execution — that is a separate concern (see Crew's async_execution per task) and does not require hierarchical process.

Most teams reach for hierarchical too early. Start sequential. Move to hierarchical when the failure mode you are seeing is "wrong agent picked the work," not "agents did the work badly."

Tool use and delegation

Each agent has its own tools list. This is how CrewAI handles tool use — per-agent, not crew-wide. The research agent gets a search tool and maybe a fetch tool. The writing agent gets nothing. The SQL agent gets the database tool and only the database tool.

The prompt-engineering implication: when an agent has tools, its backstory should include tool-use discipline. "Always check the schema with the schema tool before writing a query." "Search before answering — never answer from training data on questions about current events." Without those constraints, the agent will sometimes skip the tool call and hallucinate, especially on tasks where the answer feels easy.

allow_delegation=True lets an agent hand work off to another agent in the crew mid-task — an agent handoff initiated by the agent rather than by the workflow. This is powerful and risky. Powerful because it lets a generalist agent recognize when a specialist is needed. Risky because two agents with delegation enabled can ping-pong, each deciding the other should handle it. This is the agent tool loop failure mode in delegation form.

The discipline:

  • Default allow_delegation=False. Turn it on per agent only when the pipeline genuinely benefits.
  • If you do enable delegation, write the backstory to be specific about when to delegate and when to handle the work directly.
  • Hierarchical process and per-agent delegation are different mechanisms. You usually want one, not both.

Memory and context discipline

CrewAI offers built-in memory: short-term (within a crew run), long-term (across runs, persisted), and entity memory (facts about specific entities the agent encounters). Memory is opt-in via memory=True on the Crew.

Token economics matter here. Memory is added to the prompt every step. On a long crew run with rich memory, you can spend more tokens on memory than on actual task content, and you start hitting context limits or paying real money for marginal recall. The decision is task-dependent:

  • Short-term only is enough for most pipeline workflows where each task feeds the next directly.
  • Long-term memory earns its tokens for agents that should learn user preferences across runs (a personal-assistant crew, an editing crew that should learn house style over time).
  • Entity memory is useful when the same entities recur across runs (a research crew that should remember which sources it has already cited).

For a deeper treatment of how to assemble context efficiently across agents, see the Context Engineering Maturity Model. For agents using a reasoning model under the hood, the cost calculus shifts further — reasoning tokens are not free, and stuffing memory into a reasoning model's context wastes both.

Common CrewAI failure modes

Five recur often enough to call out by name. Each is diagnosable by reading the actual prompts the framework sends to the model — which is the first thing to log when something feels wrong.

Backstory bloat. Symptom: every agent has three paragraphs of persona, the system prompt is 1,500 tokens before the task even loads, and the agents do not behave noticeably better than a single-paragraph version. Fix: cut every backstory to two to four sentences. If a sentence does not change behavior, delete it.

Vague expected_output. Symptom: the agent returns something different on every run, and the next agent in the pipeline has to guess at the shape. Fix: rewrite every expected_output to specify sections, format, and length. The structured example above is the template.

Hierarchical overuse. Symptom: a linear pipeline runs under hierarchical process, the manager LLM is making the same routing decision every time, and the run is slow and expensive for no quality lift. Fix: switch to sequential. Reach for hierarchical only when the routing genuinely varies by run.

Delegation loops. Symptom: two agents with allow_delegation=True keep handing the same task back and forth, the trace is full of "I think Agent B should handle this" / "I think Agent A should handle this," and the run eventually times out or hits a max-iteration cap. Fix: turn delegation off by default; enable it only on specific agents with backstories that name the delegation criteria explicitly.

Role/task contradiction. Symptom: the agent role says "concise editor" and the task description says "write a 2,000-word piece," or the agent's tools are read-only but the task asks the agent to update the database. Fix: read every (agent, task) pairing for compatibility before shipping. If the role contradicts the task, change one of them.

The shared diagnostic move for all five: log the prompt strings the framework sends, on every step, and read them. Most CrewAI bugs are visible in the prompts. The fix is almost always smaller than rewriting the agent — usually a backstory cut, an expected_output rewrite, or a flag flip.

When CrewAI is right and when it is overkill

CrewAI is right when:

  • The work has a real pipeline shape — distinct steps with hand-offs.
  • The value is in specialization — different agents bring different frames or different tool access.
  • The control flow is mostly known in advance — you can name the steps in order.
  • You want configuration-driven prompts (YAML files) so prompt iteration does not require code changes.

CrewAI is overkill when:

  • One LLM call can plausibly do the task. Adding a multi-agent framework on top buys you latency, cost, and surface area for failure with no quality lift.
  • The work is conversational rather than pipeline-shaped — a chatbot is a single-agent task with tools, not a crew.
  • The work needs arbitrary state-machine control flow — that is a LangGraph shape, not a Crew shape.
  • You need first-class human-in-the-loop checkpoints across many turns. CrewAI supports HITL but it is not the design center; LangGraph or a hand-rolled wrapper fits better.

The honest test: can you describe your work as "a small team of specialists running a pipeline"? If yes, CrewAI fits. If you find yourself describing it as "a state machine" or "a conversation," reach for a different tool.

For the broader picture of agent frameworks and how prompts move across them, see the agentic AI prompting guide. For the layer-by-layer model for the prompts inside any single agent (in CrewAI or anywhere else), see the Agentic Prompt Stack.

  • The Agentic Prompt Stack — the six-layer model for designing the prompt that runs any individual agent. Apply it to each agent in your crew.
  • The multi-agent prompting guide — patterns for prompts that flow across multiple agents, regardless of framework.
  • The LangGraph prompting guide — sibling guide for the graph-shaped framework. Useful when you outgrow the team metaphor.
  • The OpenAI Agents SDK prompting guide and the Mastra prompting guide — sibling guides for two more frameworks in the same space, with their own opinions on what an agent system should look like.
  • The SurePrompts Quality Rubric — score your CrewAI prompts on five axes before shipping. The constraint-tightness and output-format axes are where most CrewAI prompts have the most room to improve.

Build prompts like these in seconds

Use the Template Builder to customize 350+ expert templates with real-time preview, then export for any AI model.

Open Template Builder