Skip to main content
Back to Blog
OpenAI Agents SDKOpenAI agentsagent frameworkshandoffsguardrailsResponses APIPython

OpenAI Agents SDK Prompting Guide: Tools, Handoffs, Guardrails, Tracing (2026)

A working-engineer guide to the OpenAI Agents SDK: agents, tools, handoffs, guardrails, tracing, structured outputs, and when not to use it.

SurePrompts Team
May 4, 2026
16 min read

TL;DR

The OpenAI Agents SDK is OpenAI's official Python framework for production agents — successor to the experimental Swarm cookbook. Built on the Responses API, it ships four primitives (agents, tools, handoffs, guardrails) plus tracing and structured outputs. This guide walks the SDK as a prompt-engineering surface: where instructions live, how handoffs and guardrails change your prompt design, and when to pick it over LangGraph or CrewAI.

The OpenAI Agents SDK is OpenAI's official Python framework for building production agents — the supported successor to the experimental Swarm cookbook, rebuilt on the Responses API around four primitives: agents, tools, handoffs, and guardrails. It is the lightest-weight serious option for OpenAI-first teams who want a typed, async, tool-calling agent runtime with built-in tracing and structured outputs.

This guide walks the SDK as a prompt-engineering surface. Where do instructions actually live? How does a handoff change the prompt you write? What does a guardrail do that a system message can't? And when should you skip it for LangGraph, CrewAI, or Mastra?

Tip

The SDK is intentionally small. Four primitives, one runner, one decorator. The hard work moves into your agent instructions, your tool docstrings, and your handoff criteria — exactly where prompt engineering lives.

Key takeaways

  • The OpenAI Agents SDK is OpenAI's production-grade framework, not a research toy. Swarm is the predecessor; if you're starting today, start here.
  • Four primitives only: Agent, function tool, handoff, and guardrail. Everything else (tracing, structured outputs, MCP) is supporting infrastructure.
  • The instructions field on an agent is the load-bearing prompt. Most failures in the SDK trace back to instructions that don't tell the model when to use which tool or when to hand off.
  • Tool docstrings aren't documentation — they are the description the model reads to decide whether to call the tool. Write them for the agent, not the human.
  • Handoffs transfer control of the loop. They are not function calls that return — once you hand off, the new agent owns the conversation.
  • Guardrails are not generic content filters. Treat them as single-purpose tripwires that run concurrently with the agent.
  • Tracing is the most underrated feature — every run produces an inspectable trace, which is how you actually debug agent prompts before they reach users.

What the OpenAI Agents SDK actually is

Strip away the marketing and the SDK is a small, opinionated Python library with these properties:

  • One runtime loop. Runner.run(agent, input) (and its async sibling) drives a tool-calling loop until the agent produces a final response, hands off, or hits a stop condition.
  • Built on the Responses API. Not Chat Completions. Responses is OpenAI's newer, agent-oriented endpoint that handles tool calls, hosted tools, and stateful turn management more cleanly.
  • OpenAI-model-first, but pluggable. Designed around OpenAI models. You can route to other providers through Responses-API-compatible adapters or LiteLLM, but the polish is on the OpenAI path.
  • Hosted tool support. Web search, file search, and computer use are first-class hosted tools you can attach without standing up your own infrastructure.
  • MCP servers. Model Context Protocol servers can be registered as tool sources, so the same agent can talk to local tools and remote MCP-exposed services.
  • Tracing on platform.openai.com. Every run logs an inspectable trace: model calls, tool calls, handoffs, guardrail outcomes, timing.

The point of the SDK is not to model your control flow as a graph or to enforce a role-based crew. It is to give the LLM a clean tool-calling surface and let the agent tool loop do the orchestration.

Swarm vs Agents SDK

Swarm was an experimental cookbook that demonstrated a handoff-driven multi-agent pattern in roughly 200 lines of Python. It was useful as a teaching artifact, but it was never positioned for production. The Agents SDK keeps the handoff idea and adds the surfaces real systems need.

DimensionSwarm (experimental, archived path)OpenAI Agents SDK (production)
StatusExperimental cookbook, not for productionOfficially supported framework
API foundationChat CompletionsResponses API
TypingUntyped helpers, dict-basedTyped Agent, Runner, RunResult
AsyncSync onlyAsync-first, sync wrapper available
Structured outputsManual JSON parsingoutput_type with Pydantic, SDK validates
GuardrailsNoneInput and output guardrails as primitive
TracingNoneBuilt-in, surfaced on platform.openai.com
Hosted toolsNoneWeb search, file search, computer use
MCP supportNoneFirst-class MCP server registration
HandoffsYes (the headline feature)Yes, generalized and typed

If you have a Swarm prototype, the SDK is the supported migration target. If you are starting today, skip Swarm — you can read it as a reference for the handoff pattern, but the SDK is what you ship.

Primitive 1: the Agent

An Agent is a model plus instructions plus a tool list plus an optional handoff list plus an optional output type. Conceptually:

python
from agents import Agent

triage = Agent(
    name="triage",
    instructions="""You are the entry point for support requests.
Classify the request as one of: billing, technical, refund, other.
- For billing or refund, hand off to the billing agent.
- For technical, hand off to the engineering agent.
- For other, answer directly in 2-3 sentences and stop.
Never attempt to resolve billing or refund issues yourself.""",
    model="gpt-4o-mini",
    handoffs=[billing_agent, engineering_agent],
)

The instructions field is where most of your prompt engineering effort goes. Three things matter:

Be explicit about when to use which tool or handoff. "Hand off to the billing agent for billing or refund issues" is useful. "Use your judgment" is not. The model is making a tool-call-shaped decision at every step; it needs criteria, not vibes.

Bound the agent. "Answer in 2-3 sentences and stop" prevents the agent from wandering. "Never attempt to resolve billing yourself" prevents it from undercutting your handoff design.

Keep instructions short and structured. A wall of prose buries the actual rules. Short paragraphs, bullets, or a numbered protocol read better — both for the model and for the human debugging the trace later.

For more on what good instructions look like, see RCAF and the SurePrompts Quality Rubric.

Sloppy vs sharp instructions

Sloppy:

You are a helpful assistant that handles customer support. Use your tools when needed and hand off to other agents when appropriate.

Sharp:

You handle inbound support. Step 1: classify (billing | technical | refund | other). Step 2: for billing/refund, hand off to billing_agent; for technical, hand off to engineering_agent. Step 3: for other, answer in 2-3 sentences. Never resolve billing/refund yourself. Never call tools other than handoffs.

The sharp version reads like a runbook. That is the goal.

Primitive 2: the Tool

Function tools are Python functions exposed to the agent via a decorator:

python
from agents import function_tool

@function_tool
def lookup_order(order_id: str) -> dict:
    """Look up an order by its ID. Returns status, amount, and customer email.

    Args:
        order_id: The order identifier in the format ORD-XXXXX.
    """
    return orders_db.get(order_id)

Two things to internalize:

The docstring is the tool description. The SDK uses your function's docstring (and parameter annotations) to build the tool schema the model sees. The agent decides whether to call this tool based on what the docstring says. So the docstring is a prompt — write it for the agent.

Type hints become the parameter schema. order_id: str becomes a typed string parameter. Pydantic models work too. The model is much better at calling tools when the schema is precise.

A useful discipline: imagine you only have the tool name, the docstring, and the parameter list. Could the agent figure out when to call this and what to pass? If not, the docstring is wrong.

One agent, fewer tools

A common failure mode is to attach 12 tools to one agent and hope it picks correctly. It usually doesn't. Two heuristics:

  • If two tools could plausibly be called for the same input, the agent will be confused. Either merge them or split the agent.
  • If you find yourself adding a tool and then writing a long instruction explaining when not to use it, that's a smell — the prompt is doing the work the tool selection should do.

When the tool surface gets wide, that's the cue to introduce a handoff and split into two agents.

Primitive 3: the Handoff

A handoff transfers control of the agent loop to another agent. From the model's perspective, it looks like a tool call. From the SDK's perspective, the active agent is swapped and the loop continues.

python
from agents import Agent

billing_agent = Agent(
    name="billing",
    instructions="""You handle billing and refund issues.
Available tools: lookup_invoice, issue_refund (max $200 without manager approval).
For refunds over $200, hand off to manager_agent.
Always confirm the refund amount with the customer before issuing.""",
    tools=[lookup_invoice, issue_refund],
    handoffs=[manager_agent],
)

triage_agent = Agent(
    name="triage",
    instructions="...",
    handoffs=[billing_agent, engineering_agent],
)

The key prompt-engineering implication: the model decides when to hand off. That decision is driven by the instructions on the source agent. So every handoff target needs an explicit criterion in the source agent's instructions. "Hand off to the billing agent for billing or refund issues" works. "Hand off when appropriate" does not.

This is also where the SDK differs from delegation patterns in CrewAI. In CrewAI, a manager agent delegates a subtask to a worker and gets a result back. In the Agents SDK, a handoff transfers ownership — the new agent owns the conversation until it responds, hands off again, or completes. There is no implicit return.

The practical consequence: if you want a "consultant" pattern (one agent asks another for help and incorporates the answer), you do not want a handoff. You want a tool that wraps a sub-agent run.

For the broader pattern catalog, see the multi-agent prompting guide and the AI agents prompting guide.

Primitive 4: the Guardrail

Guardrails are validators that run alongside an agent. There are two kinds:

  • Input guardrails inspect the input before the agent runs. They can short-circuit the run if a policy is violated.
  • Output guardrails inspect the final output before it is returned to the caller.

Each guardrail is a function (often itself a small, fast agent) that returns a structured verdict — typically including whether the guardrail tripped and a reason.

python
from agents import Agent, input_guardrail, GuardrailFunctionOutput
from pydantic import BaseModel

class ScopeCheck(BaseModel):
    is_in_scope: bool
    reason: str

scope_agent = Agent(
    name="scope_check",
    instructions="Decide if this message is about billing/support. Return JSON.",
    output_type=ScopeCheck,
    model="gpt-4o-mini",
)

@input_guardrail
async def in_scope(ctx, agent, input_text):
    result = await Runner.run(scope_agent, input_text)
    return GuardrailFunctionOutput(
        output_info=result.final_output,
        tripwire_triggered=not result.final_output.is_in_scope,
    )

main_agent = Agent(
    name="support",
    instructions="...",
    input_guardrails=[in_scope],
)

Two design rules that save pain later:

One guardrail, one failure mode. A guardrail that checks scope, PII, and prompt injection at once is hard to debug when it trips. Three small guardrails are easier to reason about than one big one.

Guardrails are not your only line of defense. They are tripwires, not gates. Real safety lives in your tool implementations (authorization checks, rate limits, audit logs), not in the guardrail layer.

Output guardrails follow the same shape but inspect the agent's final response. A common use is checking that a structured output passes business rules the schema can't express — for example, "the refund amount is within policy."

Tracing — the underrated production feature

Every run of Runner.run produces a trace: the sequence of model calls, tool calls, handoffs, guardrail outcomes, and timing. Traces show up on platform.openai.com under your project.

This sounds mundane. It is the single biggest reason to use the SDK over hand-rolled tool-calling code.

In practice, you debug agents by reading traces. You see which tool the model picked, what it passed, what came back, why it handed off, where it looped. You can identify the exact step where instructions were ambiguous and the model made the wrong call. Without tracing, you are guessing from final outputs.

Two habits worth building:

  • Read your traces during development. Not just when something breaks. Skim a few traces of "successful" runs and you'll find decisions the model made for the wrong reason — they happened to work this time.
  • Add trace metadata for filtering. Tag traces with session ID, user ID, or experiment name so you can slice them later.

Tracing also pairs well with the Context Engineering Maturity Model — the trace is what tells you whether your agent is operating at level 1 (single-shot) or level 3+ (managed memory and tool orchestration).

Structured outputs with output_type

When you need the agent's final response to be a typed object, set output_type to a Pydantic model:

python
from pydantic import BaseModel
from agents import Agent

class Triage(BaseModel):
    category: str  # "billing" | "technical" | "refund" | "other"
    urgency: str   # "low" | "medium" | "high"
    summary: str

triage = Agent(
    name="triage",
    instructions="Classify the inbound request and summarize it in one sentence.",
    output_type=Triage,
)

The SDK enforces the schema, validates the response, and gives you a typed object back. If validation fails, the SDK retries with the validation error fed back to the model — a small but real productivity win. This is the same structured output pattern that frameworks like LangGraph and Mastra also support, just with less ceremony.

When you should use it: any time the next step in your code branches on a field of the agent's response. Don't parse natural language — declare a schema.

When you shouldn't: open-ended generation (drafts, summaries, explanations) where forcing a schema flattens the output.

Agents SDK vs LangGraph vs CrewAI

DimensionOpenAI Agents SDKLangGraphCrewAI
Mental modelTool-calling agent + handoffsExplicit graph of nodes and edgesCrew of role-based agents with tasks
Control flowLLM decides (tool calls + handoffs)You declare (nodes, edges, conditional routing)You declare (task list, sequential or hierarchical)
Framework footprintSmallest of the threeLarger; graph + state + checkpointingMedium; role/task abstractions
State managementIn-context + tool-managedFirst-class typed state, checkpointing, time travelPer-task context, shared crew memory
TracingBuilt-in on platform.openai.comLangSmith integrationBuilt-in observability hooks
Model coverageOpenAI-first; others via adaptersProvider-agnosticProvider-agnostic
Best fitOpenAI-first stacks; lightest setupWorkflows where control flow must be explicit and debuggableRole/task-shaped workflows
Worst fitMulti-provider stacks; explicit DAG workflows"I just want a tool-using assistant"Workflows where roles are a fiction

These are sibling tools, not competitors fighting for one slot. Most serious teams end up with more than one in their stack — for example, an Agents SDK assistant for end-user interactions and a LangGraph pipeline for the offline workflow that produces the data the assistant uses. See the Agentic Prompt Stack for how these fit into a layered design.

Common failure modes

The SDK is small, so most failures are prompt failures. The patterns repeat.

Vague instructions. By far the most common. "You are a helpful agent that handles X" is not enough. The model needs criteria for when to call which tool, when to hand off, when to stop. Symptom: the agent calls the wrong tool, hands off prematurely, or wanders. Fix: rewrite instructions as a numbered protocol with explicit conditions. Read three traces and see if you can predict the model's next action — if you can't, neither can the model.

Too many tools per agent. A single agent with 10+ tools is a confused agent. Symptom: the agent picks plausible-but-wrong tools, especially when two tools have overlapping descriptions. Fix: split into multiple agents connected by handoffs, with each agent owning a coherent slice of the toolset.

Handoffs without clear when-to-handoff criteria. "Hand off when appropriate" is not a criterion. Symptom: flapping (agent A hands off to B, B hands back to A), premature handoffs, missed handoffs. Fix: in the source agent's instructions, name each handoff target and the exact condition that should trigger it.

Over-aggressive guardrails. A guardrail that trips on edge cases the user actually cares about kills the experience. Symptom: legitimate requests blocked; users frustrated. Fix: scope each guardrail tightly. Run the guardrail's classifier agent against a held-out set of real inputs before shipping. Keep tripwires per-purpose.

Ignoring tracing during development. Building an agent without reading traces is like writing code without running a debugger. Symptom: mysterious failures in production that "worked locally." Fix: make trace-reading part of every PR. Skim five traces — three successes, two failures — before you call an agent done.

Treating handoffs as function calls. They aren't. Once you hand off, the new agent owns the conversation. Symptom: you expect the original agent to "come back" with a result and it doesn't. Fix: if you want delegation-with-return, wrap the sub-agent run in a @function_tool instead of using a handoff.

When the SDK is the right pick — and when it isn't

Right pick when:

  • Your stack is OpenAI-first and you want to take advantage of Responses API + hosted tools + platform tracing.
  • You want the lightest framework footprint that still gives you guardrails, structured outputs, and tracing.
  • Your workflow is genuinely model-driven — the agent decides what to do at each step, with handoffs for clear hand-off points.
  • You are building an end-user-facing assistant rather than an offline DAG pipeline.

Wrong pick when:

  • Your control flow is naturally a graph with explicit branching, retries, and parallelism — LangGraph makes those legible in a way the SDK does not.
  • Your team thinks in roles and tasks rather than agents and tools — CrewAI maps to that mental model.
  • You are TypeScript-first or you want a multi-agent system with first-class evals built in — Mastra is the better starting point.
  • You need to support multiple model providers as first-class citizens — the SDK can be coerced, but you'll be fighting it.

For agent orchestration at scale, none of these tools is a complete answer. Most production agent systems blend a framework like the SDK with custom orchestration, observability, and evals.

The SDK rewards taking prompt engineering seriously. Four primitives is not a lot of framework — which means your instructions, your tool docstrings, and your handoff criteria are doing most of the work. Read your traces, write sharp instructions, keep guardrails focused, and the small surface becomes a feature, not a limitation.

Build prompts like these in seconds

Use the Template Builder to customize 350+ expert templates with real-time preview, then export for any AI model.

Open Template Builder