OpenAI Agents SDK Prompting Guide: Tools, Handoffs, Guardrails, Tracing (2026)

Q: What is the OpenAI Agents SDK?

The OpenAI Agents SDK is OpenAI's official Python framework for building production agents. It is the successor to Swarm, an earlier experimental cookbook that demonstrated handoff-style multi-agent patterns. The SDK is built on top of the Responses API and provides four core primitives: agents (a model plus instructions plus tools), function tools (Python functions exposed via a decorator), handoffs (one agent transferring control of the loop to another), and guardrails (input and output validators). It also ships tracing on platform.openai.com, structured outputs via Pydantic, async-first execution, and support for hosted tools like web search, file search, and computer use, plus Model Context Protocol (MCP) servers.

Q: What's the difference between Swarm and the OpenAI Agents SDK?

Swarm was an experimental, educational repo from OpenAI that showed how a small handoff-driven multi-agent system could be built on top of the Chat Completions API. It was intentionally minimal, not versioned for production, and missing the surfaces real systems need: tracing, guardrails, structured outputs, MCP support, and a typed runner. The Agents SDK is the production-grade follow-on. It keeps the handoff idea but rebuilds it on the Responses API, adds typed agents and runner, input and output guardrails, structured Pydantic outputs, async by default, and platform-side tracing. If you were prototyping with Swarm, the SDK is the supported path forward; if you are starting fresh today, skip Swarm and start with the SDK.

Q: How do handoffs work in the Agents SDK?

A handoff transfers control of the agent loop from one agent to another. You declare possible handoff targets on an agent via its handoffs list. At runtime, the model can choose to invoke a handoff the same way it invokes a tool. When it does, the SDK swaps the active agent and continues the loop with the new agent's instructions, tools, and handoff list. This is different from a delegation pattern where a manager agent calls a worker and waits for a return value. In the SDK, the new agent owns the conversation until it either responds to the user, hands off again, or completes. Because the model decides when to hand off, your prompt has to make the criteria for handoff explicit — vague instructions cause flapping or premature transfers.

Q: What are guardrails in the Agents SDK?

Guardrails are validators that run alongside an agent's input or output. Input guardrails inspect what is coming into the agent (user message, prior context) and can short-circuit the run if a policy is violated — for example, blocking off-topic requests before they hit an expensive model. Output guardrails inspect what the agent produced and can reject or trigger a regeneration. Each guardrail is a small function (often itself a small agent) that returns a structured verdict including whether to trip the guardrail and why. Guardrails run concurrently with the main agent where possible, so the cost of a tripwire is mostly latency on the path that fires it. The discipline is to keep guardrails focused on a single failure mode each.

Q: Should I use the Agents SDK or LangGraph?

Pick the Agents SDK if your stack is OpenAI-first, you want the lightest framework footprint, you value built-in tracing on platform.openai.com, and your control flow is naturally model-driven — the agent decides when to call tools, when to hand off, when it's done. Pick LangGraph if you need to make control flow explicit and inspectable as a graph of nodes and edges, you want fine-grained checkpointing and time-travel debugging, you are mixing many model providers, or your workflow has hard branching and routing logic that you want to own outside the model. Both can call the same models. The choice is about how much of the orchestration you want the framework to model versus how much you want the LLM to decide at runtime.

Q: Can I use the OpenAI Agents SDK with non-OpenAI models?

Yes, with caveats. The SDK is built on the Responses API and is optimized for OpenAI models, but the model layer is pluggable — you can route to other providers through Responses-API-compatible adapters or through community integrations like LiteLLM. Hosted tools (web search, file search, computer use) and platform tracing are tied to OpenAI infrastructure and won't work with arbitrary third-party models. Function tools, handoffs, guardrails, and structured outputs all work model-agnostically as long as the model supports tool calling and JSON-mode-style structured output. If your primary models are Anthropic or Google, a framework-agnostic option like LangGraph or Mastra is usually a smoother fit than bending the Agents SDK around them.

SurePrompts Team

The OpenAI Agents SDK is OpenAI's official Python framework for building production agents — the supported successor to the experimental Swarm cookbook, rebuilt on the Responses API around four primitives: agents, tools, handoffs, and guardrails. It is the lightest-weight serious option for OpenAI-first teams who want a typed, async, tool-calling agent runtime with built-in tracing and structured outputs.

This guide walks the SDK as a prompt-engineering surface. Where do instructions actually live? How does a handoff change the prompt you write? What does a guardrail do that a system message can't? And when should you skip it for LangGraph, CrewAI, or Mastra?

Tip

The SDK is intentionally small. Four primitives, one runner, one decorator. The hard work moves into your agent instructions, your tool docstrings, and your handoff criteria — exactly where prompt engineering lives.

Key takeaways

The OpenAI Agents SDK is OpenAI's production-grade framework, not a research toy. Swarm is the predecessor; if you're starting today, start here.
Four primitives only: Agent, function tool, handoff, and guardrail. Everything else (tracing, structured outputs, MCP) is supporting infrastructure.
The instructions field on an agent is the load-bearing prompt. Most failures in the SDK trace back to instructions that don't tell the model when to use which tool or when to hand off.
Tool docstrings aren't documentation — they are the description the model reads to decide whether to call the tool. Write them for the agent, not the human.
Handoffs transfer control of the loop. They are not function calls that return — once you hand off, the new agent owns the conversation.
Guardrails are not generic content filters. Treat them as single-purpose tripwires that run concurrently with the agent.
Tracing is the most underrated feature — every run produces an inspectable trace, which is how you actually debug agent prompts before they reach users.

What the OpenAI Agents SDK actually is

Strip away the marketing and the SDK is a small, opinionated Python library with these properties:

One runtime loop. Runner.run(agent, input) (and its async sibling) drives a tool-calling loop until the agent produces a final response, hands off, or hits a stop condition.
Built on the Responses API. Not Chat Completions. Responses is OpenAI's newer, agent-oriented endpoint that handles tool calls, hosted tools, and stateful turn management more cleanly.
OpenAI-model-first, but pluggable. Designed around OpenAI models. You can route to other providers through Responses-API-compatible adapters or LiteLLM, but the polish is on the OpenAI path.
Hosted tool support. Web search, file search, and computer use are first-class hosted tools you can attach without standing up your own infrastructure.
MCP servers. Model Context Protocol servers can be registered as tool sources, so the same agent can talk to local tools and remote MCP-exposed services.
Tracing on platform.openai.com. Every run logs an inspectable trace: model calls, tool calls, handoffs, guardrail outcomes, timing.

The point of the SDK is not to model your control flow as a graph or to enforce a role-based crew. It is to give the LLM a clean tool-calling surface and let the agent tool loop do the orchestration.

Swarm vs Agents SDK

Swarm was an experimental cookbook that demonstrated a handoff-driven multi-agent pattern in roughly 200 lines of Python. It was useful as a teaching artifact, but it was never positioned for production. The Agents SDK keeps the handoff idea and adds the surfaces real systems need.

Dimension	Swarm (experimental, archived path)	OpenAI Agents SDK (production)
Status	Experimental cookbook, not for production	Officially supported framework
API foundation	Chat Completions	Responses API
Typing	Untyped helpers, dict-based	Typed `Agent`, `Runner`, `RunResult`
Async	Sync only	Async-first, sync wrapper available
Structured outputs	Manual JSON parsing	`output_type` with Pydantic, SDK validates
Guardrails	None	Input and output guardrails as primitive
Tracing	None	Built-in, surfaced on platform.openai.com
Hosted tools	None	Web search, file search, computer use
MCP support	None	First-class MCP server registration
Handoffs	Yes (the headline feature)	Yes, generalized and typed

If you have a Swarm prototype, the SDK is the supported migration target. If you are starting today, skip Swarm — you can read it as a reference for the handoff pattern, but the SDK is what you ship.

Primitive 1: the Agent

An Agent is a model plus instructions plus a tool list plus an optional handoff list plus an optional output type. Conceptually:

python

from agents import Agent

triage = Agent(
    name="triage",
    instructions="""You are the entry point for support requests.
Classify the request as one of: billing, technical, refund, other.
- For billing or refund, hand off to the billing agent.
- For technical, hand off to the engineering agent.
- For other, answer directly in 2-3 sentences and stop.
Never attempt to resolve billing or refund issues yourself.""",
    model="gpt-4o-mini",
    handoffs=[billing_agent, engineering_agent],
)

The instructions field is where most of your prompt engineering effort goes. Three things matter:

Be explicit about when to use which tool or handoff. "Hand off to the billing agent for billing or refund issues" is useful. "Use your judgment" is not. The model is making a tool-call-shaped decision at every step; it needs criteria, not vibes.

Bound the agent. "Answer in 2-3 sentences and stop" prevents the agent from wandering. "Never attempt to resolve billing yourself" prevents it from undercutting your handoff design.

Keep instructions short and structured. A wall of prose buries the actual rules. Short paragraphs, bullets, or a numbered protocol read better — both for the model and for the human debugging the trace later.

For more on what good instructions look like, see RCAF and the SurePrompts Quality Rubric.

Sloppy vs sharp instructions

Sloppy:

You are a helpful assistant that handles customer support. Use your tools when needed and hand off to other agents when appropriate.

Sharp:

You handle inbound support. Step 1: classify (billing | technical | refund | other). Step 2: for billing/refund, hand off to billing_agent; for technical, hand off to engineering_agent. Step 3: for other, answer in 2-3 sentences. Never resolve billing/refund yourself. Never call tools other than handoffs.

The sharp version reads like a runbook. That is the goal.

Primitive 2: the Tool

Function tools are Python functions exposed to the agent via a decorator:

python

from agents import function_tool

@function_tool
def lookup_order(order_id: str) -> dict:
    """Look up an order by its ID. Returns status, amount, and customer email.

    Args:
        order_id: The order identifier in the format ORD-XXXXX.
    """
    return orders_db.get(order_id)

Two things to internalize:

The docstring is the tool description. The SDK uses your function's docstring (and parameter annotations) to build the tool schema the model sees. The agent decides whether to call this tool based on what the docstring says. So the docstring is a prompt — write it for the agent.

Type hints become the parameter schema. order_id: str becomes a typed string parameter. Pydantic models work too. The model is much better at calling tools when the schema is precise.

A useful discipline: imagine you only have the tool name, the docstring, and the parameter list. Could the agent figure out when to call this and what to pass? If not, the docstring is wrong.

One agent, fewer tools

A common failure mode is to attach 12 tools to one agent and hope it picks correctly. It usually doesn't. Two heuristics:

If two tools could plausibly be called for the same input, the agent will be confused. Either merge them or split the agent.
If you find yourself adding a tool and then writing a long instruction explaining when not to use it, that's a smell — the prompt is doing the work the tool selection should do.

When the tool surface gets wide, that's the cue to introduce a handoff and split into two agents.

Primitive 3: the Handoff

A handoff transfers control of the agent loop to another agent. From the model's perspective, it looks like a tool call. From the SDK's perspective, the active agent is swapped and the loop continues.

python

from agents import Agent

billing_agent = Agent(
    name="billing",
    instructions="""You handle billing and refund issues.
Available tools: lookup_invoice, issue_refund (max $200 without manager approval).
For refunds over $200, hand off to manager_agent.
Always confirm the refund amount with the customer before issuing.""",
    tools=[lookup_invoice, issue_refund],
    handoffs=[manager_agent],
)

triage_agent = Agent(
    name="triage",
    instructions="...",
    handoffs=[billing_agent, engineering_agent],
)

The key prompt-engineering implication: the model decides when to hand off. That decision is driven by the instructions on the source agent. So every handoff target needs an explicit criterion in the source agent's instructions. "Hand off to the billing agent for billing or refund issues" works. "Hand off when appropriate" does not.

This is also where the SDK differs from delegation patterns in CrewAI. In CrewAI, a manager agent delegates a subtask to a worker and gets a result back. In the Agents SDK, a handoff transfers ownership — the new agent owns the conversation until it responds, hands off again, or completes. There is no implicit return.

The practical consequence: if you want a "consultant" pattern (one agent asks another for help and incorporates the answer), you do not want a handoff. You want a tool that wraps a sub-agent run.

For the broader pattern catalog, see the multi-agent prompting guide and the AI agents prompting guide.

Primitive 4: the Guardrail

Guardrails are validators that run alongside an agent. There are two kinds:

Input guardrails inspect the input before the agent runs. They can short-circuit the run if a policy is violated.
Output guardrails inspect the final output before it is returned to the caller.

Each guardrail is a function (often itself a small, fast agent) that returns a structured verdict — typically including whether the guardrail tripped and a reason.

python

from agents import Agent, input_guardrail, GuardrailFunctionOutput
from pydantic import BaseModel

class ScopeCheck(BaseModel):
    is_in_scope: bool
    reason: str

scope_agent = Agent(
    name="scope_check",
    instructions="Decide if this message is about billing/support. Return JSON.",
    output_type=ScopeCheck,
    model="gpt-4o-mini",
)

@input_guardrail
async def in_scope(ctx, agent, input_text):
    result = await Runner.run(scope_agent, input_text)
    return GuardrailFunctionOutput(
        output_info=result.final_output,
        tripwire_triggered=not result.final_output.is_in_scope,
    )

main_agent = Agent(
    name="support",
    instructions="...",
    input_guardrails=[in_scope],
)

Two design rules that save pain later:

One guardrail, one failure mode. A guardrail that checks scope, PII, and prompt injection at once is hard to debug when it trips. Three small guardrails are easier to reason about than one big one.

Guardrails are not your only line of defense. They are tripwires, not gates. Real safety lives in your tool implementations (authorization checks, rate limits, audit logs), not in the guardrail layer.

Output guardrails follow the same shape but inspect the agent's final response. A common use is checking that a structured output passes business rules the schema can't express — for example, "the refund amount is within policy."

Tracing — the underrated production feature

Every run of Runner.run produces a trace: the sequence of model calls, tool calls, handoffs, guardrail outcomes, and timing. Traces show up on platform.openai.com under your project.

This sounds mundane. It is the single biggest reason to use the SDK over hand-rolled tool-calling code.

In practice, you debug agents by reading traces. You see which tool the model picked, what it passed, what came back, why it handed off, where it looped. You can identify the exact step where instructions were ambiguous and the model made the wrong call. Without tracing, you are guessing from final outputs.

Two habits worth building:

Read your traces during development. Not just when something breaks. Skim a few traces of "successful" runs and you'll find decisions the model made for the wrong reason — they happened to work this time.
Add trace metadata for filtering. Tag traces with session ID, user ID, or experiment name so you can slice them later.

Tracing also pairs well with the Context Engineering Maturity Model — the trace is what tells you whether your agent is operating at level 1 (single-shot) or level 3+ (managed memory and tool orchestration).

Structured outputs with `output_type`

When you need the agent's final response to be a typed object, set output_type to a Pydantic model:

python

from pydantic import BaseModel
from agents import Agent

class Triage(BaseModel):
    category: str  # "billing" | "technical" | "refund" | "other"
    urgency: str   # "low" | "medium" | "high"
    summary: str

triage = Agent(
    name="triage",
    instructions="Classify the inbound request and summarize it in one sentence.",
    output_type=Triage,
)

The SDK enforces the schema, validates the response, and gives you a typed object back. If validation fails, the SDK retries with the validation error fed back to the model — a small but real productivity win. This is the same structured output pattern that frameworks like LangGraph and Mastra also support, just with less ceremony.

When you should use it: any time the next step in your code branches on a field of the agent's response. Don't parse natural language — declare a schema.

When you shouldn't: open-ended generation (drafts, summaries, explanations) where forcing a schema flattens the output.

Agents SDK vs LangGraph vs CrewAI

Dimension	OpenAI Agents SDK	LangGraph	CrewAI
Mental model	Tool-calling agent + handoffs	Explicit graph of nodes and edges	Crew of role-based agents with tasks
Control flow	LLM decides (tool calls + handoffs)	You declare (nodes, edges, conditional routing)	You declare (task list, sequential or hierarchical)
Framework footprint	Smallest of the three	Larger; graph + state + checkpointing	Medium; role/task abstractions
State management	In-context + tool-managed	First-class typed state, checkpointing, time travel	Per-task context, shared crew memory
Tracing	Built-in on platform.openai.com	LangSmith integration	Built-in observability hooks
Model coverage	OpenAI-first; others via adapters	Provider-agnostic	Provider-agnostic
Best fit	OpenAI-first stacks; lightest setup	Workflows where control flow must be explicit and debuggable	Role/task-shaped workflows
Worst fit	Multi-provider stacks; explicit DAG workflows	"I just want a tool-using assistant"	Workflows where roles are a fiction

These are sibling tools, not competitors fighting for one slot. Most serious teams end up with more than one in their stack — for example, an Agents SDK assistant for end-user interactions and a LangGraph pipeline for the offline workflow that produces the data the assistant uses. See the Agentic Prompt Stack for how these fit into a layered design.

Common failure modes

The SDK is small, so most failures are prompt failures. The patterns repeat.

Vague instructions. By far the most common. "You are a helpful agent that handles X" is not enough. The model needs criteria for when to call which tool, when to hand off, when to stop. Symptom: the agent calls the wrong tool, hands off prematurely, or wanders. Fix: rewrite instructions as a numbered protocol with explicit conditions. Read three traces and see if you can predict the model's next action — if you can't, neither can the model.

Too many tools per agent. A single agent with 10+ tools is a confused agent. Symptom: the agent picks plausible-but-wrong tools, especially when two tools have overlapping descriptions. Fix: split into multiple agents connected by handoffs, with each agent owning a coherent slice of the toolset.

Handoffs without clear when-to-handoff criteria. "Hand off when appropriate" is not a criterion. Symptom: flapping (agent A hands off to B, B hands back to A), premature handoffs, missed handoffs. Fix: in the source agent's instructions, name each handoff target and the exact condition that should trigger it.

Over-aggressive guardrails. A guardrail that trips on edge cases the user actually cares about kills the experience. Symptom: legitimate requests blocked; users frustrated. Fix: scope each guardrail tightly. Run the guardrail's classifier agent against a held-out set of real inputs before shipping. Keep tripwires per-purpose.

Ignoring tracing during development. Building an agent without reading traces is like writing code without running a debugger. Symptom: mysterious failures in production that "worked locally." Fix: make trace-reading part of every PR. Skim five traces — three successes, two failures — before you call an agent done.

Treating handoffs as function calls. They aren't. Once you hand off, the new agent owns the conversation. Symptom: you expect the original agent to "come back" with a result and it doesn't. Fix: if you want delegation-with-return, wrap the sub-agent run in a @function_tool instead of using a handoff.

When the SDK is the right pick — and when it isn't

Right pick when:

Your stack is OpenAI-first and you want to take advantage of Responses API + hosted tools + platform tracing.
You want the lightest framework footprint that still gives you guardrails, structured outputs, and tracing.
Your workflow is genuinely model-driven — the agent decides what to do at each step, with handoffs for clear hand-off points.
You are building an end-user-facing assistant rather than an offline DAG pipeline.

Wrong pick when:

Your control flow is naturally a graph with explicit branching, retries, and parallelism — LangGraph makes those legible in a way the SDK does not.
Your team thinks in roles and tasks rather than agents and tools — CrewAI maps to that mental model.
You are TypeScript-first or you want a multi-agent system with first-class evals built in — Mastra is the better starting point.
You need to support multiple model providers as first-class citizens — the SDK can be coerced, but you'll be fighting it.

For agent orchestration at scale, none of these tools is a complete answer. Most production agent systems blend a framework like the SDK with custom orchestration, observability, and evals.

What to read next

The Agentic Prompt Stack — the layered model that makes any of these frameworks tractable. Read this first if you're new to agent design.
The multi-agent prompting guide — patterns (handoff, delegation, swarm, pipeline) and when each fits.
The AI agents prompting guide — single-agent prompting before you reach for multi-agent.
The LangGraph prompting guide — the explicit-graph alternative.
The CrewAI prompting guide — the role-based alternative.
The Mastra prompting guide — the TypeScript-native alternative.
The agentic AI prompting guide — broader patterns for reasoning models operating in agent loops.

The SDK rewards taking prompt engineering seriously. Four primitives is not a lot of framework — which means your instructions, your tool docstrings, and your handoff criteria are doing most of the work. Read your traces, write sharp instructions, keep guardrails focused, and the small surface becomes a feature, not a limitation.

OpenAI Agents SDK Prompting Guide: Tools, Handoffs, Guardrails, Tracing (2026)

Key takeaways

What the OpenAI Agents SDK actually is

Swarm vs Agents SDK

Primitive 1: the Agent

Sloppy vs sharp instructions

Primitive 2: the Tool

One agent, fewer tools

Primitive 3: the Handoff

Primitive 4: the Guardrail

Tracing — the underrated production feature

Structured outputs with `output_type`

Agents SDK vs LangGraph vs CrewAI

Common failure modes

When the SDK is the right pick — and when it isn't

What to read next

Build prompts like these in seconds

Related Resources

AI Safety & Guardrails Designer Template

Related Articles

The Agentic Prompt Stack: 6 Layers for Designing Prompts That Run Agents

Multi-Agent Prompting Guide: Coordinating Specialist Agents (2026)

AI Agents Prompting Guide: How to Write Instructions That Actually Work (2026)

OpenAI Agents SDK Prompting Guide: Tools, Handoffs, Guardrails, Tracing (2026)

Key takeaways

What the OpenAI Agents SDK actually is

Swarm vs Agents SDK

Primitive 1: the Agent

Sloppy vs sharp instructions

Primitive 2: the Tool

One agent, fewer tools

Primitive 3: the Handoff

Primitive 4: the Guardrail

Tracing — the underrated production feature

Structured outputs with output_type

Agents SDK vs LangGraph vs CrewAI

Common failure modes

When the SDK is the right pick — and when it isn't

What to read next

Build prompts like these in seconds

Related Resources

AI Safety & Guardrails Designer Template

Related Articles

The Agentic Prompt Stack: 6 Layers for Designing Prompts That Run Agents

Multi-Agent Prompting Guide: Coordinating Specialist Agents (2026)

AI Agents Prompting Guide: How to Write Instructions That Actually Work (2026)

Structured outputs with `output_type`