Key takeaways:
- The Realtime API is a speech-to-speech model, not a chat model with audio bolted on. Audio in, audio out, model reasoning over the audio directly. That fact is why response latency lands around 300 ms instead of 1.5-3 seconds for an STT-LLM-TTS pipeline.
- The session-config payload is where the agent is configured: voice, instructions, turn-detection mode, audio formats, modalities, and tools. Get the shape right and the rest is content.
- Voice system prompts are a different discipline from text. Bullets become rambling. Markdown becomes literal asterisks. Long enumerations become turns the user will interrupt. Voice-shaped instructions are short, speakable, and tolerant of being cut off.
- Server VAD makes interruption a first-class behavior, not an exception. Prompts have to assume the user will cut in — which changes confirmation patterns, refusal phrasing, and how you handle partial information.
- Tool calls over voice need a verbal cover — "let me check that for you" — to prevent dead air during the tool round-trip. The cover is a prompted behavior.
- The Realtime API is not always the right model. For pure TTS, voice quality, or sub-150 ms latency, other voice models are often better.
Why the Realtime API Is a Different Tool from TTS
Before the Realtime API, building a voice agent meant three models in series. A speech-to-text model transcribed the user. A language model produced a text response. A text-to-speech model synthesized that response. The whole pipeline added latency at every hop, and the round-trip from end-of-user-speech to start-of-agent-audio typically landed somewhere between 1.5 and 3 seconds. That gap is the difference between an interaction that feels conversational and one that feels like a phone IVR.
The Realtime API collapses the pipeline. The model takes raw audio as input and produces raw audio as output, all over a single persistent WebSocket session. There is no transcription step the model reasons against. The audio itself is the input modality, which means tone, pacing, hesitation, and background noise are all available to the model in a way they are not when the input has already been flattened to text. End-to-end response latency lands around 300 ms in practice.
This is a different tool from gpt-4o-tts and the family of one-shot TTS models. A TTS model is a synthesizer: hand it a string, get back an audio buffer. It has no notion of listening, no concept of being interrupted, no conversational state. The Realtime API is bidirectional and stateful — it streams audio both ways and maintains a session. If your use case is reading a fixed script aloud, a dedicated TTS model is cheaper. If your use case is a back-and-forth conversation, the Realtime API is the right shape.
The cost is real. The model is more expensive per minute than text alternatives. The voice library is smaller than what dedicated TTS providers offer. And debugging a session means listening to recordings rather than reading logs. None of these are deal-breakers, but they shape when the Realtime API is the right pick versus when something like ElevenLabs, Cartesia, or Hume wins on a different axis — the voice generation models comparison walks through the trade-offs.
The Session-Config Anatomy
A Realtime session opens with a session.update event that configures everything about the conversation. Most of what makes an agent feel right or wrong lives in this payload. A representative shape:
{
"type": "session.update",
"session": {
"modalities": ["audio", "text"],
"voice": "alloy",
"instructions": "You are a support agent for Acme Software...",
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 500
},
"tools": [
{
"type": "function",
"name": "lookup_order",
"description": "Look up an order by its ID. Use when the caller asks about a specific order.",
"parameters": {
"type": "object",
"properties": {
"order_id": { "type": "string" }
},
"required": ["order_id"]
}
}
]
}
}
Field by field.
modalities controls what the model can produce. ["audio", "text"] gives you both — the model speaks and you also receive a text transcript. For a voice agent, audio is required; the text companion is what you log and analyze later.
voice picks one of the available voices. The set is smaller than what dedicated TTS providers offer, and each voice has a distinct personality the system prompt cannot fully override. A reserved voice will sound reserved even if the prompt asks for warmth. Audition each one against your use case before committing.
instructions is the system prompt. This is the field that makes or breaks the agent and is the focus of the next section. Treat it the way you would treat any system prompt — but written for speech, not for reading.
input_audio_format and output_audio_format define what the WebSocket is sending. pcm16 is the default and highest-fidelity option. g711_ulaw and g711_alaw are the telephony codecs — pick those when the agent is on a phone bridge and audio is already constrained to 8 kHz. Mismatching formats between the client and the session config is a classic first-day bug; the agent sounds like static, and the cause is one wrong string.
turn_detection controls how the model decides the user is done talking. "server_vad" runs voice activity detection on the inbound audio and emits a speech_stopped event after a tunable silence window. The other option is to disable server VAD entirely and commit each user turn from the client — what you reach for when you have your own VAD or a push-to-talk button.
tools is the function-calling array, the same shape used in Chat Completions. The voice-specific consideration is what the model says while the tool runs; that gets its own section.
There are more fields — temperature, max output tokens, audio transcription configuration — but these six are the load-bearing ones.
Writing System Prompts That Sound Right Out Loud
A system prompt that works for text chat will produce a voice agent that sounds wrong, and the failures are not subtle. The voice prompting glossary entry names the discipline; this section walks the specific failure modes.
The bullet-list-becomes-rambling problem. Voice models read lists out loud. A prompt that says "When the caller asks about pricing, cover: tier names, monthly cost, annual cost, included features, overage rates, trial availability" produces a 30-second monologue that the user will interrupt halfway through. They were asking the price of one tier. Fix: prompt for one piece of information per turn, with optional follow-up.
The markdown-becomes-asterisks problem. Voice models occasionally vocalize markdown literally — pronouncing "asterisk" or pausing strangely where bold text would render. Even when they handle it gracefully, "format as a bulleted list" produces stilted phrasing. Strip markdown from the instructions entirely. Write in prose. If you must structure output, structure it as conversational beats: "first acknowledge the issue, then offer the next step, then ask if that works."
The short-turn discipline. Voice turns should be short. Two to three sentences per turn max is a useful default, with anything longer broken across multiple turns where the user can interject. Long turns actively damage the conversation — users cannot scan-skim a voice response the way they scan-skim a paragraph, and after about 15 seconds the user has either lost track or interrupted.
Refusal phrasing. A text refusal can be polite and slightly verbose. A voice refusal needs to be shorter and clearly directional, because the user will talk over the explanation. "I can't help with that. Want to ask about something else?" works — clear handoff, clear next move, no buried explanation.
When to suggest a switch. Some information does not work as speech — a long list of options, a configuration string, a multi-step technical procedure. "If the answer requires more than four discrete items, offer to email it instead of reading it aloud" is the kind of guardrail that prevents monologuing through a list nobody can remember.
A worked voice system prompt for a support agent might look like this:
You are a phone support agent for Acme Software. You help callers with
account questions, order lookups, and basic troubleshooting.
Speak conversationally. Keep each turn to two or three sentences.
If the caller interrupts, stop talking immediately and listen.
When confirming actions that change the account, repeat the key detail
back before acting — for example, "I'll cancel order number 4471, is
that right?" — and wait for confirmation.
If the answer requires reading more than three items aloud, offer to
email the full list instead. Do not read long lists.
You cannot help with billing disputes or refunds — for those, offer
to transfer the caller to a human agent.
Never describe yourself as an AI unless the caller asks directly.
It is short on purpose. Voice system prompts get worse as they get longer, partly because more rules mean more conflicting guidance for the model to navigate in real time, and partly because every rule that does not directly map to spoken behavior is dead weight.
Turn Detection, Interruption, and Backchannel
In a text chat, a turn ends when the user hits send. In a voice conversation, a turn ends when the user stops talking — and the system has to figure out when.
Server VAD is the Realtime API's answer. The server analyzes inbound audio, detects when the user starts speaking, and emits a speech_stopped event after a configurable silence window. The model then commits the audio buffer as a user turn and starts generating. The threshold and prefix_padding_ms parameters tune sensitivity — how loud counts as speech, and how much audio before the detected start should be included to avoid clipped first words.
The defaults are reasonable for most environments but not all. A noisy call center needs a higher threshold to avoid false triggers on background chatter. A user who pauses mid-sentence to think needs a longer silence window. Tune per deployment against real recordings, not intuition. The alternative is to disable server VAD entirely and commit each user turn from the client — the right call when you cannot trust the audio environment or when the UX is push-to-talk.
Interruption is where voice agents most obviously diverge from text chats. When server VAD is enabled and the user starts talking while the model is mid-response, the model truncates its audio output and switches to listening. The conversation history reflects only what was actually played to the user, not the full unspoken response. Two implications for prompt design.
First, confirmation patterns must tolerate being cut off. "Always confirm the order number by repeating it back before acting" works only if the agent does not act until it has heard "yes" — because the user might interrupt the confirmation itself with "yes that's right" or "no, wait, the other one." Confirmation has to be a two-turn protocol, not a single-turn habit.
Second, important information should come first in each turn. A voice agent that buries the answer at the end of a long preamble loses to interruption every time. Lead with the answer; offer detail after.
Backchannel — the small "uh-huh" sounds that signal listening — is a deliberate choice. Some agents emit them for warmth; others avoid them because they sound performative. The Realtime API will produce them if the system prompt encourages it and avoid them if it does not. Pick one explicitly.
Tool Calling Without Awkward Silence
Tool calling over voice has the same mechanics as in Chat Completions — the model emits a function-call event, the client runs the function, the client sends a function-output event, the model continues. The new problem is what happens to the audio stream while the tool runs.
A 200 ms tool call is fine as silence. A 1.5 second tool call is the agent freezing mid-sentence. A 4 second tool call without a verbal cover gets the user saying "hello? are you still there?" before the result comes back.
The fix is to prompt the model to acknowledge verbally before the tool runs:
When you need to look up information, say a short acknowledgment first
— "let me check that for you" or "one moment" — and then call the tool.
Keep the acknowledgment under one second.
The model produces the acknowledgment audio, then emits the function-call event. The client runs the function while the acknowledgment is still playing. By the time the user hears "let me check that for you," the tool result is back and the model can continue. The dead air is hidden inside the cover.
Reading tool results back aloud is its own discipline. Results are usually structured — a JSON blob from order lookup, a list of objects from search. Without explicit instructions, the model will sometimes read field names or punctuation. Prompt the shape of the read-back: "When you receive an order lookup result, summarize it in one sentence — order ID, status, and expected delivery date. Do not read field names or punctuation."
For tool calls that take longer than a verbal cover can mask, escalation is the move. "This is taking longer than usual — would you like me to follow up by email when I have the answer?" preserves the conversation and surfaces a useful signal: a tool that consistently triggers the escalation path needs optimization.
Parallel tool calls work the same way as in Chat Completions — multiple function-call events in one turn, all results sent back. The system prompt should describe when parallel calls are appropriate ("when the caller asks about multiple orders, look them all up at once") to nudge the model toward them.
Worked Example: A Support Voice Agent
A short conversation end to end. The agent uses the system prompt above, with one tool — lookup_order — wired in. Session opens; the client sends session.update with the prompt, the function definition, server VAD enabled, and the alloy voice.
Agent: "Acme Software support, this is Alex. How can I help?"
User: "Hey, I want to check on an order I placed last week."
Agent: "Sure — what's the order number?"
User: "It's, uh, four-four-seven-one."
The model needs to call lookup_order. The system prompt instructs a verbal cover, so it speaks first:
Agent: "Let me check that for you."
While the audio plays, the model emits the function call:
{
"type": "response.function_call_arguments.done",
"name": "lookup_order",
"arguments": "{\"order_id\": \"4471\"}"
}
The client runs the function, gets back {"order_id": "4471", "status": "shipped", "expected_delivery": "April 25"}, and sends the function-output event. The model continues:
Agent: "Order 4471 shipped already. Expected delivery is April 25 — does that match what you're seeing in your tracking?"
The user interrupts the last word:
User: "Wait, can you cancel that one and—"
Server VAD fires speech_started. The model truncates its in-flight audio and switches to listening. The user continues:
User: "—reorder it with overnight shipping?"
The model now has two things to navigate: an account-changing action (cancellation, which the prompt requires confirming) and a refund-adjacent reorder (explicitly out of scope). Its response:
Agent: "I can cancel order 4471, but I'm not able to handle refunds or reorders here. Want me to transfer you to a billing agent who can do both?"
That single turn confirms what it can do, declines what it cannot, and offers a graceful escalation — all under three sentences. The user accepts the transfer; the conversation ends with a handoff event the client routes to its actual transfer mechanism. The prompt did all the work — the verbal cover, the confirmation requirement, the refusal phrasing, the escalation move all live in the system prompt. The model is executing the contract.
Latency, Quality, and When to Pick a Different Model
The Realtime API's headline number is the ~300 ms response latency, end-to-end from end of user speech to start of agent audio. What eats into that budget: network latency between the client and the OpenAI WebSocket (typically 30-80 ms near-region, more across continents), audio buffering on the client (especially over telephony codecs that batch into 20 ms frames), tool call round-trips (the reason the verbal-cover pattern exists), and the model's own time-to-first-audio. The model is the smallest contributor.
What you can do: run the WebSocket from a backend in the same region as your users; use pcm16 when audio quality matters and g711 only when telephony bandwidth is constrained (it is lower quality, not faster); and optimize tool latency aggressively, because a 1.5 second tool call the user notices is more painful than a 50 ms model-side improvement.
Three cases where the Realtime API is the wrong pick.
Pure narration or one-shot synthesis. Voicing an explainer video, generating a notification, producing audiobook chapters — a dedicated TTS model is cheaper and offers more voices. The Realtime API's bidirectional architecture is dead weight when there is no conversation.
Voice quality is the dominant requirement. ElevenLabs, Cartesia, and Hume each specialize in voice expressiveness in ways the Realtime API does not match. If your product hinges on a distinctive voice — a branded character, an emotional read, a non-English accent — those providers are usually a better fit. The trade-off is that you give up speech-to-speech reasoning and get a more expressive synthesizer. The voice generation models comparison covers the specifics.
Sub-150 ms latency. The Realtime API's floor is fast for a conversation but not the fastest possible. Cartesia's TTS, paired with aggressive STT and a small LLM at the edge, can land below 150 ms in some configurations. If your use case demands that — competitive gaming voice, real-time interpretation — build the pipeline yourself and pay in glue code.
Pick the model for the use case. The Realtime API is the right default for conversational voice agents where speech-to-speech reasoning matters and 300 ms is acceptable. It is not the universal answer.
Evaluating a Voice Agent
Voice agents fail in ways transcripts hide. A turn that reads fine on the page might sound stilted, rushed, or weirdly emphatic when actually spoken. A confirmation pattern that works in writing might fall apart when the user interrupts at the wrong moment. A 30-second tool call that looks reasonable in a log is an interminable pause to anyone listening. Transcript review is necessary but not sufficient. Voice agents have to be evaluated by listening.
Three layers.
Live conversation tests. Have humans talk to the agent — engineers, internal users, eventually support staff — under realistic conditions, including bad network, background noise, and users who interrupt. Record everything. Listen back. The failures that matter show up in real audio: dead air, weird emphasis, agents that talk over users, agents that take confirmations the user did not actually give. None of these are visible in a transcript.
Scripted regression tests. Build 20-50 canned conversations covering the cases you already ship — order lookups, refusals, escalations, awkward edge cases. Run them as recorded user audio against the agent in CI, capture responses, compare against expected behavior. This catches regressions in known failure modes, not new ones.
Rubric-based scoring. The SurePrompts Quality Rubric applies to spoken output with adaptations. Specificity and grounding are the same. Faithfulness to instructions is the same. Brevity has a stricter standard — a turn that is "appropriately concise" in writing might be a monologue out loud. Voice-specific dimensions worth adding: speakability (does it sound natural read aloud?), interruptibility (does it front-load the important information?), and tool-call coverage (was a verbal acknowledgment present where required?).
The temptation is to lean entirely on transcript-based metrics because they are cheap and automatable. Resist. The most expensive voice-agent failures are the ones nobody catches in the transcript.
What's Next
This walkthrough covered the speech-to-speech direction — agents that listen and respond in real time. The other direction is agents that take audio as input and reason over it without necessarily speaking back. Long-form podcast analysis, meeting transcription with structure, audio understanding for accessibility — different problems, different tools. The audio understanding walkthrough covers the long-context side, where Gemini's 1M+ context plus native audio support makes whole-call or whole-meeting reasoning practical in a way streaming voice agents are not designed for.
For broader context, the multimodal prompting pillar puts voice, vision, audio understanding, and video generation in one frame and traces the prompt patterns that compose across them.
Three reading paths. If the next thing on your roadmap is picking a TTS provider, the voice generation models comparison is the sister tutorial in this wave. If you are building voice agents that call into longer reasoning chains, the agentic prompt stack and the RCAF prompt structure name those patterns. And if you have an agent in production, the SurePrompts Quality Rubric adapts cleanly to spoken output.
The Realtime API is a meaningful change in what voice agents can be. Speech-to-speech with 300 ms latency and tool calling is different from the STT-LLM-TTS pipelines that came before — different in ways the prompt has to respect. Get the session config right, write instructions for speech rather than reading, design for interruption rather than against it, and the rest of the work is the work it always was: choosing what the agent should and should not do, then writing that down clearly enough that the model can execute it.