Realtime Voice API

Realtime Voice API

A realtime voice API is a speech-to-speech architecture that accepts streaming audio input and returns streaming audio output directly, without the classical STT-then-LLM-then-TTS pipeline. By skipping the intermediate text representation, these systems target sub-second end-to-end latency suitable for natural conversational turn-taking, including barge-in (the user interrupting mid-response). OpenAI's Realtime API and Cartesia's Sonic are leading examples.

The trade-offs versus a pipelined architecture are real: tool-use, structured output, and inspectable transcripts are more constrained, and the model running speech-to-speech is typically distinct from (and often less capable than) the text model in the same vendor's lineup.

Example

A customer-support voice agent built on the OpenAI Realtime API streams the caller's microphone audio into a single WebSocket and streams the agent's spoken response back over the same connection, with target round-trip latency in the few-hundred-millisecond range. The same agent built as STT-then-GPT-4-then-TTS would add the latency of all three stages sequentially and would lose the ability to barge in cleanly mid-response.

Frequently asked questions

What is Realtime Voice API?: A realtime voice API is a speech-to-speech architecture that accepts streaming audio input and returns streaming audio output directly, without the classical STT-then-LLM-then-TTS pipeline.
How does Realtime Voice API work?: The trade-offs versus a pipelined architecture are real: tool-use, structured output, and inspectable transcripts are more constrained, and the model running speech-to-speech is typically distinct from (and often less capable than) the text model in the same vendor's lineup.
Can you give an example of Realtime Voice API?: A customer-support voice agent built on the OpenAI Realtime API streams the caller's microphone audio into a single WebSocket and streams the agent's spoken response back over the same connection, with target round-trip latency in the few-hundred-millisecond range. The same agent built as STT-then-GPT-4-then-TTS would add the latency of all three stages sequentially and would lose the ability to barge in cleanly mid-response.

Related Resources

Blog Post

AI Voice and Audio Prompting: The Complete 2026 Guide

The canonical 2026 guide to voice and audio prompting for OUTPUT — TTS, voice cloning, realtime conversational voice, and voice agents. Covers the model landscape, the universal anatomy, three architectures, voice-agent system prompts, and the boundary with the multimodal pillar (which covers audio INPUT).

Blog Post

Prompting GPT-4o Realtime Voice: A Speech-to-Speech Walkthrough

The OpenAI Realtime API skips the STT-LLM-TTS pipeline and treats voice as a first-class modality. This walkthrough covers the session-config payload, voice-shaped system prompts, turn detection, tool calls without awkward silence, and a worked support-agent example.

Example

Frequently asked questions

What is Realtime Voice API?

How does Realtime Voice API work?

Can you give an example of Realtime Voice API?

Related Terms

Related Resources

AI Voice and Audio Prompting: The Complete 2026 Guide

Prompting GPT-4o Realtime Voice: A Speech-to-Speech Walkthrough