Realtime Voice API
A realtime voice API is a speech-to-speech architecture that accepts streaming audio input and returns streaming audio output directly, without the classical STT-then-LLM-then-TTS pipeline. By skipping the intermediate text representation, these systems target sub-second end-to-end latency suitable for natural conversational turn-taking, including barge-in (the user interrupting mid-response). OpenAI's Realtime API and Cartesia's Sonic are leading examples. The trade-offs versus a pipelined architecture are real: tool-use, structured output, and inspectable transcripts are more constrained, and the model running speech-to-speech is typically distinct from (and often less capable than) the text model in the same vendor's lineup.
Example
A customer-support voice agent built on the OpenAI Realtime API streams the caller's microphone audio into a single WebSocket and streams the agent's spoken response back over the same connection, with target round-trip latency in the few-hundred-millisecond range. The same agent built as STT-then-GPT-4-then-TTS would add the latency of all three stages sequentially and would lose the ability to barge in cleanly mid-response.
Related Resources
AI Voice and Audio Prompting: The Complete 2026 Guide
The canonical 2026 guide to voice and audio prompting for OUTPUT — TTS, voice cloning, realtime conversational voice, and voice agents. Covers the model landscape, the universal anatomy, three architectures, voice-agent system prompts, and the boundary with the multimodal pillar (which covers audio INPUT).
Prompting GPT-4o Realtime Voice: A Speech-to-Speech Walkthrough
The OpenAI Realtime API skips the STT-LLM-TTS pipeline and treats voice as a first-class modality. This walkthrough covers the session-config payload, voice-shaped system prompts, turn detection, tool calls without awkward silence, and a worked support-agent example.
Put this into practice
Build polished, copy-ready prompts in under 60 seconds with SurePrompts.
Try SurePrompts