Skip to main content

Realtime Voice API

A realtime voice API is a speech-to-speech architecture that accepts streaming audio input and returns streaming audio output directly, without the classical STT-then-LLM-then-TTS pipeline. By skipping the intermediate text representation, these systems target sub-second end-to-end latency suitable for natural conversational turn-taking, including barge-in (the user interrupting mid-response). OpenAI's Realtime API and Cartesia's Sonic are leading examples. The trade-offs versus a pipelined architecture are real: tool-use, structured output, and inspectable transcripts are more constrained, and the model running speech-to-speech is typically distinct from (and often less capable than) the text model in the same vendor's lineup.

Example

A customer-support voice agent built on the OpenAI Realtime API streams the caller's microphone audio into a single WebSocket and streams the agent's spoken response back over the same connection, with target round-trip latency in the few-hundred-millisecond range. The same agent built as STT-then-GPT-4-then-TTS would add the latency of all three stages sequentially and would lose the ability to barge in cleanly mid-response.

Put this into practice

Build polished, copy-ready prompts in under 60 seconds with SurePrompts.

Try SurePrompts