Text to Speech (TTS)
Text to speech, or TTS, is the synthesis of spoken audio from written text. It is the inverse of speech-to-text and the older of the two disciplines, with roots in concatenative and parametric synthesis long predating modern AI. Contemporary TTS is dominated by neural models from vendors like ElevenLabs, OpenAI, Google, and Cartesia, which produce near-natural prosody and speaker timbre from a single text input. TTS is distinct from realtime speech-to-speech architectures: it operates as a one-shot text-in/audio-out call, typically with higher latency tolerance, and is the right primitive for narration, voiceover, audiobook generation, and asynchronous voice output where conversational turn-taking is not required.
Example
A documentation site generates an audio version of every published article by sending the article body to an ElevenLabs TTS endpoint with a chosen voice ID, then storing the returned MP3 alongside the post. The synthesis happens once at publish time, not per request, because TTS is a batch-friendly one-shot call rather than a conversational primitive.
Put this into practice
Build polished, copy-ready prompts in under 60 seconds with SurePrompts.
Try SurePrompts