Prosody
Prosody is the rhythm, stress, intonation, and pacing of speech — the suprasegmental layer above individual phonemes that carries emotion, emphasis, question vs. statement, and conversational intent. It is the dimension on which modern neural TTS most clearly distinguishes itself from older concatenative or parametric systems, which produced intelligible but flat, robotic-sounding output. Vendors approach prosody differently: ElevenLabs and Cartesia infer it largely from text and voice model, while Hume explicitly optimizes for prosodic emotion as a first-class signal. Prosody is also what voice cloning struggles with most — timbre transfers from a short sample, but a speaker's characteristic phrasing rhythm often does not.
Example
The same sentence — "I didn't say she stole the money" — carries seven different meanings depending on which word is stressed. A TTS system with strong prosodic control can place that stress correctly when given SSML hints or context; an older flat-prosody system reads the sentence with uniform emphasis and loses the meaning entirely.
Put this into practice
Build polished, copy-ready prompts in under 60 seconds with SurePrompts.
Try SurePrompts