Skip to main content
Back to Blog
voice generationTTSElevenLabsOpenAI TTSHume AICartesiaPlayHTvoice cloningcomparison

Voice Generation Models Compared (2026): ElevenLabs, OpenAI TTS, Hume, Cartesia, PlayHT

Voice generation in 2026 is no longer a one-vendor question — ElevenLabs, OpenAI TTS, Hume, Cartesia, PlayHT, Gemini TTS, and the open-weights tier each win different shots. This tutorial maps the landscape and gives you a per-shot picking framework.

SurePrompts Team
April 23, 2026
19 min read

TL;DR

Voice generation models in 2026 split sharply by use case — quality leaders, latency leaders, emotion-aware models, long-form narrators, and open-weights options each win different shots. This tutorial walks through the model landscape, the universal voice-prompt anatomy, and a per-shot picking framework for audiobook, realtime support, podcast, and dubbing workloads.

Key takeaways:

  • Voice generation in 2026 is no longer a single-vendor question. ElevenLabs leads on overall quality and cloning, OpenAI leads on instructable voice character, Hume leads on emotion, Cartesia leads on latency, PlayHT leads on conversational long-form, Google's Gemini TTS and NotebookLM cover the Google stack, and the open-weights tier (XTTS-v2, Bark, MeloTTS, Kokoro) covers self-hosted control.
  • Model choice is per-shot, not per-project. Audiobook chapter, realtime customer support, podcast intro, and multilingual dubbing each have different right answers, and committing the whole project to one vendor leaves quality on the table.
  • The universal voice-prompt anatomy is five slots — voice character, tone and emotion, pacing, format-for-speech, and pronunciation overrides. Skipping any of them lets the platform pick a default that almost never matches the brief.
  • Latency, naturalness, and emotion control are three separate axes that do not move together. The realtime tier trades naturalness for speed; the quality tier trades speed for naturalness; the emotion-aware tier optimizes prosody.
  • This tutorial covers voice generation specifically. Audio understanding is the input side, covered in the audio understanding walkthrough, and full speech-to-speech is covered in the GPT-4o realtime walkthrough. The multimodal pillar frames the input side end-to-end.

Voice generation in 2026 stopped being a one-question problem once three things happened at once: ElevenLabs' quality made cloned voices hard to distinguish from the source on short clips, OpenAI shipped instructable TTS where you can steer the speaker with a prompt, and Cartesia drove time-to-first-byte under 100 milliseconds for realtime applications. None of those moves came from the same team and none of them are interchangeable. A workflow that picks one vendor and stops thinking is leaving real capability on the table.

The honest framing is that voice-generation model choice is per-shot, not per-project. An audiobook chapter, a realtime customer-support agent, a podcast intro, and a multilingual dubbing pass have different right answers — the provider that wins one rarely wins the others. Treating "TTS" as a single category is the same mistake as treating "image generation" as a single category, and the AI image prompting pillar and AI video prompting pillar make the same point on their respective surfaces.

This tutorial maps the 2026 voice-generation landscape, walks through the universal voice-prompt anatomy, and gives you a per-shot picking framework. It pairs with two sister tutorials in the same wave — the realtime walkthrough and the audio-understanding walkthrough — that the closing section bridges to.

What "voice generation" actually covers in 2026

The phrase "voice AI" gets used loosely. Pulling apart what it actually contains:

Text-to-speech (TTS). Text in, audio out. The classical voice-generation problem: you have a script, you want it spoken. Quality is judged on naturalness, prosody, breath placement, and consistency across long inputs.

Voice cloning. A short or long audio sample of a target speaker in, a model that synthesizes new audio in that speaker's voice out. The same TTS interface, but the voice identity is custom. Consent is the load-bearing concern, not the technology.

Conversational voice (speech-to-speech). Audio in, audio out, with model reasoning in between. The user speaks, the model understands, decides, and replies in voice — all in one bidirectional stream. OpenAI's Realtime API is the canonical 2026 implementation; the GPT-4o realtime walkthrough covers it in depth.

Audio understanding. Audio in, text out. Transcription, summarization, sentiment analysis, speaker diarization. This is the input side, not the generation side, and it is covered separately in the audio understanding walkthrough. Multimodal models that take audio as a first-class input — GPT-4o, Gemini 2.5 Pro — sit in this category. The multimodal pillar frames the broader input surface.

Music generation. Adjacent and not covered here. Suno, Udio, and similar models generate vocal music from prompts; the prompting discipline overlaps but the model landscape and evaluation criteria are different enough that they belong in their own treatment.

This tutorial is about the first two — TTS and voice cloning — with brief coverage of where conversational voice fits and pointers to the dedicated tutorials for the input side.

For terminology, the voice-prompting glossary entry covers the prompt-side vocabulary in one place.

The 2026 voice-generation model landscape

ModelStrengthLatency profileVoice cloningEmotion controlLanguagesCommercial terms
ElevenLabs Multilingual v2 / v3Naturalness, voice identity, cloning qualityQuality tier (offline-friendly)Yes — instant and professionalStrong; v3 adds tag-based controlsBroad multilingual coverageSubscription with usage tiers; commercial use included
ElevenLabs Turbo v2.5 / Flash v2.5Low-latency variants of the same voice libraryInteractive (Turbo) and realtime (Flash)Yes — same cloning libraryReduced vs. quality tierSame multilingual coverageSame
OpenAI gpt-4o-mini-tts / gpt-4o-ttsInstructable voice character via promptInteractive tierNo public custom voicesStrong via free-form instructionsMany languages, English-firstPay-per-use API
OpenAI Realtime API (GPT-4o-realtime)End-to-end speech-to-speechRealtime (sub-second TTFB)No public custom voicesSame instructable surfaceEnglish-first, expandingPay-per-use API
Hume Octave TTS / EVIProsodic emotion modeling, expressive TTSQuality and interactive tiersYesBest-in-class for explicit emotion steeringEnglish-first, expandingPay-per-use API
Cartesia SonicSub-100ms TTFB latency leaderRealtime tierYesAdequate for realtime useMultilingual, growingPay-per-use API
PlayHT PlayDialog / Play 3.0Long-form narration, two-voice dialogueQuality tierYesAdequate; explicit dialogue controlsMultilingualSubscription and API
Google Gemini TTSGoogle-stack TTS, broad language coverageQuality and interactive tiersLimitedAdequateBroad multilingualGoogle AI Studio / Vertex pricing
NotebookLM Audio OverviewsTwo-host podcast-style audio from documentsOfflineNoPre-styled hostsEnglish-first, expandingFree in NotebookLM
Open-weights (XTTS-v2, Bark, MeloTTS, Kokoro)Self-hosted, no per-call cost, full controlVaries by hardwareYes (XTTS-v2 especially)LimitedVaries by modelApache / MIT-class licenses; check each

Two things to read out of the table. First, no single row wins every column — that is what "per-shot picking" means in practice. Second, latency and naturalness are not on the same axis; the same vendor often offers a quality model and a latency model, and the right one is the one that matches the shot.

Per-model personality

ElevenLabs

ElevenLabs is the current quality leader in voice generation, and the gap is widest on cloned voices and on long-form naturalness. The Multilingual v2 line — and the Eleven v3 line where it is available — produces audio where breath placement, prosody, and pacing read as performed rather than synthesized over multi-minute inputs. On short clips of cloned voices, blind A/B tests against the source often come down to context cues rather than acoustic differences.

The lineup splits by latency. Multilingual v2 and v3 sit in the quality tier — best for offline rendering, audiobooks, and anything where time-to-first-byte does not matter. Turbo v2.5 is the interactive sibling with slightly reduced naturalness and meaningfully lower latency. Flash v2.5 is the realtime sibling, designed for sub-second TTFB. The cloning library is shared across all three; the same custom voice is available regardless of which model you call.

The dubbing product runs on the same substrate and is a credible pick for translating long-form video while keeping the original speaker's voice identity across languages. The main tradeoff is cost — ElevenLabs is at the premium end, and high-volume workloads need a deliberate model-tier choice (Flash where Multilingual is overkill) to keep the bill reasonable.

OpenAI TTS (gpt-4o-mini-tts and gpt-4o-tts)

OpenAI's TTS surface is the instructable one. Beyond picking a voice from the named set — alloy, echo, fable, onyx, nova, shimmer, plus the newer named voices — you can prompt for character, tone, and delivery as part of the request: "speak like a calm museum docent", "sound mildly exasperated", "deliver this as a quick news bulletin". The steerability via natural-language instruction is the strongest of any closed provider in 2026.

The interactive tier (gpt-4o-mini-tts) is the right default for most application-layer voice replies. Custom voices are not exposed publicly at the time of writing, which makes OpenAI's TTS a poor pick for branded-voice workflows where the brand owns a specific speaker. The Realtime API (GPT-4o-realtime) is the speech-to-speech endpoint where the same model handles understanding and synthesis in one bidirectional stream — covered in the GPT-4o realtime walkthrough. For TTS-only workloads, the regular gpt-4o-mini-tts endpoint is the entry point.

Hume AI (Octave TTS and EVI)

Hume's distinguishing feature is explicit prosodic emotion modeling. Octave TTS exposes emotion controls — happiness, sadness, calm, intensity, and more — as steerable parameters rather than something you nudge with prose. For workloads where the emotional register has to land precisely (mental-health applications, expressive narration, character voices with consistent affect), Hume is the model worth evaluating first.

The Empathic Voice Interface (EVI) extends the same emotion-modeling approach to conversational voice — a speech-to-speech endpoint where the model also reads emotion in the user's voice and adjusts accordingly. It sits alongside OpenAI's Realtime API with different strengths: EVI optimizes for emotional fidelity, OpenAI Realtime for general conversational capability backed by GPT-4o's reasoning.

The honest tradeoff is that the explicit emotion surface is genuinely useful when emotion is the load-bearing requirement, but for most workloads ElevenLabs or OpenAI deliver enough emotion through voice choice and prompting that Hume's explicit controls are overkill.

Cartesia (Sonic)

Cartesia is the latency leader. Sonic targets time-to-first-byte under 100 milliseconds — fast enough that the model's response can begin before the user has finished hearing their own last word in a turn-taking conversation. The architectural choice that enables this is a state-space-model approach rather than the autoregressive transformer that dominates the rest of the field; the practical consequence is that Sonic feels different in interactive use than the rest of the lineup.

The shot Cartesia wins is realtime conversational voice where you bring your own STT and LLM. ElevenLabs Flash v2.5 sits in the same lane from the quality side; the choice between them often comes down to voice library and pricing rather than raw speed. Naturalness on Sonic is good — not at the level of ElevenLabs Multilingual v2 for long-form, but more than adequate for the realtime applications it targets. The mismatch would be using Sonic for an audiobook, where you do not need its latency and you do want the quality tier's prosody.

PlayHT (PlayDialog and Play 3.0)

PlayHT's strength is long-form narration and conversational dialogue between two voices. PlayDialog is built specifically around the two-speaker case — useful for podcast-style scripts, dramatized passages in audiobooks, or any workflow where a single render needs to alternate between speakers naturally. Play 3.0 is the general-purpose long-form model. The voice library is large and includes voices that work well for narration registers; cloning is supported with workflows comparable to ElevenLabs.

Where PlayHT slots in: when the brief is "narrate this script with two voices in dialogue" or "produce a 30-minute long-form piece that feels like a podcast segment", PlayDialog is the model worth trying first. For standard single-narrator long-form, ElevenLabs Multilingual v2 is a stronger default; PlayHT becomes the right call when its specific voice library or the dialogue feature wins the brief.

Google Gemini TTS and NotebookLM Audio Overviews

Google's TTS surface has two distinct entry points that often get conflated.

Gemini TTS (via Google AI Studio and Vertex AI) is the general-purpose API endpoint. Quality is competitive with the rest of the closed-provider field; language coverage is broad; integration is the natural pick for anything already running on the Google stack. It does not lead on any single axis but it is a credible default for Google-native workflows.

NotebookLM Audio Overviews is a different product entirely. Upload documents, sources, or notes, and NotebookLM generates a two-host podcast-style audio summary — two distinct voices in conversation about the source material, with intonation, banter, and pacing that read as edited radio. The voices and format are pre-styled rather than configurable. The shot it wins is "I have a stack of documents and I want a 10-minute audio briefing on them I can listen to during a commute". For that shot, nothing else in the landscape currently matches it. For traditional script-driven TTS, NotebookLM is not the right tool.

Open-weights and self-hosted options

The open-weights tier is meaningful in 2026 even though quality trails the closed leaders. Four models worth knowing:

XTTS-v2 (Coqui). Multilingual TTS with voice cloning from a short sample. Strongest quality in the open-weights tier and the closest open-source analogue to ElevenLabs' instant cloning. Self-hostable on a single GPU.

Bark (Suno). Generates more than just speech — handles music, sound effects, and non-verbal vocalizations from prompts. Plain-TTS quality trails XTTS-v2, but the broader generation surface is unique in this tier.

MeloTTS. Lightweight multilingual TTS optimized for speed and CPU inference. Lower quality than XTTS-v2, but the right pick for self-hosted realtime where GPU is not available.

Kokoro. A more recent open-weights TTS model with surprisingly competitive quality given its small size. Worth evaluating for self-hosted deployments where model footprint matters.

Reasons to pick open-weights: data residency requirements that forbid sending audio off-prem, per-call cost economics that break at very high volume, and latency where network round-trip to a hosted API is the bottleneck. Reasons not to: quality genuinely trails the closed leaders on long-form naturalness and cloned-voice fidelity. Match tier to requirement honestly rather than picking open or closed on principle.

The universal voice-prompt anatomy

Five slots that carry across providers. A voice prompt that omits any of them falls back to a platform default that almost never matches the brief.

1. Voice character — the speaker. Who is speaking? On ElevenLabs and PlayHT this is a voice ID from the library or a cloned custom voice. On OpenAI it is a named voice (alloy, nova, shimmer, etc.) plus an instruction about the speaker's character ("a calm museum docent in her fifties"). On Hume it is a voice plus a baseline emotional state. The voice is the single biggest determinant of how the output lands; treat picking it as a real decision, not a default.

2. Tone and emotion. What emotional register should the speaker be in for this specific render? Calm, urgent, warm, exasperated, encouraging. On Hume this is a parameter. On OpenAI it is part of the instruction prompt. On ElevenLabs v3 it is tag-based controls inside the script. On platforms without an explicit emotion control, this slot lives in the script itself — punctuation, sentence length, and word choice steer prosody indirectly. The rule of thumb: if you do not specify, you get the voice's default register, which is usually neutral-professional.

3. Pacing. How fast should the speaker talk, where should they pause, and where should breaks fall? On most platforms pacing comes from punctuation, line breaks, and SSML or platform-specific pause tags. A script written for the page does not pace correctly for the ear — sentences are too long, paragraphs run together, and natural breath points are missing. Format the script for the speaker, not the reader.

4. Format-for-speech. TTS models are reading the script literally. Markdown formatting, bullet lists, code blocks, and inline citations either render badly (the model speaks the asterisks) or get misinterpreted. Strip formatting before sending. Convert lists to spoken prose ("first... second... third..."), expand abbreviations the model might mispronounce, and write numbers as the speaker should say them ($1.2M as "one point two million dollars" rather than as the literal characters). For conversational use, write in short turns — long monologues sound unnatural in a back-and-forth context.

5. Pronunciation overrides. Proper nouns, technical terms, and uncommon words are the most common source of TTS errors. Most platforms support some form of override — SSML <phoneme> tags with IPA notation, platform-specific phonetic spellings, or a pronunciation dictionary attached to the request. Override the words you know the model will get wrong before you render. Auditing the output for mispronunciations and patching them in a dictionary is faster than re-rendering whole takes.

The five slots are the same shape as the universal multimodal anatomy and the same shape as the prompt frameworks the RCAF prompt structure and SurePrompts Quality Rubric describe for text. A voice prompt is a prompt with one extra dimension — the audible delivery — and the discipline that makes text prompts good makes voice prompts good.

Picking the right model for the shot

Five common shots and the right starting point for each.

Short-form social — TikTok, Reels, Shorts narration. Quality matters more than latency because rendering is offline. ElevenLabs Multilingual v2 with a voice that fits the brand register is the default; a cloned voice on the same model for a brand voice you own. OpenAI gpt-4o-mini-tts with a strong character instruction is a credible alternative when the brief calls for an explicitly directed delivery.

Realtime customer support. Latency matters more than naturalness because users will not wait. Cartesia Sonic for TTS-only realtime where you bring your own STT and LLM. ElevenLabs Flash v2.5 if you want the same voice library as your offline workloads. OpenAI's Realtime API for full speech-to-speech where the model also handles understanding — covered in the GPT-4o realtime walkthrough. Hume EVI when emotional fidelity in both directions is the load-bearing requirement.

Long-form audiobook. Quality dominates. ElevenLabs Multilingual v2 with a cloned narrator voice is the default. PlayHT Play 3.0 if its specific voice library wins the brief or you need PlayDialog's two-voice support. Avoid latency-tier models — you do not need their speed and they cost you naturalness.

Multilingual dubbing. Voice identity across languages matters. ElevenLabs' dubbing product (built on Multilingual v2) is the strongest pick for keeping a single speaker across languages. For Google-stack workflows, Gemini TTS is the natural alternative. For very-high-volume workflows where per-minute economics matter, the open-weights tier (XTTS-v2 with cloned voice) is worth evaluating, accepting the quality tradeoff.

Document-to-podcast briefings. NotebookLM Audio Overviews wins this shot outright. Upload the source material, generate a two-host podcast summary, ship. The constraint is that you take NotebookLM's pre-styled hosts as given.

The general rule: pick per-shot. A workflow using ElevenLabs for long-form narration, Cartesia for the realtime support agent, OpenAI for in-app voice replies, NotebookLM for briefings, and XTTS-v2 for high-volume bulk dubs is not over-engineered — it is appropriately matched.

Honest evaluation

Voice generation is harder to evaluate than text. The output is acoustic, evaluation is at least partly subjective, and the failure modes are sneakier than text failures because the prose reads competent even when the audio does not land.

Listening tests. Subjective evaluation is unavoidable. For production-grade voice work, run blind A/B tests where listeners hear takes from two or three providers without knowing which is which, rating on naturalness, character fit, and pleasantness. Five to ten listeners is enough to surface the strongest preferences. Do this on real scripts from your actual workload, not on demo content the providers chose.

Naturalness vs. accuracy. A model can sound natural and still be wrong. The script said "Dr. Lee" and the model speaks "Doctor Lee" when the brand convention is "Doctor L-E-E"; the script said "iOS" and the model speaks "ee-ohs" instead of "eye-O-S". Audit the audio against the script word by word on the first few takes; pronunciation errors are systematic and patching them once via the pronunciation dictionary fixes them forever.

Hallucinated pronunciation. TTS models sometimes invent pronunciations for words they have never seen — proper nouns, technical jargon, made-up product names. The output is fluent and confident, which means a casual listen will not catch it. Build a list of brand-critical and domain-critical words, render them once across providers, and confirm each one before scaling.

Voice consistency across long sessions. On multi-chapter audiobooks or multi-segment narration, voice identity can drift. The cloned narrator at chapter one and chapter twelve may not sound like the same person to an attentive listener. Sample takes from the start, middle, and end of long projects; if the model supports it, render in segments with consistent context rather than as one giant input.

The discipline that catches all of this is what the SurePrompts Quality Rubric describes for text outputs and the Context Engineering Maturity Model describes for production systems — explicit success criteria, evaluation on real workload not demos, observed quality over time rather than assumed quality at launch. Audio is harder to instrument than text, but the principle is the same.

What's next

This tutorial is the model landscape. Two sister tutorials in the same wave cover the adjacent surfaces:

For the broader input surface across modalities, the multimodal pillar is the canonical reference. The agentic prompt stack covers how voice surfaces compose into larger systems. The RCAF prompt structure and SurePrompts Quality Rubric are the framework canonicals that the prompt-anatomy section builds on.

A consolidated voice and audio pillar is in the editorial calendar; this tutorial and the two sister walkthroughs are its source material.

Try it yourself

Build expert-level prompts from plain English with SurePrompts — 350+ templates with real-time preview.

Open Prompt Builder

Get ready-made ChatGPT prompts

Browse our curated ChatGPT prompt library — tested templates you can use right away, no prompt engineering required.

Browse ChatGPT Prompts