Skip to main content

Speech to Text (STT)

Speech to text, or STT — also called automatic speech recognition (ASR) — is the transcription of spoken audio into written text. Modern STT is dominated by neural models including OpenAI Whisper, Deepgram, and AssemblyAI, which handle accents, background noise, and domain vocabulary with significantly higher accuracy than the HMM-based systems that preceded them. STT is the inverse of TTS and is one half of the classical voice-agent pipeline (STT then LLM then TTS). It is distinct from realtime speech-to-speech architectures, which skip the intermediate text representation entirely and operate audio-in/audio-out at lower latency.

Example

A meeting-notes product uploads each recorded call to Whisper or Deepgram, receives a timestamped transcript back, then passes the transcript to an LLM for summarization. STT is the front of the pipeline; the audio is converted to text before any reasoning happens, which is why latency in this architecture is the sum of all three stages rather than a single round trip.

Put this into practice

Build polished, copy-ready prompts in under 60 seconds with SurePrompts.

Try SurePrompts