Question 1

What is Speech to Text (STT)?

Accepted Answer

Speech to text, or STT — also called automatic speech recognition (ASR) — is the transcription of spoken audio into written text.

Question 2

How does Speech to Text (STT) work?

Accepted Answer

STT is the inverse of TTS and is one half of the classical voice-agent pipeline (STT then LLM then TTS). It is distinct from realtime speech-to-speech architectures, which skip the intermediate text representation entirely and operate audio-in/audio-out at lower latency.

Question 3

Can you give an example of Speech to Text (STT)?

Accepted Answer

A meeting-notes product uploads each recorded call to Whisper or Deepgram, receives a timestamped transcript back, then passes the transcript to an LLM for summarization. STT is the front of the pipeline; the audio is converted to text before any reasoning happens, which is why latency in this architecture is the sum of all three stages rather than a single round trip.

Speech to Text (STT)

Example

Frequently asked questions

What is Speech to Text (STT)?

How does Speech to Text (STT) work?

Can you give an example of Speech to Text (STT)?

Related Terms

Put this into practice