Skip to main content
Back to Blog
audio understandingGeminilong-context audioaudio input promptingpodcast summarizationmeeting analysistutorial

Audio Understanding with Gemini Long-Context: A Walkthrough

Gemini 2.5 Pro takes long-form audio as a native input — meetings, podcasts, calls, lectures — and reasons over it directly. This tutorial walks through the upload flow, prompt anatomy, five shippable patterns, and the failure modes that make audio harder to evaluate than text.

SurePrompts Team
April 23, 2026
21 min read

TL;DR

Gemini 2.5 Pro accepts hours of audio as direct input and reasons over it without an intermediate transcript step. This walkthrough covers the upload flow, the audio-prompt anatomy, five production patterns (meeting, podcast, customer call, earnings call, lecture), timestamp handling, diarization realities, and how to evaluate audio output without confusing fluent prose for accurate analysis.

Key takeaways:

  • Gemini 2.5 Pro takes audio as a native input — no internal transcribe-then-prompt step. The model reasons over the signal directly, so prosody, pauses, and overlapping speech all reach the analysis.
  • Audio prompts have three slots: instruction, audio reference, output shape. The instruction does heavy lifting because the audio is opaque to anyone reading the prompt back later.
  • Five patterns ship reliably: meeting → action items, podcast → chaptered summary, customer call → sentiment + objections, earnings call → financial figures to JSON, lecture → concept-by-concept summary.
  • Timestamps are usable for navigation but not frame-accurate. Native diarization is good enough for clean two-to-three-speaker audio and breaks on heavy crosstalk; preprocess for high-stakes attribution.
  • The dominant failure mode is fluent-but-wrong summaries the reader cannot fact-check without re-listening. Golden-set evaluation against extracted facts is the only durable defense — see llm-as-judge and self-critique.
  • Long-context audio understanding is a different surface from realtime speech-to-speech and from voice generation. Pair this with the GPT-4o Realtime voice walkthrough and the voice generation models comparison — three tools, three jobs.

Why Direct Audio Input Beats Transcribe-then-Prompt

The default architecture for "do something with this audio" used to be a two-step pipeline: send the audio to a speech-to-text API, then send the transcript to a text model. For analysis tasks, that pipeline throws away most of the signal.

Speech is not text with timing metadata. Pace, pitch, pauses, overlapping voices, laughter, and emphasis all carry meaning the words alone do not. An engineer who says "yeah, the deploy went fine" with a four-second pause and a flat tone before "fine" is a different signal from the same sentence said quickly. A customer who says "that's interesting" with rising pitch and a short exhale is signaling skepticism the transcript renders as enthusiasm. A CFO who answers an analyst question with three seconds of silence followed by a hedge is a different data point from the same words delivered immediately.

A transcript flattens all of that. It also makes intractable categories of work — sentiment on conversational audio, tone on a sales call, polite assent versus genuine agreement — that are routine when the model has the audio.

The trade: direct audio input costs more per second than transcription plus a small text prompt, and it does not give you a clean transcript artifact for compliance archives, captioning, or search indexing. Use transcribe-first when you need the transcript itself. Use direct audio input when the output is analysis and the audio carries information the transcript would lose. The multimodal prompting pillar names this trade; the rest of this walkthrough is the implementation specifics for the Gemini surface.

The 2026 Audio Understanding Surface (Gemini Specifically)

A few things to know before writing prompts.

Native audio input. Gemini 2.5 Pro and 2.5 Flash both accept audio as a first-class modality, sent in the same request as the text instruction. No separate transcription step.

Long-form support. Audio fits inside Gemini's million-token context window using the model's audio tokenization, comfortably covering half-hour standups, hour-long podcast episodes, and multi-hour earnings calls. The exact per-prompt maximum is in the Gemini API reference and shifts with vendor updates; check current docs before architecting around a specific ceiling.

Common formats. mp3, wav, aiff, aac, ogg, flac. Transcode unusual containers with ffmpeg first. Sample rate and channel count rarely matter at the prompting layer.

Multilingual. Gemini handles non-English audio natively and code-switches gracefully. The instruction can be in a different language from the audio — English instruction, Japanese audio, English output — without an intermediate translation step.

Two upload modes. Smaller files inline as base64 in the request. Larger files (and any file you want to reuse) through the Files API, which uploads once and returns a handle you can reference in subsequent prompts. The Files API is the right default for anything beyond a quick test — it separates upload from prompt and lets you re-prompt the same audio with different instructions without re-uploading.

Flash is the cost-efficient sibling for high-volume routine work where Pro is overkill. Same API; switching is a model-name change, not an architectural one.

The Upload-and-Prompt Flow

A minimal end-to-end example using the Python SDK and the Files API. The shape is the same in JavaScript and in raw REST; the SDK is just the most readable.

python
from google import genai
from google.genai import types

client = genai.Client()

# 1. Upload the audio file once.
audio_file = client.files.upload(file="meeting-2026-04-22.mp3")

# 2. Prompt over the uploaded file.
response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=[
        "You are an analyst summarizing a sprint retrospective. "
        "The recording is a 45-minute meeting with five engineers and "
        "one engineering manager (the EM speaks first). "
        "Output: (1) a structured summary with named sections "
        "(what went well, what didn't, action items with owners), "
        "(2) overall sentiment with one supporting quote, "
        "(3) any segment where the audio was unclear, with timestamps. "
        "Mark inaudible segments as [inaudible] rather than guessing. "
        "Do not invent action items that were not explicitly assigned.",
        audio_file,
    ],
)

print(response.text)

A few things about that shape.

Instruction first, audio reference second. Gemini reads the contents list in order; instruction-first produces the most consistent behavior. The model reads the instruction, then reads the audio with the instruction in mind.

The file handle is reusable. Upload once, prompt many times. The same audio_file handle works for follow-up prompts ("now extract every commitment with a date") without re-uploading — useful when iterating against a fixed recording.

Inline path for short clips. For audio under the inline threshold, skip the Files API:

python
import base64

with open("clip.mp3", "rb") as f:
    audio_bytes = f.read()

response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=[
        "Summarize this voice memo in three bullet points.",
        types.Part.from_bytes(data=audio_bytes, mime_type="audio/mp3"),
    ],
)

The REST shape is the same idea: a multipart/related upload to the Files API, then a generateContent call referencing the file URI. The mechanics are not the interesting part. What you say in the instruction is.

Audio-Prompt Anatomy

A strong audio prompt has three slots.

Instruction. The verb-bearing description of what the model should do — analyze, summarize, extract, classify, transcribe, redact, evaluate — paired with the specific aspects to attend to. "Summarize this call" produces generic prose. "Extract every commitment the customer made, with timestamp, who said it, and the deadline if one was mentioned" produces pipeline output. The instruction does heavy lifting in audio prompting because the audio itself is opaque to anyone reviewing the prompt back later — six months from now the only readable artifact is the instruction, so it has to fully describe what was asked.

Audio reference. The Files API handle or the inline base64 payload. The SDK handles the wiring. One note: when sending multiple audio files in one prompt, label them in the instruction ("Audio 1 is the main interview; audio 2 is a 30-second reference recording for diarization").

Output shape. Exactly what the response should look like. JSON matching a named schema for pipeline work. Markdown with named sections for human consumption. A bulleted list with a maximum count for triage. Vague output asks produce vague output. The shape also doubles as a success criterion — a JSON schema is also a checklist of what the analysis must cover.

An optional fourth slot is success criteria — explicit standards the model can use to self-check. "Mark inaudible segments as [inaudible] rather than guessing." "Do not invent action items that were not explicitly assigned." These are anti-hallucination rails, and they pay off most when the audio is long enough that the reader cannot easily verify the output by re-listening.

This is the voice-prompting discipline at the prompt-anatomy layer. The patterns below fill these slots for the most common audio analysis jobs.

Five Patterns That Ship

Each has a prompt template, a what-works note, and a what-fails note.

Meeting → Action Items + Decisions + Next Steps

The most common audio job: team meetings, internal reviews, planning sessions. A clean record of what was decided and who owns what.

code
You are summarizing a recorded team meeting.

Audio: [meeting recording, ~45 minutes, 5 participants]
Context: This is a weekly product planning meeting. The
participants are the PM (Sarah), two engineers (Kenji, Ana),
the designer (Ravi), and the QA lead (Mei). Use these names
when you can identify the speakers; otherwise label as
Speaker 1 through Speaker 5.

Output a markdown document with these sections:
1. Decisions — every decision made in the meeting, one bullet
   each, with the speaker who proposed it (if identifiable)
2. Action items — every commitment made, with owner, deliverable,
   and deadline if one was named (omit deadline rather than
   guess if not stated)
3. Open questions — anything raised but not resolved
4. Next meeting topics — anything explicitly deferred to a
   follow-up

Do not invent action items that were not explicitly assigned.
If an item was discussed but no owner was named, list it under
"Open questions," not "Action items."

What works: structured sections force the model to separate decisions from discussion, which is the distinction that matters for follow-up. Produces a usable artifact for teams who do not currently take meeting notes.

What fails: when no one clearly assigned ownership for an item, the model occasionally invents a plausible owner. The "do not invent" rail reduces this but does not eliminate it. Spot-check the action items section against the audio for the first dozen meetings before trusting the output downstream.

Podcast → Chaptered Summary with Timestamps

Podcast episodes, interview recordings, long-form conversations. A navigable summary that doubles as a jump-to-the-good-parts table of contents.

code
You are creating chaptered show notes for a podcast episode.

Audio: [podcast episode, ~60 minutes, 2 speakers]
Context: This is an interview-format podcast. The host introduces
themselves at the top; the guest is named in the introduction.
Use their names throughout.

Output a markdown document with:
1. A 2-sentence episode summary at the top
2. A chapter list: 5-10 chapters, each with a starting timestamp
   in mm:ss format, a short title, and a 1-2 sentence description
   of what is discussed in that chapter
3. Three pull-quotes from the guest — verbatim, with timestamp —
   that capture the most distinctive ideas
4. A list of any books, papers, products, or people mentioned
   by name, with the timestamp of first mention

Use mm:ss timestamps. If a quote is approximate (you are not 100%
sure of the exact wording), mark it [paraphrase] instead of
quoting it verbatim.

What works: the chapter list is the most-requested artifact for any podcast workflow, and the model produces it cleanly when the prompt names the format. Timestamps land within a few seconds — the right granularity for navigation.

What fails: pull-quotes are where verbatim drift shows up. The model will sometimes "clean up" a quote into smoother prose than what was said. The [paraphrase] rail helps; for anything reproduced in published show notes, verify against the audio. The named-references list is also where the model occasionally hallucinates a plausible-but-wrong title.

Customer Call → Sentiment + Objections + Commitments

Sales calls, support calls, customer-success check-ins. The structured artifacts underneath: how it went, what the customer pushed back on, what got agreed to.

code
You are analyzing a recorded customer call for a SaaS sales team.

Audio: [customer call, ~25 minutes, 2 speakers]
Context: The first speaker is the account executive (AE); the
second speaker is the customer (a potential buyer evaluating
our product). Use "AE" and "Customer" as labels.

Output JSON matching this shape:
{
  "overall_sentiment": "positive" | "neutral" | "negative",
  "sentiment_evidence": "<one sentence with timestamp>",
  "objections": [
    { "summary": "...", "timestamp": "mm:ss", "ae_response": "..." }
  ],
  "commitments": [
    { "who": "AE" | "Customer", "what": "...", "timestamp": "mm:ss",
      "deadline": "..." | null }
  ],
  "next_steps": ["..."],
  "unclear_segments": [{ "timestamp": "mm:ss", "note": "..." }]
}

Sentiment must be inferred from tone, pace, and word choice —
not from polite phrases alone. A customer who says "that's
interesting" with a flat tone and follows it with a hedge is
neutral or negative, not positive. Mark sentiment as "neutral"
when the signal is mixed; do not force a strong label.

What works: this is where direct audio input most clearly beats transcribe-then-prompt. Sentiment is genuinely better when the model has the audio — polite-but-not-actually-interested is hard to distinguish from genuine interest in a transcript, easy in audio. JSON output makes the result a CRM row, not a wall of prose.

What fails: the model over-confidently classifies mixed signals. The "do not force a strong label" rail helps. Obliquely raised objections (a long pause, a topic-change away from pricing) sometimes get missed; a second pass with a more specific prompt — "list every moment the customer redirected the conversation" — catches what the first pass missed.

Earnings Call → Financial Figures Extracted to JSON

Earnings calls, analyst days, investor presentations. Every figure, every guidance update, every reaffirmed forecast pulled out as structured data.

code
You are extracting financial figures from an earnings call.

Audio: [earnings call recording, ~75 minutes]
Context: This is a public-company quarterly earnings call. The
CFO presents prepared remarks first, then takes analyst Q&A.
Speakers are the CEO, CFO, IR lead, and named analysts.

Output JSON matching this shape:
{
  "company_named_figures": [
    { "metric": "...", "value": "...", "period": "...",
      "speaker": "...", "timestamp": "mm:ss",
      "qualifier": "actual" | "guidance" | "reaffirmed" | null }
  ],
  "guidance_updates": [
    { "metric": "...", "previous": "...", "new": "...",
      "direction": "raised" | "lowered" | "reaffirmed",
      "timestamp": "mm:ss" }
  ],
  "analyst_questions": [
    { "analyst": "...", "firm": "...", "question_summary": "...",
      "answer_summary": "...", "timestamp": "mm:ss" }
  ]
}

Do not include figures you are not confident were stated. If a
number is mumbled or unclear, omit it rather than guess. Do not
extract figures from the prepared remarks slide deck — only from
spoken audio. If the speaker says "approximately" or "around,"
preserve that qualifier in the value field.

What works: structured extraction is reliable when the prompt is specific about what counts. The qualifier field (actual, guidance, reaffirmed) is the most useful column in the output and the one a generic "summarize this earnings call" prompt always omits.

What fails: a wrong number is materially worse than a missing one. The "omit rather than guess" rail is the right default; build a regression test on a known-clean recording before using the output for anything trade-related. This is the pattern most worth pairing with self-critique — a verification pass over each extracted figure reduces the false-positive rate further.

Lecture → Concept-by-Concept Summary with Terminology Gloss

Recorded lectures, conference talks, educational audio. The goal is learning without linear listening, plus vocabulary the learner can look up.

code
You are summarizing a recorded lecture for a learner.

Audio: [lecture recording, ~50 minutes, single speaker]
Context: This is a university lecture on [domain]. The audience
is upper-undergraduate level; assume the listener has the
prerequisites the lecturer assumes.

Output a markdown document with:
1. A 3-sentence summary of the lecture's main argument
2. A concept-by-concept breakdown: 6-12 numbered concepts in
   the order they were introduced, each with a 2-3 sentence
   explanation in the lecturer's framing
3. A terminology gloss: every technical term the lecturer used
   that a learner might not know, with a short definition based
   on how the lecturer used it (not a generic textbook definition)
4. Three review questions a learner could use to self-test
   understanding, with the timestamp of the segment that
   answers each question

If the lecturer named a paper, book, or other reference, list
it separately at the end with the timestamp.

What works: the concept-by-concept structure mirrors how the lecture is taught, making the summary a usable study aid rather than a generic abstract. The terminology gloss in the lecturer's framing — not generic — is what makes this more useful than a transcript would be.

What fails: when the lecturer uses a term in a non-standard way (common in research talks), the model sometimes substitutes the standard definition. The "based on how the lecturer used it" rail helps but does not always hold. Review questions are the weakest output; for higher-quality questions, prompt for a specific style ("questions that require synthesis across two concepts").

Timestamps and Navigation

Timestamps are not free; you have to ask for them. "With timestamp in mm:ss format" or "with hh:mm:ss timestamps" produces them. Without that, the model defaults to prose without timing. The format you ask for is the format you get — mm:ss under an hour, hh:mm:ss above.

Accuracy is good enough for navigation, not good enough for editing. A returned timestamp of 14:32 will land within a few seconds of the actual quote when clicked in a player. For show notes, jump-to-the-good-parts links, and "here is where the customer raised the objection" annotations on a CRM record, that accuracy is sufficient. For frame-accurate alignment (subtitles, audio editing, legal verbatim with timecode), pair Gemini's analysis with a dedicated forced-alignment tool.

Practical pattern for long recordings: ask for timestamps on every extracted artifact, even ones you do not think you will navigate to. The first time a user disputes a model's claim ("did the customer really commit to that?"), the timestamp lets you re-listen to the relevant segment in seconds rather than scrubbing through an hour. This is the observability discipline the agentic prompt stack names — you cannot debug what you have not instrumented.

Speaker Diarization in 2026

Diarization is figuring out who said what. Gemini's native handling is good for the easy cases and breaks predictably on the hard ones.

What works natively. Two-to-three-speaker recordings with distinct voices, clean audio, minimal crosstalk. The model labels speakers generically (Speaker 1, Speaker 2) and, if the prompt provides a participant list, attaches roles or names where it can confidently identify them. An interview with a host and a guest, a sales call with an AE and a customer, a meeting with three engineers and a PM — these diarize cleanly enough for the analysis tasks above.

Where it breaks. Five-plus-speaker recordings with similar voices. Heavy crosstalk. Poor-quality conference calls. Long recordings where a speaker drops out for ten minutes and rejoins (the model occasionally re-labels them as a new speaker). High-precision attribution use cases — legal, compliance, podcast credits — where every line matters.

When to preprocess. Run a dedicated diarization model first (pyannote is the open-source default; managed equivalents on the major clouds). The output is a sidecar file with per-segment speaker IDs and timestamps. Reference both audio and speaker map in the Gemini prompt: "Audio is attached. The accompanying JSON lists speaker segments with IDs (S1, S2, ...). Use these IDs in your output."

Long-audio retrieval is the needle-in-a-haystack case — recall on specific facts buried in long recordings degrades the same way it does for long text. Telling the model exactly what speaker and what topic you are looking for outperforms generic summarization.

Honest Evaluation and the Hallucination Trap

The dominant failure mode of audio understanding is fluent-but-wrong output that reads competent and is not. It is worse for audio than for text because the reader cannot easily verify the output by glancing back at the input — checking a claim against a 90-minute earnings call requires re-listening, which nobody does. The model says "the CFO reaffirmed Q4 revenue guidance at $4.2 billion" and the reader believes it. If the actual figure was $4.4 billion, said about Q3, or hedged with "approximately," the error survives until someone with a transcript catches it weeks later.

Three disciplines mitigate this; none is optional for production use.

Golden-set evaluation. Build a small library of recordings (10-30 to start) with manually noted key facts — every decision, every action item, every financial figure, every named entity. Run your prompt over each recording and check the output fact-by-fact against your notes. Track per-claim precision (what fraction of the model's claims were correct?) and recall (what fraction of the facts in your notes did the model surface?). The numbers will be lower than your text-prompt evals; that is the baseline you tune against.

Regression tests in CI. Once the golden set exists, prompts should regress in CI the same way any other code does. A prompt change that improves average prose at the cost of a 5-point drop in fact recall is a regression to catch before it ships, not after a customer notices. The eval-harness glossary entry covers the broader discipline.

Self-critique passes for high-stakes outputs. For earnings-call extraction or any pattern where a wrong claim is materially worse than a missing one, run a second pass over the model's own output: "here is the audio and here is your extracted JSON; for each figure, verify it was stated by the named speaker at the named timestamp. Mark any unverifiable figure for human review." This is the self-critique pattern applied to audio. The llm-as-judge pattern complements it as a continuous monitor — sample a fraction of production traces and have a separate model grade them against the golden-set rubric, flagging the bottom decile.

The discipline is the same one the SurePrompts Quality Rubric names for any generative task: do not grade the prose for fluency. Grade the claims for accuracy. Audio-understanding is just where the gap between the two is widest, because the input is least inspectable.

When NOT to Use Long-Audio Understanding

Four cases where a different tool wins.

Sub-30-second clips. A short voice memo, a single-question voicemail. A small dedicated transcription model plus a text prompt over the transcript is faster and cheaper. Long-context audio is overkill for anything that fits in a paragraph of text.

Realtime or streaming audio. Voice agents, live captioning with summarization, conversational assistants — anywhere the user expects a partial response while still speaking. That is the Realtime API surface, architecturally distinct from request-response analysis over a complete recording. The GPT-4o Realtime voice walkthrough covers the realtime side; do not bend Gemini long-audio into a use case it is not built for.

High-stakes legal verbatim. Court transcripts, regulatory depositions, medical dictation that becomes part of a patient record. Use a dedicated transcription service with a defensible human-in-the-loop chain. A general-purpose model is the wrong tool when verbatim correctness is a legal requirement.

Pure transcription as the deliverable. If you need a clean text transcript for captioning, archive search, or downstream RAG indexing, use a transcription model. Paying analysis rates for transcription is wasteful. The agentic RAG walkthrough is the right next stop if you are building a system that retrieves over many transcribed audio sources.

The general rule: long-context audio sits in the middle of the duration-and-analysis-depth spectrum. Too long for a small model, too analysis-heavy for a raw transcript, too request-response for realtime. In that middle band — meetings, podcasts, customer calls, earnings calls, lectures — Gemini long-audio is the right tool. Outside it, use the right tool instead.

What's Next

This walkthrough is one of three in the SurePrompts voice and audio cluster, and they are deliberately disjoint:

If you build anything voice-shaped, you will end up needing more than one of these. The mistake to avoid is bending one into another's job — long-context audio for realtime, realtime for batch analysis, TTS for voice-cloning evaluation. Three tools, three jobs.

For the broader frame, the multimodal AI prompting pillar is the canonical entry point on the input side — the universal prompt anatomy, the per-modality dialects, and the model landscape across images, PDFs, audio, and video. This walkthrough is the audio-input deep-dive that pillar links to.

Try it yourself

Build expert-level prompts from plain English with SurePrompts — 350+ templates with real-time preview.

Open Prompt Builder

Get ready-made Gemini prompts

Browse our curated Gemini prompt library — tested templates you can use right away, no prompt engineering required.

Browse Gemini Prompts