Most people still prompt AI with text only. They type questions, paste paragraphs, maybe format a system prompt. Meanwhile, the best models in 2026 can see photographs, read PDFs, listen to audio, and watch video. If you're only sending text, you're leaving the most powerful capabilities of GPT-4o, Claude, and Gemini completely untouched.
What Is Multimodal Prompting?
Multimodal prompting means giving an AI model more than one type of input in a single interaction. Instead of describing an image with words and asking the model to imagine it, you attach the image directly. Instead of transcribing a meeting and pasting the transcript, you upload the audio file. Instead of summarizing a video yourself, you give the model the video and ask it to do the work.
This is a fundamental shift in how prompting works. Traditional prompt engineering is about crafting the perfect text instruction. Multimodal prompting adds a new dimension: choosing the right combination of input types to get the best result.
Info
Multimodal prompting is the practice of combining multiple input types — text, images, documents, audio, or video — in a single AI prompt. The text portion provides instructions and context, while the non-text inputs supply the raw material the model should analyze, interpret, or act upon.
The practical impact is significant. Consider a real estate agent who wants AI to write a property listing. With text-only prompting, they have to describe every room, every feature, every angle. With multimodal prompting, they upload 10 photos of the property and say "Write a compelling listing based on these photos." The model sees the granite countertops, the natural light, the backyard view — details the agent might forget to mention.
Which Models Support What?
Not every model handles every modality. Here's where things stand in 2026:
GPT-4o accepts text, images, and audio as inputs. It can process screenshots, photographs, diagrams, and charts. Its native audio input allows for voice-based interactions and audio analysis. It can also generate images directly within conversations.
Claude (Opus, Sonnet, and Haiku) accepts text, images, and PDF documents. Claude's PDF support is particularly strong — it can read multi-page documents, understand formatting and tables, and reason about document structure in ways that other models struggle with.
Gemini (2.5 Pro and Flash) has the broadest multimodal support. It accepts text, images, audio, and video. Gemini can process video files directly, analyzing visual content, speech, and on-screen text simultaneously. Its long context window (up to 1 million tokens) makes it especially suited for analyzing lengthy videos or large document collections.
| Capability | GPT-4o | Claude (Opus/Sonnet) | Gemini 2.5 Pro |
|---|---|---|---|
| Text input | Yes | Yes | Yes |
| Image input | Yes | Yes | Yes |
| PDF/document input | Via image/text | Native PDF support | Yes |
| Audio input | Yes | No | Yes |
| Video input | No | No | Yes |
| Image generation | Yes | Yes | Yes |
| Long context for multimodal | 128K tokens | 200K tokens | 1M tokens |
Image + Text Prompting
Image-plus-text is the most widely used form of multimodal prompting, and it's where you'll likely start. Every major model supports it, and the use cases are immediately practical.
The core principle: your text prompt tells the model what to do with the image. The image provides the raw visual information. Neither is useful without the other — an image without instructions just gets a generic description, and instructions without an image force the model to guess.
How to Structure Image Analysis Prompts
A strong image analysis prompt has three parts:
- Role or context — What perspective should the model take?
- The image — Attached directly to the conversation.
- Specific task — What exactly should the model do with the image?
Here's the difference between a weak and strong approach:
Weak prompt (with attached image):
What do you see in this image?
Strong prompt (with attached image):
You are a senior UI/UX designer reviewing a mobile app screenshot.
Analyze this screenshot and provide:
1. Three specific usability issues, ranked by severity
2. Whether the visual hierarchy guides the user toward the primary action
3. Accessibility concerns (contrast ratios, touch target sizes, text readability)
4. One concrete redesign suggestion with reasoning
The second prompt produces actionable feedback because it defines a role, specifies the output format, and tells the model exactly what aspects to evaluate.
Screenshot Analysis for UI/UX Review
One of the highest-value applications of image prompting is interface review. Designers and product managers can get instant feedback on screenshots without waiting for a formal design review.
I'm attaching a screenshot of our checkout page. Act as a conversion
rate optimization specialist.
Identify:
- Any points of friction that might cause cart abandonment
- Whether the trust signals (security badges, guarantees) are
positioned effectively
- If the call-to-action button has sufficient visual weight
- How the mobile experience would differ from what I'm showing
(this is the desktop version)
Prioritize your findings by estimated impact on conversion rate.
Document and Receipt Extraction
Multimodal models can read text from images — receipts, invoices, business cards, handwritten notes. The key is telling the model what data you need extracted.
This is a photo of a restaurant receipt. Extract the following into
a structured JSON format:
- Restaurant name
- Date and time
- Each line item (name, quantity, price)
- Subtotal, tax, and total
- Tip amount (if visible)
- Payment method
If any field is unclear or partially obscured, note it as
"unclear" rather than guessing.
Diagram and Chart Interpretation
Models can read charts, graphs, flowcharts, and technical diagrams — but they need guidance on what level of analysis you want.
I'm attaching a bar chart showing our quarterly revenue by product
line for 2025.
Provide:
1. A plain-English summary of the overall trend
2. Which product line grew fastest (percentage, not just absolute)
3. Any anomalies or surprising patterns
4. Three questions a CFO would ask based on this data
Photo Analysis for Real Estate, Ecommerce, and More
Industry-specific image analysis is where multimodal prompting becomes a serious productivity tool.
Real estate:
I'm uploading 5 photos of a property listing. Write a compelling
real estate listing description that:
- Highlights the most visually striking features from the photos
- Mentions specific materials and finishes you can identify
- Describes the natural light quality
- Notes the approximate room sizes based on visual cues
- Uses language appropriate for a luxury market listing
- Keeps the description under 200 words
Ecommerce product listing:
Analyze these 3 product photos of a leather messenger bag.
Create an ecommerce product description that includes:
- Key features visible in the photos (stitching, hardware, pockets)
- Material quality assessment based on visual appearance
- Suggested product title (under 80 characters)
- 5 bullet points for the product highlights section
- A paragraph for the "Product Details" tab
Do not invent features you can't see. If you're unsure about a
material or feature, use language like "appears to be" rather than
stating definitively.
Tip
When uploading multiple images for analysis, number them or describe them in your prompt. "In image 1 (the kitchen)..." helps the model reference specific photos accurately and prevents confusion about which image you're asking about.
Comparing Multiple Images
One of the most underused multimodal techniques is asking the model to compare two or more images:
I'm attaching two versions of our landing page — the current design
(image 1) and the proposed redesign (image 2).
Compare them on:
- Visual hierarchy and where the eye is drawn first
- Use of whitespace
- Typography choices
- Color contrast and accessibility
- Which version better communicates our value proposition
(we sell project management software)
Be specific — point to exact elements in each design, not general
principles.
Document + Text Prompting
Document analysis is where Claude currently stands out. While you can screenshot a PDF and upload it as an image to any model, Claude's native PDF processing understands document structure — headers, tables, page numbers, footnotes — in ways that screenshot-based approaches miss.
PDF Analysis With Claude
Claude can process PDFs up to hundreds of pages, understanding layout, extracting data from tables, and reasoning across sections. The key is being specific about what you need.
I'm uploading a 45-page quarterly earnings report.
Provide:
1. A 3-sentence executive summary
2. Revenue and profit figures compared to the same quarter last year
3. Any forward-looking statements or guidance changes
4. The three most significant risks mentioned in the report
5. Data from the financial tables formatted as markdown tables
Focus on what's changed from previous quarters, not on boilerplate
language.
Extracting Data From Scanned Documents
Scanned documents — old contracts, hand-filled forms, faxed pages — are a common pain point. Multimodal models handle these well, but you need to account for image quality issues.
This is a scanned copy of a signed contract from 2019. The scan
quality is medium — some text may be slightly blurry.
Extract:
- All party names and their roles (buyer, seller, guarantor, etc.)
- Key dates (signing date, effective date, expiration date)
- Payment terms and amounts
- Any termination or renewal clauses
- Signature names (if legible)
For any text you can't read clearly, write "[illegible]" and
describe what section it appears in so we know what to check
manually.
Comparing Multiple Documents
This is a powerful but often overlooked use case. Upload two or three related documents and ask the model to find differences or conflicts.
I'm uploading two versions of our Terms of Service — the current
version (document 1) and the proposed revision (document 2).
Identify:
- Every substantive change (not just formatting or rewording)
- Any new clauses added
- Any clauses removed
- Changes that affect user rights or obligations
- Changes that affect our liability
Present the differences in a table with columns: Section, Current
Language, Proposed Language, Impact Assessment.
Legal Document Review
While AI should never replace legal counsel, it can dramatically speed up initial document review.
I'm uploading a commercial lease agreement. I am not a lawyer —
I need you to help me understand this document, not provide legal
advice.
Summarize:
- Lease term and renewal options
- Monthly rent and how it escalates
- Who pays for what (maintenance, insurance, taxes, utilities)
- Restrictions on use, subleasing, or modifications
- Early termination conditions and penalties
- Anything unusual or non-standard that I should ask a lawyer about
Use plain English. Flag any clauses that are unusually one-sided.
Academic Paper Analysis
Researchers and students can use document prompting to quickly understand dense papers.
I'm uploading an academic paper on transformer architectures.
Provide:
1. The core research question in one sentence
2. The key finding or contribution
3. The methodology used (simplified for a non-expert)
4. How the results compare to the baselines mentioned
5. The three most important limitations acknowledged by the authors
6. Two follow-up research questions this paper suggests
Write at an undergraduate level — avoid jargon where possible,
and define technical terms where you must use them.
Warning
When uploading sensitive documents like contracts, financial reports, or medical records, always check the AI provider's data handling policy. Most providers don't train on API inputs, but consumer chat interfaces may have different policies. For sensitive documents, use the API directly or check the provider's enterprise privacy terms.
Audio + Text Prompting
Audio input is currently supported by GPT-4o and Gemini. This modality is less mature than image understanding, but the use cases are already compelling.
Transcription With Context
The simplest audio use case is transcription — but multimodal prompting lets you go far beyond raw transcription by adding context.
I'm uploading an audio recording of a client meeting (approximately
20 minutes).
Provide:
1. A clean transcript with speaker labels (Speaker 1, Speaker 2, etc.)
2. A bullet-point summary of key decisions made
3. A list of action items with who is responsible (if mentioned)
4. Any unresolved questions or disagreements noted
5. The overall sentiment — was this a positive, neutral, or
contentious meeting?
If any portion of the audio is unclear, mark it as [inaudible]
rather than guessing.
Meeting Note Analysis
When you have multiple meeting recordings, you can use audio prompting to track threads across conversations.
I'm uploading an audio recording of today's sprint retrospective.
Extract:
- What went well (specific items mentioned by team members)
- What didn't go well
- Process improvements suggested
- Any interpersonal dynamics worth noting (frustration,
enthusiasm, disengagement)
- Comparison to the themes from last sprint's retro (which focused
on deployment delays and testing gaps — has progress been made?)
Format as a standard retro document I can share with the team.
Podcast Summarization
Audio prompting is particularly useful for podcast consumption — getting key insights without listening to a full episode.
I'm uploading a 45-minute podcast episode about AI regulation
in the EU.
Provide:
1. A 3-sentence summary of the episode
2. The main arguments made by each speaker
3. Any specific regulations, dates, or deadlines mentioned
4. Direct quotes that capture the most important points
(with approximate timestamps if possible)
5. Whether the hosts reached a consensus or disagreed
I'm preparing for a presentation on this topic, so prioritize
actionable information over background context.
Voice-Based Workflows
Audio input enables new interaction patterns — dictating complex instructions, providing verbal context, or conducting voice-first workflows.
[Audio input: verbal description of a software bug]
Based on my verbal description of this bug:
1. Write a formal bug report with title, steps to reproduce,
expected behavior, and actual behavior
2. Suggest the likely root cause based on the symptoms I described
3. Recommend which team member or system component to investigate
4. Assign a severity level (P0-P3) with justification
Clean up any verbal tics or rambling from my description — the
bug report should be concise and technical.
Tip
When using audio input, speak clearly and mention proper nouns deliberately. AI models handle natural speech well, but unusual company names, technical terms, or acronyms may be misinterpreted. Spell out critical terms: "the CORS — C-O-R-S — configuration" rather than just saying "the CORS config."
Video + Text Prompting
Video understanding is Gemini's standout capability. As of 2026, Gemini 2.5 Pro and Flash are the only major models that accept video input directly. Other models require you to extract frames or transcribe audio separately — Gemini processes the visual, audio, and text content of a video simultaneously.
Video Summarization
The most straightforward video use case is summarization — particularly for long-form content you don't have time to watch.
I'm uploading a 30-minute product demo video from a competitor.
Provide:
1. A structured summary of every feature demonstrated
2. The order in which features were presented (this tells us
what they consider most important)
3. Any pricing or plan information shown on screen
4. UI/UX patterns they use that we should consider
5. Claims they make about performance, accuracy, or capabilities
6. Anything that appears to be in beta or coming soon
I'm preparing a competitive analysis — focus on objective
observations, not subjective quality judgments.
Content Moderation
Video analysis enables automated content review at a level that wasn't previously possible without dedicated computer vision systems.
Review this user-uploaded video for our community platform.
Check for:
- Inappropriate visual content (violence, explicit material)
- On-screen text that violates our community guidelines
(hate speech, harassment, personal information)
- Audio content that contains slurs, threats, or harassment
- Copyright-protected content (music, movie clips, TV shows)
- Misleading content (manipulated media, deepfakes if detectable)
For each issue found, provide:
- Timestamp
- Category of violation
- Severity (must-remove, review-needed, borderline)
- Specific description of what you found
If the video is clean, confirm that no violations were detected.
Educational Video Analysis
Students and educators can use video prompting to extract structured learning materials from lectures, tutorials, and presentations.
I'm uploading a university lecture on microeconomics (50 minutes).
Create:
1. A comprehensive set of lecture notes organized by topic
2. Key definitions with the exact wording used by the professor
3. Any formulas or equations shown on the slides or board
4. Examples or case studies discussed, with full context
5. Questions posed by students and the professor's answers
6. A list of 10 study questions based on the lecture content
Format the notes as if preparing them for a student who missed
the lecture.
Process Documentation From Video
Recording a process and asking AI to document it is faster than writing documentation from scratch.
I'm uploading a screen recording of our deployment process
(12 minutes).
Create:
1. Step-by-step documentation with numbered instructions
2. Screenshots descriptions (describe what should be visible at
each step, for when we add static screenshots later)
3. Common pitfalls or error states visible in the recording
4. The approximate time each step takes
5. Any keyboard shortcuts or commands used
Write the documentation for a new team member who has never
done this deployment before. Assume they have basic terminal
knowledge but no familiarity with our specific tools.
Info
Gemini's video processing works best with videos under 60 minutes. For longer videos, consider splitting them into logical segments and processing each separately. The model can handle longer content with its million-token context window, but analysis quality tends to be higher with focused segments.
Advanced Multimodal Techniques
Once you're comfortable with single-modality inputs (image+text, audio+text), you can combine techniques for more sophisticated workflows.
Combining Multiple Input Types in One Prompt
The most powerful multimodal prompts use several input types together.
I'm providing three inputs:
1. [Image] A photo of our current office layout
2. [Document] Our company's space planning guidelines (PDF)
3. [Text] We're adding 15 new employees in Q3 and need to
accommodate them without moving offices.
Based on the current layout visible in the photo and the
constraints in the guidelines document:
- Identify underutilized areas that could be reconfigured
- Suggest a revised layout that accommodates 15 additional
workstations
- Note any guideline violations in the current layout
- Estimate whether the current space can handle the growth
or if we'll need additional square footage
Chain-of-Modality Prompting
This technique processes one modality first, then uses the results to inform analysis of another. It's the multimodal equivalent of chain-of-thought prompting.
Step 1 — Process the image:
Analyze this architectural floor plan. Identify every room,
its approximate dimensions, and its labeled purpose. Output
as a structured list.
Step 2 — Reason with text using the results:
Based on your analysis of the floor plan, and given that we
need to convert this residential property into a co-working
space:
- Which rooms are suitable for private offices (minimum 100 sq ft)?
- Where should the common area be (needs natural light and
easy access from the entrance)?
- Where should the kitchen/break room go (needs plumbing access)?
- What building code issues might arise from the conversion?
This two-step approach produces better results than a single prompt because the model commits to specific observations before reasoning about them — reducing hallucination and improving accuracy.
Multimodal Few-Shot Prompting
Just as few-shot prompting with text shows examples of desired input-output pairs, multimodal few-shot prompting provides examples with images.
I need you to classify product images into damage categories.
Example 1: [Image of a dented package]
Classification: Minor damage — cosmetic dent, product likely
unaffected. Action: Ship as-is with a note to the customer.
Example 2: [Image of a torn, open package]
Classification: Major damage — package integrity compromised,
product may be damaged. Action: Inspect product before shipping
or replace.
Example 3: [Image of a water-stained package]
Classification: Moderate damage — water exposure, product may
be affected depending on contents. Action: Open and inspect,
replace if contents are damaged.
Now classify this image:
[New product image to classify]
This approach dramatically improves consistency because the model sees exactly what your categories look like rather than interpreting text descriptions.
Cross-Modal Verification
Use one modality to fact-check another. This is particularly useful for catching errors and building trust in AI outputs.
I'm providing two inputs:
1. [Image] A photo of our warehouse inventory shelf
2. [Text] Our inventory system says this shelf should contain:
- 24 units of SKU-A100 (blue boxes)
- 12 units of SKU-B200 (red boxes)
- 6 units of SKU-C300 (yellow boxes)
Compare what's visible in the photo against the inventory data.
Report:
- Any discrepancies in quantity
- Any misplaced items
- Whether the items appear to match the expected packaging
- Any organizational issues visible (items not aligned,
labels not facing forward, etc.)
Warning
Cross-modal verification works best when you're checking specific, countable claims against visual evidence. Don't rely on it for precise measurements — AI models estimate spatial dimensions from images but are not calibrated instruments. Use it for qualitative checks and approximate counts.
Model Comparison: Choosing the Right Tool
Each model has distinct strengths in multimodal processing. Picking the right model for your specific use case matters more than which model is "best" overall.
| Use Case | Best Model | Why |
|---|---|---|
| UI/UX screenshot review | GPT-4o or Claude | Strong visual reasoning, detailed design feedback |
| PDF document analysis | Claude | Native PDF processing, understands document structure |
| Multi-page legal documents | Claude | Best at maintaining context across long documents |
| Photo analysis (products, real estate) | GPT-4o or Gemini | Strong object recognition and spatial understanding |
| Audio transcription + analysis | GPT-4o | Native audio input with nuanced interpretation |
| Video summarization | Gemini 2.5 Pro | Only major model with direct video input |
| Long video analysis (30+ min) | Gemini 2.5 Pro | Million-token context window handles extended content |
| Comparing multiple images | Claude or GPT-4o | Both handle multi-image prompts well |
| Chart/graph interpretation | GPT-4o or Claude | Both strong at data extraction from visualizations |
| Multimodal few-shot examples | Gemini 2.5 Pro | Handles many images in context efficiently |
| Scanned document OCR | Any model | All handle text extraction from images well |
Tip
Don't default to one model for everything. A practical workflow might use Claude for contract review (PDF strength), GPT-4o for design feedback (vision + generation strength), and Gemini for video analysis (only option). The best multimodal practitioners match models to modalities. For model-specific prompt tips, see our comparison of 9 AI models.
Common Mistakes in Multimodal Prompting
Even experienced prompt engineers make these errors when working with non-text inputs:
Mistake 1: Describing what you're showing. If you attach a screenshot of a login page and write "This is a screenshot of a login page with a username field, password field, and blue submit button" — you've wasted tokens and potentially biased the model. Let the model see for itself. Only describe what isn't visible: "This is our production login page. The submit button was recently changed from green to blue — we're testing whether this improves conversion."
Mistake 2: Uploading low-quality images. Models can't read blurry text or identify objects in dark, grainy photos. If you're taking a photo of a document, make sure the lighting is even and the text is sharp. If you're screenshotting a UI, capture at full resolution. A few seconds spent on image quality saves you from garbage results.
Mistake 3: Not specifying output format. This applies to all prompting, but it's especially critical with multimodal inputs because the model has so much it could say about an image or document. Without format constraints, you'll get a rambling description instead of the structured analysis you need.
Mistake 4: Ignoring model limitations. Asking GPT-4o to analyze a video or asking Claude to process audio will either fail or produce poor workarounds. Check the capability table above and route your task to the right model.
Mistake 5: Uploading too many images without structure. Dropping 20 images into a conversation without labels or instructions overwhelms the model. Number your images, group them logically, and be explicit about what you want from each one.
Building Multimodal Prompts With SurePrompts
Structuring multimodal prompts follows the same principles as text-only prompt engineering — clear role, specific task, defined output format — with the added dimension of specifying how non-text inputs should be processed. Our prompt generator helps you build structured prompts that work well as the text component of multimodal interactions. Start with a generated prompt framework, then attach your images, documents, or audio for a complete multimodal experience.
For more on the foundations of good prompting that apply across all modalities, see our prompt engineering basics guide. If you're specifically working with image generation rather than image analysis, our AI image prompts guide and ChatGPT image prompts guide cover that angle in depth.
The Future of Multimodal Prompting
Multimodal AI is moving fast. A year ago, video input was experimental. Today it's a standard capability in Gemini. Audio input has gone from novelty to a practical tool for meeting analysis and transcription. The trajectory is clear: within the next year, we'll likely see real-time multimodal interactions become standard — models that can simultaneously process a live video feed, listen to speech, and respond in real time.
What this means for prompt engineers: the skill of choosing the right modality for the task will become as important as writing good text instructions. Sometimes a photo communicates in one second what would take 500 words to describe. Sometimes a 30-second audio clip captures tone and nuance that a transcript destroys. The best prompts in 2026 and beyond won't just be well-written — they'll be well-composed, using the right combination of inputs to give the model exactly the information it needs.
Start simple. Upload a screenshot and ask for feedback. Attach a receipt and ask for data extraction. Record a voice memo and ask for a formatted summary. Once you see the difference between describing something to AI and showing it, you won't go back to text-only prompting.