Skip to main content
Back to Blog
GPT-5.5Claude Opus 4.8Gemini 3.1 Promodel comparisonfrontier models2026

GPT-5.5 vs. Claude Opus 4.8 vs. Gemini 3.1 Pro: 20 Copy-Paste Prompts, Three Models (2026)

20 copy-paste prompts run head-to-head on GPT-5.5, Claude Opus 4.8, and Gemini 3.1 Pro — with notes on which model wins each category and why. Pick the right model for your task.

May 6, 2026
Updated June 17, 2026
34 min read

TL;DR

Twenty real prompts run on GPT-5.5, Claude Opus 4.8, and Gemini 3.1 Pro with structured comparison — math, code, agents, long-context, writing, research, decisions. Includes a model-selection rubric so you can route work to the model that actually wins each kind of task instead of paying flagship rates for everything.

"Which model is best?" is the wrong question. The useful question is "which model wins for this specific task?" These 20 copy-paste prompts — run on GPT-5.5, Claude Opus 4.8, and Gemini 3.1 Pro — answer it category by category. The goal is a routing map, not a winner's podium.

How to Read These Comparisons

Each of the 20 prompts was run on all three models using the framing that model actually rewards: structured outputs and clean markdown for GPT-5.5 (with its reasoning-effort dial turned up for hard problems), XML-tagged context for Opus 4.8, and problem framings Gemini 3.1 Pro can explore across multiple hypotheses. Running the same literal text on all three would be a test of format tolerance, not capability — that's not useful. What's useful is: given a task, what does the best prompt on each model look like, and what does the output quality reveal?

"Winner" is a qualitative call. It's based on output quality for the specific task — logical coherence, format discipline, task completion, absence of hallucination, and usefulness without post-processing. It is not based on benchmark scores. Benchmarks are useful for comparing models in aggregate; they are nearly useless for deciding which model to use for a specific task you have today. We won't fabricate percentages.

Tradeoffs always exist and always matter. GPT-5.5 at high reasoning effort is slower and more expensive; reaching for that dial on a task it handles adequately at low effort wastes real money and latency. Its structured output discipline makes it the obvious choice for pipelines that need machine-parseable JSON — but that discipline can make freeform writing feel slightly mechanical. Opus 4.8's long-form writing has a voice and a rhythm the other two don't match, but voice doesn't matter in an API response schema. Gemini 3.1 Pro is the most exploratory of the three — that's an asset on open-ended research and a liability when you wanted a terse, single answer.

The practical implication is that most teams should route by task type rather than selecting one flagship model for everything. For a deeper look at how each model's prompting mechanics differ, read our guide to advanced prompt engineering in 2026.

We won't claim specific benchmark scores for any model in this post. Every observation here is about prompt-following behavior, output structure, verbosity patterns, and refusal tendencies — things you can observe yourself.

20

Copy-paste prompts run head-to-head on GPT-5.5, Claude Opus 4.8, and Gemini 3.1 Pro

Math & Hard Reasoning (Prompts 1–3)

This category tests multi-step deduction and quantitative reasoning where intermediate errors compound. A model that makes a wrong turn in step 3 produces confident nonsense by step 8.

1. Multi-Step Optimization Problem

code
A company runs three factories: A, B, and C.
- Factory A produces 200 units/day at $12/unit cost; max capacity 500 units/day
- Factory B produces 150 units/day at $9/unit cost; max capacity 400 units/day
- Factory C produces 100 units/day at $15/unit cost; max capacity 300 units/day
- Total demand is 800 units/day
- Factory A has a $500/day fixed cost; B has $300/day fixed; C has $200/day fixed

Find the production allocation that minimizes total daily cost while meeting demand.
Show your work. If you use a greedy approach, verify it against the LP solution.

GPT-5.5: At high reasoning effort it frames this as a linear programming problem, sets up the objective function and constraints explicitly, solves correctly, then checks the greedy (fill cheapest-per-unit first) against the LP result and flags where they agree and diverge. The self-verification pass catches arithmetic slips before they surface, so error rate on intermediate steps is very low and the method is fully auditable.

Claude Opus 4.8: Solves correctly and narrates each step clearly — but on a problem this size the narrative adds length without adding insight. On harder problems with 10+ constraints, the narrative format helps catch errors.

Gemini 3.1 Pro: Reaches the right allocation and, true to its exploratory bent, spins up two or three candidate allocations in parallel before converging — useful when the optimum isn't obvious, but on a clean LP this small it adds breadth you don't need. The work is correct; the path is longer than it has to be.

Winner: GPT-5.5 — At high reasoning effort, the explicit LP setup and self-verification make errors auditable. On optimization problems, showing the method matters as much as the answer.


2. Proof by Contradiction

code
Prove that there are infinitely many prime numbers.
Use proof by contradiction. Show each logical step explicitly.
After the proof, explain in plain English why the assumption leads to contradiction.

GPT-5.5: At high reasoning effort it produces a clean, formal proof with clearly labeled steps. The "assume finite set of primes P₁…Pₙ" construction, the N+1 argument, and the contradiction are all explicit, and the structured reasoning keeps each step independent. The plain-English follow-up is accurate and not dumbed down.

Claude Opus 4.8: Gets the math right and writes the plain-English section with more elegance than the other two. Proof formatting is slightly less terse than GPT-5.5's but more readable for a non-mathematician audience.

Gemini 3.1 Pro: Correct proof, and it volunteers an alternate framing (Euclid's classic vs. a factorial-based construction) before settling on one — a nice touch for a teaching context, but it makes the formal section longer than the proof needs to be. The plain-English explanation is solid.

Winner: GPT-5.5 — For formal proofs where each step needs to stand independently, the structured-reasoning terseness at high effort is a feature.


3. Probability Chain

code
A software deployment pipeline has three sequential stages. Each stage fails 
independently with the following probabilities: Stage 1: 3%, Stage 2: 5%, Stage 3: 2%.

1. What is the probability a deployment succeeds end-to-end?
2. If you could reduce one stage's failure rate by half, which stage gives the 
   greatest improvement in overall success rate?
3. If you run 50 deployments per week, what is the expected number of full failures per week?
4. What sample size do you need to detect a 1% improvement in overall success rate 
   with 80% power and α=0.05?

GPT-5.5: At high reasoning effort it handles all four parts cleanly, including the power calculation in part 4 which requires setting up the appropriate z-test or binomial test. It commits to the calculation, shows the work, and doesn't hedge — the self-verification pass keeps intermediate rounding from drifting the answer.

Claude Opus 4.8: Completes the first three parts confidently. On part 4, sometimes qualifies heavily ("the exact sample size depends on the test formulation chosen") rather than committing to a calculation, which is technically careful but less useful in practice.

Gemini 3.1 Pro: Completes all four parts and, on part 4, lays out two test formulations (normal approximation vs. exact binomial) and computes both before recommending one. Thorough and correct, though when you just want the number it surfaces more options than the question asked for.

Winner: GPT-5.5 — Part 4 is the discriminator. At high reasoning effort GPT-5.5 commits to the calculation and shows the work.


Code Generation & Refactor (Prompts 4–6)

This category tests whether a model can write production-quality code, not just plausible code — meaning the output handles edge cases, follows conventions, and doesn't require immediate debugging.

4. API Rate Limiter Implementation

code
Implement a thread-safe rate limiter in Python using the token bucket algorithm.

Requirements:
- Class-based interface: RateLimiter(rate: float, capacity: float)
- Method: consume(tokens: float = 1.0) -> bool
- Must be safe for concurrent use across threads
- Use threading.Lock or threading.RLock appropriately
- Include a refill mechanism that is lazy (refill on consume, not on a background thread)
- Add type hints and a short docstring per method

Write the implementation, then write pytest tests covering:
- Normal consumption within capacity
- Consumption that exceeds capacity
- Concurrent consumption from 5 threads

GPT-5.5: Produces a correct token bucket with lazy refill using time.monotonic(), proper lock usage, and a clear consume() — and adds a production-practical bonus: it returns (bool, float) where the float is the wait time if the consume fails, which is often what you actually want downstream. The pytest suite covers all three specified scenarios, including a concurrency test using threading.Thread, and it handles fractional tokens correctly.

Claude Opus 4.8: Correct implementation, and the docstrings are notably better — they explain the algorithm, not just the parameters. Test suite is thorough. Occasionally proposes a slightly more complex refill formula that is more accurate at irregular intervals.

Gemini 3.1 Pro: Correct and complete, and it tends to volunteer a second design — a leaky-bucket variant alongside the token bucket — with notes on when each fits. Useful if you're still choosing the algorithm; extra surface area if you already decided. Tests are solid.

Winner: GPT-5.5 — The wait-time return value is the kind of production-practical addition that changes how you use the class downstream, without being asked.


5. Legacy Code Refactor

code
Refactor this Python function. Preserve identical behavior; improve readability, 
reduce cyclomatic complexity, and add type hints.

def process(data, mode, threshold=0.5, extra=None):
    result = []
    for i in range(len(data)):
        if mode == 'filter':
            if data[i] > threshold:
                result.append(data[i])
        elif mode == 'scale':
            if extra is not None:
                result.append(data[i] * extra)
            else:
                result.append(data[i] * 2)
        elif mode == 'flag':
            if data[i] > threshold:
                result.append((data[i], True))
            else:
                result.append((data[i], False))
        else:
            result.append(data[i])
    return result

After refactoring, explain each change and why it improves the code.

GPT-5.5: Splits into separate handlers, adds type hints with overloads where return types differ by mode, replaces the for i in range(len(data)) pattern with direct iteration, and introduces a dispatch dictionary (mapping mode strings to handlers) that flattens the conditional ladder. The result is the most idiomatic Python 3.10+ version of the three, and the explanation is methodical — one change per bullet with the rationale.

Claude Opus 4.8: Produces the cleanest function signatures and the most readable explanation, but tends to keep the three modes in one function with a more structured conditional block rather than splitting or dispatching. Valid choice; different opinion on cohesion.

Gemini 3.1 Pro: Refactors correctly and presents two alternative structures — a dispatch dict and a small strategy-pattern class — with tradeoffs for each. Thorough, but for a function this size it offers more architecture than the problem warrants.

Winner: GPT-5.5 — The dispatch pattern and idiomatic Python 3.10+ conventions produce code that's easiest to extend without touching existing behavior.


6. SQL Query Optimization

code
This query takes 45+ seconds on a 10M-row orders table. Analyze and rewrite it.

SELECT 
    c.name,
    c.email,
    COUNT(o.id) as order_count,
    SUM(o.total) as lifetime_value,
    MAX(o.created_at) as last_order_date
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
WHERE o.created_at >= '2024-01-01'
    AND o.status != 'cancelled'
GROUP BY c.id, c.name, c.email
HAVING COUNT(o.id) > 2
ORDER BY lifetime_value DESC;

Identify every performance problem, rewrite the query, and specify the indexes needed.

GPT-5.5: Correctly identifies the LEFT JOIN + WHERE on a non-null column (which silently converts to INNER JOIN semantics), the missing composite index on (customer_id, created_at, status), and the sort on a computed column without a covering index — and goes one step further with a partial index on (customer_id, created_at) WHERE status != 'cancelled', a specific, practical optimization that reduces index size. The rewrite is clean and the explanation of why the partial index helps is clear without being a lecture.

Claude Opus 4.8: Identifies all the same issues. Explanation is the most thorough — includes an estimated cost model discussion and explains the LEFT/INNER JOIN behavior in terms a developer who didn't write the original query would immediately understand.

Gemini 3.1 Pro: Catches the same core problems and adds breadth — it weighs a covering index against a partial index and against a materialized rollup for the aggregate, with a note on read/write tradeoffs. Genuinely useful exploration, but it stops short of committing to the single sharpest index recommendation.

Winner: GPT-5.5 — The partial index recommendation is a concrete win, and the explanation is sufficiently detailed without being a lecture.


Agentic Tool-Use Loops (Prompts 7–9)

This category tests the model as an agent: given a goal and a set of tools, does it plan sensibly, avoid unnecessary calls, recover from errors, and stop when done?

7. Research-Then-Write Agent

code
You have access to: web_search(query: str) -> list[SearchResult], 
read_url(url: str) -> str, write_file(filename: str, content: str) -> None.

Task: Research the current state of RISC-V adoption in data center CPUs. 
Produce a 600-word briefing saved to "riscv-datacenter-2026.md".

Constraints:
- Use at least 3 distinct sources
- Do not use the same domain twice
- Flag any claim you couldn't verify with [UNVERIFIED]
- Stop when the file is written; do not ask for confirmation

GPT-5.5: Executes the tool loop efficiently with minimal redundant calls. The structured output discipline carries over — the briefing is well-formatted with clear headers. Follows the "no same domain twice" constraint reliably. The prose is clean but reads more like a report template than an analyst's voice.

Claude Opus 4.8: Plans explicitly before acting via Claude Code's interleaved thinking, which adds one reasoning step but reduces backtracking. The final briefing has the most editorial voice of the three — reads like a human analyst wrote it, not a pipeline — and the [UNVERIFIED] flags are applied judiciously. Stops when told to stop, unlike older Claude versions that asked "Shall I proceed?"

Gemini 3.1 Pro: Strong here on the research side — its Search grounding pulls genuinely current sources and it naturally satisfies the three-distinct-domains constraint. It tends to over-collect, reading more URLs than strictly needed, and the written briefing is more exploratory than tight.

Winner: Claude Opus 4.8 — The plan-before-acting behavior reduces wasted tool calls, and the editorial quality on the final document is meaningfully higher.


8. Debugging Agent

code
You have access to: run_code(code: str, timeout: int = 10) -> dict[str, str], 
read_file(path: str) -> str.

A production script is crashing with: 
"KeyError: 'user_id'" at line 47 of user_processor.py.

The file contents are:
[PASTE FILE CONTENTS]

Diagnose the root cause, write a fix, test the fix by running the modified code, 
and confirm the specific line that caused the issue. Do not patch around the error — 
find why 'user_id' is absent and fix that.

GPT-5.5: At high reasoning effort it traces the KeyError to its source rather than adding a .get() band-aid, and the diagnosis is structured: hypothesis → test → confirmation. More likely than the others to add defensive .get() calls around the fix "for safety" even after the root cause is addressed — which can obscure future bugs.

Claude Opus 4.8: Most thorough diagnosis — via Claude Code's interleaved thinking it explicitly rules out multiple hypotheses before committing to one. Uses run_code efficiently: one run to confirm the bug, one to confirm the fix. The explanation of why user_id was absent is the most complete, and it fixes the cause, not the symptom.

Gemini 3.1 Pro: Generates several competing hypotheses up front and weighs them, which is a genuine strength — but it can spend extra run_code calls testing branches a more disciplined investigation would prune. Lands on the right cause; the path is more exploratory.

Winner: Claude Opus 4.8 — The multi-hypothesis diagnosis prevents fixing the symptom instead of the cause, and the interleaved thinking between tool calls produces a cleaner, more efficient investigation.


9. Data Pipeline Agent

code
You have access to: read_csv(path: str) -> DataFrame, 
transform(df: DataFrame, ops: list[dict]) -> DataFrame,
validate_schema(df: DataFrame, schema: dict) -> ValidationResult,
write_parquet(df: DataFrame, path: str) -> None.

Task: Build a pipeline that reads "raw_sales.csv", cleans it 
(remove nulls in revenue column, deduplicate on order_id, 
cast date column to ISO8601), validates the result against 
the provided schema, and writes to "clean_sales.parquet".

If validation fails, return a structured error report instead of writing the file.
Schema: {order_id: int, date: str (ISO8601), revenue: float, region: str}

GPT-5.5: Handles this extremely well — the structured output discipline means the error report format is immediately machine-parseable, and it stays identical in shape whether validation passes or fails. The conditional branch (validate before write) is clean and it doesn't hallucinate tool parameters. This is exactly the case where consistent schema on both paths matters.

Claude Opus 4.8: Correct execution but occasionally adds extra validation passes not specified in the task. The error report is detailed and human-readable; less rigidly structured than GPT-5.5's.

Gemini 3.1 Pro: Executes the pipeline correctly and writes a thorough, readable error report — but the report's shape varies more between runs, which is a problem when downstream code expects a fixed schema. Its instinct to enrich the output works against rigid machine-parsing here.

Winner: GPT-5.5 — When pipelines need consistent output schema on both success and failure paths, GPT-5.5's structured output discipline is the right choice.


Long-Context Document Analysis (Prompts 10–12)

This category tests comprehension and extraction across long inputs — legal contracts, codebases, research corpora.

10. Contract Risk Review

code
<contract>
[PASTE FULL CONTRACT TEXT — 15,000–40,000 words]
</contract>

Review this contract from the perspective of the party named [PARTY NAME].
Identify:
1. Clauses that expose [PARTY NAME] to uncapped liability
2. Termination provisions that favor the counterparty
3. IP assignment clauses and what rights [PARTY NAME] retains
4. Any defined terms that are unusually broad or ambiguous
5. Clauses absent that are standard in this contract type

For each finding: cite the clause number, quote the relevant language, 
and explain the risk in plain English. Return findings as a numbered list 
sorted by severity (High / Medium / Low).

GPT-5.5: Structured output discipline shines here — the severity-sorted list is clean and consistently formatted, and clause citations are accurate. At very long contract lengths (40K+ words), attention occasionally lapses on clauses buried late in the document unless you turn its reasoning effort up. Strong on IP and liability clauses.

Claude Opus 4.8: Best at finding the subtle risks — unusual defined terms, absent standard clauses, and ambiguous language that the other two treat as boilerplate. The 1M-token context window handles even sprawling contracts without degradation, and the genuine long-context attention means findings late in the document are as sharp as those early.

Gemini 3.1 Pro: Reads the full contract comfortably in its 1M window and is good at surfacing cross-references between clauses. It's less consistent on the most subtle risks — broad defined terms and missing standard clauses — and its severity ordering is less reliable than the other two.

Winner: Claude Opus 4.8 — The combination of genuine long-context attention and quality of analysis on subtle clause risks makes it the best choice for contract work.


11. Codebase Audit

code
<codebase>
[PASTE OR REFERENCE REPOSITORY — auth module, ~3,000 lines]
</codebase>

Audit this authentication module for:
1. Security vulnerabilities (OWASP Top 10 where applicable)
2. Logic errors that could bypass authentication
3. Missing input validation on user-supplied data
4. Token lifecycle issues (expiry, rotation, revocation)
5. Race conditions in concurrent session handling

For each finding: file, line range, vulnerability class, severity, and a 
one-sentence fix recommendation. Return as a markdown table.

GPT-5.5: Excellent structured output — the table is clean and sortable. Catches the OWASP Top 10 items reliably and, at high reasoning effort, is strong on logic errors and authentication bypass patterns like missing constant-time comparison in token validation. Occasionally misses the more subtle timing or concurrency issues.

Claude Opus 4.8: Most thorough on the concurrency and token lifecycle findings. The analysis of race conditions in session handling is notably more complete than the other two, and across a ~3,000-line module it holds context without losing the thread between files. Table output is clean when explicitly requested.

Gemini 3.1 Pro: Good breadth on the OWASP categories and willing to flag architectural concerns the prompt didn't ask about. Less consistent on the hardest findings — concurrency and token rotation — and its table formatting occasionally needs a cleanup pass.

Winner: Claude Opus 4.8 — The depth on concurrency and token lifecycle issues matters in production auth code where those are the hardest bugs to catch in code review.


12. Research Synthesis Across Documents

code
<doc1>[PAPER 1 — ~8,000 words]</doc1>
<doc2>[PAPER 2 — ~7,000 words]</doc2>
<doc3>[PAPER 3 — ~9,000 words]</doc3>

These three papers examine [TOPIC] from different methodological perspectives.

Synthesize:
1. Points of consensus across all three papers
2. Points of direct disagreement — cite the specific claims and paper sections
3. Methodological differences that could explain the disagreements
4. Gaps none of the three papers address
5. A 300-word executive summary that someone who hasn't read the papers can act on

Do not summarize each paper individually. Synthesize across them.

GPT-5.5: Follows the "synthesize across, don't summarize individually" instruction well, and the structure of the output is clean — each of the five sections is clearly demarcated. Slightly less deep on the methodological analysis, and the gaps section tends toward the obvious.

Claude Opus 4.8: Produces a deeply insightful synthesis with the best prose of the three. The methodological difference analysis is strong, and the executive summary reads as a genuine synthesis rather than a stitched-together summary. On three separate documents it holds all of them in context cleanly.

Gemini 3.1 Pro: This is its home turf. The parallel multi-hypothesis approach lets it hold all three papers side by side and weigh competing readings of where they actually disagree — it traces each disagreement to a specific section and a specific methodological choice more precisely than the other two, and the gaps it identifies are the least obvious. Cross-document synthesis is exactly what its thinking levels are built for.

Winner: Gemini 3.1 Pro — Weighing the three papers' claims in parallel surfaces sharper points of disagreement and non-obvious gaps; cross-document synthesis is where it pulls ahead.


Long-Form Writing & Editing (Prompts 13–15)

This category tests whether the model can produce writing that doesn't need rewriting — voice consistency, structural logic, and not sounding like a language model wrote it.

13. Opinion Essay With a Clear Thesis

code
Write a 1,200-word opinion essay arguing that [POSITION].

Audience: [PUBLICATION TYPE — e.g., general business readers]
Tone: direct, slightly contrarian, willing to name what conventional wisdom gets wrong
Structure: 
- Opening that earns attention without clickbait
- Clear thesis stated in the first 150 words
- Three arguments, each with a specific example or piece of evidence
- Steel-man the opposing view in one paragraph, then respond to it
- Closing that returns to the opening and leaves the reader with something to act on

Do not use transition phrases like "Furthermore" or "In conclusion." 
Write like a person, not a structure diagram.

GPT-5.5: Strong structural compliance, clean prose. Tends to write in a polished-neutral register — professional but slightly anonymous. The steel-man paragraph is typically the best-written section, but the contrarian tone the brief asks for reads slightly applied rather than felt.

Claude Opus 4.8: The voice is distinctly better. The contrarian tone in the brief lands naturally rather than feeling performed, the examples are more specific, and the opening earns attention without tricks. The writing sounds like a person with a point of view.

Gemini 3.1 Pro: Produces a structurally correct essay with good range of evidence — its grounding helps it reach for concrete examples. But the prose runs longer and more exploratory than an opinion piece wants, and the thesis can get diffused across the breadth it generates.

Winner: Claude Opus 4.8 — For long-form opinion writing where voice and persuasion matter, the quality gap is large enough to be immediately visible.


14. Substantive Copy Edit

code
Edit this draft for clarity, concision, and argument structure. 
Do not rewrite it — edit it. Preserve my voice and my thesis.

<draft>
[PASTE 800–1,200 word draft]
</draft>

For each significant change:
- Quote the original passage
- Show the edit
- Explain why the change improves the writing

At the end, flag the two or three structural issues in the argument 
that no line-level edit can fix, and suggest how to address them.

GPT-5.5: Clean edits, good explanations, and it respects the instruction to edit rather than rewrite. The structural issues list at the end is consistently useful. Sometimes over-edits toward its own polished-neutral register, smoothing away voice the author chose on purpose.

Claude Opus 4.8: Best at detecting intentional stylistic choices versus genuine awkwardness — it edits the latter without touching the former. The structural issues analysis at the end is the most incisive of the three, and it preserves the author's voice rather than overwriting it.

Gemini 3.1 Pro: Thorough and well-reasoned per-change explanations, and it's generous with the structural feedback at the end. But it edits more heavily than asked, sometimes reworking passages that were fine, which is exactly the rewriter behavior the prompt forbids.

Winner: Claude Opus 4.8 — The ability to distinguish intentional voice from fixable awkwardness is exactly what distinguishes a good editor from a rewriter.


15. Technical Explainer for a General Audience

code
Write a 900-word explainer on [TECHNICAL TOPIC] for readers who are 
intelligent but not specialists in this field.

Rules:
- No jargon without an immediate plain-English definition
- Every abstract concept gets a concrete analogy before moving on
- Use second-person ("you") to keep the reader engaged
- Structure: problem → current approach → why it's hard → what's changing → what it means for you
- Closing: one specific thing the reader can do or look for because they read this

Do not condescend. Treat the reader as smart but uninitiated.

GPT-5.5: Excellent structure compliance and strong on technical accuracy. The analogies are polished and the "you" framing is consistent, though they occasionally feel constructed rather than intuitive. Tends toward slightly longer explanations than necessary for a general audience.

Claude Opus 4.8: Best analogies — they surface the key insight of a concept naturally rather than forcing a comparison. The writing respects the reader's intelligence without assuming background. The closing action item is typically more specific than the other two.

Gemini 3.1 Pro: Accurate and well-researched — its grounding keeps the "what's changing" section current. But it tends to cover more ground than a 900-word explainer should, and the analogies multiply rather than landing on the single clearest one.

Winner: Claude Opus 4.8 — Analogy quality and respect for reader intelligence are the hardest things to specify in a prompt and the easiest to notice in output.


Research & Synthesis (Prompts 16–17)

This category tests quality of analysis over breadth of retrieval — can the model evaluate what it finds, not just report it?

16. Competitive Analysis

code
Conduct a competitive analysis of [MARKET/PRODUCT CATEGORY] from the perspective 
of a founder deciding whether to enter this market.

Cover:
1. Market structure: number of competitors, concentration, consolidation trends
2. Differentiation axes: what dimensions competitors actually compete on
3. Where incumbents are weak (from customer reviews, public complaints, market gaps)
4. Barriers to entry: capital, switching costs, network effects, regulation
5. Where a new entrant could realistically win (specific segment or angle, not generic "innovation")
6. Biggest assumption you'd need to validate before committing

Be specific. Name competitors. Identify real weaknesses, not generic "large companies move slowly."

GPT-5.5: Strong at market structure analysis and the barrier-to-entry section is well-reasoned. Competitor naming is reliable. The "where incumbents are weak" section sometimes generalizes rather than citing specific complaints, and on "where a new entrant could win" it can stay cautious rather than directional.

Claude Opus 4.8: Most opinionated where it should be — the "where a new entrant could win" answer is a specific claim with reasoning, not a hedge, and the biggest-assumption section identifies the actual uncertainty. Its weakness is currency: without grounding it leans on what it already knows about the market.

Gemini 3.1 Pro: The Search grounding is decisive here — it pulls current competitor positioning, recent funding, and actual customer complaints rather than working from stale priors, and it explores several distinct entry angles in parallel before weighing them. For a founder deciding whether to enter a market today, that combination of fresh data and breadth of options is exactly what the question needs.

Winner: Gemini 3.1 Pro — Search-grounded current data plus parallel exploration of entry angles makes the competitive picture both up-to-date and genuinely strategic.


17. Systematic Comparison of Approaches

code
Compare three approaches to [PROBLEM — e.g., database schema versioning, 
microservice communication, LLM output validation].

For each approach:
1. How it works (2-3 sentences, no fluff)
2. When it's the right choice
3. When it breaks down
4. Operational complexity (low/medium/high + one sentence justification)
5. What teams actually end up regretting about it

End with a decision tree: given [Variable 1], [Variable 2], [Variable 3], 
which approach should you use?

GPT-5.5: Excellent format compliance. The decision tree at the end is clean and usable. The "when it breaks down" and "what teams regret" sections are accurate but sometimes generic — it reports the standard tradeoffs rather than the surprising ones.

Claude Opus 4.8: Strong on all sections, with the most readable prose. The "what teams end up regretting" observations are good. The decision tree has more nuance but can become harder to follow when there are many branches.

Gemini 3.1 Pro: Best at the core of this prompt — systematically comparing three approaches. Its parallel-hypothesis approach holds all three side by side and weighs them against each variable at once, so the "when it breaks down" and "what teams regret" observations are the most specific and least obvious, and the decision tree reflects genuine cross-approach reasoning rather than three separate write-ups stitched together.

Winner: Gemini 3.1 Pro — Holding the three approaches in parallel and weighing them against each variable produces the sharpest, least-generic comparison and a decision tree that actually reasons across options.


Strategic Decision Analysis (Prompts 18–19)

This category tests structured reasoning about decisions with competing considerations, uncertain information, and real consequences.

18. Build vs. Buy vs. Partner Decision

code
I need to decide whether to build, buy, or partner to add [CAPABILITY] to our product.

Context:
- Company stage: [SERIES A / B / BOOTSTRAPPED / etc.]
- Team size: [NUMBER] engineers
- Time to ship: [DEADLINE PRESSURE]
- Budget for acquisition: [RANGE]
- Strategic importance: [CORE / SUPPORTING / EXPERIMENTAL]
- Current state: [WHAT WE HAVE NOW]

Structure your analysis as:
1. Decision criteria and how to weight them for our situation
2. Build option: realistic effort estimate, risks, and long-term implications
3. Buy option: what to evaluate, red flags, integration cost
4. Partner option: what partnership structures exist, where they usually fail
5. Your recommendation with the two conditions that would change it

Do not give me a framework. Give me a recommendation.

GPT-5.5: At high reasoning effort it follows the structure, commits to a recommendation, and the "conditions that would change it" section is the strongest — it identifies the actual decision pivots rather than listing generic risks, and the structured reasoning keeps the criteria weighting explicit and auditable. It doesn't hedge the recommendation into uselessness.

Claude Opus 4.8: Gives a firm recommendation and is willing to take a counterintuitive position when the context warrants it. The partnership section — particularly where partnerships typically fail — is the most practically useful of the three.

Gemini 3.1 Pro: Explores the three options thoroughly and surfaces considerations the other two skip, and its grounding can pull in relevant market signals. But it tends to present a richer decision surface rather than landing hard on one recommendation — useful for thinking, less so when you asked for a call.

Winner: GPT-5.5 — The decision-pivot analysis is the most structurally rigorous, and at high reasoning effort it commits to a recommendation without hedging it into uselessness.


19. Risk Scenario Planning

code
I'm considering [STRATEGIC DECISION — entering a new market, shipping a 
major feature, changing pricing, key hire].

Run a pre-mortem. Assume it's 18 months from now and this decision failed badly.

Step 1: Generate the 5 most plausible failure scenarios (specific, not generic)
Step 2: For each scenario, identify: 
   - What early indicator would have signaled this was coming (within 90 days)
   - What we could have done differently in the first 30 days
Step 3: Based on these scenarios, what 3 commitments should we make before proceeding?

Be specific. "Market conditions changed" is not a failure scenario. 
"We entered at the same time three incumbents cut price by 40%" is.

GPT-5.5: At high reasoning effort the pre-mortem scenarios are specific and uncomfortable — which is exactly what a useful pre-mortem requires. The 90-day early-indicator identification is where it's strongest; the indicators are concrete and monitorable, and the structured reasoning ties each one back to its scenario without drifting into generic risk.

Claude Opus 4.8: Strong on specificity. The 3 commitments at the end are the most actionable of the three — they read like decisions a real leadership team would actually make, not bullet points from a strategy template.

Gemini 3.1 Pro: Generates the widest range of failure scenarios, including a couple the other two miss, and its parallel exploration is genuinely useful for not anchoring on one failure mode. The tradeoff is that it's slightly more likely to slip in a generic scenario despite the explicit instruction, and the early indicators run less crisp.

Winner: GPT-5.5 — Specific failure scenarios with monitorable early indicators are the core value of a pre-mortem, and at high reasoning effort GPT-5.5 delivers the most concrete, traceable version.


Creative & Conceptual Work (Prompt 20)

Creative work tests whether a model can do something genuinely unexpected — not competent, but interesting.

20. Concept Development for a Novel Idea

code
Develop this half-formed concept into something real and specific.

The seed: [ONE-SENTENCE RAW IDEA]

I need:
1. Three very different directions this concept could go (distinct premises, not variations)
2. For the direction you think is most interesting: a fully developed version — 
   concrete world, specific characters or entities, the central tension, 
   and what makes it worth the reader's/viewer's/user's time
3. The one thing this concept is really about underneath the surface
4. What this concept is NOT — the adjacent bad version someone would make by default

Push it. The default obvious direction is not interesting.

GPT-5.5: The three directions are genuinely different and the fully developed version is polished and clean. The "what it's really about underneath" answer is typically the weakest section — it tends toward thematic generalities rather than a specific emotional core.

Claude Opus 4.8: The developed version has the most texture — specific details that make the concept feel inhabited rather than described. The "what it's really about underneath" answer is often the best part — it finds the emotional or philosophical core that isn't stated in the seed. The "what it's NOT" answer is the most useful because it's the most specific about the failure mode.

Gemini 3.1 Pro: Generates the most divergent set of directions — its parallel exploration is an asset when you want range — and the premises are genuinely distinct. But the fully developed version stays more described than inhabited, and the underneath-the-surface answer reaches for breadth where the seed wanted depth.

Winner: Claude Opus 4.8 — Creative and conceptual development is where voice, texture, and genuine surprise matter. Opus 4.8 produces work that requires less follow-up to become usable.


Model Selection Rubric

Use this as a routing guide, not a ranking. All three models are capable across categories; this is about where each model earns its cost. For a broader framework on which AI model to use for your task, the selection hub covers more task types than the seven below.

1

Hard math, proofs, multi-step optimization → GPT-5.5 (high reasoning effort). Turn the reasoning-effort dial to high or xhigh and it does the explicit structured reasoning and self-verification the old o-series was known for — arithmetic reliability on multi-step problems plus an auditable method. For the very hardest reasoning, GPT-5.5 Pro is the higher-compute option.

2

Long-context contracts, codebases, research corpora → Opus 4.8. The combination of genuine 1M-token attention and quality of analysis on subtle findings (not just top-10 pattern matching) makes it the right call for anything over 20K tokens where the hard insights are buried. For the toughest analyses, Claude Fable 5 sits above it.

3

Agentic tool-use loops with planning discipline → Opus 4.8. Via Claude Code's interleaved, adaptive thinking it plans before acting and reduces wasted actions and backtracking. For agents that need to plan and debug across multiple hypotheses, not just react, Opus 4.8 is the better choice.

4

Structured outputs and JSON pipelines → GPT-5.5. When your pipeline needs consistent schema — both on success and failure paths — GPT-5.5's structured output discipline is the production-practical choice. Use it for any workflow where downstream code parses the output.

5

Long-form writing, editing, creative development → Opus 4.8. The voice quality, analogy generation, and editorial judgment that distinguishes Opus 4.8 writing from the other two is not subtle. If you can notice the difference between "polished" and "written by a person," Opus wins.

6

Speed- or cost-sensitive triage at scale → none of these models. GPT-5.5, Opus 4.8, and Gemini 3.1 Pro are all expensive. Classification, routing, summarization of short documents, simple transformations — route to smaller models (GPT-5.4 nano, Claude Haiku 4.5, Gemini 2.5 Flash-Lite). Using a flagship model for tasks a smaller model handles well is a cost and latency problem.

7

Cross-document synthesis, open-ended research, current-data analysis → Gemini 3.1 Pro. Its parallel multi-hypothesis "thinking levels" weigh competing readings before converging, and Google Search grounding keeps the data current — the right call for synthesizing across several documents, systematically comparing approaches, and competitive analysis that depends on fresh facts. It's also natively multimodal across text, image, audio, and video.

8

Strategic decisions and pre-mortems → GPT-5.5 (high reasoning effort). The structured reasoning approach and willingness to commit to specific, uncomfortable conclusions makes GPT-5.5 the strongest strategic thinking partner when the stakes of being vague are high. Use Gemini 3.1 Pro instead when the decision hinges on current market data or you want the widest set of scenarios surfaced.


Before

Use GPT-5.5 for everything — it's the latest flagship, so it should handle all tasks well. Pay the same per-token rate regardless of task type.

After

Route by task type. Use GPT-5.5 (reasoning effort up) for hard math, structured pipelines, clean code, and strategic decisions; Opus 4.8 for long-context analysis, agentic loops, and writing; Gemini 3.1 Pro for cross-document synthesis and current-data research. Reserve flagships only for tasks where their specific strengths matter. Drop to smaller models for everything else. Same quality across the board, significantly lower cost and latency where it counts.

Run Your Own Comparisons

These 20 prompts are a starting point. The most useful thing you can do is take the two or three task types that dominate your actual workload, run them across models, and let the output quality decide — not benchmarks, not brand preference.

If you want a faster way to build structured, production-ready prompts for any of the three models, the AI prompt generator builds model-specific prompts from plain English descriptions — it handles the structural framing so you spend time on the task, not the prompt syntax.

For deeper reading on each model's specific prompt mechanics, the guides on best GPT-5 prompts, best Claude prompts, and best Gemini prompts go further on model-specific techniques.

Try it yourself

Build expert-level prompts from plain English with SurePrompts — 350+ templates with real-time preview.

Open Prompt Builder

Get ready-made Claude prompts

Browse our curated Claude prompt library — tested templates you can use right away, no prompt engineering required.

Browse Claude Prompts