Skip to main content
Back to Blog
o3GPT-5Claude Opus 4.7model comparisonfrontier models2026

o3 vs. GPT-5 vs. Claude Opus 4.7: 20 Copy-Paste Prompts, Three Models (2026)

20 copy-paste prompts run head-to-head on o3, GPT-5, and Claude Opus 4.7 — with notes on which model wins each category and why. Pick the right model for your task.

SurePrompts Team
May 6, 2026
28 min read

TL;DR

Twenty real prompts run on o3, GPT-5, and Claude Opus 4.7 with structured comparison — math, code, agents, long-context, writing, research, decisions. Includes a model-selection rubric so you can route work to the model that actually wins each kind of task instead of paying flagship rates for everything.

"Which model is best?" is the wrong question. The useful question is "which model wins for this specific task?" These 20 copy-paste prompts — run on o3, GPT-5, and Claude Opus 4.7 — answer it category by category. The goal is a routing map, not a winner's podium.

How to Read These Comparisons

Each of the 20 prompts was run on all three models using the framing that model actually rewards: XML-tagged context for Opus 4.7, structured outputs and clean markdown for GPT-5, and lean problem statements for o3. Running the same literal text on all three would be a test of format tolerance, not capability — that's not useful. What's useful is: given a task, what does the best prompt on each model look like, and what does the output quality reveal?

"Winner" is a qualitative call. It's based on output quality for the specific task — logical coherence, format discipline, task completion, absence of hallucination, and usefulness without post-processing. It is not based on benchmark scores. Benchmarks are useful for comparing models in aggregate; they are nearly useless for deciding which model to use for a specific task you have today. We won't fabricate percentages.

Tradeoffs always exist and always matter. o3 is slow and expensive; reaching for it on a task that GPT-5 handles adequately wastes real money and latency. GPT-5's structured output discipline makes it the obvious choice for pipelines that need machine-parseable JSON — but that discipline can make freeform writing feel slightly mechanical. Opus 4.7's long-form writing has a voice and a rhythm the other two don't match, but voice doesn't matter in an API response schema.

The practical implication is that most teams should route by task type rather than selecting one flagship model for everything. For a deeper look at how each model's prompting mechanics differ, read our guide to advanced prompt engineering in 2026.

We won't claim specific benchmark scores for any model in this post. Every observation here is about prompt-following behavior, output structure, verbosity patterns, and refusal tendencies — things you can observe yourself.

20
Copy-paste prompts run head-to-head on o3, GPT-5, and Claude Opus 4.7

Math & Hard Reasoning (Prompts 1–3)

This category tests multi-step deduction and quantitative reasoning where intermediate errors compound. A model that makes a wrong turn in step 3 produces confident nonsense by step 8.

1. Multi-Step Optimization Problem

code
A company runs three factories: A, B, and C.
- Factory A produces 200 units/day at $12/unit cost; max capacity 500 units/day
- Factory B produces 150 units/day at $9/unit cost; max capacity 400 units/day
- Factory C produces 100 units/day at $15/unit cost; max capacity 300 units/day
- Total demand is 800 units/day
- Factory A has a $500/day fixed cost; B has $300/day fixed; C has $200/day fixed

Find the production allocation that minimizes total daily cost while meeting demand.
Show your work. If you use a greedy approach, verify it against the LP solution.

o3: Frames this as a linear programming problem, sets up the objective function and constraints explicitly, solves correctly, then checks the greedy (fill cheapest-per-unit first) against the LP result and flags where they agree and diverge. Error rate on intermediate arithmetic is very low.

GPT-5: Gets the right answer but frames the work less formally. Tends to skip explicitly stating the LP formulation and jumps to solving, which makes it harder to audit. The final allocation is correct; the path is harder to follow.

Opus 4.7: Solves correctly and narrates each step clearly — but on a problem this size the narrative adds length without adding insight. On harder problems with 10+ constraints, the narrative format helps catch errors.

Winner: o3 — The explicit LP setup and self-verification make errors auditable. On optimization problems, showing the method matters as much as the answer.


2. Proof by Contradiction

code
Prove that there are infinitely many prime numbers.
Use proof by contradiction. Show each logical step explicitly.
After the proof, explain in plain English why the assumption leads to contradiction.

o3: Produces a clean, formal proof with clearly labeled steps. The "assume finite set of primes P₁…Pₙ" construction, the N+1 argument, and the contradiction are all explicit. The plain-English follow-up is accurate and not dumbed down.

GPT-5: Also correct. Slightly more verbose in the formal section — tends to add clarifying asides that could be footnotes. The plain-English explanation is good but occasionally redundant with the proof itself.

Opus 4.7: Gets the math right and writes the plain-English section with more elegance than the other two. Proof formatting is slightly less terse than o3 but more readable for a non-mathematician audience.

Winner: o3 — For formal proofs where each step needs to stand independently, o3's terseness is a feature.


3. Probability Chain

code
A software deployment pipeline has three sequential stages. Each stage fails 
independently with the following probabilities: Stage 1: 3%, Stage 2: 5%, Stage 3: 2%.

1. What is the probability a deployment succeeds end-to-end?
2. If you could reduce one stage's failure rate by half, which stage gives the 
   greatest improvement in overall success rate?
3. If you run 50 deployments per week, what is the expected number of full failures per week?
4. What sample size do you need to detect a 1% improvement in overall success rate 
   with 80% power and α=0.05?

o3: Handles all four parts cleanly, including the power calculation in part 4 which requires setting up the appropriate z-test or binomial test. Doesn't skip part 4 or hedge excessively.

GPT-5: Also completes all four parts. On part 4, occasionally rounds intermediate values in ways that shift the answer by a few deployments — still in the right ballpark. The worked math is readable.

Opus 4.7: Completes the first three parts confidently. On part 4, sometimes qualifies heavily ("the exact sample size depends on the test formulation chosen") rather than committing to a calculation, which is technically careful but less useful in practice.

Winner: o3 — Part 4 is the discriminator. o3 commits to the calculation and shows the work.


Code Generation & Refactor (Prompts 4–6)

This category tests whether a model can write production-quality code, not just plausible code — meaning the output handles edge cases, follows conventions, and doesn't require immediate debugging.

4. API Rate Limiter Implementation

code
Implement a thread-safe rate limiter in Python using the token bucket algorithm.

Requirements:
- Class-based interface: RateLimiter(rate: float, capacity: float)
- Method: consume(tokens: float = 1.0) -> bool
- Must be safe for concurrent use across threads
- Use threading.Lock or threading.RLock appropriately
- Include a refill mechanism that is lazy (refill on consume, not on a background thread)
- Add type hints and a short docstring per method

Write the implementation, then write pytest tests covering:
- Normal consumption within capacity
- Consumption that exceeds capacity
- Concurrent consumption from 5 threads

o3: Produces a correct token bucket with lazy refill using time.monotonic(), proper lock usage, and a clear consume() that returns False instead of blocking. The pytest suite covers all three specified scenarios and includes a race condition test using threading.Thread. Edge case: handles fractional tokens correctly.

GPT-5: Also correct and adds a small bonus — returns (bool, float) where the float is wait time if the consume fails, which is often what you actually want in production. Test coverage is complete. Slightly more verbose comments.

Opus 4.7: Correct implementation, and the docstrings are notably better — they explain the algorithm, not just the parameters. Test suite is thorough. Occasionally proposes a slightly more complex refill formula that is more accurate at irregular intervals.

Winner: GPT-5 — The wait-time return value is the kind of production-practical addition that changes how you use the class downstream, without being asked.


5. Legacy Code Refactor

code
Refactor this Python function. Preserve identical behavior; improve readability, 
reduce cyclomatic complexity, and add type hints.

def process(data, mode, threshold=0.5, extra=None):
    result = []
    for i in range(len(data)):
        if mode == 'filter':
            if data[i] > threshold:
                result.append(data[i])
        elif mode == 'scale':
            if extra is not None:
                result.append(data[i] * extra)
            else:
                result.append(data[i] * 2)
        elif mode == 'flag':
            if data[i] > threshold:
                result.append((data[i], True))
            else:
                result.append((data[i], False))
        else:
            result.append(data[i])
    return result

After refactoring, explain each change and why it improves the code.

o3: Splits into three separate functions, adds type hints with overloads where return types differ by mode, and replaces the for i in range(len(data)) pattern with direct iteration. The explanation is methodical — one change per bullet with the rationale.

GPT-5: Takes a similar approach but also introduces a dispatch dictionary pattern (mapping mode strings to lambdas), which reduces the conditional ladder but adds indirection. The explanation is slightly shorter. Both approaches are defensible; GPT-5's is more idiomatic for Python 3.10+.

Opus 4.7: Produces the cleanest function signatures and the most readable explanation, but tends to keep the three modes in one function with a more structured conditional block rather than splitting or dispatching. Valid choice; different opinion on cohesion.

Winner: GPT-5 — The dispatch pattern and idiomatic Python 3.10+ conventions produce code that's easiest to extend without touching existing behavior.


6. SQL Query Optimization

code
This query takes 45+ seconds on a 10M-row orders table. Analyze and rewrite it.

SELECT 
    c.name,
    c.email,
    COUNT(o.id) as order_count,
    SUM(o.total) as lifetime_value,
    MAX(o.created_at) as last_order_date
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
WHERE o.created_at >= '2024-01-01'
    AND o.status != 'cancelled'
GROUP BY c.id, c.name, c.email
HAVING COUNT(o.id) > 2
ORDER BY lifetime_value DESC;

Identify every performance problem, rewrite the query, and specify the indexes needed.

o3: Correctly identifies the LEFT JOIN + WHERE on a non-null column (which silently converts to INNER JOIN semantics), the missing composite index on (customer_id, created_at, status), and the sort on a computed column without a covering index. Rewrite is clean and explained.

GPT-5: Identifies the same issues and additionally recommends a partial index on (customer_id, created_at) WHERE status != 'cancelled' — a specific and practical optimization that reduces index size. The explanation of why the partial index helps is clear.

Opus 4.7: Identifies all the same issues. Explanation is the most thorough — includes an estimated cost model discussion and explains the LEFT/INNER JOIN behavior in terms a developer who didn't write the original query would immediately understand.

Winner: GPT-5 — The partial index recommendation is a concrete win that o3 and Opus miss, and the explanation is sufficiently detailed without being a lecture.


Agentic Tool-Use Loops (Prompts 7–9)

This category tests the model as an agent: given a goal and a set of tools, does it plan sensibly, avoid unnecessary calls, recover from errors, and stop when done?

7. Research-Then-Write Agent

code
You have access to: web_search(query: str) -> list[SearchResult], 
read_url(url: str) -> str, write_file(filename: str, content: str) -> None.

Task: Research the current state of RISC-V adoption in data center CPUs. 
Produce a 600-word briefing saved to "riscv-datacenter-2026.md".

Constraints:
- Use at least 3 distinct sources
- Do not use the same domain twice
- Flag any claim you couldn't verify with [UNVERIFIED]
- Stop when the file is written; do not ask for confirmation

o3: Tends to plan the search sequence before executing — identifies 3 distinct query angles, runs them, then reads the most promising URLs. Stops cleanly after writing the file. Occasionally runs one extra verification search.

GPT-5: Executes the tool loop efficiently with minimal redundant calls. The structured output discipline carries over — the briefing is well-formatted with clear headers. Follows the "no same domain twice" constraint reliably.

Opus 4.7: Plans explicitly before acting, which adds one reasoning step but reduces backtracking. The final briefing has more editorial voice than GPT-5's — reads like a human analyst wrote it, not a pipeline. Stops when told to stop, unlike older Claude versions that asked "Shall I proceed?"

Winner: Opus 4.7 — The planning-before-acting behavior reduces wasted tool calls, and the output quality on the final document is meaningfully higher.


8. Debugging Agent

code
You have access to: run_code(code: str, timeout: int = 10) -> dict[str, str], 
read_file(path: str) -> str.

A production script is crashing with: 
"KeyError: 'user_id'" at line 47 of user_processor.py.

The file contents are:
[PASTE FILE CONTENTS]

Diagnose the root cause, write a fix, test the fix by running the modified code, 
and confirm the specific line that caused the issue. Do not patch around the error — 
find why 'user_id' is absent and fix that.

o3: Traces the KeyError to its source rather than adding a .get() band-aid. When run_code confirms the fix works, it stops. The diagnosis is structured: hypothesis → test → confirmation.

GPT-5: Also finds the root cause and tests the fix. More likely to add defensive .get() calls around the fix "for safety" even when the root cause is addressed — which can obscure future bugs.

Opus 4.7: Most thorough diagnosis — explicitly rules out multiple hypotheses before committing to one. Uses run_code efficiently: one run to confirm the bug, one to confirm the fix. The explanation of why user_id was absent is the most complete.

Winner: Opus 4.7 — The multi-hypothesis diagnostic approach prevents fixing the symptom instead of the cause, and the interleaved thinking between tool calls produces a cleaner investigation.


9. Data Pipeline Agent

code
You have access to: read_csv(path: str) -> DataFrame, 
transform(df: DataFrame, ops: list[dict]) -> DataFrame,
validate_schema(df: DataFrame, schema: dict) -> ValidationResult,
write_parquet(df: DataFrame, path: str) -> None.

Task: Build a pipeline that reads "raw_sales.csv", cleans it 
(remove nulls in revenue column, deduplicate on order_id, 
cast date column to ISO8601), validates the result against 
the provided schema, and writes to "clean_sales.parquet".

If validation fails, return a structured error report instead of writing the file.
Schema: {order_id: int, date: str (ISO8601), revenue: float, region: str}

o3: Chains the tools correctly, handles the conditional (validate before write) cleanly, and the error report on validation failure is well-structured. Doesn't hallucinate tool parameters.

GPT-5: Handles this extremely well — the structured output discipline means the error report format is immediately machine-parseable. The conditional branch is clean. GPT-5 is particularly strong when the success/failure path needs consistent schema.

Opus 4.7: Correct execution but occasionally adds extra validation passes not specified in the task. The error report is detailed and human-readable; less rigidly structured than GPT-5's.

Winner: GPT-5 — When pipelines need consistent output schema on both success and failure paths, GPT-5's structured output discipline is the right choice.


Long-Context Document Analysis (Prompts 10–12)

This category tests comprehension and extraction across long inputs — legal contracts, codebases, research corpora.

10. Contract Risk Review

code
<contract>
[PASTE FULL CONTRACT TEXT — 15,000–40,000 words]
</contract>

Review this contract from the perspective of the party named [PARTY NAME].
Identify:
1. Clauses that expose [PARTY NAME] to uncapped liability
2. Termination provisions that favor the counterparty
3. IP assignment clauses and what rights [PARTY NAME] retains
4. Any defined terms that are unusually broad or ambiguous
5. Clauses absent that are standard in this contract type

For each finding: cite the clause number, quote the relevant language, 
and explain the risk in plain English. Return findings as a numbered list 
sorted by severity (High / Medium / Low).

o3: Handles long contracts well within its context window. Clause citations are accurate. Occasionally miscategorizes a Medium risk as High. Strong on IP and liability clauses.

GPT-5: Structured output discipline shines here — the severity-sorted list is clean and consistently formatted. At very long contract lengths (40K+ words), attention occasionally lapses on clauses buried late in the document.

Opus 4.7: Best at finding the subtle risks — unusual defined terms, absent standard clauses, and ambiguous language that the other two treat as boilerplate. The 1M context window handles even sprawling contracts without degradation.

Winner: Opus 4.7 — The combination of genuine long-context attention and quality of analysis on subtle clause risks makes it the best choice for contract work.


11. Codebase Audit

code
<codebase>
[PASTE OR REFERENCE REPOSITORY — auth module, ~3,000 lines]
</codebase>

Audit this authentication module for:
1. Security vulnerabilities (OWASP Top 10 where applicable)
2. Logic errors that could bypass authentication
3. Missing input validation on user-supplied data
4. Token lifecycle issues (expiry, rotation, revocation)
5. Race conditions in concurrent session handling

For each finding: file, line range, vulnerability class, severity, and a 
one-sentence fix recommendation. Return as a markdown table.

o3: Strong on logic errors and authentication bypass patterns. Reliable at spotting things like missing constant-time comparison in token validation or session fixation risks. Generates the markdown table correctly.

GPT-5: Excellent structured output — the table is clean and sortable. Catches the OWASP Top 10 items reliably. Occasionally misses more subtle timing or concurrency issues.

Opus 4.7: Most thorough on the concurrency and token lifecycle findings. The analysis of race conditions in session handling is notably more complete than the other two. Table output is clean when explicitly requested.

Winner: Opus 4.7 — The depth on concurrency and token lifecycle issues matters in production auth code where those are the hardest bugs to catch in code review.


12. Research Synthesis Across Documents

code
<doc1>[PAPER 1 — ~8,000 words]</doc1>
<doc2>[PAPER 2 — ~7,000 words]</doc2>
<doc3>[PAPER 3 — ~9,000 words]</doc3>

These three papers examine [TOPIC] from different methodological perspectives.

Synthesize:
1. Points of consensus across all three papers
2. Points of direct disagreement — cite the specific claims and paper sections
3. Methodological differences that could explain the disagreements
4. Gaps none of the three papers address
5. A 300-word executive summary that someone who hasn't read the papers can act on

Do not summarize each paper individually. Synthesize across them.

o3: Follows the "synthesize across, don't summarize individually" instruction. Good at identifying disagreements and tracing them to methodology. The executive summary is clean and non-redundant.

GPT-5: Also follows the instruction well. The structure of the output is cleanest — each of the five sections is clearly demarcated. Slightly less deep on the methodological analysis.

Opus 4.7: Produces the most insightful synthesis. The methodological difference analysis in particular goes further than the other two — it connects how each paper's data collection approach creates structural limitations. The executive summary reads as a genuine synthesis, not a stitched-together summary.

Winner: Opus 4.7 — Cross-document synthesis quality, especially on methodology and gaps, is noticeably higher.


Long-Form Writing & Editing (Prompts 13–15)

This category tests whether the model can produce writing that doesn't need rewriting — voice consistency, structural logic, and not sounding like a language model wrote it.

13. Opinion Essay With a Clear Thesis

code
Write a 1,200-word opinion essay arguing that [POSITION].

Audience: [PUBLICATION TYPE — e.g., general business readers]
Tone: direct, slightly contrarian, willing to name what conventional wisdom gets wrong
Structure: 
- Opening that earns attention without clickbait
- Clear thesis stated in the first 150 words
- Three arguments, each with a specific example or piece of evidence
- Steel-man the opposing view in one paragraph, then respond to it
- Closing that returns to the opening and leaves the reader with something to act on

Do not use transition phrases like "Furthermore" or "In conclusion." 
Write like a person, not a structure diagram.

o3: Produces a structurally correct essay. Follows the format instructions reliably. The voice tends toward factual and precise — good for technical opinion pieces, less good for persuasion aimed at general readers.

GPT-5: Strong structural compliance, clean prose. Tends to write in a polished-neutral register — professional but slightly anonymous. The steel-man paragraph is typically the best-written section.

Opus 4.7: The voice is distinctly better. The contrarian tone in the brief lands naturally rather than feeling performed, the examples are more specific, and the opening earns attention without tricks. The writing sounds like a person with a point of view.

Winner: Opus 4.7 — For long-form opinion writing where voice and persuasion matter, the quality gap is large enough to be immediately visible.


14. Substantive Copy Edit

code
Edit this draft for clarity, concision, and argument structure. 
Do not rewrite it — edit it. Preserve my voice and my thesis.

<draft>
[PASTE 800–1,200 word draft]
</draft>

For each significant change:
- Quote the original passage
- Show the edit
- Explain why the change improves the writing

At the end, flag the two or three structural issues in the argument 
that no line-level edit can fix, and suggest how to address them.

o3: Edits rather than rewrites — respects the instruction not to take over the piece. The explanations are technically accurate. Occasionally misses voice-specific word choices that are intentional.

GPT-5: Clean edits, good explanations. The structural issues list at the end is consistently useful. Sometimes over-edits toward GPT-5's own register.

Opus 4.7: Best at detecting intentional stylistic choices versus genuine awkwardness — it edits the latter without touching the former. The structural issues analysis at the end is the most incisive of the three.

Winner: Opus 4.7 — The ability to distinguish intentional voice from fixable awkwardness is exactly what distinguishes a good editor from a rewriter.


15. Technical Explainer for a General Audience

code
Write a 900-word explainer on [TECHNICAL TOPIC] for readers who are 
intelligent but not specialists in this field.

Rules:
- No jargon without an immediate plain-English definition
- Every abstract concept gets a concrete analogy before moving on
- Use second-person ("you") to keep the reader engaged
- Structure: problem → current approach → why it's hard → what's changing → what it means for you
- Closing: one specific thing the reader can do or look for because they read this

Do not condescend. Treat the reader as smart but uninitiated.

o3: Strong on technical accuracy. The analogies are correct but occasionally feel constructed rather than intuitive. Follows the structure reliably.

GPT-5: Excellent structure compliance. The analogies are polished. The "you" framing is consistent. Tends toward slightly longer explanations than necessary for a general audience.

Opus 4.7: Best analogies — they surface the key insight of a concept naturally rather than forcing a comparison. The writing respects the reader's intelligence without assuming background. The closing action item is typically more specific than the other two.

Winner: Opus 4.7 — Analogy quality and respect for reader intelligence are the hardest things to specify in a prompt and the easiest to notice in output.


Research & Synthesis (Prompts 16–17)

This category tests quality of analysis over breadth of retrieval — can the model evaluate what it finds, not just report it?

16. Competitive Analysis

code
Conduct a competitive analysis of [MARKET/PRODUCT CATEGORY] from the perspective 
of a founder deciding whether to enter this market.

Cover:
1. Market structure: number of competitors, concentration, consolidation trends
2. Differentiation axes: what dimensions competitors actually compete on
3. Where incumbents are weak (from customer reviews, public complaints, market gaps)
4. Barriers to entry: capital, switching costs, network effects, regulation
5. Where a new entrant could realistically win (specific segment or angle, not generic "innovation")
6. Biggest assumption you'd need to validate before committing

Be specific. Name competitors. Identify real weaknesses, not generic "large companies move slowly."

o3: Produces a structured analysis with real specificity. The barrier-to-entry section is typically well-reasoned. On the "where a new entrant could win" question, tends toward cautious framing rather than a clear directional recommendation.

GPT-5: Strong at market structure analysis. The competitor naming is reliable. The "where incumbents are weak" section sometimes generalizes from review patterns rather than citing specific patterns.

Opus 4.7: Most opinionated where it should be — the "where a new entrant could win" answer is a specific claim with reasoning, not a hedge. The biggest assumption section is the most useful of the three because it identifies the actual uncertainty rather than listing plausible risks.

Winner: Opus 4.7 — Strategic analysis that takes a position is more useful than analysis that documents without advising.


17. Systematic Comparison of Approaches

code
Compare three approaches to [PROBLEM — e.g., database schema versioning, 
microservice communication, LLM output validation].

For each approach:
1. How it works (2-3 sentences, no fluff)
2. When it's the right choice
3. When it breaks down
4. Operational complexity (low/medium/high + one sentence justification)
5. What teams actually end up regretting about it

End with a decision tree: given [Variable 1], [Variable 2], [Variable 3], 
which approach should you use?

o3: Handles the comparison table cleanly. The "what teams end up regretting" section is where o3 pulls ahead — it produces specific, non-obvious observations rather than generic drawbacks.

GPT-5: Excellent format compliance. The decision tree at the end is clean and usable. The "when it breaks down" sections are accurate but sometimes generic.

Opus 4.7: Strong on all sections. The "what teams end up regretting" observations match o3 in quality. The decision tree has more nuance but can become harder to follow when there are many branches.

Winner: o3 — The non-obvious observations in "what teams end up regretting" is the most valuable output in the whole prompt, and o3 produces the most specific version consistently.


Strategic Decision Analysis (Prompts 18–19)

This category tests structured reasoning about decisions with competing considerations, uncertain information, and real consequences.

18. Build vs. Buy vs. Partner Decision

code
I need to decide whether to build, buy, or partner to add [CAPABILITY] to our product.

Context:
- Company stage: [SERIES A / B / BOOTSTRAPPED / etc.]
- Team size: [NUMBER] engineers
- Time to ship: [DEADLINE PRESSURE]
- Budget for acquisition: [RANGE]
- Strategic importance: [CORE / SUPPORTING / EXPERIMENTAL]
- Current state: [WHAT WE HAVE NOW]

Structure your analysis as:
1. Decision criteria and how to weight them for our situation
2. Build option: realistic effort estimate, risks, and long-term implications
3. Buy option: what to evaluate, red flags, integration cost
4. Partner option: what partnership structures exist, where they usually fail
5. Your recommendation with the two conditions that would change it

Do not give me a framework. Give me a recommendation.

o3: Follows the structure and gives a recommendation. The "conditions that would change it" section is particularly strong — o3 identifies the actual decision pivots rather than listing generic risks.

GPT-5: Also gives a recommendation and structures the analysis cleanly. Occasionally hedges the recommendation more than necessary ("it depends on your priorities" type language after being explicitly told to recommend).

Opus 4.7: Gives a firm recommendation and is willing to take a counterintuitive position when the context warrants it. The partnership section — particularly where partnerships typically fail — is the most practically useful of the three.

Winner: o3 — The decision pivot analysis is the most structurally rigorous, and o3 commits to a recommendation without hedging it into uselessness.


19. Risk Scenario Planning

code
I'm considering [STRATEGIC DECISION — entering a new market, shipping a 
major feature, changing pricing, key hire].

Run a pre-mortem. Assume it's 18 months from now and this decision failed badly.

Step 1: Generate the 5 most plausible failure scenarios (specific, not generic)
Step 2: For each scenario, identify: 
   - What early indicator would have signaled this was coming (within 90 days)
   - What we could have done differently in the first 30 days
Step 3: Based on these scenarios, what 3 commitments should we make before proceeding?

Be specific. "Market conditions changed" is not a failure scenario. 
"We entered at the same time three incumbents cut price by 40%" is.

o3: The pre-mortem scenarios are specific and uncomfortable — which is exactly what a useful pre-mortem requires. The 90-day early indicator identification is where o3 is strongest; the indicators are concrete and monitorable.

GPT-5: Also produces specific scenarios. The "what we could have done differently" answers are practical. Slightly more likely than o3 to include a generic scenario despite the explicit instruction not to.

Opus 4.7: Strong on specificity. The 3 commitments at the end are the most actionable of the three — they read like decisions a real leadership team would actually make, not bullet points from a strategy template.

Winner: o3 — Specific failure scenarios with monitorable early indicators is the core value of a pre-mortem, and o3 delivers the most concrete version.


Creative & Conceptual Work (Prompt 20)

Creative work tests whether a model can do something genuinely unexpected — not competent, but interesting.

20. Concept Development for a Novel Idea

code
Develop this half-formed concept into something real and specific.

The seed: [ONE-SENTENCE RAW IDEA]

I need:
1. Three very different directions this concept could go (distinct premises, not variations)
2. For the direction you think is most interesting: a fully developed version — 
   concrete world, specific characters or entities, the central tension, 
   and what makes it worth the reader's/viewer's/user's time
3. The one thing this concept is really about underneath the surface
4. What this concept is NOT — the adjacent bad version someone would make by default

Push it. The default obvious direction is not interesting.

o3: Produces three structurally distinct directions reliably. The "what this concept is NOT" answer is sharp. The fully developed version is coherent but can lean analytical rather than generative — better at identifying what makes a concept work than at making the reader feel it.

GPT-5: The three directions are genuinely different. The fully developed version is polished and clean. The "what it's really about underneath" answer is typically the weakest section — tends toward thematic generalities.

Opus 4.7: The developed version has the most texture — specific details that make the concept feel inhabited rather than described. The "what it's really about underneath" answer is often the best part — it finds the emotional or philosophical core that isn't stated in the seed. The "what it's NOT" answer is the most useful because it's the most specific about the failure mode.

Winner: Opus 4.7 — Creative and conceptual development is where voice, texture, and genuine surprise matter. Opus 4.7 produces work that requires less follow-up to become usable.


Model Selection Rubric

Use this as a routing guide, not a ranking. All three models are capable across categories; this is about where each model earns its cost.

1

Hard math, proofs, multi-step optimization → o3. The explicit reasoning structure and arithmetic reliability on multi-step problems justifies the cost. For pure calculation, GPT-5 is close but o3's self-verification behavior reduces auditing burden.

2

Long-context contracts, codebases, research corpora → Opus 4.7. The combination of genuine long-context attention and quality of analysis on subtle findings (not just top-10 pattern matching) makes it the right call for anything over 20K tokens where the hard insights are buried.

3

Agentic tool-use loops with planning discipline → Opus 4.7. The interleaved thinking between tool calls reduces wasted actions and backtracking. For agents that need to plan before acting, not just react, Opus 4.7 is the better choice.

4

Structured outputs and JSON pipelines → GPT-5. When your pipeline needs consistent schema — both on success and failure paths — GPT-5's structured output discipline is the production-practical choice. Use it for any workflow where downstream code parses the output.

5

Long-form writing, editing, creative development → Opus 4.7. The voice quality, analogy generation, and editorial judgment that distinguishes Opus 4.7 writing from the other two is not subtle. If you can notice the difference between "polished" and "written by a person," Opus wins.

6

Speed- or cost-sensitive triage at scale → neither of these models. o3, GPT-5, and Opus 4.7 are all expensive. Classification, routing, summarization of short documents, simple transformations — route to smaller models (GPT-4o mini, Claude Haiku, Gemini Flash). Using a flagship model for tasks a smaller model handles well is a cost and latency problem.

7

Strategic decisions and pre-mortems → o3. The structured reasoning approach and willingness to commit to specific, uncomfortable conclusions makes o3 the strongest strategic thinking partner when the stakes of being vague are high.

8

Production code generation → GPT-5 or Opus 4.7 depending on task. For code that needs to be clean, idiomatic, and immediately usable: GPT-5 when you need structured output or function-calling integration, Opus 4.7 when you need thorough documentation and edge case analysis.


Before

Use GPT-5 for everything — it's the latest flagship, so it should handle all tasks well. Pay the same per-token rate regardless of task type.

After

Route by task type. Use o3 for hard math and strategic decisions, Opus 4.7 for long-context analysis and writing, GPT-5 for structured pipelines and code with clean outputs. Reserve flagships only for tasks where their specific strengths matter. Drop to smaller models for everything else. Same quality across the board, significantly lower cost and latency where it counts.

Run Your Own Comparisons

These 20 prompts are a starting point. The most useful thing you can do is take the two or three task types that dominate your actual workload, run them across models, and let the output quality decide — not benchmarks, not brand preference.

If you want a faster way to build structured, production-ready prompts for any of the three models, the AI prompt generator builds model-specific prompts from plain English descriptions — it handles the structural framing so you spend time on the task, not the prompt syntax.

For deeper reading on each model's specific prompt mechanics, the guides on best GPT-5 prompts, best Claude Opus 4.7 prompts, and best o3 prompts go further on model-specific techniques.

Try it yourself

Build expert-level prompts from plain English with SurePrompts — 350+ templates with real-time preview.

Open Prompt Builder

Get ready-made Claude prompts

Browse our curated Claude prompt library — tested templates you can use right away, no prompt engineering required.

Browse Claude Prompts