Skip to main content
Back to Blog
o3reasoning modelsOpenAI o3AI promptsprompt templates2026

50 Best o3 Prompts in 2026: Math, Code, Planning, Research (Copy-Paste)

50 copy-paste o3 prompts engineered for OpenAI's reasoning model — math problems, hard code, multi-step planning, research synthesis, and strategic decisions. Built around o3's thinking budget.

SurePrompts Team
May 6, 2026
43 min read

TL;DR

Fifty o3 copy-paste prompts written for the reasoning-model paradigm: math and proofs, hard coding problems, multi-step planning, research synthesis, scientific reasoning, decision analysis under uncertainty, and complex troubleshooting. Each prompt explicitly invokes o3's thinking budget rather than fighting it.

Most prompt advice on the internet was written for instruction-following models: GPT-4o, Claude Haiku, Gemini Flash. Give them a role, be specific about format, chain your steps. That playbook is wrong for o3. o3 is a reasoning model — it thinks in hidden tokens before it ever writes the first word of its response, and telling it how to think usually makes things worse, not better. These 50 prompts are built around that reality: state the problem cleanly, set the depth you need, get out of the way.

Why o3 Prompts Are Different

o3 already does the internal reasoning — don't ask for it. When you write "think step by step" or "show your work," you're asking o3 to narrate a process it has already completed internally. At best, you get redundant output. At worst, you force the internal reasoning to follow your script rather than find the best path on its own. The prompt for o3 is the problem statement, not the reasoning procedure.

State the problem and constraints, then ask the question. o3's internal thinking is most powerful when it has a well-defined problem space to search. That means: give it the full context first, specify what's in and out of scope, define the constraints, and only then pose the question. Prompts that lead with the question and fill in context after give the model a weaker starting point for its reasoning pass.

Calibrate thinking depth with reasoning_effort. When calling o3 via the API, the reasoning_effort parameter (values: low, medium, high) controls how many thinking tokens the model allocates. For a combinatorics proof or a formal algorithm correctness argument, use high. For a synthesis task or a planning outline, medium is usually sufficient. For quick triage or a first-pass classification, low saves cost without meaningful quality loss. The prompts below include explicit effort cues where the category warrants it — treat these as API-layer guidance, not text you must include verbatim in every environment.

Verification framing works. Asking o3 to re-derive an answer from first principles after producing it is not the same as asking it to show its work mid-answer. Verification is a second pass — it catches algebraic slips, logic gaps, and faulty assumptions that even a strong internal reasoning pass can miss. On hard math, proofs, and algorithmic correctness problems, include a verification step in your prompt. On softer synthesis tasks, it's less necessary.

o3 is expensive and slower than GPT-4o — use it where reasoning earns its keep. o3 is the right choice when the problem is genuinely hard: multi-step proofs, correctness-critical code, scenarios with competing constraints, causal inference under uncertainty. It is the wrong choice for summarization, formatting, simple Q&A, and tasks where instruction-following matters more than reasoning. See the full reasoning-model prompting guide at /blog/ai-reasoning-models-prompting-complete-guide-2026 for a complete treatment.

50
Copy-paste o3 prompts across math, code, planning, research, science, decisions, and troubleshooting

Math & Logic Prompts (1–8)

1. Multi-Step Word Problem

code
reasoning_effort: high

Problem:
A factory produces two product lines, A and B. Line A requires 3 
hours of machine time and 1 hour of labor per unit. Line B requires 
2 hours of machine time and 4 hours of labor per unit. Available 
capacity this week: 240 machine hours and 200 labor hours. 
Profit margin is $45 per unit of A and $70 per unit of B.

Find the production quantities of A and B that maximize total profit 
subject to the capacity constraints. State whether the optimal solution 
uses all available capacity on one or both resources.

After your answer, verify by substituting your solution back into 
both constraints and confirming feasibility and optimality.

2. Proof Sketch

code
reasoning_effort: high

Claim: For any prime p > 2, the sum of all positive integers less 
than p that are coprime to p equals p(p-1)/2.

Provide a rigorous proof. You may use standard number theory results 
(Euler's totient function, properties of modular arithmetic) without 
reproving them, but cite each result you invoke.

Success criteria: the proof must handle the structure of the residues 
mod p, not just verify numerically for small cases. After completing 
the proof, identify the one step most likely to contain an error and 
re-examine it.

3. Constrained Optimization

code
reasoning_effort: high

I need to allocate a $500,000 annual marketing budget across four 
channels. Historical data gives the following estimated returns 
(incremental revenue per dollar spent, subject to diminishing returns):

Channel A: r(x) = 8 * sqrt(x), where x is dollars spent
Channel B: r(x) = 12 * x^0.4
Channel C: r(x) = 5 * ln(1 + x)  [base-e log]
Channel D: r(x) = 6 * x^0.5

Constraints:
- Minimum $30,000 per channel (contractual)
- Channel C capped at $150,000
- Total budget exactly $500,000

Find the allocation that maximizes total expected incremental revenue. 
State your method, the optimal allocation, and the marginal return 
at the optimum for each channel.

4. Probability Puzzle

code
reasoning_effort: high

Setup:
A test for a rare condition has 98% sensitivity (true positive rate) 
and 95% specificity (true negative rate). The condition affects 0.3% 
of the general population. A patient tests positive.

Questions:
1. What is the probability the patient actually has the condition?
2. The doctor orders a second independent test with identical 
   characteristics. It also comes back positive. Now what is the 
   posterior probability?
3. How many positive tests in a row would be needed to push the 
   posterior above 90%?

Show Bayesian calculations. State all assumptions explicitly. After 
answering question 3, verify your answer is consistent with the 
pattern established in questions 1 and 2.

5. Combinatorics

code
reasoning_effort: high

A committee of 5 is to be chosen from 8 men and 6 women with these 
constraints:
- At least 2 women must be on the committee
- Two specific men (call them M1 and M2) refuse to serve together
- One specific woman (W1) will only serve if at least one of M1 or M2 
  is also on the committee

How many valid committees are possible?

Use casework. For each case, state the logic clearly before computing. 
After reaching your total, verify by checking at least one boundary 
case explicitly.

6. Real Analysis

code
reasoning_effort: high

Let f: [0,1] → R be defined by:
  f(x) = x * sin(1/x) for x ∈ (0,1]
  f(0) = 0

Questions:
1. Is f continuous on [0,1]? Prove your answer.
2. Is f uniformly continuous on [0,1]? Prove your answer.
3. Is f differentiable at x = 0? Prove your answer using the 
   limit definition of the derivative.

Success criteria: each answer must include a complete epsilon-delta 
or limit argument, not just an appeal to intuition. Flag any step 
where the argument depends on a non-obvious interchange of limits.

7. Applied Math — Compound Finance

code
reasoning_effort: medium

An investor deposits $10,000 into an account at the start of each 
year for 20 years (20 deposits total, first deposit at t=0). The 
account earns 6% annual interest, compounded monthly.

Calculate:
1. The exact value of the account immediately after the 20th deposit
2. The total interest earned (account value minus total principal)
3. The equivalent lump-sum amount at t=0 that would produce the same 
   final balance (present value of the deposit series at 6% monthly 
   compounding)

State the formula you use for each calculation before applying 
numbers. Verify calculation 1 against the formula for the future 
value of an annuity due.

8. Constraint Satisfaction

code
reasoning_effort: high

Eight employees (A through H) need to be assigned to four two-person 
project teams (Teams 1–4). Constraints:

- A and B must be on the same team
- C and D cannot be on the same team
- E must be on Team 1 or Team 2
- F must be on a different team from both G and H
- H must be on Team 3 or Team 4
- Team 2 must include at least one of {C, D, G}

List all valid assignments. If there are more than 10, describe the 
complete structure of the solution space rather than enumerating. 
If there are fewer than 5, enumerate all of them explicitly and 
verify each satisfies every constraint.

Hard Coding Prompts (9–15)

9. Algorithm Design

code
reasoning_effort: high

Problem:
Given a directed graph with N nodes and weighted edges, find the 
minimum cost path from source S to destination T such that:
- The path visits exactly K distinct intermediate nodes (not counting 
  S and T)
- No node is visited twice
- The total weight does not exceed budget B

Constraints: N up to 500, K up to 20, edges can have negative weights 
(but no negative cycles).

Design an algorithm that solves this. Specify:
1. The data structure and state representation
2. The recurrence or search strategy
3. Time and space complexity with justification
4. Why your approach handles negative weights correctly

After describing the algorithm, identify the most likely source of 
an off-by-one or boundary error and explain how your design avoids it.

10. Complexity Analysis

code
reasoning_effort: high

Analyze the time and space complexity of this algorithm:

[PASTE YOUR ALGORITHM OR PSEUDOCODE HERE]

I need:
1. Best, average, and worst-case time complexity with tight bounds 
   (not just O-notation — prove or disprove that the bound is tight)
2. Space complexity, distinguishing input space from auxiliary space
3. Whether the algorithm is cache-friendly and why that matters 
   for real-world performance at scale
4. The single most impactful optimization that would reduce the 
   dominant cost, with a rough estimate of the constant-factor 
   improvement

Success criteria: the complexity argument must reference specific 
loop iterations or recursion levels, not just pattern-match to 
known algorithms.

11. Concurrency Correctness

code
reasoning_effort: high

I have a multi-threaded cache implementation in [LANGUAGE]:

[PASTE YOUR CONCURRENT CACHE CODE]

Analyze this code for:
1. Race conditions — identify every shared state access that is not 
   atomically protected and describe the interleaving that produces 
   incorrect behavior
2. Deadlock potential — list all lock acquisition sequences and 
   identify any cycle that could deadlock under adversarial scheduling
3. ABA problems (if using compare-and-swap operations)
4. Memory visibility issues on architectures weaker than x86 (ARM, RISC-V)

For each issue found: name the problem, describe the exact thread 
interleaving or instruction reordering that triggers it, and provide 
the corrected code with a comment explaining the fix.

12. Race Condition Reasoning

code
reasoning_effort: high

Context:
Our distributed job queue uses optimistic locking. Workers read a 
job record, set status='processing', and update with a version check. 
We're seeing jobs processed twice roughly 0.1% of the time.

System details:
- PostgreSQL 15, READ COMMITTED isolation
- Workers run on 8 separate machines
- Job pickup query: SELECT ... WHERE status='pending' FOR UPDATE SKIP LOCKED
- Status update: UPDATE jobs SET status='processing', version=version+1 
  WHERE id=? AND version=?

Diagnose all plausible root causes for the double-processing. For 
each hypothesis:
1. Describe the exact event sequence that produces it
2. Rate likelihood given the symptoms (0.1% rate, distributed workers)
3. Provide a test that would confirm or eliminate this hypothesis
4. Provide the fix

After listing hypotheses, rank them by probability and recommend 
where to start the investigation.

13. Formal-Spec Coding

code
reasoning_effort: high

Implement a purely functional, persistent balanced BST (Red-Black Tree) 
in [LANGUAGE] with the following specification:

Operations:
- insert(tree, key, value) → new_tree  [no mutation of original]
- lookup(tree, key) → Option<value>
- delete(tree, key) → new_tree
- rank(tree, key) → int  [0-indexed position in sorted order]
- select(tree, k) → Option<(key, value)>  [k-th element by rank]

Invariants that must hold after every operation:
1. BST ordering property
2. Red-Black coloring rules (root black, no adjacent red nodes, 
   equal black height on all paths)
3. Rank and select are O(log n)

Include an invariant checker function. After writing the implementation, 
trace through an insert followed by delete on a tree of depth 4 and 
verify the invariants hold at each step.

14. Performance Optimization

code
reasoning_effort: medium

This function is called 50 million times per second in a hot path 
of our data pipeline:

[PASTE FUNCTION CODE]

Profile it analytically (no profiler available — reason from first 
principles):
1. Identify the dominant cost: memory allocations, branch mispredictions, 
   cache misses, or CPU-bound computation
2. Propose optimizations in order of expected impact
3. For the top optimization, write the revised code
4. Estimate the speedup using rough cycle counts or allocation counts
5. Identify any correctness risk the optimization introduces and 
   how to test for it

Success criteria: optimization recommendations must be grounded in 
specific CPU or memory behavior, not general "avoid allocations" advice.

15. Hard Debugging

code
reasoning_effort: high

I have an intermittent memory corruption bug in a C++ service. 
Symptoms:
- Crashes occur 2-5 times per day on production (not reproducible 
  in staging)
- Stack traces always point to a different location in the code
- ASAN and Valgrind clean on all test runs
- Crash rate increased 3x after we added a background thread that 
  compacts an in-memory index every 60 seconds

Relevant code:
[PASTE THE BACKGROUND COMPACTION THREAD AND THE DATA STRUCTURES IT TOUCHES]

Generate a differential diagnosis:
1. List all plausible root causes consistent with these symptoms
2. For each: the mechanism, why ASAN misses it in testing, and why 
   production triggers it
3. Ranked by probability given the background thread correlation
4. For the top hypothesis: a targeted test that would reproduce it 
   without relying on timing luck
5. The fix, with an explanation of why it addresses the root cause

Multi-Step Planning Prompts (16–22)

16. Project Plan With Dependency Analysis

code
reasoning_effort: medium

Project: Migrate our monolithic Rails application to a service-oriented 
architecture. Scope includes: user authentication, billing, 
notifications, and the core product domain.

Constraints:
- Zero downtime during migration
- 6 engineers available (2 senior, 4 mid-level)
- 9-month timeline
- Must maintain feature parity throughout
- Billing service requires PCI-DSS compliance review (8-10 weeks)

Produce:
1. A phased migration plan with explicit sequencing rationale
2. A dependency graph (described in text) — which phases block which
3. The critical path and where schedule risk lives
4. Resource allocation per phase
5. Three decision points where the plan should be re-evaluated
6. The single highest-risk assumption embedded in this plan

Do not produce a generic "identify services → extract → deploy" 
template. Make the plan specific to the constraints given.

17. Capacity Planning

code
reasoning_effort: medium

Our SaaS platform currently serves 15,000 daily active users with 
these resource profiles:
- API servers: 8 × c5.2xlarge (8 vCPU, 16 GB RAM), avg 45% CPU
- Database: 1 × db.r6g.4xlarge (16 vCPU, 128 GB RAM), avg 60% CPU, 
  200 GB storage used
- Cache: 2 × cache.r6g.xlarge Redis nodes, 70% memory utilized
- Background jobs: 4 × c5.xlarge, avg 30% CPU

Growth projections: 20% month-over-month for the next 6 months, 
then flattening to 5% per month.

Produce a 12-month capacity plan:
1. Month-by-month resource utilization projections for each tier
2. Recommended scaling actions with timing (which months to act)
3. Estimated cost at each milestone (use rough AWS on-demand pricing)
4. The tier that will hit a hard limit first and why
5. One architectural change that would meaningfully reduce the 
   cost of scaling (beyond just adding instances)

18. Scenario Tree

code
reasoning_effort: high

Decision context:
We are a 40-person B2B SaaS company with $3M ARR and 18 months 
runway. We have received three options from a potential acquirer:

Option A: $18M cash acquisition, close in 60 days
Option B: $25M deal, 60% cash + 40% acquirer stock, close in 6 months, 
  subject to standard reps & warranties
Option C: The acquirer invests $5M as a strategic round at $30M 
  pre-money valuation (we remain independent)

Key uncertainties:
- Acquirer stock (Option B) could be worth 0.5x to 2.5x its stated 
  value in 3 years depending on their trajectory
- Option C requires us to hit 2x ARR within 18 months to raise the 
  next round at a reasonable valuation
- Our current growth rate is 8% month-over-month but slowing

Build a scenario tree that maps outcomes across the key uncertainties. 
For each terminal node, calculate the approximate founder payout 
(assume 25% fully diluted founder ownership post-option-pool). 
Recommend the option and explain the assumptions that drive it.

19. Supply-Chain Logistics

code
reasoning_effort: medium

I run a hardware startup that ships physical devices. Current state:
- 3,000 units per month, growing to 8,000 units/month within 12 months
- Single CM in Shenzhen, 90-day lead time
- Single 3PL in New Jersey
- 35% of customers are in EU (customs delays causing 2-3 week delivery times)
- Current COGS: $42/unit, logistics adds $18/unit average

Design a supply chain architecture for 8,000 units/month that:
1. Reduces EU delivery time to under 10 days
2. Reduces logistics cost per unit by at least 20%
3. Reduces single-CM risk
4. Can be operationally stood up within 9 months

For each recommendation: describe the change, the cost to implement, 
the projected savings or time improvement, and the operational 
complexity it adds. Identify the two changes with the best ROI.

20. Multi-Stage Launch Sequence

code
reasoning_effort: medium

Product: A developer-focused API security tool (static analysis + 
runtime monitoring) targeting engineering teams at Series B–D companies.

Launch objective: 500 paying customers within 6 months of GA, 
with at least 50 customers generating $500+/month in ARR.

Assets available at launch:
- 200 beta users (60% active, primarily at Series A companies)
- 3 design partners at Series C companies (willing to be references)
- $200K marketing budget
- 4-person GTM team (2 sales, 1 demand gen, 1 partnerships)
- An integration with GitHub Actions (marketplace listing)

Design a month-by-month launch sequence. For each month include:
1. The primary motion (PLG, outbound, partner, content — pick the 
   dominant one)
2. Specific activation targets and leading indicators
3. Budget allocation
4. The single biggest risk to that month's plan and the contingency

Flag assumptions about sales cycle length and conversion rates 
explicitly rather than burying them in the plan.

21. GTM Rollout for New Market

code
reasoning_effort: medium

We sell project management software to professional services firms 
(law firms, consulting firms, accounting firms) in the US. We want 
to expand into the UK and Germany.

Current US state: 600 customers, $8M ARR, 92% gross retention, 
4 AEs closing deals.

Constraints for international expansion:
- Budget: $400K first year
- Cannot hire full-time staff in-country until we have $500K ARR 
  in that market
- GDPR compliance required before first EU sale
- Sales process currently takes 45 days on average (US)

Design a 12-month GTM rollout that:
1. Sequences UK before Germany (or makes the case for the reverse)
2. Identifies the top 3 ICP differences between US and UK/DE markets
3. Specifies the channel mix and why it differs from the US playbook
4. Sets go/no-go criteria for hiring in-country staff
5. Identifies the single most likely point of failure and how to 
   detect it early

22. Complex Migration Plan

code
reasoning_effort: high

We need to migrate 8 TB of production data from a self-managed 
PostgreSQL 13 cluster (3 nodes, synchronous replication) to 
Amazon Aurora PostgreSQL, with these requirements:

- Maximum acceptable downtime: 4 hours (maintenance window)
- Data must be consistent at the cut-over point
- Rollback must be possible within 2 hours of cut-over if issues arise
- 200+ application services connect to the database
- Several services use PostgreSQL-specific features: advisory locks, 
  listen/notify, and custom extensions (pg_trgm, uuid-ossp)

Produce:
1. Migration strategy (logical replication vs. physical vs. DMS — 
   justify your choice given the constraints)
2. Pre-migration checklist (schema, extension compatibility, 
   connection string changes)
3. Cut-over runbook with explicit steps, owners, and time estimates
4. Rollback runbook
5. The three highest-risk steps and your mitigation for each

Identify any constraint that is likely to be harder to meet than 
it appears and explain why.

Research & Synthesis Prompts (23–29)

23. Literature Synthesis With Reasoning

code
reasoning_effort: high

Topic: The effectiveness of retrieval-augmented generation (RAG) 
versus fine-tuning for domain adaptation of large language models.

Synthesize what is known about this comparison, covering:
1. The conditions under which RAG outperforms fine-tuning on 
   domain-specific tasks (and the reverse)
2. How knowledge recency interacts with each approach
3. The role of retrieval quality as a limiting factor for RAG
4. Compute cost tradeoffs at inference vs. training time
5. Evidence on catastrophic forgetting in fine-tuned models when 
   applied to out-of-distribution queries

For each claim, indicate whether it is (a) well-supported by 
multiple independent sources, (b) based on a single line of 
evidence, or (c) a reasonable inference with limited direct evidence. 
Flag claims where your training knowledge may be outdated.

24. Contradiction Resolution

code
reasoning_effort: high

I have two research findings that appear to contradict each other:

Finding A: [PASTE FINDING — include methodology summary]
Finding B: [PASTE FINDING — include methodology summary]

Analyze whether this is a genuine contradiction or an apparent one. 
Specifically:
1. Identify every methodological difference that could explain 
   the divergence (population, measurement, confounders, time horizon)
2. Assess whether the two findings are actually measuring the same 
   construct
3. If the contradiction is genuine, propose the most parsimonious 
   theoretical account that reconciles both
4. Describe the study design that would resolve the question 
   empirically, and what result would support each finding

State your confidence level in each part of your analysis and 
the assumptions that underpin it.

25. Hypothesis Generation With Priors

code
reasoning_effort: high

Observation:
In our longitudinal user study (n=340, 6 months), users who adopted 
our AI writing assistant in the first week retained at significantly 
higher rates at 3 months (78% vs. 54% for later adopters). The 
effect persists after controlling for prior tool usage and role.

Generate at least 6 hypotheses that could explain this pattern. 
For each hypothesis:
1. The causal mechanism (why early adoption would cause higher retention)
2. The prior probability you'd assign before seeing the data 
   (low/medium/high) and your reasoning
3. What observable data in our current dataset would support or 
   undercut this hypothesis
4. What additional data collection would most cleanly test it

Rank the hypotheses by a combination of explanatory power and 
testability. Identify which hypotheses are mutually exclusive and 
which can coexist.

26. Experimental Design

code
reasoning_effort: high

I want to run an experiment to test whether adding an AI-generated 
summary at the top of long-form B2B content (>2,000 words) increases 
time-on-page and downstream conversion rates.

Context:
- 80,000 monthly unique visitors to the content section
- Average content length: 2,400 words
- Current avg time-on-page: 3.2 minutes
- Current content-to-lead conversion: 2.1%
- 40% of traffic is mobile

Design the experiment:
1. Experimental unit (page, session, or user) — justify your choice
2. Randomization and assignment strategy
3. Required sample size for 80% power to detect a 15% relative 
   improvement in the primary metric, at p < 0.05
4. Primary metric, secondary metrics, and guardrail metrics
5. Minimum detectable effect given realistic traffic
6. Risks that could invalidate the experiment (contamination, 
   novelty effects, selection bias) and mitigations
7. Decision rule: what result triggers rollout, what triggers iteration

27. Evidence Grading

code
reasoning_effort: medium

I am evaluating whether to implement a structured onboarding 
checklist for new SaaS customers (vs. our current free-form 
onboarding calls) to improve 30-day activation rates.

I have collected the following evidence:
1. Our internal data: customers who completed 7+ product actions 
   in week 1 have 3x higher 90-day retention
2. A case study from a competitor blog claiming checklist onboarding 
   improved their activation by 40%
3. Two SaaS industry benchmarks reports (source and methodology unknown)
4. A randomized experiment by a SaaS research firm showing 
   structured onboarding improved 30-day activation by 12% 
   (n=1,200, p=0.03)
5. Three customer interviews where customers said they felt "lost" 
   in the first week

Grade each piece of evidence on: reliability, relevance to our 
situation, and the inference it supports. Then give an overall 
evidence-strength rating for the decision. Identify the weakest 
inference in the chain between evidence and recommendation.

28. Claim Verification

code
reasoning_effort: high

Claim to evaluate: "Transformer models trained on code are more 
sample-efficient than those trained on natural language because 
code has more regular syntactic structure, which reduces the 
hypothesis space the model must search during training."

Evaluate this claim by:
1. Breaking it into its constituent sub-claims
2. Assessing the logical validity of the argument structure 
   (does the conclusion follow from the premises, assuming 
   the premises are true?)
3. Evaluating the empirical status of each premise
4. Identifying the weakest link in the argument
5. Proposing a study that would provide strong evidence for 
   or against the core claim

Do not simply agree or disagree. Treat this as a structured 
argument analysis, not an opinion question.

29. Theoretical Framework Comparison

code
reasoning_effort: high

Compare two competing theoretical frameworks for explaining 
[PHENOMENON IN YOUR FIELD]:

Framework A: [NAME AND 2-3 SENTENCE DESCRIPTION]
Framework B: [NAME AND 2-3 SENTENCE DESCRIPTION]

For each framework:
1. Core ontological commitments (what entities and processes it posits)
2. Key predictions that differ from the competing framework
3. Empirical evidence it best explains
4. Evidence that strains or contradicts it
5. The methodological assumptions required to test it

Then assess:
- Whether the frameworks are genuinely competing or address 
  different aspects of the phenomenon
- The type of evidence that would most decisively favor one 
  over the other
- Whether a third framework or integration is suggested by 
  the gaps in both

Success criteria: the analysis should be useful to someone who 
already knows both frameworks and is trying to decide which to 
work within.

Scientific Reasoning Prompts (30–36)

30. Biomedical Reasoning

code
reasoning_effort: high

Clinical scenario:
A 58-year-old patient presents with progressive fatigue, 
unexplained 8 kg weight loss over 4 months, and mild right 
upper quadrant discomfort. Labs show: elevated alkaline 
phosphatase (3x upper limit of normal), mildly elevated ALT/AST 
(1.5x ULN), total bilirubin normal, CA 19-9 elevated at 180 U/mL. 
CBC normal. No history of liver disease or alcohol use.

This is a reasoning exercise, not a diagnostic consultation.

Analyze the differential diagnosis:
1. List the 4 most likely diagnoses consistent with this 
   presentation, ranked by probability
2. For each: the pathophysiology explaining each abnormal finding
3. The single most discriminating next test for the top two diagnoses
4. What feature of this presentation most constrains the differential

Identify any finding that is inconsistent with your top diagnosis 
and how you account for it.

31. Physics Problem

code
reasoning_effort: high

A thin uniform rod of mass M and length L is initially at rest, 
lying on a frictionless horizontal surface. An impulse J is applied 
perpendicular to the rod at a point P located at distance d from 
the center of mass (0 < d ≤ L/2).

Find:
1. The velocity of the center of mass immediately after the impulse
2. The angular velocity of the rod immediately after the impulse
3. The location along the rod that is instantaneously at rest 
   immediately after the impulse (the "center of percussion")
4. The value of d for which the center of percussion coincides 
   with the end of the rod

After solving, verify that your answer to (3) reduces to the 
known result for d = L/2 (impulse at the end of the rod). 
Check dimensional consistency throughout.

32. Chemistry Mechanism

code
reasoning_effort: high

Reaction:
A secondary alkyl bromide undergoes treatment with sodium ethoxide 
in ethanol at 55°C. The product distribution is 70% E2 elimination 
product and 30% SN2 substitution product.

Analyze this reaction:
1. Explain the mechanistic basis for the observed product ratio, 
   citing the factors that favor E2 vs. SN2 for this substrate 
   and these conditions
2. Predict how the product ratio would change if:
   a. The reaction is run at 25°C instead of 55°C
   b. Sodium methoxide in methanol is used instead
   c. A tertiary alkyl bromide is used instead
3. Draw the energy profile (describe in text) for the competing 
   pathways, noting which transition state is higher in energy 
   for this case

For each prediction in (2), state the specific mechanistic 
reason for the change, not just the direction of the effect.

33. Statistical Analysis

code
reasoning_effort: high

I ran a 2×2 factorial experiment testing two variables:
- Factor A: feature flag on/off
- Factor B: user segment (new vs. returning)

Primary metric: 7-day retention rate
Sample sizes and outcomes:
- A=off, B=new:      n=2,340, retention=28.3%
- A=on,  B=new:      n=2,287, retention=31.7%
- A=off, B=returning: n=4,102, retention=61.2%
- A=on,  B=returning: n=4,218, retention=64.9%

Analyze:
1. Main effect of Factor A (statistical test and interpretation)
2. Main effect of Factor B
3. Interaction effect — is the effect of A different for new 
   vs. returning users? Conduct the appropriate test.
4. Multiple comparisons adjustment — which corrections apply here 
   and why
5. Practical significance: is the effect size meaningful for 
   a product decision, and what would you recommend?

State your statistical assumptions and flag any that may be 
violated by the design or data.

34. Causal Inference

code
reasoning_effort: high

Observational study:
We observed that users who receive in-app coaching messages 
have 40% higher 60-day retention than users who do not. 
The messages are triggered automatically when users show 
certain behavioral signals (low activity, skipped onboarding 
steps, etc.).

I want to claim that coaching messages cause higher retention. 
Analyze the causal inference challenge:

1. Identify all plausible confounders between "receives coaching" 
   and "retains at 60 days"
2. Explain why the trigger mechanism makes naive comparison 
   especially unreliable
3. Propose the cleanest observational study design that would 
   reduce confounding (assuming we cannot run an RCT)
4. Describe what a properly designed RCT would require, 
   including how you'd handle the ethical concern of 
   withholding coaching from struggling users
5. What estimate of the true causal effect would you be 
   comfortable defending given only observational data?

35. Experimental Result Interpretation

code
reasoning_effort: high

Experimental result:
We trained two versions of a text classifier:
- Model A: fine-tuned on 10,000 labeled examples from our domain
- Model B: zero-shot prompting of a large foundation model

Test set performance (held-out 2,000 examples, same distribution 
as training data):
- Model A: F1 = 0.91, precision = 0.93, recall = 0.89
- Model B: F1 = 0.84, precision = 0.88, recall = 0.80

However, on a stress-test set of 500 adversarial and out-of-distribution 
examples we collected separately:
- Model A: F1 = 0.62
- Model B: F1 = 0.79

Interpret these results:
1. What does the in-distribution gap tell us about fine-tuning vs. 
   prompting for this task?
2. What does the OOD reversal suggest about the nature of what 
   Model A learned?
3. What are the production deployment implications if roughly 
   15% of real-world inputs are OOD?
4. Design a follow-up experiment that would determine whether 
   Model A's OOD degradation is due to distribution shift or 
   adversarial manipulation specifically

State confidence levels for each interpretive claim.

36. System Dynamics Modeling

code
reasoning_effort: high

I want to model the growth dynamics of a two-sided marketplace 
with these structural elements:
- Buyers and sellers who both exhibit network effects (more of 
  each side increases value for the other side)
- A "quality deterioration" effect: as low-quality sellers are 
  attracted by platform growth, average transaction quality declines
- A trust-based retention mechanism: buyers who have a bad 
  transaction churn at 3x the normal rate and take their 
  social network connections with them

Construct a system dynamics model:
1. Define the key stocks (state variables) and flows (rates of change)
2. Write out the causal relationships with sign and feedback loop type 
   (reinforcing vs. balancing)
3. Identify all feedback loops and classify them
4. Describe the qualitative behavior this system would exhibit 
   at different levels of seller quality enforcement
5. Identify the highest-leverage intervention point to prevent 
   quality-driven collapse at scale

You do not need to provide numerical equations — reason about 
the structure and behavior qualitatively, with precision.

Decision Analysis Prompts (37–43)

37. Decision Under Uncertainty

code
reasoning_effort: high

Decision:
We must decide by Friday whether to sign a 3-year office lease at 
$180,000/year. The alternative is month-to-month at $22,000/month. 
We have 28 employees and expect to grow, but there is real uncertainty.

Relevant uncertainties:
- 30% chance we raise a Series A within 12 months and need to move 
  to a larger space
- 20% chance of a significant market downturn in the next 18 months 
  that triggers headcount reductions (to ~15 employees)
- 70% chance we remain roughly stable and the long-term lease saves money

Sublease risk: if we need to exit the 3-year lease early, we estimate 
60% probability of finding a subtenant within 6 months (at break-even), 
and 40% probability of carrying 3-6 months of empty rent.

Build a decision tree with expected value calculations for both options. 
State all assumptions. Then do a sensitivity analysis: at what 
probability of Series A fundraising does the month-to-month option 
become preferable in expected value terms?

38. Expected-Value Calculation

code
reasoning_effort: medium

I'm evaluating three potential product investments for next quarter. 
Each requires engineering and PM resources:

Investment A: New integration with Salesforce
- Development cost: 6 engineer-weeks + 2 PM-weeks
- Estimated ARR impact: $200K–$400K (uniform distribution over range)
- Probability of technical success: 90%
- Time to revenue: 3 months after ship

Investment B: Self-serve onboarding flow redesign
- Development cost: 4 engineer-weeks + 3 PM-weeks
- Estimated ARR impact: $100K–$600K (skewed right — most likely $150K, 
  max $600K)
- Probability that impact exceeds $200K: 25%
- Time to revenue: 1 month after ship

Investment C: Enterprise reporting dashboard
- Development cost: 10 engineer-weeks + 2 PM-weeks
- Estimated ARR impact: $0K or $500K (binary, depending on whether 
  we close 3 specific enterprise deals)
- Probability of closing all 3 deals given the feature: 40%
- Time to revenue: 6 months after ship

Available capacity: 12 engineer-weeks, 5 PM-weeks this quarter.

Calculate expected ARR per engineer-week for each investment. 
Identify the portfolio that maximizes expected ARR within the 
capacity constraint, and flag which investment the EV calculation 
most likely undervalues due to distribution shape.

39. Risk-Weighted Strategy

code
reasoning_effort: high

Context:
We are a 15-person cybersecurity startup with $4M in the bank, 
$1.2M ARR, and 14 months of runway. We have an opportunity to 
pursue a large government contract ($2.8M, 18-month engagement) 
that would require us to:
- Hire 4 specialized engineers immediately
- Achieve FedRAMP authorization (12–18 months, $300K–$500K cost)
- Divert 60% of engineering capacity from our commercial product

Alternative: focus entirely on the commercial market where we 
have early traction.

Produce a risk-weighted strategic analysis:
1. Financial scenario modeling for both paths (at least 3 scenarios each)
2. Strategic option value: what does each path open or close?
3. Key risks that are not visible in the financial model
4. Decision criteria that would make the government path clearly right 
   or clearly wrong
5. A recommended decision and the conditions under which you'd reverse it

State the assumptions that most affect the recommendation.

40. Multi-Stakeholder Negotiation

code
reasoning_effort: high

Situation:
I am negotiating a software licensing agreement with a large enterprise 
customer. Three stakeholders are involved on their side:

- VP of Engineering (technical buyer): wants perpetual license, 
  worried about vendor lock-in, prefers on-premises deployment
- CFO: wants annual subscription to spread cost, concerned about 
  total cost of ownership over 5 years
- CISO: requires data residency guarantees, SOC 2 Type II, and 
  the right to audit our infrastructure

My constraints:
- Perpetual license is technically feasible but unfavorable — it 
  removes recurring revenue
- On-premises deployment is possible but costs us $80K in 
  setup and ongoing support
- We have SOC 2 Type II; audit rights are negotiable
- We need to close by end of quarter (6 weeks)

Design a negotiation strategy that:
1. Maps each stakeholder's stated and likely unstated interests
2. Identifies value trades that satisfy each party differently
3. Sequences the negotiation to resolve the highest-risk objection first
4. Proposes fallback positions for each key term
5. Identifies the concession that costs us least but has highest 
   perceived value to them

41. Build vs. Buy With TCO

code
reasoning_effort: medium

Decision: should we build or buy an internal data pipeline and 
transformation layer?

Build option:
- 3 engineers × 4 months to build v1
- Ongoing maintenance: 0.5 FTE per year
- Can be exactly tailored to our needs
- Full control over roadmap

Buy options (3 vendors evaluated):
- Vendor X: $3,500/month, covers 80% of our use cases, 2-week implementation
- Vendor Y: $8,000/month, covers 95% of use cases, 6-week implementation
- Vendor Z: $2,000/month, covers 65% of use cases, strong open-source 
  ecosystem but requires in-house customization (estimate 6 engineer-weeks)

Engineering cost: $180,000 fully loaded per engineer per year
Current data volume: 50M events/day, growing 15% per month

Build a 3-year TCO analysis for each option. Include:
1. Year-by-year cost breakdown
2. Hidden costs (integration, migration, vendor risk, coverage gaps)
3. Break-even point for the build option vs. each vendor
4. The factor that most changes the recommendation under sensitivity analysis
5. Your recommendation and the key assumption it depends on

42. Hiring Decision Matrix

code
reasoning_effort: medium

I need to hire a VP of Engineering. After a full process, I have 
two finalists:

Candidate A:
- 12 years experience, 8 as an engineering manager
- Grew engineering team from 8 to 60 at a B2B SaaS company
- Strong operator: excellent at process, hiring, performance management
- Lacks deep technical background in our stack (ML infrastructure)
- Cultural fit: methodical, process-driven; our culture is currently scrappy

Candidate B:
- 9 years experience, 4 as an engineering manager
- Technical depth in ML infrastructure is exactly what we need
- Built high-output teams but never at scale above 20 engineers
- Cultural fit: builders mindset, aligns with current team
- Weak references on stakeholder management and cross-functional 
  alignment

Our context: 18-person engineering team, building ML-heavy product, 
Series B fundraise in 12 months, need to 3x team size in 18 months.

Build a decision framework:
1. Weight the relevant criteria for our situation
2. Score each candidate against each criterion with your reasoning
3. Identify the scenario where each candidate clearly outperforms
4. Flag the highest-risk assumption in choosing each
5. Recommend one — and state the one thing you'd verify before signing

Do not give a false balance. Make a call.

43. Board-Level Tradeoff

code
reasoning_effort: high

Context:
Our board is debating whether to pursue an aggressive growth strategy 
or extend runway by cutting costs. Company state:
- $12M ARR, growing 65% year-over-year
- $18M raised, $9M remaining, 14 months runway at current burn
- $1.8M monthly burn rate
- NRR: 118%, CAC payback: 11 months
- Market: competitive, two well-funded competitors raised large 
  rounds in the past 6 months

Option A (aggressive growth): Increase burn to $2.8M/month, 
hire 15 salespeople and 8 engineers in 90 days, target $22M ARR 
in 12 months. Requires raising $15M within 6 months.

Option B (default alive): Cut burn to $1.2M/month, reduce team 
by 20%, extend runway to 30+ months, grow to $17M ARR at 
reduced pace. Raises from position of strength.

Produce a board-quality analysis:
1. The capital market assumptions each option requires
2. The company's defensibility in each scenario if fundraising 
   is slower than planned
3. What the unit economics imply about the right growth rate
4. The competitive dynamics argument for each option
5. A recommendation with explicit conditions that would change it

Treat this as advice to a board, not a summary of the tradeoffs.

Complex Troubleshooting Prompts (44–50)

44. Production Incident Root Cause

code
reasoning_effort: high

Incident timeline:
- 14:32 UTC: API error rate spikes from 0.1% to 18%
- 14:33 UTC: Alerting fires on P95 latency (2.1s vs. 200ms baseline)
- 14:35 UTC: On-call engineer begins investigation
- 14:41 UTC: Deployment rolled back (deployed at 14:28 UTC)
- 14:48 UTC: Error rate drops to 2% but does not fully recover
- 15:10 UTC: Error rate returns to baseline after database connection 
  pool restart

Deployment at 14:28 UTC: added a new database query to the 
user session validation path (runs on every authenticated request).

Database metrics during incident:
- Active connections: jumped from 45 to 195 (pool max: 200)
- Query duration for new query: median 850ms, P99 12s
- Existing queries: median latency increased 3x during incident
- No deadlocks or lock waits recorded

Application metrics:
- 12 out of 16 API pods showed elevated errors
- 4 pods that recently restarted showed no errors

Perform root cause analysis:
1. Primary root cause with supporting evidence
2. Contributing factors that explain why recovery was partial 
   after rollback
3. Why 4 pods were unaffected
4. Three systemic fixes (not just "add index") with priority order
5. Detection improvements that would have caught this before impact

45. Intermittent Bug Hypothesis Tree

code
reasoning_effort: high

Bug report:
Intermittent payment processing failures affecting roughly 0.3% 
of transactions. Symptoms:
- Failures appear random across customers, amounts, and payment methods
- Error message: "Transaction timeout after 30000ms"
- Failures cluster slightly between 9:00–10:00 AM and 4:00–5:00 PM 
  in US Eastern time
- Rate has increased from 0.1% to 0.3% over the past 3 weeks
- Payment processor reports no incidents during these times
- Database shows no anomalies during the failure windows

Architecture:
- Node.js API servers → payment microservice → third-party payment processor
- Payment microservice runs on 6 pods (Kubernetes)
- Each pod maintains a connection pool to the payment processor API

Build a hypothesis tree:
1. Generate at least 8 distinct hypotheses for the root cause
2. For each hypothesis: the causal mechanism, why it produces 0.3% 
   failure rate, and why it would cluster at business-hours peaks
3. Group hypotheses by likely location (network, application, 
   external dependency, infrastructure)
4. Rank by probability given all symptoms, including the 3-week trend
5. The three cheapest diagnostic steps that would most rapidly 
   eliminate hypotheses

46. System-Level Outage Analysis

code
reasoning_effort: high

Post-mortem scenario:
A 47-minute complete service outage occurred. Reconstruct the 
failure cascade from these observations:

Observation 1: The outage began immediately after an auto-scaling 
event added 40 new application instances.

Observation 2: The service discovery system (Consul) showed all 
new instances as unhealthy within 90 seconds of launch.

Observation 3: Existing instances continued serving traffic for 
8 minutes before also failing.

Observation 4: Database CPU spiked to 100% at the 6-minute mark.

Observation 5: The load balancer continued routing traffic to all 
instances (healthy and unhealthy) throughout the outage.

Observation 6: Recovery occurred only after manually removing 
all new instances from the service discovery registry and 
restarting the database connection pool on old instances.

Reconstruct the most likely failure cascade step-by-step. For 
each step, identify what evidence supports it and what alternative 
explanations you are ruling out. Then identify the single control 
that, if implemented, would have prevented the cascade from 
progressing past the initial scaling event.

47. Security Incident Timeline

code
reasoning_effort: high

Suspected intrusion. Known facts:
- 03:17 UTC: Unusual API calls from an authenticated user account 
  (user ID 48291) — bulk export of customer records via a legitimate 
  but rarely-used admin endpoint
- 03:22 UTC: The same account initiates 14 password reset emails 
  to high-value customer accounts
- 03:31 UTC: SIEM alerts on the bulk export volume
- 03:45 UTC: Security team disables account 48291
- 04:12 UTC: Three of the targeted customers report unauthorized 
  login attempts from new IPs

Account 48291 belongs to a support engineer who was on PTO 
and denies any access. MFA is required for this account; 
MFA logs show a successful TOTP verification at 03:16 UTC 
from an IP in a country the employee has never accessed from.

Reconstruct the incident:
1. The most likely attack vector (how was the account compromised)
2. Timeline with attacker actions vs. defender actions
3. What the attacker's objective appears to have been
4. Evidence of what data may have been exfiltrated
5. Immediate containment actions that should have happened faster
6. Three detection controls that would have caught the initial 
   access in real time

Identify the assumption in your reconstruction that you are 
least confident about.

48. Data-Quality Root Cause

code
reasoning_effort: high

Issue:
Our data team noticed that monthly revenue reported by our 
billing system ($2.34M) and our data warehouse ($2.19M) diverged 
by $150K last month. This has never happened before at this scale 
(prior discrepancies under $5K).

System context:
- Billing system: Stripe, with custom subscription logic for 
  annual plan prorations
- Data warehouse: nightly ETL pulls from Stripe API, transforms, 
  and loads into Snowflake
- Revenue recognition: custom SQL that applies ASC 606 rules 
  on top of raw billing data
- Last month: we launched annual plans (previously only monthly)

Generate a systematic root cause investigation plan:
1. The most likely sources of the $150K discrepancy given the 
   context (at least 5 hypotheses)
2. For each hypothesis: the data query or check that would 
   confirm or eliminate it
3. The sequence in which to run those checks (fastest elimination first)
4. What a $150K discrepancy from annual plan launch specifically 
   suggests about proration handling
5. The permanent fix and the audit procedure to prevent recurrence

Flag any hypothesis that would indicate a systemic reporting 
error (not just a one-time variance).

49. Financial Anomaly Investigation

code
reasoning_effort: high

Situation:
Our gross margin dropped from 71% to 58% in one month. Revenue 
was flat (no growth or decline). The margin drop appeared in the 
month-end close and was flagged by our CFO.

Known changes that month:
- We migrated from AWS to Google Cloud (partially — 40% of workloads 
  moved, 60% still on AWS during transition)
- We signed two new enterprise customers who required dedicated 
  infrastructure (separate instances, not shared)
- No new headcount
- COGS in the income statement: cloud infrastructure, third-party 
  APIs, and customer support labor

Investigate:
1. Generate all plausible explanations for a 13-point gross margin 
   decline with flat revenue
2. For each explanation: the account line it would appear in, 
   the dollar magnitude consistent with a 13-point drop, and 
   whether the known changes could produce it
3. The financial query or report that would isolate the source
4. If the cause is the cloud migration: what accounting treatment 
   should have been applied to the transition costs, and whether 
   this belongs in COGS or a separate line
5. The permanent reporting fix to prevent this from obscuring 
   underlying margin trends in future migrations

50. Multi-System Failure Analysis

code
reasoning_effort: high

Three systems failed in a 90-minute window. Determine whether 
these are related or independent failures:

System 1 (14:05 UTC): Order management system becomes unresponsive. 
Root cause identified: a full-table scan query in a nightly report 
job that was mistakenly triggered at 14:00 instead of 02:00.

System 2 (14:22 UTC): Customer notification service stops sending 
emails. On-call finds the service is up but the outbound email 
queue has grown to 400,000 messages (normal: under 5,000).

System 3 (14:51 UTC): Inventory sync service starts reporting 
stale data. All inventory levels in the UI are frozen at 
14:00 UTC values.

Additional context:
- Systems 1, 2, and 3 share a PostgreSQL database cluster
- System 2's email queue is populated by database triggers on 
  the orders table
- Inventory sync runs every 15 minutes and reads from the same 
  orders table
- The nightly report job that caused System 1's failure produces 
  a 2.3M row result set that it writes to a temp table

Analyze:
1. Is this one failure or three independent failures? Build 
   the causal graph.
2. The mechanism by which System 1's failure propagated (if it did)
3. Why System 3 shows stale data from exactly 14:00 UTC
4. The order in which these should have been diagnosed and resolved
5. The architectural change that would have contained System 1's 
   impact to System 1 alone

o3 Power Tips

1

State the full problem, then ask. Load all constraints, context, and data into the prompt before posing the question. o3's internal reasoning pass is strongest when it starts with a complete problem definition — not one it has to assemble from a vague question and follow-up clarifications.

2

Request the answer and a verification step — not chain-of-thought. Write "After your answer, verify by [specific check]" instead of "show your work" or "think step by step." You get a correctness check without forcing o3 to narrate reasoning it already did internally.

3

Calibrate reasoning_effort to the problem. Use high for proofs, formal algorithm correctness, and hard causal reasoning. Use medium for synthesis, planning, and analysis where depth matters but exhaustive search doesn't. Use low for triage, classification, and first-pass drafts where speed matters more than precision.

4

Define success criteria explicitly. Tell o3 what a good answer looks like — "the complexity argument must reference specific loop iterations," "the proof must handle the general case, not just small examples," "the recommendation must include the conditions under which you'd reverse it." This shapes the internal reasoning toward the right target.

5

Ask for confidence levels and explicit assumptions. Add "State your confidence in each major claim and the assumptions that underpin it" to any prompt where the answer involves inference under uncertainty. o3 surfaces hidden assumptions better when you request them directly.

6

Use re-derivation verification on hard problems. For math, proofs, and algorithmic correctness, the highest-value verification is "re-derive from first principles and check for inconsistencies with your first answer." This catches errors that a simple plausibility check misses.

Before

Let's work through this step by step. First, think about the constraints. Then consider the options. Walk me through your reasoning as you go. What's the best decision here given that we have 18 months runway, two acquisition offers, and a potential Series A?

After

reasoning_effort: high\n\nContext: 18 months runway, $3M ARR, 65% YoY growth, NRR 118%.\n\nOption A: $18M cash acquisition at close in 60 days.\nOption B: $25M deal, 60% cash + 40% acquirer stock, 6-month close.\nOption C: $5M strategic investment at $30M pre-money, remain independent.\n\nAcquirer stock in Option B is worth 0.5x–2.5x stated value in 3 years depending on their trajectory. Option C requires 2x ARR in 18 months to raise at a reasonable valuation; current growth makes this plausible but not certain.\n\nAssume 25% fully diluted founder ownership. Which option maximizes expected founder outcome? Build a decision tree across the key uncertainties, state your assumptions explicitly, and recommend one option. After your recommendation, identify the assumption that most changes the answer if wrong.

Build Better Prompts for Hard Problems

These 50 prompts follow one principle: give o3 a well-defined problem and get out of the way. The model's reasoning budget is the differentiator — your job is to aim it, not direct it step by step.

If you want to build prompts like these for your own hard problems without doing it from scratch, the AI prompt generator generates structured, reasoning-model-optimized prompts from plain English descriptions. For the full theoretical foundation on why reasoning models require a different approach, read the complete guide: Prompt Engineering for Reasoning Models. And if you're also working with GPT-5, see 50 Best GPT-5 Prompts in 2026 for the companion roundup.

Build prompts like these in seconds

Use the Template Builder to customize 350+ expert templates with real-time preview, then export for any AI model.

Open Template Builder