Most prompt advice on the internet was written for instruction-following models: GPT-4o, Claude Haiku, Gemini Flash. Give them a role, be specific about format, chain your steps. That playbook is wrong for o3. o3 is a reasoning model — it thinks in hidden tokens before it ever writes the first word of its response, and telling it how to think usually makes things worse, not better. These 50 prompts are built around that reality: state the problem cleanly, set the depth you need, get out of the way.
Why o3 Prompts Are Different
o3 already does the internal reasoning — don't ask for it. When you write "think step by step" or "show your work," you're asking o3 to narrate a process it has already completed internally. At best, you get redundant output. At worst, you force the internal reasoning to follow your script rather than find the best path on its own. The prompt for o3 is the problem statement, not the reasoning procedure.
State the problem and constraints, then ask the question. o3's internal thinking is most powerful when it has a well-defined problem space to search. That means: give it the full context first, specify what's in and out of scope, define the constraints, and only then pose the question. Prompts that lead with the question and fill in context after give the model a weaker starting point for its reasoning pass.
Calibrate thinking depth with reasoning_effort. When calling o3 via the API, the reasoning_effort parameter (values: low, medium, high) controls how many thinking tokens the model allocates. For a combinatorics proof or a formal algorithm correctness argument, use high. For a synthesis task or a planning outline, medium is usually sufficient. For quick triage or a first-pass classification, low saves cost without meaningful quality loss. The prompts below include explicit effort cues where the category warrants it — treat these as API-layer guidance, not text you must include verbatim in every environment.
Verification framing works. Asking o3 to re-derive an answer from first principles after producing it is not the same as asking it to show its work mid-answer. Verification is a second pass — it catches algebraic slips, logic gaps, and faulty assumptions that even a strong internal reasoning pass can miss. On hard math, proofs, and algorithmic correctness problems, include a verification step in your prompt. On softer synthesis tasks, it's less necessary.
o3 is expensive and slower than GPT-4o — use it where reasoning earns its keep. o3 is the right choice when the problem is genuinely hard: multi-step proofs, correctness-critical code, scenarios with competing constraints, causal inference under uncertainty. It is the wrong choice for summarization, formatting, simple Q&A, and tasks where instruction-following matters more than reasoning. See the full reasoning-model prompting guide at /blog/ai-reasoning-models-prompting-complete-guide-2026 for a complete treatment.
Math & Logic Prompts (1–8)
1. Multi-Step Word Problem
reasoning_effort: high
Problem:
A factory produces two product lines, A and B. Line A requires 3
hours of machine time and 1 hour of labor per unit. Line B requires
2 hours of machine time and 4 hours of labor per unit. Available
capacity this week: 240 machine hours and 200 labor hours.
Profit margin is $45 per unit of A and $70 per unit of B.
Find the production quantities of A and B that maximize total profit
subject to the capacity constraints. State whether the optimal solution
uses all available capacity on one or both resources.
After your answer, verify by substituting your solution back into
both constraints and confirming feasibility and optimality.
2. Proof Sketch
reasoning_effort: high
Claim: For any prime p > 2, the sum of all positive integers less
than p that are coprime to p equals p(p-1)/2.
Provide a rigorous proof. You may use standard number theory results
(Euler's totient function, properties of modular arithmetic) without
reproving them, but cite each result you invoke.
Success criteria: the proof must handle the structure of the residues
mod p, not just verify numerically for small cases. After completing
the proof, identify the one step most likely to contain an error and
re-examine it.
3. Constrained Optimization
reasoning_effort: high
I need to allocate a $500,000 annual marketing budget across four
channels. Historical data gives the following estimated returns
(incremental revenue per dollar spent, subject to diminishing returns):
Channel A: r(x) = 8 * sqrt(x), where x is dollars spent
Channel B: r(x) = 12 * x^0.4
Channel C: r(x) = 5 * ln(1 + x) [base-e log]
Channel D: r(x) = 6 * x^0.5
Constraints:
- Minimum $30,000 per channel (contractual)
- Channel C capped at $150,000
- Total budget exactly $500,000
Find the allocation that maximizes total expected incremental revenue.
State your method, the optimal allocation, and the marginal return
at the optimum for each channel.
4. Probability Puzzle
reasoning_effort: high
Setup:
A test for a rare condition has 98% sensitivity (true positive rate)
and 95% specificity (true negative rate). The condition affects 0.3%
of the general population. A patient tests positive.
Questions:
1. What is the probability the patient actually has the condition?
2. The doctor orders a second independent test with identical
characteristics. It also comes back positive. Now what is the
posterior probability?
3. How many positive tests in a row would be needed to push the
posterior above 90%?
Show Bayesian calculations. State all assumptions explicitly. After
answering question 3, verify your answer is consistent with the
pattern established in questions 1 and 2.
5. Combinatorics
reasoning_effort: high
A committee of 5 is to be chosen from 8 men and 6 women with these
constraints:
- At least 2 women must be on the committee
- Two specific men (call them M1 and M2) refuse to serve together
- One specific woman (W1) will only serve if at least one of M1 or M2
is also on the committee
How many valid committees are possible?
Use casework. For each case, state the logic clearly before computing.
After reaching your total, verify by checking at least one boundary
case explicitly.
6. Real Analysis
reasoning_effort: high
Let f: [0,1] → R be defined by:
f(x) = x * sin(1/x) for x ∈ (0,1]
f(0) = 0
Questions:
1. Is f continuous on [0,1]? Prove your answer.
2. Is f uniformly continuous on [0,1]? Prove your answer.
3. Is f differentiable at x = 0? Prove your answer using the
limit definition of the derivative.
Success criteria: each answer must include a complete epsilon-delta
or limit argument, not just an appeal to intuition. Flag any step
where the argument depends on a non-obvious interchange of limits.
7. Applied Math — Compound Finance
reasoning_effort: medium
An investor deposits $10,000 into an account at the start of each
year for 20 years (20 deposits total, first deposit at t=0). The
account earns 6% annual interest, compounded monthly.
Calculate:
1. The exact value of the account immediately after the 20th deposit
2. The total interest earned (account value minus total principal)
3. The equivalent lump-sum amount at t=0 that would produce the same
final balance (present value of the deposit series at 6% monthly
compounding)
State the formula you use for each calculation before applying
numbers. Verify calculation 1 against the formula for the future
value of an annuity due.
8. Constraint Satisfaction
reasoning_effort: high
Eight employees (A through H) need to be assigned to four two-person
project teams (Teams 1–4). Constraints:
- A and B must be on the same team
- C and D cannot be on the same team
- E must be on Team 1 or Team 2
- F must be on a different team from both G and H
- H must be on Team 3 or Team 4
- Team 2 must include at least one of {C, D, G}
List all valid assignments. If there are more than 10, describe the
complete structure of the solution space rather than enumerating.
If there are fewer than 5, enumerate all of them explicitly and
verify each satisfies every constraint.
Hard Coding Prompts (9–15)
9. Algorithm Design
reasoning_effort: high
Problem:
Given a directed graph with N nodes and weighted edges, find the
minimum cost path from source S to destination T such that:
- The path visits exactly K distinct intermediate nodes (not counting
S and T)
- No node is visited twice
- The total weight does not exceed budget B
Constraints: N up to 500, K up to 20, edges can have negative weights
(but no negative cycles).
Design an algorithm that solves this. Specify:
1. The data structure and state representation
2. The recurrence or search strategy
3. Time and space complexity with justification
4. Why your approach handles negative weights correctly
After describing the algorithm, identify the most likely source of
an off-by-one or boundary error and explain how your design avoids it.
10. Complexity Analysis
reasoning_effort: high
Analyze the time and space complexity of this algorithm:
[PASTE YOUR ALGORITHM OR PSEUDOCODE HERE]
I need:
1. Best, average, and worst-case time complexity with tight bounds
(not just O-notation — prove or disprove that the bound is tight)
2. Space complexity, distinguishing input space from auxiliary space
3. Whether the algorithm is cache-friendly and why that matters
for real-world performance at scale
4. The single most impactful optimization that would reduce the
dominant cost, with a rough estimate of the constant-factor
improvement
Success criteria: the complexity argument must reference specific
loop iterations or recursion levels, not just pattern-match to
known algorithms.
11. Concurrency Correctness
reasoning_effort: high
I have a multi-threaded cache implementation in [LANGUAGE]:
[PASTE YOUR CONCURRENT CACHE CODE]
Analyze this code for:
1. Race conditions — identify every shared state access that is not
atomically protected and describe the interleaving that produces
incorrect behavior
2. Deadlock potential — list all lock acquisition sequences and
identify any cycle that could deadlock under adversarial scheduling
3. ABA problems (if using compare-and-swap operations)
4. Memory visibility issues on architectures weaker than x86 (ARM, RISC-V)
For each issue found: name the problem, describe the exact thread
interleaving or instruction reordering that triggers it, and provide
the corrected code with a comment explaining the fix.
12. Race Condition Reasoning
reasoning_effort: high
Context:
Our distributed job queue uses optimistic locking. Workers read a
job record, set status='processing', and update with a version check.
We're seeing jobs processed twice roughly 0.1% of the time.
System details:
- PostgreSQL 15, READ COMMITTED isolation
- Workers run on 8 separate machines
- Job pickup query: SELECT ... WHERE status='pending' FOR UPDATE SKIP LOCKED
- Status update: UPDATE jobs SET status='processing', version=version+1
WHERE id=? AND version=?
Diagnose all plausible root causes for the double-processing. For
each hypothesis:
1. Describe the exact event sequence that produces it
2. Rate likelihood given the symptoms (0.1% rate, distributed workers)
3. Provide a test that would confirm or eliminate this hypothesis
4. Provide the fix
After listing hypotheses, rank them by probability and recommend
where to start the investigation.
13. Formal-Spec Coding
reasoning_effort: high
Implement a purely functional, persistent balanced BST (Red-Black Tree)
in [LANGUAGE] with the following specification:
Operations:
- insert(tree, key, value) → new_tree [no mutation of original]
- lookup(tree, key) → Option<value>
- delete(tree, key) → new_tree
- rank(tree, key) → int [0-indexed position in sorted order]
- select(tree, k) → Option<(key, value)> [k-th element by rank]
Invariants that must hold after every operation:
1. BST ordering property
2. Red-Black coloring rules (root black, no adjacent red nodes,
equal black height on all paths)
3. Rank and select are O(log n)
Include an invariant checker function. After writing the implementation,
trace through an insert followed by delete on a tree of depth 4 and
verify the invariants hold at each step.
14. Performance Optimization
reasoning_effort: medium
This function is called 50 million times per second in a hot path
of our data pipeline:
[PASTE FUNCTION CODE]
Profile it analytically (no profiler available — reason from first
principles):
1. Identify the dominant cost: memory allocations, branch mispredictions,
cache misses, or CPU-bound computation
2. Propose optimizations in order of expected impact
3. For the top optimization, write the revised code
4. Estimate the speedup using rough cycle counts or allocation counts
5. Identify any correctness risk the optimization introduces and
how to test for it
Success criteria: optimization recommendations must be grounded in
specific CPU or memory behavior, not general "avoid allocations" advice.
15. Hard Debugging
reasoning_effort: high
I have an intermittent memory corruption bug in a C++ service.
Symptoms:
- Crashes occur 2-5 times per day on production (not reproducible
in staging)
- Stack traces always point to a different location in the code
- ASAN and Valgrind clean on all test runs
- Crash rate increased 3x after we added a background thread that
compacts an in-memory index every 60 seconds
Relevant code:
[PASTE THE BACKGROUND COMPACTION THREAD AND THE DATA STRUCTURES IT TOUCHES]
Generate a differential diagnosis:
1. List all plausible root causes consistent with these symptoms
2. For each: the mechanism, why ASAN misses it in testing, and why
production triggers it
3. Ranked by probability given the background thread correlation
4. For the top hypothesis: a targeted test that would reproduce it
without relying on timing luck
5. The fix, with an explanation of why it addresses the root cause
Multi-Step Planning Prompts (16–22)
16. Project Plan With Dependency Analysis
reasoning_effort: medium
Project: Migrate our monolithic Rails application to a service-oriented
architecture. Scope includes: user authentication, billing,
notifications, and the core product domain.
Constraints:
- Zero downtime during migration
- 6 engineers available (2 senior, 4 mid-level)
- 9-month timeline
- Must maintain feature parity throughout
- Billing service requires PCI-DSS compliance review (8-10 weeks)
Produce:
1. A phased migration plan with explicit sequencing rationale
2. A dependency graph (described in text) — which phases block which
3. The critical path and where schedule risk lives
4. Resource allocation per phase
5. Three decision points where the plan should be re-evaluated
6. The single highest-risk assumption embedded in this plan
Do not produce a generic "identify services → extract → deploy"
template. Make the plan specific to the constraints given.
17. Capacity Planning
reasoning_effort: medium
Our SaaS platform currently serves 15,000 daily active users with
these resource profiles:
- API servers: 8 × c5.2xlarge (8 vCPU, 16 GB RAM), avg 45% CPU
- Database: 1 × db.r6g.4xlarge (16 vCPU, 128 GB RAM), avg 60% CPU,
200 GB storage used
- Cache: 2 × cache.r6g.xlarge Redis nodes, 70% memory utilized
- Background jobs: 4 × c5.xlarge, avg 30% CPU
Growth projections: 20% month-over-month for the next 6 months,
then flattening to 5% per month.
Produce a 12-month capacity plan:
1. Month-by-month resource utilization projections for each tier
2. Recommended scaling actions with timing (which months to act)
3. Estimated cost at each milestone (use rough AWS on-demand pricing)
4. The tier that will hit a hard limit first and why
5. One architectural change that would meaningfully reduce the
cost of scaling (beyond just adding instances)
18. Scenario Tree
reasoning_effort: high
Decision context:
We are a 40-person B2B SaaS company with $3M ARR and 18 months
runway. We have received three options from a potential acquirer:
Option A: $18M cash acquisition, close in 60 days
Option B: $25M deal, 60% cash + 40% acquirer stock, close in 6 months,
subject to standard reps & warranties
Option C: The acquirer invests $5M as a strategic round at $30M
pre-money valuation (we remain independent)
Key uncertainties:
- Acquirer stock (Option B) could be worth 0.5x to 2.5x its stated
value in 3 years depending on their trajectory
- Option C requires us to hit 2x ARR within 18 months to raise the
next round at a reasonable valuation
- Our current growth rate is 8% month-over-month but slowing
Build a scenario tree that maps outcomes across the key uncertainties.
For each terminal node, calculate the approximate founder payout
(assume 25% fully diluted founder ownership post-option-pool).
Recommend the option and explain the assumptions that drive it.
19. Supply-Chain Logistics
reasoning_effort: medium
I run a hardware startup that ships physical devices. Current state:
- 3,000 units per month, growing to 8,000 units/month within 12 months
- Single CM in Shenzhen, 90-day lead time
- Single 3PL in New Jersey
- 35% of customers are in EU (customs delays causing 2-3 week delivery times)
- Current COGS: $42/unit, logistics adds $18/unit average
Design a supply chain architecture for 8,000 units/month that:
1. Reduces EU delivery time to under 10 days
2. Reduces logistics cost per unit by at least 20%
3. Reduces single-CM risk
4. Can be operationally stood up within 9 months
For each recommendation: describe the change, the cost to implement,
the projected savings or time improvement, and the operational
complexity it adds. Identify the two changes with the best ROI.
20. Multi-Stage Launch Sequence
reasoning_effort: medium
Product: A developer-focused API security tool (static analysis +
runtime monitoring) targeting engineering teams at Series B–D companies.
Launch objective: 500 paying customers within 6 months of GA,
with at least 50 customers generating $500+/month in ARR.
Assets available at launch:
- 200 beta users (60% active, primarily at Series A companies)
- 3 design partners at Series C companies (willing to be references)
- $200K marketing budget
- 4-person GTM team (2 sales, 1 demand gen, 1 partnerships)
- An integration with GitHub Actions (marketplace listing)
Design a month-by-month launch sequence. For each month include:
1. The primary motion (PLG, outbound, partner, content — pick the
dominant one)
2. Specific activation targets and leading indicators
3. Budget allocation
4. The single biggest risk to that month's plan and the contingency
Flag assumptions about sales cycle length and conversion rates
explicitly rather than burying them in the plan.
21. GTM Rollout for New Market
reasoning_effort: medium
We sell project management software to professional services firms
(law firms, consulting firms, accounting firms) in the US. We want
to expand into the UK and Germany.
Current US state: 600 customers, $8M ARR, 92% gross retention,
4 AEs closing deals.
Constraints for international expansion:
- Budget: $400K first year
- Cannot hire full-time staff in-country until we have $500K ARR
in that market
- GDPR compliance required before first EU sale
- Sales process currently takes 45 days on average (US)
Design a 12-month GTM rollout that:
1. Sequences UK before Germany (or makes the case for the reverse)
2. Identifies the top 3 ICP differences between US and UK/DE markets
3. Specifies the channel mix and why it differs from the US playbook
4. Sets go/no-go criteria for hiring in-country staff
5. Identifies the single most likely point of failure and how to
detect it early
22. Complex Migration Plan
reasoning_effort: high
We need to migrate 8 TB of production data from a self-managed
PostgreSQL 13 cluster (3 nodes, synchronous replication) to
Amazon Aurora PostgreSQL, with these requirements:
- Maximum acceptable downtime: 4 hours (maintenance window)
- Data must be consistent at the cut-over point
- Rollback must be possible within 2 hours of cut-over if issues arise
- 200+ application services connect to the database
- Several services use PostgreSQL-specific features: advisory locks,
listen/notify, and custom extensions (pg_trgm, uuid-ossp)
Produce:
1. Migration strategy (logical replication vs. physical vs. DMS —
justify your choice given the constraints)
2. Pre-migration checklist (schema, extension compatibility,
connection string changes)
3. Cut-over runbook with explicit steps, owners, and time estimates
4. Rollback runbook
5. The three highest-risk steps and your mitigation for each
Identify any constraint that is likely to be harder to meet than
it appears and explain why.
Research & Synthesis Prompts (23–29)
23. Literature Synthesis With Reasoning
reasoning_effort: high
Topic: The effectiveness of retrieval-augmented generation (RAG)
versus fine-tuning for domain adaptation of large language models.
Synthesize what is known about this comparison, covering:
1. The conditions under which RAG outperforms fine-tuning on
domain-specific tasks (and the reverse)
2. How knowledge recency interacts with each approach
3. The role of retrieval quality as a limiting factor for RAG
4. Compute cost tradeoffs at inference vs. training time
5. Evidence on catastrophic forgetting in fine-tuned models when
applied to out-of-distribution queries
For each claim, indicate whether it is (a) well-supported by
multiple independent sources, (b) based on a single line of
evidence, or (c) a reasonable inference with limited direct evidence.
Flag claims where your training knowledge may be outdated.
24. Contradiction Resolution
reasoning_effort: high
I have two research findings that appear to contradict each other:
Finding A: [PASTE FINDING — include methodology summary]
Finding B: [PASTE FINDING — include methodology summary]
Analyze whether this is a genuine contradiction or an apparent one.
Specifically:
1. Identify every methodological difference that could explain
the divergence (population, measurement, confounders, time horizon)
2. Assess whether the two findings are actually measuring the same
construct
3. If the contradiction is genuine, propose the most parsimonious
theoretical account that reconciles both
4. Describe the study design that would resolve the question
empirically, and what result would support each finding
State your confidence level in each part of your analysis and
the assumptions that underpin it.
25. Hypothesis Generation With Priors
reasoning_effort: high
Observation:
In our longitudinal user study (n=340, 6 months), users who adopted
our AI writing assistant in the first week retained at significantly
higher rates at 3 months (78% vs. 54% for later adopters). The
effect persists after controlling for prior tool usage and role.
Generate at least 6 hypotheses that could explain this pattern.
For each hypothesis:
1. The causal mechanism (why early adoption would cause higher retention)
2. The prior probability you'd assign before seeing the data
(low/medium/high) and your reasoning
3. What observable data in our current dataset would support or
undercut this hypothesis
4. What additional data collection would most cleanly test it
Rank the hypotheses by a combination of explanatory power and
testability. Identify which hypotheses are mutually exclusive and
which can coexist.
26. Experimental Design
reasoning_effort: high
I want to run an experiment to test whether adding an AI-generated
summary at the top of long-form B2B content (>2,000 words) increases
time-on-page and downstream conversion rates.
Context:
- 80,000 monthly unique visitors to the content section
- Average content length: 2,400 words
- Current avg time-on-page: 3.2 minutes
- Current content-to-lead conversion: 2.1%
- 40% of traffic is mobile
Design the experiment:
1. Experimental unit (page, session, or user) — justify your choice
2. Randomization and assignment strategy
3. Required sample size for 80% power to detect a 15% relative
improvement in the primary metric, at p < 0.05
4. Primary metric, secondary metrics, and guardrail metrics
5. Minimum detectable effect given realistic traffic
6. Risks that could invalidate the experiment (contamination,
novelty effects, selection bias) and mitigations
7. Decision rule: what result triggers rollout, what triggers iteration
27. Evidence Grading
reasoning_effort: medium
I am evaluating whether to implement a structured onboarding
checklist for new SaaS customers (vs. our current free-form
onboarding calls) to improve 30-day activation rates.
I have collected the following evidence:
1. Our internal data: customers who completed 7+ product actions
in week 1 have 3x higher 90-day retention
2. A case study from a competitor blog claiming checklist onboarding
improved their activation by 40%
3. Two SaaS industry benchmarks reports (source and methodology unknown)
4. A randomized experiment by a SaaS research firm showing
structured onboarding improved 30-day activation by 12%
(n=1,200, p=0.03)
5. Three customer interviews where customers said they felt "lost"
in the first week
Grade each piece of evidence on: reliability, relevance to our
situation, and the inference it supports. Then give an overall
evidence-strength rating for the decision. Identify the weakest
inference in the chain between evidence and recommendation.
28. Claim Verification
reasoning_effort: high
Claim to evaluate: "Transformer models trained on code are more
sample-efficient than those trained on natural language because
code has more regular syntactic structure, which reduces the
hypothesis space the model must search during training."
Evaluate this claim by:
1. Breaking it into its constituent sub-claims
2. Assessing the logical validity of the argument structure
(does the conclusion follow from the premises, assuming
the premises are true?)
3. Evaluating the empirical status of each premise
4. Identifying the weakest link in the argument
5. Proposing a study that would provide strong evidence for
or against the core claim
Do not simply agree or disagree. Treat this as a structured
argument analysis, not an opinion question.
29. Theoretical Framework Comparison
reasoning_effort: high
Compare two competing theoretical frameworks for explaining
[PHENOMENON IN YOUR FIELD]:
Framework A: [NAME AND 2-3 SENTENCE DESCRIPTION]
Framework B: [NAME AND 2-3 SENTENCE DESCRIPTION]
For each framework:
1. Core ontological commitments (what entities and processes it posits)
2. Key predictions that differ from the competing framework
3. Empirical evidence it best explains
4. Evidence that strains or contradicts it
5. The methodological assumptions required to test it
Then assess:
- Whether the frameworks are genuinely competing or address
different aspects of the phenomenon
- The type of evidence that would most decisively favor one
over the other
- Whether a third framework or integration is suggested by
the gaps in both
Success criteria: the analysis should be useful to someone who
already knows both frameworks and is trying to decide which to
work within.
Scientific Reasoning Prompts (30–36)
30. Biomedical Reasoning
reasoning_effort: high
Clinical scenario:
A 58-year-old patient presents with progressive fatigue,
unexplained 8 kg weight loss over 4 months, and mild right
upper quadrant discomfort. Labs show: elevated alkaline
phosphatase (3x upper limit of normal), mildly elevated ALT/AST
(1.5x ULN), total bilirubin normal, CA 19-9 elevated at 180 U/mL.
CBC normal. No history of liver disease or alcohol use.
This is a reasoning exercise, not a diagnostic consultation.
Analyze the differential diagnosis:
1. List the 4 most likely diagnoses consistent with this
presentation, ranked by probability
2. For each: the pathophysiology explaining each abnormal finding
3. The single most discriminating next test for the top two diagnoses
4. What feature of this presentation most constrains the differential
Identify any finding that is inconsistent with your top diagnosis
and how you account for it.
31. Physics Problem
reasoning_effort: high
A thin uniform rod of mass M and length L is initially at rest,
lying on a frictionless horizontal surface. An impulse J is applied
perpendicular to the rod at a point P located at distance d from
the center of mass (0 < d ≤ L/2).
Find:
1. The velocity of the center of mass immediately after the impulse
2. The angular velocity of the rod immediately after the impulse
3. The location along the rod that is instantaneously at rest
immediately after the impulse (the "center of percussion")
4. The value of d for which the center of percussion coincides
with the end of the rod
After solving, verify that your answer to (3) reduces to the
known result for d = L/2 (impulse at the end of the rod).
Check dimensional consistency throughout.
32. Chemistry Mechanism
reasoning_effort: high
Reaction:
A secondary alkyl bromide undergoes treatment with sodium ethoxide
in ethanol at 55°C. The product distribution is 70% E2 elimination
product and 30% SN2 substitution product.
Analyze this reaction:
1. Explain the mechanistic basis for the observed product ratio,
citing the factors that favor E2 vs. SN2 for this substrate
and these conditions
2. Predict how the product ratio would change if:
a. The reaction is run at 25°C instead of 55°C
b. Sodium methoxide in methanol is used instead
c. A tertiary alkyl bromide is used instead
3. Draw the energy profile (describe in text) for the competing
pathways, noting which transition state is higher in energy
for this case
For each prediction in (2), state the specific mechanistic
reason for the change, not just the direction of the effect.
33. Statistical Analysis
reasoning_effort: high
I ran a 2×2 factorial experiment testing two variables:
- Factor A: feature flag on/off
- Factor B: user segment (new vs. returning)
Primary metric: 7-day retention rate
Sample sizes and outcomes:
- A=off, B=new: n=2,340, retention=28.3%
- A=on, B=new: n=2,287, retention=31.7%
- A=off, B=returning: n=4,102, retention=61.2%
- A=on, B=returning: n=4,218, retention=64.9%
Analyze:
1. Main effect of Factor A (statistical test and interpretation)
2. Main effect of Factor B
3. Interaction effect — is the effect of A different for new
vs. returning users? Conduct the appropriate test.
4. Multiple comparisons adjustment — which corrections apply here
and why
5. Practical significance: is the effect size meaningful for
a product decision, and what would you recommend?
State your statistical assumptions and flag any that may be
violated by the design or data.
34. Causal Inference
reasoning_effort: high
Observational study:
We observed that users who receive in-app coaching messages
have 40% higher 60-day retention than users who do not.
The messages are triggered automatically when users show
certain behavioral signals (low activity, skipped onboarding
steps, etc.).
I want to claim that coaching messages cause higher retention.
Analyze the causal inference challenge:
1. Identify all plausible confounders between "receives coaching"
and "retains at 60 days"
2. Explain why the trigger mechanism makes naive comparison
especially unreliable
3. Propose the cleanest observational study design that would
reduce confounding (assuming we cannot run an RCT)
4. Describe what a properly designed RCT would require,
including how you'd handle the ethical concern of
withholding coaching from struggling users
5. What estimate of the true causal effect would you be
comfortable defending given only observational data?
35. Experimental Result Interpretation
reasoning_effort: high
Experimental result:
We trained two versions of a text classifier:
- Model A: fine-tuned on 10,000 labeled examples from our domain
- Model B: zero-shot prompting of a large foundation model
Test set performance (held-out 2,000 examples, same distribution
as training data):
- Model A: F1 = 0.91, precision = 0.93, recall = 0.89
- Model B: F1 = 0.84, precision = 0.88, recall = 0.80
However, on a stress-test set of 500 adversarial and out-of-distribution
examples we collected separately:
- Model A: F1 = 0.62
- Model B: F1 = 0.79
Interpret these results:
1. What does the in-distribution gap tell us about fine-tuning vs.
prompting for this task?
2. What does the OOD reversal suggest about the nature of what
Model A learned?
3. What are the production deployment implications if roughly
15% of real-world inputs are OOD?
4. Design a follow-up experiment that would determine whether
Model A's OOD degradation is due to distribution shift or
adversarial manipulation specifically
State confidence levels for each interpretive claim.
36. System Dynamics Modeling
reasoning_effort: high
I want to model the growth dynamics of a two-sided marketplace
with these structural elements:
- Buyers and sellers who both exhibit network effects (more of
each side increases value for the other side)
- A "quality deterioration" effect: as low-quality sellers are
attracted by platform growth, average transaction quality declines
- A trust-based retention mechanism: buyers who have a bad
transaction churn at 3x the normal rate and take their
social network connections with them
Construct a system dynamics model:
1. Define the key stocks (state variables) and flows (rates of change)
2. Write out the causal relationships with sign and feedback loop type
(reinforcing vs. balancing)
3. Identify all feedback loops and classify them
4. Describe the qualitative behavior this system would exhibit
at different levels of seller quality enforcement
5. Identify the highest-leverage intervention point to prevent
quality-driven collapse at scale
You do not need to provide numerical equations — reason about
the structure and behavior qualitatively, with precision.
Decision Analysis Prompts (37–43)
37. Decision Under Uncertainty
reasoning_effort: high
Decision:
We must decide by Friday whether to sign a 3-year office lease at
$180,000/year. The alternative is month-to-month at $22,000/month.
We have 28 employees and expect to grow, but there is real uncertainty.
Relevant uncertainties:
- 30% chance we raise a Series A within 12 months and need to move
to a larger space
- 20% chance of a significant market downturn in the next 18 months
that triggers headcount reductions (to ~15 employees)
- 70% chance we remain roughly stable and the long-term lease saves money
Sublease risk: if we need to exit the 3-year lease early, we estimate
60% probability of finding a subtenant within 6 months (at break-even),
and 40% probability of carrying 3-6 months of empty rent.
Build a decision tree with expected value calculations for both options.
State all assumptions. Then do a sensitivity analysis: at what
probability of Series A fundraising does the month-to-month option
become preferable in expected value terms?
38. Expected-Value Calculation
reasoning_effort: medium
I'm evaluating three potential product investments for next quarter.
Each requires engineering and PM resources:
Investment A: New integration with Salesforce
- Development cost: 6 engineer-weeks + 2 PM-weeks
- Estimated ARR impact: $200K–$400K (uniform distribution over range)
- Probability of technical success: 90%
- Time to revenue: 3 months after ship
Investment B: Self-serve onboarding flow redesign
- Development cost: 4 engineer-weeks + 3 PM-weeks
- Estimated ARR impact: $100K–$600K (skewed right — most likely $150K,
max $600K)
- Probability that impact exceeds $200K: 25%
- Time to revenue: 1 month after ship
Investment C: Enterprise reporting dashboard
- Development cost: 10 engineer-weeks + 2 PM-weeks
- Estimated ARR impact: $0K or $500K (binary, depending on whether
we close 3 specific enterprise deals)
- Probability of closing all 3 deals given the feature: 40%
- Time to revenue: 6 months after ship
Available capacity: 12 engineer-weeks, 5 PM-weeks this quarter.
Calculate expected ARR per engineer-week for each investment.
Identify the portfolio that maximizes expected ARR within the
capacity constraint, and flag which investment the EV calculation
most likely undervalues due to distribution shape.
39. Risk-Weighted Strategy
reasoning_effort: high
Context:
We are a 15-person cybersecurity startup with $4M in the bank,
$1.2M ARR, and 14 months of runway. We have an opportunity to
pursue a large government contract ($2.8M, 18-month engagement)
that would require us to:
- Hire 4 specialized engineers immediately
- Achieve FedRAMP authorization (12–18 months, $300K–$500K cost)
- Divert 60% of engineering capacity from our commercial product
Alternative: focus entirely on the commercial market where we
have early traction.
Produce a risk-weighted strategic analysis:
1. Financial scenario modeling for both paths (at least 3 scenarios each)
2. Strategic option value: what does each path open or close?
3. Key risks that are not visible in the financial model
4. Decision criteria that would make the government path clearly right
or clearly wrong
5. A recommended decision and the conditions under which you'd reverse it
State the assumptions that most affect the recommendation.
40. Multi-Stakeholder Negotiation
reasoning_effort: high
Situation:
I am negotiating a software licensing agreement with a large enterprise
customer. Three stakeholders are involved on their side:
- VP of Engineering (technical buyer): wants perpetual license,
worried about vendor lock-in, prefers on-premises deployment
- CFO: wants annual subscription to spread cost, concerned about
total cost of ownership over 5 years
- CISO: requires data residency guarantees, SOC 2 Type II, and
the right to audit our infrastructure
My constraints:
- Perpetual license is technically feasible but unfavorable — it
removes recurring revenue
- On-premises deployment is possible but costs us $80K in
setup and ongoing support
- We have SOC 2 Type II; audit rights are negotiable
- We need to close by end of quarter (6 weeks)
Design a negotiation strategy that:
1. Maps each stakeholder's stated and likely unstated interests
2. Identifies value trades that satisfy each party differently
3. Sequences the negotiation to resolve the highest-risk objection first
4. Proposes fallback positions for each key term
5. Identifies the concession that costs us least but has highest
perceived value to them
41. Build vs. Buy With TCO
reasoning_effort: medium
Decision: should we build or buy an internal data pipeline and
transformation layer?
Build option:
- 3 engineers × 4 months to build v1
- Ongoing maintenance: 0.5 FTE per year
- Can be exactly tailored to our needs
- Full control over roadmap
Buy options (3 vendors evaluated):
- Vendor X: $3,500/month, covers 80% of our use cases, 2-week implementation
- Vendor Y: $8,000/month, covers 95% of use cases, 6-week implementation
- Vendor Z: $2,000/month, covers 65% of use cases, strong open-source
ecosystem but requires in-house customization (estimate 6 engineer-weeks)
Engineering cost: $180,000 fully loaded per engineer per year
Current data volume: 50M events/day, growing 15% per month
Build a 3-year TCO analysis for each option. Include:
1. Year-by-year cost breakdown
2. Hidden costs (integration, migration, vendor risk, coverage gaps)
3. Break-even point for the build option vs. each vendor
4. The factor that most changes the recommendation under sensitivity analysis
5. Your recommendation and the key assumption it depends on
42. Hiring Decision Matrix
reasoning_effort: medium
I need to hire a VP of Engineering. After a full process, I have
two finalists:
Candidate A:
- 12 years experience, 8 as an engineering manager
- Grew engineering team from 8 to 60 at a B2B SaaS company
- Strong operator: excellent at process, hiring, performance management
- Lacks deep technical background in our stack (ML infrastructure)
- Cultural fit: methodical, process-driven; our culture is currently scrappy
Candidate B:
- 9 years experience, 4 as an engineering manager
- Technical depth in ML infrastructure is exactly what we need
- Built high-output teams but never at scale above 20 engineers
- Cultural fit: builders mindset, aligns with current team
- Weak references on stakeholder management and cross-functional
alignment
Our context: 18-person engineering team, building ML-heavy product,
Series B fundraise in 12 months, need to 3x team size in 18 months.
Build a decision framework:
1. Weight the relevant criteria for our situation
2. Score each candidate against each criterion with your reasoning
3. Identify the scenario where each candidate clearly outperforms
4. Flag the highest-risk assumption in choosing each
5. Recommend one — and state the one thing you'd verify before signing
Do not give a false balance. Make a call.
43. Board-Level Tradeoff
reasoning_effort: high
Context:
Our board is debating whether to pursue an aggressive growth strategy
or extend runway by cutting costs. Company state:
- $12M ARR, growing 65% year-over-year
- $18M raised, $9M remaining, 14 months runway at current burn
- $1.8M monthly burn rate
- NRR: 118%, CAC payback: 11 months
- Market: competitive, two well-funded competitors raised large
rounds in the past 6 months
Option A (aggressive growth): Increase burn to $2.8M/month,
hire 15 salespeople and 8 engineers in 90 days, target $22M ARR
in 12 months. Requires raising $15M within 6 months.
Option B (default alive): Cut burn to $1.2M/month, reduce team
by 20%, extend runway to 30+ months, grow to $17M ARR at
reduced pace. Raises from position of strength.
Produce a board-quality analysis:
1. The capital market assumptions each option requires
2. The company's defensibility in each scenario if fundraising
is slower than planned
3. What the unit economics imply about the right growth rate
4. The competitive dynamics argument for each option
5. A recommendation with explicit conditions that would change it
Treat this as advice to a board, not a summary of the tradeoffs.
Complex Troubleshooting Prompts (44–50)
44. Production Incident Root Cause
reasoning_effort: high
Incident timeline:
- 14:32 UTC: API error rate spikes from 0.1% to 18%
- 14:33 UTC: Alerting fires on P95 latency (2.1s vs. 200ms baseline)
- 14:35 UTC: On-call engineer begins investigation
- 14:41 UTC: Deployment rolled back (deployed at 14:28 UTC)
- 14:48 UTC: Error rate drops to 2% but does not fully recover
- 15:10 UTC: Error rate returns to baseline after database connection
pool restart
Deployment at 14:28 UTC: added a new database query to the
user session validation path (runs on every authenticated request).
Database metrics during incident:
- Active connections: jumped from 45 to 195 (pool max: 200)
- Query duration for new query: median 850ms, P99 12s
- Existing queries: median latency increased 3x during incident
- No deadlocks or lock waits recorded
Application metrics:
- 12 out of 16 API pods showed elevated errors
- 4 pods that recently restarted showed no errors
Perform root cause analysis:
1. Primary root cause with supporting evidence
2. Contributing factors that explain why recovery was partial
after rollback
3. Why 4 pods were unaffected
4. Three systemic fixes (not just "add index") with priority order
5. Detection improvements that would have caught this before impact
45. Intermittent Bug Hypothesis Tree
reasoning_effort: high
Bug report:
Intermittent payment processing failures affecting roughly 0.3%
of transactions. Symptoms:
- Failures appear random across customers, amounts, and payment methods
- Error message: "Transaction timeout after 30000ms"
- Failures cluster slightly between 9:00–10:00 AM and 4:00–5:00 PM
in US Eastern time
- Rate has increased from 0.1% to 0.3% over the past 3 weeks
- Payment processor reports no incidents during these times
- Database shows no anomalies during the failure windows
Architecture:
- Node.js API servers → payment microservice → third-party payment processor
- Payment microservice runs on 6 pods (Kubernetes)
- Each pod maintains a connection pool to the payment processor API
Build a hypothesis tree:
1. Generate at least 8 distinct hypotheses for the root cause
2. For each hypothesis: the causal mechanism, why it produces 0.3%
failure rate, and why it would cluster at business-hours peaks
3. Group hypotheses by likely location (network, application,
external dependency, infrastructure)
4. Rank by probability given all symptoms, including the 3-week trend
5. The three cheapest diagnostic steps that would most rapidly
eliminate hypotheses
46. System-Level Outage Analysis
reasoning_effort: high
Post-mortem scenario:
A 47-minute complete service outage occurred. Reconstruct the
failure cascade from these observations:
Observation 1: The outage began immediately after an auto-scaling
event added 40 new application instances.
Observation 2: The service discovery system (Consul) showed all
new instances as unhealthy within 90 seconds of launch.
Observation 3: Existing instances continued serving traffic for
8 minutes before also failing.
Observation 4: Database CPU spiked to 100% at the 6-minute mark.
Observation 5: The load balancer continued routing traffic to all
instances (healthy and unhealthy) throughout the outage.
Observation 6: Recovery occurred only after manually removing
all new instances from the service discovery registry and
restarting the database connection pool on old instances.
Reconstruct the most likely failure cascade step-by-step. For
each step, identify what evidence supports it and what alternative
explanations you are ruling out. Then identify the single control
that, if implemented, would have prevented the cascade from
progressing past the initial scaling event.
47. Security Incident Timeline
reasoning_effort: high
Suspected intrusion. Known facts:
- 03:17 UTC: Unusual API calls from an authenticated user account
(user ID 48291) — bulk export of customer records via a legitimate
but rarely-used admin endpoint
- 03:22 UTC: The same account initiates 14 password reset emails
to high-value customer accounts
- 03:31 UTC: SIEM alerts on the bulk export volume
- 03:45 UTC: Security team disables account 48291
- 04:12 UTC: Three of the targeted customers report unauthorized
login attempts from new IPs
Account 48291 belongs to a support engineer who was on PTO
and denies any access. MFA is required for this account;
MFA logs show a successful TOTP verification at 03:16 UTC
from an IP in a country the employee has never accessed from.
Reconstruct the incident:
1. The most likely attack vector (how was the account compromised)
2. Timeline with attacker actions vs. defender actions
3. What the attacker's objective appears to have been
4. Evidence of what data may have been exfiltrated
5. Immediate containment actions that should have happened faster
6. Three detection controls that would have caught the initial
access in real time
Identify the assumption in your reconstruction that you are
least confident about.
48. Data-Quality Root Cause
reasoning_effort: high
Issue:
Our data team noticed that monthly revenue reported by our
billing system ($2.34M) and our data warehouse ($2.19M) diverged
by $150K last month. This has never happened before at this scale
(prior discrepancies under $5K).
System context:
- Billing system: Stripe, with custom subscription logic for
annual plan prorations
- Data warehouse: nightly ETL pulls from Stripe API, transforms,
and loads into Snowflake
- Revenue recognition: custom SQL that applies ASC 606 rules
on top of raw billing data
- Last month: we launched annual plans (previously only monthly)
Generate a systematic root cause investigation plan:
1. The most likely sources of the $150K discrepancy given the
context (at least 5 hypotheses)
2. For each hypothesis: the data query or check that would
confirm or eliminate it
3. The sequence in which to run those checks (fastest elimination first)
4. What a $150K discrepancy from annual plan launch specifically
suggests about proration handling
5. The permanent fix and the audit procedure to prevent recurrence
Flag any hypothesis that would indicate a systemic reporting
error (not just a one-time variance).
49. Financial Anomaly Investigation
reasoning_effort: high
Situation:
Our gross margin dropped from 71% to 58% in one month. Revenue
was flat (no growth or decline). The margin drop appeared in the
month-end close and was flagged by our CFO.
Known changes that month:
- We migrated from AWS to Google Cloud (partially — 40% of workloads
moved, 60% still on AWS during transition)
- We signed two new enterprise customers who required dedicated
infrastructure (separate instances, not shared)
- No new headcount
- COGS in the income statement: cloud infrastructure, third-party
APIs, and customer support labor
Investigate:
1. Generate all plausible explanations for a 13-point gross margin
decline with flat revenue
2. For each explanation: the account line it would appear in,
the dollar magnitude consistent with a 13-point drop, and
whether the known changes could produce it
3. The financial query or report that would isolate the source
4. If the cause is the cloud migration: what accounting treatment
should have been applied to the transition costs, and whether
this belongs in COGS or a separate line
5. The permanent reporting fix to prevent this from obscuring
underlying margin trends in future migrations
50. Multi-System Failure Analysis
reasoning_effort: high
Three systems failed in a 90-minute window. Determine whether
these are related or independent failures:
System 1 (14:05 UTC): Order management system becomes unresponsive.
Root cause identified: a full-table scan query in a nightly report
job that was mistakenly triggered at 14:00 instead of 02:00.
System 2 (14:22 UTC): Customer notification service stops sending
emails. On-call finds the service is up but the outbound email
queue has grown to 400,000 messages (normal: under 5,000).
System 3 (14:51 UTC): Inventory sync service starts reporting
stale data. All inventory levels in the UI are frozen at
14:00 UTC values.
Additional context:
- Systems 1, 2, and 3 share a PostgreSQL database cluster
- System 2's email queue is populated by database triggers on
the orders table
- Inventory sync runs every 15 minutes and reads from the same
orders table
- The nightly report job that caused System 1's failure produces
a 2.3M row result set that it writes to a temp table
Analyze:
1. Is this one failure or three independent failures? Build
the causal graph.
2. The mechanism by which System 1's failure propagated (if it did)
3. Why System 3 shows stale data from exactly 14:00 UTC
4. The order in which these should have been diagnosed and resolved
5. The architectural change that would have contained System 1's
impact to System 1 alone
o3 Power Tips
State the full problem, then ask. Load all constraints, context, and data into the prompt before posing the question. o3's internal reasoning pass is strongest when it starts with a complete problem definition — not one it has to assemble from a vague question and follow-up clarifications.
Request the answer and a verification step — not chain-of-thought. Write "After your answer, verify by [specific check]" instead of "show your work" or "think step by step." You get a correctness check without forcing o3 to narrate reasoning it already did internally.
Calibrate reasoning_effort to the problem. Use high for proofs, formal algorithm correctness, and hard causal reasoning. Use medium for synthesis, planning, and analysis where depth matters but exhaustive search doesn't. Use low for triage, classification, and first-pass drafts where speed matters more than precision.
Define success criteria explicitly. Tell o3 what a good answer looks like — "the complexity argument must reference specific loop iterations," "the proof must handle the general case, not just small examples," "the recommendation must include the conditions under which you'd reverse it." This shapes the internal reasoning toward the right target.
Ask for confidence levels and explicit assumptions. Add "State your confidence in each major claim and the assumptions that underpin it" to any prompt where the answer involves inference under uncertainty. o3 surfaces hidden assumptions better when you request them directly.
Use re-derivation verification on hard problems. For math, proofs, and algorithmic correctness, the highest-value verification is "re-derive from first principles and check for inconsistencies with your first answer." This catches errors that a simple plausibility check misses.
Let's work through this step by step. First, think about the constraints. Then consider the options. Walk me through your reasoning as you go. What's the best decision here given that we have 18 months runway, two acquisition offers, and a potential Series A?
reasoning_effort: high\n\nContext: 18 months runway, $3M ARR, 65% YoY growth, NRR 118%.\n\nOption A: $18M cash acquisition at close in 60 days.\nOption B: $25M deal, 60% cash + 40% acquirer stock, 6-month close.\nOption C: $5M strategic investment at $30M pre-money, remain independent.\n\nAcquirer stock in Option B is worth 0.5x–2.5x stated value in 3 years depending on their trajectory. Option C requires 2x ARR in 18 months to raise at a reasonable valuation; current growth makes this plausible but not certain.\n\nAssume 25% fully diluted founder ownership. Which option maximizes expected founder outcome? Build a decision tree across the key uncertainties, state your assumptions explicitly, and recommend one option. After your recommendation, identify the assumption that most changes the answer if wrong.
Build Better Prompts for Hard Problems
These 50 prompts follow one principle: give o3 a well-defined problem and get out of the way. The model's reasoning budget is the differentiator — your job is to aim it, not direct it step by step.
If you want to build prompts like these for your own hard problems without doing it from scratch, the AI prompt generator generates structured, reasoning-model-optimized prompts from plain English descriptions. For the full theoretical foundation on why reasoning models require a different approach, read the complete guide: Prompt Engineering for Reasoning Models. And if you're also working with GPT-5, see 50 Best GPT-5 Prompts in 2026 for the companion roundup.