If the math problem is genuinely hard — the kind where a wrong step in line three quietly poisons the answer in line fourteen — the default pick in 2026 is o3. It is the most stable model we have for long chains of deduction and the most willing to catch its own errors. Gemini Deep Think is the cost-aware quantitative pick, especially when you need to reason across a long document or many tables. DeepSeek R1 is the budget option that punches well above its price tier on competition-style problems and on legible scratch work. Claude Opus 4.7 with extended thinking is the right call when math is embedded in a longer analytical narrative, or when you need a million-token window with disciplined step-by-step reasoning.
How We Evaluated
This is a working buyer's matrix, not a leaderboard. We focused on the dimensions that actually predict whether a model will get a hard quantitative problem right and whether a careful reader will be able to trust the answer.
The seven dimensions in the matrix are:
- Context window — how much math you can stuff in, including problem statements, reference material, prior steps, and data tables.
- Multi-step deduction stability — whether the model can sustain a long chain of reasoning without drifting, swapping signs, or losing a constraint introduced ten steps earlier.
- Symbolic manipulation — how reliably it handles algebra, calculus, combinatorics, and proof-style work without "looks plausible" hand-waving.
- Self-verification behavior — whether the model checks its own work, substitutes back, sanity-checks units, or flags uncertainty rather than confidently producing the wrong number.
- Showing work clearly — whether the output is a legible argument a human can audit, not a wall of dense notation.
- Latency — how long you wait for a hard problem. Reasoning models are slow by design; we treat this as a factor, not a deal-breaker.
- Cost tier — relative price for a representative hard-math turn.
Honesty disclaimer. AIME, USAMO, MATH, GPQA Diamond, and FrontierMath are real public benchmarks with published results from model providers and independent researchers. Those numbers move with every model revision, and we will not quote specific percentages in this post. Where prose references a benchmark, we name it and say results are published by the provider or by independent evaluators — no fabricated scores. Capability columns are qualitative buckets: Best-in-class, Strong, Adequate, Trailing.
The Decision Matrix
| Model | Context window | Multi-step deduction stability | Symbolic manipulation | Self-verification behavior | Showing work clearly | Latency | Cost tier |
|---|---|---|---|---|---|---|---|
| o3 | 200k tokens | Best-in-class | Best-in-class | Best-in-class | Strong | Trailing | Premium |
| Gemini Deep Think | 1M tokens | Strong | Strong | Strong | Strong | Trailing | Mid |
| DeepSeek R1 | 128k tokens | Strong | Strong | Adequate | Best-in-class | Adequate | Budget |
| Claude Opus 4.7 (extended thinking) | 1M tokens | Strong | Adequate | Strong | Best-in-class | Adequate | Premium |
The matrix says the boring true thing: there is no free lunch. o3 is the strongest at the math itself but is slow and expensive. Gemini Deep Think trades a notch of raw deductive horsepower for a much larger context and a friendlier price. DeepSeek R1 is shockingly capable for the money and writes the cleanest scratch work, with a softer track record on catching its own mistakes. Claude Opus 4.7 with extended thinking is the long-context analyst — its math is solid, its narrative work is exceptional, and its weak spot is dense symbolic manipulation.
o3: When It's the Right Call
o3 is the default for any math problem where being wrong is expensive. Inside OpenAI's reasoning family, it is the model that most reliably sustains a long deductive chain without drifting, and it has the strongest self-verification habits we have seen in production. On hard problem sets like AIME and on graduate-level science reasoning evaluations such as GPQA Diamond — results published by OpenAI and by independent evaluators — o3 has been the reference point that other 2026 reasoning models are measured against.
The thing o3 does that lesser models don't: it slows down on purpose. Give it a multi-step optimization problem or a proof and it will spend real time exploring, backtracking, substituting candidate answers back into the original constraints, and rejecting branches that don't close. The output is not a dazzling first draft. It is a verified one.
Pick o3 when:
- The problem is genuinely hard — competition math, non-trivial proofs, multi-constraint optimization, edge cases in numerical methods.
- The cost of a wrong answer is higher than the cost of a slow one. Quant research, model risk, exam prep at the top end, technical interview prep.
- You are verifying a human-written derivation and you want a model that will actually try to find the error rather than agree with the author.
- You can tolerate latency. o3 thinks for a while. That's the feature.
Avoid o3 when:
- You need throughput. Batch grading thousands of arithmetic problems is wasted on o3.
- You need a 1M-token window. o3's 200k is generous but not the widest in this matrix.
- The problem is genuinely easy. o3 will still answer correctly, but you are overpaying for headroom you didn't need.
The right prompt for o3 on a hard problem is short and clean. Strip the ornament. Reasoning models reward precision in the problem statement and tend to be distracted by verbose system instructions. See the sample prompt section below for what "lean" looks like.
One more practical note. o3's verification behavior is most useful when you actually ask for it. The model will often self-verify by default on hard problems, but stating "verify your answer before reporting it" makes the verification step robust rather than opportunistic — and gives you a clean signal in the output when verification succeeds versus when the model is still uncertain.
Gemini Deep Think: When It's the Right Call
Gemini Deep Think is the cost-aware quantitative pick. In Google's Gemini family, Deep Think is the higher-reasoning mode designed for problems where the model needs to spend more compute exploring solution branches. On hard math evaluations including IMO-style problems and on FrontierMath — results published by Google DeepMind and by independent researchers — Deep Think has been a credible peer to o3 across many problem types, with the practical advantages of a 1M-token context window and a friendlier price tier.
What makes Deep Think the right call is not that it beats o3 head-to-head on every category — it doesn't. It is that the gap is smaller than the cost gap, and that the long context lets you do things that simply aren't possible with a 200k window. You can feed Deep Think an entire textbook chapter, a full dataset description, every prior exam, and the student's submission, all in one turn. The reasoning quality holds up across that volume of material in ways smaller-context reasoning models cannot match.
Pick Gemini Deep Think when:
- You need serious reasoning at a meaningfully lower cost than o3, and you are okay trading a small amount of top-end accuracy for that savings.
- The problem involves a lot of material — long problem statements, reference PDFs, historical data, multi-document derivations.
- You are doing applied quant work where the math is hard but not olympiad-hard, and the bottleneck is integrating information across many sources.
- You want strong self-verification behavior without paying premium-tier rates.
Avoid Gemini Deep Think when:
- You are working on the absolute hardest end of competition or research-grade math. o3 still has the edge there.
- You need consistent low latency. Deep Think, like all top-tier reasoning models in 2026, is slow on hard problems by design.
Deep Think is the most defensible default when you are buying for a team and most of your math work is "hard, but tractable" rather than "frontier." It also pairs naturally with o3 in a two-pass setup: Deep Think drafts the solution across whatever sprawling reference material you have, then o3 verifies the critical steps. That stack costs less than running everything on o3 and gets you most of the way to o3-only reliability.
DeepSeek R1: When It's the Right Call
DeepSeek R1 is the budget pick that nobody serious gets to ignore. DeepSeek released R1 as an open-weights reasoning model — the model card and the original release post from DeepSeek describe a reinforcement-learning pipeline trained specifically to incentivize long chain-of-thought. On MATH, AIME-style problems, and competition coding benchmarks — results published by DeepSeek and corroborated by independent evaluators — R1 has been the model that proved you don't need premium-tier pricing to get genuinely strong mathematical reasoning.
The standout property of R1 for math users isn't raw accuracy. It is legibility. R1 shows its work in clean, structured, human-readable scratch — the kind of step-by-step a teacher would actually want to read. For tutoring, for explanation-heavy contexts, and for any use case where "show your work" is part of the deliverable, R1 is the best in this matrix.
Pick DeepSeek R1 when:
- Budget is a real constraint. You are running high-volume math workloads — tutoring platforms, grading assistance, problem-set generation — and premium-tier reasoning models would be cost-prohibitive.
- You want clean, well-structured scratch work that a student or junior analyst can actually follow.
- You can self-host. R1 being open-weights is a serious advantage for teams that want to control their reasoning stack and avoid per-token pricing entirely.
- You are okay with weaker self-verification. R1 is strong at deriving an answer but less aggressive than o3 about catching its own mistakes.
Avoid DeepSeek R1 when:
- The problem is at the frontier and you cannot tolerate a missed verification step.
- You need a guarantee that the model has checked its own work before reporting the answer. Pair R1 with a verification pass from a different model if that's the requirement.
For most education, tutoring, and high-volume quantitative reasoning workloads, R1 is the rational pick. Reserve premium-tier models for the genuinely hard tail.
Claude Opus 4.7 (extended thinking): When It's the Right Call
Claude Opus 4.7 with extended thinking enabled is the long-context analyst's pick. Anthropic's extended-thinking mode lets Opus 4.7 spend additional compute on reasoning before producing an answer, and combined with its 1M-token context window it occupies a specific niche: math that lives inside a longer narrative, or quantitative analysis that has to be defensible to a careful human reader.
On hard math benchmarks Opus 4.7 with extended thinking is not the top scorer in this matrix — that is generally o3. Anthropic's own published results put Opus 4.7's math performance in the strong tier, ahead of most non-reasoning models and competitive with other premium reasoning models on most problem types, with some softness on heavy symbolic manipulation. What it excels at is writing the math up. The output is structured, the assumptions are surfaced, the caveats are honest, and the prose is the kind a human reviewer can read and trust.
Pick Claude Opus 4.7 (extended thinking) when:
- The math is embedded in a longer analysis — a research memo, a model documentation, an audit report, a long technical post-mortem.
- You need a million-token window. Long-context math reasoning is genuinely useful when you are working across many documents or a large codebase that contains the calculations.
- You are writing for a human reader who needs to follow the reasoning. Opus 4.7 produces some of the most legible quantitative prose in this matrix.
- You value disciplined uncertainty — surfacing assumptions, flagging where the math could go wrong, distinguishing what is computed from what is assumed.
Avoid Claude Opus 4.7 when:
- The bottleneck is pure symbolic manipulation. Opus 4.7's math is solid, but for dense algebra-grinding o3 and DeepSeek R1 will more reliably grind through.
- Cost is the binding constraint. Opus 4.7 is in the premium tier; if you are paying premium prices and the work is pure math, o3 is the safer choice.
The cleanest way to think about Opus 4.7 with extended thinking is: it is the model you want writing the analysis your CFO, your regulator, or your editor is going to read.
Which to Pick by Sub-Segment
The matrix above is a starting point. Math work has texture, and the right pick shifts depending on which kind of math you are doing.
Olympiad-style competition math
For AIME, IMO, Putnam-style problems where the difficulty is concentrated in clever construction rather than length: o3 first, Gemini Deep Think as the cost-aware second choice. DeepSeek R1 is impressively strong here for the price — published results on competition problems from DeepSeek and independent evaluators show R1 holding its own on the easier end of competition difficulty — but the top of the distribution is still o3's territory in 2026.
Symbolic algebra and proofs
For dense symbolic manipulation, long algebraic derivations, and formal-style proofs: o3 is the default, DeepSeek R1 is the strong budget alternative because it grinds through symbolic work cleanly. Avoid Claude Opus 4.7 for the pure symbol-grinding case; its strength is writing about math, not manipulating it for many steps. Gemini Deep Think is solid here too, but slightly less consistent than o3 at the hardest end.
Applied statistics and data analysis reasoning
For reasoning about distributions, hypothesis tests, regression diagnostics, A/B test results, causal-inference assumptions: Claude Opus 4.7 with extended thinking is the top pick. Statistics is almost never just math — it is math plus a story about what the data means. Opus 4.7's combination of strong reasoning and exceptional narrative writing is hard to beat here. Gemini Deep Think is the cost-aware second choice when you need to reason across many data sources at once.
Quantitative finance and modeling
For pricing models, portfolio math, risk calculations, time-series reasoning, and the kind of work where being wrong is genuinely expensive: o3 for the model-checking and the hardest cases, Gemini Deep Think for the day-to-day analytical work where its long context and lower cost give it the edge. Pair them — Deep Think drafts, o3 verifies — when the cost of an error is high enough to justify two passes.
Physics word problems
For physics — mechanics, thermodynamics, electromagnetism, quantum, anything where unit-tracking and dimensional analysis are load-bearing: o3 is the safest pick because its self-verification habits catch unit errors and physically-implausible answers more reliably than the others. Gemini Deep Think is a credible alternative. DeepSeek R1 writes clean physics scratch work and is the right call for high-volume tutoring contexts.
Verification of human-written math (grading, error-finding)
For checking someone else's derivation, finding the bug in a student's proof, or auditing a quantitative result: o3 is the right pick, full stop. Self-verification is its defining strength. Other reasoning models will often agree with a confident-sounding but wrong derivation; o3 is the most likely to push back, re-derive from scratch, and surface the actual error. If you are building a grading pipeline, route the verification step to o3 even if upstream generation runs on a cheaper model. The economics work out: cheap model for the easy bulk, premium model for the disagreement cases and the audit pass.
For the broader framework that informs these picks, see our AI model selection guide and our complete guide to prompting reasoning models in 2026.
Sample Prompt for the Recommended Winner
Here is a clean prompt for o3 on a multi-step optimization problem. Reasoning models reward lean problem statements over decorated ones, so the prompt deliberately avoids heavy formatting, persona setup, or instruction blocks. The bracketed placeholders are where you customize.
Problem.
[State the problem in one or two short paragraphs. Define all variables. State all constraints explicitly. If there are units, include them.]
Find: [exactly what you want the model to compute or prove].
Work the problem step by step. Show your reasoning. After you reach an answer, verify it by substituting back into the original constraints and confirming each one holds. If the verification fails, identify the error and redo the affected steps. Report the final answer only after verification succeeds.
A worked example of the placeholder fill: "Problem. A factory produces two products, A and B. Each unit of A requires 3 hours of labor and 2 kg of material; each unit of B requires 1 hour of labor and 4 kg of material. The factory has 120 labor-hours and 100 kg of material available per day. Profit per unit of A is $40; per unit of B is $30. Find: the production plan that maximizes daily profit, the maximum profit, and whether the optimum is unique."
Three things to notice about this prompt structure. First, no system message, no role assignment, no "you are a math expert" framing. o3 reasons more reliably when the problem statement is the prompt rather than buried inside a wrapper. Second, the verification instruction is explicit. o3 will self-verify by default, but stating it forces the behavior and gives you a clean stopping condition. Third, the output specification is minimal — "report the final answer only after verification succeeds" — because over-specifying the format pulls compute away from the reasoning.
For more patterns specific to o3, see our roundup of best o3 prompts for 2026.
Closing
If you only remember one rule from this matrix: the right model depends on whether the math is the hard part or whether the writing around it is. For genuinely hard math and for verification, o3 is the 2026 default. For cost-aware quantitative reasoning with a long context, Gemini Deep Think. For budget-constrained tutoring and high-volume scratch work, DeepSeek R1. For statistics, applied analysis, and math-inside-prose, Claude Opus 4.7 with extended thinking.
Pick the right model, then write the right prompt for it — lean for o3, structured for Opus 4.7, explicit for Deep Think, and let R1 show its work. The combination of model choice and prompt discipline is what separates "the AI got it wrong" from "the AI got it right and showed me why."
Keep reading:
- AI Model Selection Guide — the master framework for picking models across every category.
- AI Reasoning Models: Prompting Complete Guide 2026 — how to prompt o3, Deep Think, R1, and Opus 4.7 with extended thinking.
- Best o3 Prompts 2026 — production-tested prompts for the model at the top of this matrix.