If the math problem is genuinely hard — the kind where a wrong step in line three quietly poisons the answer in line fourteen — the default pick in 2026 is GPT-5.5 at high reasoning effort. It is the most stable model we have for long chains of deduction and the most willing to catch its own errors, with Claude Opus 4.8 and Gemini 3.1 Pro neck-and-neck right behind it. Gemini 3.1 Pro is the cost-aware quantitative pick, especially when you need to reason across a long document or many tables. DeepSeek V4-Pro is the budget option that punches well above its price tier on competition-style problems and on legible scratch work. Claude Opus 4.8 with extended thinking is the right call when math is embedded in a longer analytical narrative, or when you need a million-token window with disciplined step-by-step reasoning.
4
How We Evaluated
This is a working buyer's matrix, not a leaderboard. We focused on the dimensions that actually predict whether a model will get a hard quantitative problem right and whether a careful reader will be able to trust the answer.
The seven dimensions in the matrix are:
- Context window — how much math you can stuff in, including problem statements, reference material, prior steps, and data tables.
- Multi-step deduction stability — whether the model can sustain a long chain of reasoning without drifting, swapping signs, or losing a constraint introduced ten steps earlier.
- Symbolic manipulation — how reliably it handles algebra, calculus, combinatorics, and proof-style work without "looks plausible" hand-waving.
- Self-verification behavior — whether the model checks its own work, substitutes back, sanity-checks units, or flags uncertainty rather than confidently producing the wrong number.
- Showing work clearly — whether the output is a legible argument a human can audit, not a wall of dense notation.
- Latency — how long you wait for a hard problem. Reasoning models are slow by design; we treat this as a factor, not a deal-breaker.
- Cost tier — relative price for a representative hard-math turn.
Honesty disclaimer. AIME, USAMO, MATH, GPQA Diamond, and FrontierMath are real public benchmarks with published results from model providers and independent researchers. Those numbers move with every model revision, and we will not quote specific percentages in this post. Where prose references a benchmark, we name it and say results are published by the provider or by independent evaluators — no fabricated scores. If you want to chase down those source numbers yourself, our 50 best Perplexity prompts for cited research are built for exactly that kind of verifiable, citation-backed fact-finding. Capability columns are qualitative buckets: Best-in-class, Strong, Adequate, Trailing.
The Decision Matrix
| Model | Context window | Multi-step deduction stability | Symbolic manipulation | Self-verification behavior | Showing work clearly | Latency | Cost tier |
|---|---|---|---|---|---|---|---|
| GPT-5.5 (high reasoning effort) | 400K tokens | Best-in-class | Best-in-class | Best-in-class | Strong | Trailing | Premium |
| Gemini 3.1 Pro | 1M tokens | Strong | Strong | Strong | Strong | Trailing | Mid |
| DeepSeek V4-Pro | 1M tokens | Strong | Strong | Adequate | Best-in-class | Adequate | Budget |
| Claude Opus 4.8 (extended thinking) | 1M tokens | Strong | Adequate | Strong | Best-in-class | Adequate | Premium |
The matrix says the boring true thing: there is no free lunch. GPT-5.5 at high reasoning effort is the strongest at the math itself but is slow and expensive. Gemini 3.1 Pro trades a notch of raw deductive horsepower for a much larger context and a friendlier price. DeepSeek V4-Pro is shockingly capable for the money and writes the cleanest scratch work, with a softer track record on catching its own mistakes. Claude Opus 4.8 with extended thinking is the long-context analyst — its math is solid, its narrative work is exceptional, and its weak spot is dense symbolic manipulation.
GPT-5.5 (high reasoning effort): When It's the Right Call
GPT-5.5 at high reasoning effort is the default for any math problem where being wrong is expensive. Inside OpenAI's lineup, the high-reasoning-effort mode is the configuration that most reliably sustains a long deductive chain without drifting, and it has the strongest self-verification habits we have seen in production — it is now the home for the deep step-by-step reasoning that earlier reasoning-specialist models were built for. On hard problem sets like AIME and on graduate-level science reasoning evaluations such as GPQA Diamond — results published by OpenAI and by independent evaluators — GPT-5.5 at high reasoning effort has been the reference point that other 2026 reasoning configurations are measured against. Claude Opus 4.8 and Gemini 3.1 Pro sit neck-and-neck right behind it.
The thing high reasoning effort does that lesser configurations don't: it slows down on purpose. Give it a multi-step optimization problem or a proof and it will spend real time exploring, backtracking, substituting candidate answers back into the original constraints, and rejecting branches that don't close. The output is not a dazzling first draft. It is a verified one.
Pick GPT-5.5 at high reasoning effort when:
- The problem is genuinely hard — competition math, non-trivial proofs, multi-constraint optimization, edge cases in numerical methods.
- The cost of a wrong answer is higher than the cost of a slow one. Quant research, model risk, exam prep at the top end, technical interview prep.
- You are verifying a human-written derivation and you want a model that will actually try to find the error rather than agree with the author.
- You can tolerate latency. High reasoning effort thinks for a while. That's the feature.
Avoid GPT-5.5 at high reasoning effort when:
- You need throughput. Batch grading thousands of arithmetic problems is wasted on it.
- You need a 1M-token window. GPT-5.5's 400K is generous but not the widest in this matrix.
- The problem is genuinely easy. It will still answer correctly, but you are overpaying for headroom you didn't need.
The right prompt for GPT-5.5 at high reasoning effort on a hard problem is short and clean. Strip the ornament. Reasoning models reward precision in the problem statement and tend to be distracted by verbose system instructions. See the sample prompt section below for what "lean" looks like.
One more practical note. GPT-5.5's verification behavior is most useful when you actually ask for it. The model will often self-verify by default on hard problems, but stating "verify your answer before reporting it" makes the verification step robust rather than opportunistic — and gives you a clean signal in the output when verification succeeds versus when the model is still uncertain.
Gemini 3.1 Pro: When It's the Right Call
Gemini 3.1 Pro is the cost-aware quantitative pick. In Google's Gemini family, it is the flagship designed for problems where the model needs to spend more compute exploring solution branches and reasoning over large inputs. On hard math evaluations including IMO-style problems and on FrontierMath — results published by Google DeepMind and by independent researchers — Gemini 3.1 Pro has been a credible peer to GPT-5.5 at high reasoning effort across many problem types, neck-and-neck with Opus 4.8 just behind the leader, with the practical advantages of a 1M-token context window and a friendlier price tier.
What makes Gemini 3.1 Pro the right call is not that it beats GPT-5.5 at high reasoning effort head-to-head on every category — it doesn't. It is that the gap is smaller than the cost gap, and that the long context lets you do things that simply aren't possible with a 400K window. You can feed Gemini 3.1 Pro an entire textbook chapter, a full dataset description, every prior exam, and the student's submission, all in one turn. The reasoning quality holds up across that volume of material in ways smaller-context configurations cannot match.
Pick Gemini 3.1 Pro when:
- You need serious reasoning at a meaningfully lower cost than GPT-5.5 at high reasoning effort, and you are okay trading a small amount of top-end accuracy for that savings.
- The problem involves a lot of material — long problem statements, reference PDFs, historical data, multi-document derivations.
- You are doing applied quant work where the math is hard but not olympiad-hard, and the bottleneck is integrating information across many sources.
- You want strong self-verification behavior without paying premium-tier rates.
Avoid Gemini 3.1 Pro when:
- You are working on the absolute hardest end of competition or research-grade math. GPT-5.5 at high reasoning effort still has the edge there.
- You need consistent low latency. Gemini 3.1 Pro, like all top-tier reasoning configurations in 2026, is slow on hard problems by design.
Gemini 3.1 Pro is the most defensible default when you are buying for a team and most of your math work is "hard, but tractable" rather than "frontier." It also pairs naturally with GPT-5.5 at high reasoning effort in a two-pass setup: Gemini 3.1 Pro drafts the solution across whatever sprawling reference material you have, then GPT-5.5 at high reasoning effort verifies the critical steps. That stack costs less than running everything on GPT-5.5 at high reasoning effort and gets you most of the way to its reliability.
DeepSeek V4-Pro: When It's the Right Call
DeepSeek V4-Pro is the budget pick that nobody serious gets to ignore. DeepSeek released V4 as an open-weight reasoning model — the model card and the original release post from DeepSeek describe a reinforcement-learning pipeline trained specifically to incentivize long chain-of-thought. On MATH, AIME-style problems, and competition coding benchmarks — results published by DeepSeek and corroborated by independent evaluators — V4-Pro has been the model that proved you don't need premium-tier pricing to get genuinely strong mathematical reasoning.
The standout property of V4-Pro for math users isn't raw accuracy. It is legibility. V4-Pro shows its work in clean, structured, human-readable scratch — the kind of step-by-step a teacher would actually want to read. For tutoring, for explanation-heavy contexts, and for any use case where "show your work" is part of the deliverable, V4-Pro is the best in this matrix.
Pick DeepSeek V4-Pro when:
- Budget is a real constraint. You are running high-volume math workloads — tutoring platforms, grading assistance, problem-set generation — and premium-tier reasoning models would be cost-prohibitive.
- You want clean, well-structured scratch work that a student or junior analyst can actually follow.
- You can self-host. V4 being open-weight is a serious advantage for teams that want to control their reasoning stack and avoid per-token pricing entirely.
- You are okay with weaker self-verification. V4-Pro is strong at deriving an answer but less aggressive than GPT-5.5 at high reasoning effort about catching its own mistakes.
Avoid DeepSeek V4-Pro when:
- The problem is at the frontier and you cannot tolerate a missed verification step.
- You need a guarantee that the model has checked its own work before reporting the answer. Pair V4-Pro with a verification pass from a different model if that's the requirement.
For most education, tutoring, and high-volume quantitative reasoning workloads, V4-Pro is the rational pick. Reserve premium-tier models for the genuinely hard tail. If you are building on a tight budget, V4-Pro also slots neatly alongside the rest of the best free AI prompt tools in 2026 for teams that want strong reasoning without a subscription.
Claude Opus 4.8 (extended thinking): When It's the Right Call
Claude Opus 4.8 with extended thinking enabled is the long-context analyst's pick. Anthropic's extended-thinking mode lets Opus 4.8 spend additional compute on reasoning before producing an answer, and combined with its 1M-token context window it occupies a specific niche: math that lives inside a longer narrative, or quantitative analysis that has to be defensible to a careful human reader. On the math itself it is neck-and-neck with Gemini 3.1 Pro, just behind GPT-5.5 at high reasoning effort.
On hard math benchmarks Opus 4.8 with extended thinking is not the top scorer in this matrix — that is generally GPT-5.5 at high reasoning effort. Anthropic's own published results put Opus 4.8's math performance in the strong tier, ahead of most non-reasoning models and competitive with other premium reasoning models on most problem types, with some softness on heavy symbolic manipulation. What it excels at is writing the math up. The output is structured, the assumptions are surfaced, the caveats are honest, and the prose is the kind a human reviewer can read and trust.
Pick Claude Opus 4.8 (extended thinking) when:
- The math is embedded in a longer analysis — a research memo, a model documentation, an audit report, a long technical post-mortem.
- You need a million-token window. Long-context math reasoning is genuinely useful when you are working across many documents or a large codebase that contains the calculations.
- You are writing for a human reader who needs to follow the reasoning. Opus 4.8 produces some of the most legible quantitative prose in this matrix.
- You value disciplined uncertainty — surfacing assumptions, flagging where the math could go wrong, distinguishing what is computed from what is assumed.
Avoid Claude Opus 4.8 when:
- The bottleneck is pure symbolic manipulation. Opus 4.8's math is solid, but for dense algebra-grinding GPT-5.5 at high reasoning effort and DeepSeek V4-Pro will more reliably grind through.
- Cost is the binding constraint. Opus 4.8 is in the premium tier; if you are paying premium prices and the work is pure math, GPT-5.5 at high reasoning effort is the safer choice.
The cleanest way to think about Opus 4.8 with extended thinking is: it is the model you want writing the analysis your CFO, your regulator, or your editor is going to read.
Which to Pick by Sub-Segment
The matrix above is a starting point. Math work has texture, and the right pick shifts depending on which kind of math you are doing.
Olympiad-style competition math
For AIME, IMO, Putnam-style problems where the difficulty is concentrated in clever construction rather than length: GPT-5.5 at high reasoning effort first, Gemini 3.1 Pro as the cost-aware second choice with Opus 4.8 right alongside it. DeepSeek V4-Pro is impressively strong here for the price — published results on competition problems from DeepSeek and independent evaluators show V4-Pro holding its own on the easier end of competition difficulty — but the top of the distribution is still GPT-5.5 at high reasoning effort's territory in 2026.
Symbolic algebra and proofs
For dense symbolic manipulation, long algebraic derivations, and formal-style proofs: GPT-5.5 at high reasoning effort is the default, DeepSeek V4-Pro is the strong budget alternative because it grinds through symbolic work cleanly. Avoid Claude Opus 4.8 for the pure symbol-grinding case; its strength is writing about math, not manipulating it for many steps. Gemini 3.1 Pro is solid here too, but slightly less consistent than GPT-5.5 at high reasoning effort at the hardest end.
Applied statistics and data analysis reasoning
For reasoning about distributions, hypothesis tests, regression diagnostics, A/B test results, causal-inference assumptions: Claude Opus 4.8 with extended thinking is the top pick. Statistics is almost never just math — it is math plus a story about what the data means. Opus 4.8's combination of strong reasoning and exceptional narrative writing is hard to beat here. Gemini 3.1 Pro is the cost-aware second choice when you need to reason across many data sources at once.
Quantitative finance and modeling
For pricing models, portfolio math, risk calculations, time-series reasoning, and the kind of work where being wrong is genuinely expensive: GPT-5.5 at high reasoning effort for the model-checking and the hardest cases, Gemini 3.1 Pro for the day-to-day analytical work where its long context and lower cost give it the edge. Pair them — Gemini 3.1 Pro drafts, GPT-5.5 at high reasoning effort verifies — when the cost of an error is high enough to justify two passes.
Physics word problems
For physics — mechanics, thermodynamics, electromagnetism, quantum, anything where unit-tracking and dimensional analysis are load-bearing: GPT-5.5 at high reasoning effort is the safest pick because its self-verification habits catch unit errors and physically-implausible answers more reliably than the others. Gemini 3.1 Pro is a credible alternative. DeepSeek V4-Pro writes clean physics scratch work and is the right call for high-volume tutoring contexts.
Verification of human-written math (grading, error-finding)
For checking someone else's derivation, finding the bug in a student's proof, or auditing a quantitative result: GPT-5.5 at high reasoning effort is the right pick, full stop. Self-verification is its defining strength. Other models will often agree with a confident-sounding but wrong derivation; GPT-5.5 at high reasoning effort is the most likely to push back, re-derive from scratch, and surface the actual error. If you are building a grading pipeline, route the verification step to GPT-5.5 at high reasoning effort even if upstream generation runs on a cheaper model. The economics work out: cheap model for the easy bulk, premium model for the disagreement cases and the audit pass.
For the broader framework that informs these picks, see our AI model selection guide and our complete guide to prompting reasoning models in 2026.
Sample Prompt for the Recommended Winner
Here is a clean prompt for GPT-5.5 at high reasoning effort on a multi-step optimization problem. Reasoning models reward lean problem statements over decorated ones, so the prompt deliberately avoids heavy formatting, persona setup, or instruction blocks. The bracketed placeholders are where you customize.
Problem.
[State the problem in one or two short paragraphs. Define all variables. State all constraints explicitly. If there are units, include them.]
Find: [exactly what you want the model to compute or prove].
Work the problem step by step. Show your reasoning. After you reach an answer, verify it by substituting back into the original constraints and confirming each one holds. If the verification fails, identify the error and redo the affected steps. Report the final answer only after verification succeeds.
A worked example of the placeholder fill: "Problem. A factory produces two products, A and B. Each unit of A requires 3 hours of labor and 2 kg of material; each unit of B requires 1 hour of labor and 4 kg of material. The factory has 120 labor-hours and 100 kg of material available per day. Profit per unit of A is $40; per unit of B is $30. Find: the production plan that maximizes daily profit, the maximum profit, and whether the optimum is unique."
Three things to notice about this prompt structure. First, no system message, no role assignment, no "you are a math expert" framing. GPT-5.5 at high reasoning effort reasons more reliably when the problem statement is the prompt rather than buried inside a wrapper. Second, the verification instruction is explicit. GPT-5.5 at high reasoning effort will self-verify by default, but stating it forces the behavior and gives you a clean stopping condition. Third, the output specification is minimal — "report the final answer only after verification succeeds" — because over-specifying the format pulls compute away from the reasoning.
For more patterns specific to reasoning models, see our roundup of 50 best reasoning-model prompts for 2026.
Closing
If you only remember one rule from this matrix: the right model depends on whether the math is the hard part or whether the writing around it is. For genuinely hard math and for verification, GPT-5.5 at high reasoning effort is the 2026 default, with Opus 4.8 and Gemini 3.1 Pro neck-and-neck behind it. For cost-aware quantitative reasoning with a long context, Gemini 3.1 Pro. For budget-constrained tutoring and high-volume scratch work, DeepSeek V4-Pro. For statistics, applied analysis, and math-inside-prose, Claude Opus 4.8 with extended thinking.
Pick the right model, then write the right prompt for it — lean for GPT-5.5 at high reasoning effort, structured and XML-tagged for Opus 4.8, explicit for Gemini 3.1 Pro, and let DeepSeek V4-Pro show its work. The combination of model choice and prompt discipline is what separates "the AI got it wrong" from "the AI got it right and showed me why."
Keep reading:
- AI Model Selection Guide — the master framework for picking models across every category.
- AI Reasoning Models: Prompting Complete Guide 2026 — how to prompt GPT-5.5 at high reasoning effort, Gemini 3.1 Pro, DeepSeek V4-Pro, and Opus 4.8 with extended thinking.
- 50 Best Reasoning-Model Prompts 2026 — production-tested prompts for the models at the top of this matrix.
