This is a companion to The State of AI Prompting 2026, where 971 real prompts averaged just 20.5 out of 100. Here we split that same dataset by the AI each prompt was written for — and the model you're talking to turns out to predict how much effort you put in.
We already knew the average prompt is weak. The surprise is how unevenly that weakness is distributed. Split the same 971 prompts by their target model and a clear pattern appears: the audience writing for Claude shows up far better prepared than the audience writing for ChatGPT.
62%
Raw prompt quality by model
Each prompt is scored 0–100 on the same eight dimensions used in the main report. "Raw" is the prompt as the person typed it; "engineered" is SurePrompts' restructured version of it.
| Model | Prompts | Share | Avg raw score | Avg engineered score |
|---|---|---|---|---|
| Perplexity | 33 | 3.4% | 28.7 / 100 | 78.3 / 100 |
| Copilot | 47 | 4.8% | 28.0 / 100 | 81.1 / 100 |
| Claude | 228 | 23.5% | 27.8 / 100 | 82.5 / 100 |
| Gemini | 102 | 10.5% | 19.2 / 100 | 71.4 / 100 |
| Grok | 85 | 8.8% | 18.1 / 100 | 56.1 / 100 |
| DeepSeek | 47 | 4.8% | 17.5 / 100 | 65.0 / 100 |
| ChatGPT | 69 | 7.1% | 17.2 / 100 | 80.5 / 100 |
| No model selected | 353 | 36.4% | 16.2 / 100 | 81.0 / 100 |
| Llama | 7 | 0.7% | 15.1 / 100 | 62.4 / 100 |
Info
Read the raw column, not the engineered one. Raw scores measure how people actually prompt. The engineered column reflects SurePrompts' model-specific formatting scored against the same rubric, so differences there are partly about output formatting, not user behavior. Claude (n=228) is the most robustly measured high scorer; Perplexity and Copilot edge it on much smaller samples.
The top of the table: Claude, Perplexity, and Copilot
The three highest-scoring audiences have something in common — they skew technical. Perplexity is built around research queries. Copilot lives inside coding and Office workflows. Claude has a heavy developer and analyst following. People arriving from those contexts are used to writing longer, more structured requests, so their raw prompts already carry more of what the rubric rewards: an explicit task, some context, and occasionally a defined output format.
Claude is the headline because its sample is large enough to trust. At 27.8/100 across 228 prompts, it isn't a fluke of a handful of power users — it's a consistent ~45% edge over Gemini (19.2) and a 62% edge over ChatGPT (17.2).
The bottom of the table: ChatGPT and the "no model" crowd
The two weakest groups are the two biggest stories.
ChatGPT (17.2/100) scores near the bottom of every named model. That isn't a knock on the model — it's a reflection of its reach. ChatGPT is the default front door to AI for the mainstream, and the mainstream writes one-line, role-less prompts. The very popularity that makes it the category leader also fills its prompt stream with "write me a caption" and "fix this."
No model selected (16.2/100) is the largest single group in the entire dataset — 36% of all prompts — and it scores lowest of all. People who don't even pause to pick which AI they're prompting are, unsurprisingly, the same people who don't pause to structure the request.
But here's the part that matters: everyone is failing
It's tempting to read this as "Claude users are good and ChatGPT users are bad." They aren't. Claude's leading score is still 27.8 out of 100 — a failing prompt. The gap between the best and worst model audiences is the gap between very weak and extremely weak. Across all 971 prompts the average is just 20.5/100, and 9 in 10 score below 50 no matter which model they target.
The lesson isn't "switch models." It's that the single biggest lever on your output quality is the prompt — and almost nobody is pulling it, regardless of which AI they prefer.
What to do with this
The fix is the same for every model, because the rubric is the same: give the AI a role, add specificity and context, define the output format, and show an example when the task is complex. That's the structure that separates a 17 from an 85.
- Paste any prompt into the free Prompt Quality Score tool to see exactly which of the eight dimensions you're leaving on the table.
- Use the prompt builder to close those gaps automatically — it's tuned per model, whether you're writing for Claude, ChatGPT, or Gemini.
- New to structuring prompts at all? Start with how to write an AI prompt.
Warning
Methodology and limits. Prompts are grouped by the target model selected at generation time; "No model selected" means the user left the default. Sample sizes vary widely — Claude (228) and Gemini (102) are robust, ChatGPT (69) and the sub-50 groups are directional, and Llama (7) is too small to draw conclusions from and is excluded from the headline. Scores are a deterministic 8-dimension heuristic, not human ratings. The sample is 971 prompts from SurePrompts users, March–June 2026, and skews toward people already seeking better prompts — so these are likely upper bounds. The full aggregate dataset is published under CC BY 4.0 — download the JSON (cite SurePrompts, State of AI Prompting 2026, by Model).
Frequently asked questions
Do people write better prompts for Claude or ChatGPT?
In a sample of 971 real prompts submitted to SurePrompts in 2026, prompts written for Claude scored an average of 27.8 out of 100, versus 17.2 for ChatGPT — a 62% gap. Both are still failing scores, but Claude consistently drew more structured prompts.
Which AI model gets the highest-quality prompts?
Among models with a meaningful sample, Claude (27.8/100, n=228) leads. Perplexity (28.7) and Copilot (28.0) score a touch higher on smaller samples (n=33 and n=47). The weakest prompts went to ChatGPT (17.2) and to sessions where no model was selected at all (16.2).
Why would prompt quality differ by AI model?
It reflects who is doing the prompting, not the model itself. Claude and Perplexity skew toward developers, analysts, and research-minded users who write longer, more structured prompts. ChatGPT's broad mainstream audience includes far more one-line, role-less prompts.
Is any model's audience writing good prompts?
No. Even the best-prompted model averages under 30 out of 100. The gap between models is the gap between bad and very bad — the entire distribution leaves most of every model's capability untapped.
How was prompt quality measured across models?
Every prompt was scored 0–100 across 8 weighted dimensions (length, role, specificity, structure, output format, constraints, context, examples) by the same deterministic engine behind the free SurePrompts Prompt Quality Score tool, then grouped by the target model selected at generation time. Sample: 971 prompts, March–June 2026, aggregate-only.
