Tip
TL;DR: The SurePrompts Quality Rubric scores a prompt across 7 dimensions — role clarity, context sufficiency, instruction specificity, format structure, example quality, constraint tightness, and output validation — each 1-5 for a max of 35. Anything 28+ is production-ready; 21-27 needs revision; below 21 is not yet working.
Key takeaways:
- The Rubric scores, it does not judge. 7 dimensions × 1-5 = max 35. A low score on one dimension is a specific fix to make next, not a verdict on the whole prompt.
- Output validation is the most under-built dimension. Prompts that score well everywhere else but 1 on validation cause production incidents disproportionately often.
- 28+ ships. 21-27 needs revision. Below 21 is not yet functional. The thresholds are deliberate, not arbitrary percentages.
- RCAF to draft, Rubric to audit. Pair the Rubric with RCAF Prompt Structure — use both, not one.
- For agent prompts, weight constraint tightness and output validation higher. Agents fail differently from one-shot prompts; a 28 one-shot can be a 22 agent prompt if the weighting is not adjusted.
Why a rubric at all?
Most prompt improvement today happens by vibe. A prompt "feels off," so the engineer tweaks wording until it "feels better." This works for simple prompts and breaks at scale: two engineers can't agree on what "better" means, a good prompt on Monday stops working on Thursday, and nobody can explain why.
A rubric replaces vibe with dimensions. Instead of this prompt is bad, you say this prompt scores 2/5 on output validation. That statement is actionable — you can add output validation and rescore.
The SurePrompts Quality Rubric is designed for one job: fast iteration with a shared vocabulary. It is not a gate, not a scoring system to impress anyone, and not a replacement for actually running the prompt on eval data. It is the thing you use between draft and eval to catch obvious weaknesses.
The 7 dimensions
1. Role clarity (1-5)
Does the prompt assign the AI a specific, coherent role?
- 5: Explicit role with scope, voice, expertise level, and posture. ("You are a senior backend engineer reviewing a pull request for production readiness.")
- 3: Role present but vague. ("You are a helpful assistant.")
- 1: No role. The model is guessing who it's supposed to be.
2. Context sufficiency (1-5)
Does the prompt include everything the model needs to do the task well?
- 5: All relevant background (the user's situation, constraints, prior decisions, relevant domain knowledge) is present.
- 3: Some context; the model can mostly proceed but will make assumptions.
- 1: Near-zero context. The model will fabricate or refuse.
3. Instruction specificity (1-5)
How precise is the task description?
- 5: The task, its sub-tasks, and the success criteria are named explicitly.
- 3: The task is named; sub-steps and success criteria are implicit.
- 1: Vague verb ("help me with X"), no sub-structure.
4. Format structure (1-5)
Is the expected output format specified?
- 5: Exact structure defined (schema, section headers, tone, length). Ideally with an example.
- 3: Format named ("as a list") but not specified in detail.
- 1: No format instructions.
5. Example quality (1-5)
Are the few-shot examples (if any) well-chosen?
- 5: 2-4 examples covering diverse input cases and the edge case(s) that matter.
- 3: 1-2 generic examples.
- 1: No examples, or examples that don't match the actual input distribution.
For zero-shot prompts, score this dimension based on whether the prompt makes zero-shot viable — some tasks genuinely don't need examples; others silently need them and suffer without.
6. Constraint tightness (1-5)
Are constraints (what the model must NOT do, length limits, banned words, output types) specified?
- 5: Explicit constraints covering the known failure modes for this task.
- 3: Some constraints, but the common failure modes are unaddressed.
- 1: No constraints. The model will do whatever it wants.
7. Output validation (1-5)
Is there a plan for validating the output before using it?
- 5: Output is machine-validated (schema check, regex, programmatic test) or explicitly reviewed against criteria.
- 3: Output is human-reviewed but without a checklist.
- 1: Output is used as-is, with no validation path.
This is the dimension most often at 1, and it's frequently the reason a prompt that "works" in testing breaks in production.
Scoring guidance
| Score | Meaning |
|---|---|
| 28-35 | Production-ready. Ship it. |
| 21-27 | Working draft. Fix the lowest-scoring dimensions. |
| 14-20 | Needs major revision. Pick the 3 lowest scores and address them. |
| 7-13 | Not yet functional. Rewrite from scratch using RCAF + Rubric. |
Worked example
Consider this starting prompt:
Write me a product description for a new blender.
Scored against the Rubric:
- Role clarity: 1 (no role)
- Context sufficiency: 1 (no product details)
- Instruction specificity: 2 (task named, nothing else)
- Format structure: 1 (no format specified)
- Example quality: 1 (no examples)
- Constraint tightness: 1 (no constraints)
- Output validation: 1 (no validation plan)
Total: 8/35. Not functional.
Revised using RCAF structure and Rubric feedback:
Role: You are an ecommerce copywriter writing for a mid-market kitchen appliance brand. Voice: confident, practical, no hype.
>
Context: The product is the Vortex Pro 700W countertop blender. Key specs: 700W motor, 6 speeds, 48oz glass jar, BPA-free lid, stainless steel blades, 7-year warranty. Target buyer: home cook who wants a reliable blender without pro-chef overkill.
>
Action: Write a product description optimized for an Amazon listing page. Cover: hero statement, 5 bullet-point feature benefits, 1 short paragraph on who it's for, 1 short paragraph on what's in the box.
>
Format:
- Hero statement: 1 sentence, <20 words
- Feature bullets: 5 bullets, each <15 words, benefit-first
- Who-it's-for paragraph: 2-3 sentences
- What's-in-the-box paragraph: 1-2 sentences listing items
>
Constraints: Do not use the words "revolutionary," "game-changing," or "ultimate." Do not make claims about blending ice unless asked (motor is 700W, which is borderline). Do not invent accessories not listed in the spec.
>
Validation: After writing, list the 5 claims you made that could not be verified from the context above, so I can check them.
Scored:
- Role clarity: 5
- Context sufficiency: 4 (we didn't include competitor positioning or price point)
- Instruction specificity: 5
- Format structure: 5
- Example quality: 2 (no example; for Amazon copy we might want one, but zero-shot is viable here)
- Constraint tightness: 4 (good banned-word list; could add length limit on output)
- Output validation: 5 (the "list unverifiable claims" instruction is an in-prompt validation step)
Total: 30/35. Ship it.
Our position
- The Rubric is a diagnostic, not a gate. Don't hold a prompt back over a 26 if the eval-set results are fine.
- Output validation is the single highest-leverage dimension. Prompts that score well elsewhere but 1 on validation cause production incidents disproportionately often.
- The Rubric is deliberately 7 dimensions. Fewer misses failure modes; more becomes theater.
- For agent prompts, double-weight constraint tightness and output validation. Single-shot failure modes differ from multi-step drift.
- Use the Rubric paired with RCAF for drafting. RCAF to draft, Rubric to audit.
Related reading
- RCAF Prompt Structure — the drafting skeleton the Rubric pairs with
- Context Engineering Maturity Model — where context sufficiency scales
- Common prompt engineering mistakes
- Why your AI prompts suck