The SurePrompts Quality Rubric: A 7-Dimension Framework for Scoring Prompts

Q: What is the SurePrompts Quality Rubric?

A 7-dimension framework for scoring prompt quality: role clarity, context sufficiency, instruction specificity, format structure, example quality, constraint tightness, and output validation. Each scored 1-5, for a max score of 35. Designed to replace vague 'this prompt feels off' judgments with concrete, actionable scores.

Q: How many points is a good prompt?

28+ out of 35 (80%) is the threshold for a prompt we'd ship. 21-27 is a working draft that still has gaps. Below 21 is a prompt that's likely to fail in subtle ways — even if it runs, it isn't reliable.

Q: Do you have to score every dimension?

For high-stakes prompts (production, customer-facing, agent systems) yes, score all seven. For one-off queries, scoring the top three weak spots is enough. The rubric is a diagnostic, not a gate — the point is faster iteration, not ceremony.

Q: How does this differ from RCAF?

RCAF is a structure (Role · Context · Action · Format) — a prompt skeleton. The Quality Rubric is an evaluation tool — how well is any given prompt (RCAF-shaped or not) actually doing? Use them together: RCAF to draft, Rubric to audit.

Q: Can the Rubric be automated?

Partially. Role clarity, format structure, and constraint tightness have objective signals (presence of a role line, presence of format instructions, counted constraints). Context sufficiency and example quality require judgment — either human or LLM-as-judge. Output validation can be tested empirically by running the prompt on a known-answer eval set.

Q: Does the Rubric work for agent prompts?

Yes, with one adjustment: weight constraint tightness and output validation more heavily. Agent prompts fail differently from one-shot prompts — they drift over steps, call tools incorrectly, and produce unrecoverable errors. High scores on constraints and validation predict agent reliability more than a high overall score.

Q: How do I use the Rubric during iteration?

Score the draft. Pick the lowest-scoring dimension. Make one change that specifically raises that dimension. Re-score. Repeat until you hit 28+ or plateau. The discipline is 'improve the weakest dimension next,' not 'make everything better at once' — the latter loses focus.

Q: Is a perfect 35/35 achievable?

Rarely, and not always desirable. A 35 prompt is exhaustively specified — which can overfit the prompt to a narrow task and make it brittle when the task shifts. Most production prompts sit 28-32. If you're scoring 35, check whether you've over-constrained and are losing flexibility.

Imtiaz Rayhan

Key takeaways:

The Rubric scores, it does not judge. 7 dimensions × 1-5 = max 35. A low score on one dimension is a specific fix to make next, not a verdict on the whole prompt.
Output validation is the most under-built dimension. Prompts that score well everywhere else but 1 on validation cause production incidents disproportionately often.
28+ ships. 21-27 needs revision. Below 21 is not yet functional. The thresholds are deliberate, not arbitrary percentages.
RCAF to draft, Rubric to audit. Pair the Rubric with RCAF Prompt Structure — use both, not one.
For agent prompts, weight constraint tightness and output validation higher. Agents fail differently from one-shot prompts; a 28 one-shot can be a 22 agent prompt if the weighting is not adjusted.

Why a rubric at all?

Most prompt improvement today happens by vibe. A prompt "feels off," so the engineer tweaks wording until it "feels better." This works for simple prompts and breaks at scale: two engineers can't agree on what "better" means, a good prompt on Monday stops working on Thursday, and nobody can explain why.

A rubric replaces vibe with dimensions. Instead of this prompt is bad, you say this prompt scores 2/5 on output validation. That statement is actionable — you can add output validation and rescore.

The SurePrompts Quality Rubric is designed for one job: fast iteration with a shared vocabulary. It is not a gate, not a scoring system to impress anyone, and not a replacement for actually running the prompt on eval data. It is the thing you use between draft and eval to catch obvious weaknesses.

The 7 dimensions

1. Role clarity (1-5)

Does the prompt assign the AI a specific, coherent role?

5: Explicit role with scope, voice, expertise level, and posture. ("You are a senior backend engineer reviewing a pull request for production readiness.")
3: Role present but vague. ("You are a helpful assistant.")
1: No role. The model is guessing who it's supposed to be.

2. Context sufficiency (1-5)

Does the prompt include everything the model needs to do the task well?

5: All relevant background (the user's situation, constraints, prior decisions, relevant domain knowledge) is present.
3: Some context; the model can mostly proceed but will make assumptions.
1: Near-zero context. The model will fabricate or refuse.

3. Instruction specificity (1-5)

How precise is the task description?

5: The task, its sub-tasks, and the success criteria are named explicitly.
3: The task is named; sub-steps and success criteria are implicit.
1: Vague verb ("help me with X"), no sub-structure.

4. Format structure (1-5)

Is the expected output format specified?

5: Exact structure defined (schema, section headers, tone, length). Ideally with an example.
3: Format named ("as a list") but not specified in detail.
1: No format instructions.

5. Example quality (1-5)

Are the few-shot examples (if any) well-chosen?

5: 2-4 examples covering diverse input cases and the edge case(s) that matter.
3: 1-2 generic examples.
1: No examples, or examples that don't match the actual input distribution.

For zero-shot prompts, score this dimension based on whether the prompt makes zero-shot viable — some tasks genuinely don't need examples; others silently need them and suffer without.

6. Constraint tightness (1-5)

Are constraints (what the model must NOT do, length limits, banned words, output types) specified?

5: Explicit constraints covering the known failure modes for this task.
3: Some constraints, but the common failure modes are unaddressed.
1: No constraints. The model will do whatever it wants.

For prompts that handle untrusted input, constraint tightness also covers resistance to prompt injection — a high score here means the prompt holds its instructions even when the input tries to override them.

7. Output validation (1-5)

Is there a plan for validating the output before using it?

5: Output is machine-validated (schema check, regex, programmatic test) or explicitly reviewed against criteria.
3: Output is human-reviewed but without a checklist.
1: Output is used as-is, with no validation path.

This is the dimension most often at 1, and it's frequently the reason a prompt that "works" in testing breaks in production.

Scoring guidance

Score	Meaning
28-35	Production-ready. Ship it.
21-27	Working draft. Fix the lowest-scoring dimensions.
14-20	Needs major revision. Pick the 3 lowest scores and address them.
7-13	Not yet functional. Rewrite from scratch using RCAF + Rubric.

Worked example

Consider this starting prompt:

Write me a product description for a new blender.

Scored against the Rubric:

Role clarity: 1 (no role)
Context sufficiency: 1 (no product details)
Instruction specificity: 2 (task named, nothing else)
Format structure: 1 (no format specified)
Example quality: 1 (no examples)
Constraint tightness: 1 (no constraints)
Output validation: 1 (no validation plan)

Total: 8/35. Not functional.

Revised using RCAF structure and Rubric feedback:

Role: You are an ecommerce copywriter writing for a mid-market kitchen appliance brand. Voice: confident, practical, no hype.

>

Context: The product is the Vortex Pro 700W countertop blender. Key specs: 700W motor, 6 speeds, 48oz glass jar, BPA-free lid, stainless steel blades, 7-year warranty. Target buyer: home cook who wants a reliable blender without pro-chef overkill.

>

Action: Write a product description optimized for an Amazon listing page. Cover: hero statement, 5 bullet-point feature benefits, 1 short paragraph on who it's for, 1 short paragraph on what's in the box.

>

Format:

- Hero statement: 1 sentence, <20 words

- Feature bullets: 5 bullets, each <15 words, benefit-first

- Who-it's-for paragraph: 2-3 sentences

- What's-in-the-box paragraph: 1-2 sentences listing items

>

Constraints: Do not use the words "revolutionary," "game-changing," or "ultimate." Do not make claims about blending ice unless asked (motor is 700W, which is borderline). Do not invent accessories not listed in the spec.

>

Validation: After writing, list the 5 claims you made that could not be verified from the context above, so I can check them.

Scored:

Role clarity: 5
Context sufficiency: 4 (we didn't include competitor positioning or price point)
Instruction specificity: 5
Format structure: 5
Example quality: 2 (no example; for Amazon copy we might want one, but zero-shot is viable here)
Constraint tightness: 4 (good banned-word list; could add length limit on output)
Output validation: 5 (the "list unverifiable claims" instruction is an in-prompt validation step)

Total: 30/35. Ship it.

Our position

The Rubric is a diagnostic, not a gate. Don't hold a prompt back over a 26 if the eval-set results are fine.
Output validation is the single highest-leverage dimension. Prompts that score well elsewhere but 1 on validation cause production incidents disproportionately often.
The Rubric is deliberately 7 dimensions. Fewer misses failure modes; more becomes theater.
For agent prompts, double-weight constraint tightness and output validation. Single-shot failure modes differ from multi-step drift.
Use the Rubric paired with RCAF for drafting. RCAF to draft, Rubric to audit.

RCAF Prompt Structure — the drafting skeleton the Rubric pairs with
Context Engineering Maturity Model — where context sufficiency scales
LLM Temperature and Sampling — the sampling settings that affect how reproducible your output-validation scores are
Common prompt engineering mistakes
Why your AI prompts suck

The SurePrompts Quality Rubric: A 7-Dimension Framework for Scoring Prompts

Why a rubric at all?

The 7 dimensions

1. Role clarity (1-5)

2. Context sufficiency (1-5)

3. Instruction specificity (1-5)

4. Format structure (1-5)

5. Example quality (1-5)

6. Constraint tightness (1-5)

7. Output validation (1-5)

Scoring guidance

Worked example

Our position

Ready to write better prompts?

Related Resources

Prompt Refinement Template

Prompt Chain Builder Template

System Prompt Writer Template

Prompt Engineering Framework Template

Related Articles

The Complete Guide to AI Prompt Engineering: From Beginner to Expert

Every Prompt Engineering Technique Explained: The Research-Backed Guide (2026)

Prompt Engineering Basics: The Complete Beginner's Guide (2026)

The SurePrompts Quality Rubric: A 7-Dimension Framework for Scoring Prompts

Why a rubric at all?

The 7 dimensions

1. Role clarity (1-5)

2. Context sufficiency (1-5)

3. Instruction specificity (1-5)

4. Format structure (1-5)

5. Example quality (1-5)

6. Constraint tightness (1-5)

7. Output validation (1-5)

Scoring guidance

Worked example

Our position

Related reading

Ready to write better prompts?

Related Resources

Prompt Refinement Template

Prompt Chain Builder Template

System Prompt Writer Template

Prompt Engineering Framework Template

Related Articles

The Complete Guide to AI Prompt Engineering: From Beginner to Expert

Every Prompt Engineering Technique Explained: The Research-Backed Guide (2026)

Prompt Engineering Basics: The Complete Beginner's Guide (2026)