Scoring a Customer Service Prompt with the SurePrompts Quality Rubric: A Worked Example

Q: What's a good Rubric score for a customer service prompt?

28+ out of 35 is the production-ready threshold for most prompts, and customer service is no exception. For customer-facing prompts we push a little higher — aim for 30+ because the cost of a bad reply is a churned customer, not a retry. Below 21, do not ship; the prompt is either refusing, hallucinating policy, or drifting off-brand in testing and will do worse in production.

Q: Why is constraint tightness especially important for customer service prompts?

Customer service prompts fail loudly when they break — wrong refund promised, wrong tier quoted, wrong tone used with an angry customer. Constraints are the cheapest way to prevent those failure modes before they ship. A banned-phrases list, escalation rules, and explicit 'do not promise' boundaries cost ten lines of prompt and eliminate entire categories of incidents. Constraint tightness is where a 2 becomes a 5 with thirty seconds of editing.

Q: How do I handle the output validation dimension for a customer service prompt?

Two layers. First, bake a self-check into the prompt: 'Before returning your reply, confirm (a) you did not promise a refund outside policy, (b) the tone matches the customer's, (c) you cited the correct account tier.' Second, validate structurally at the app layer — if your prompt returns JSON with reply, escalation_needed, and confidence_score, your code can check the fields before sending. Validation is the dimension most under-built and most load-bearing.

Q: Can I skip scoring Role clarity if I've already set it in the system prompt?

Score it once at the system-prompt level and carry that score forward; do not re-score it per user turn. But be careful — many teams put a vague 'You are a helpful assistant' in the system prompt and assume that counts as role clarity. It does not. A role at 5 names the scope, voice, expertise level, and posture. See the [SurePrompts Quality Rubric](/blog/sureprompts-quality-rubric) for what each score level looks like.

Q: How does this Rubric-scored approach pair with RCAF?

RCAF (Role, Context, Action, Format) is the drafting skeleton. The Rubric is the audit. Draft the prompt using [RCAF](/blog/rcaf-prompt-structure), then score it against the seven Rubric dimensions. RCAF naturally raises scores on Role, Context, Instruction, and Format because the skeleton forces those slots. Examples, Constraints, and Validation are still on you to add, which is where Rubric scoring earns its keep.

Q: Should I use LLM-as-judge to automate Rubric scoring?

Yes, for volume. Once you have a stable rubric and a reference prompt or two, an LLM-as-judge can score hundreds of prompt candidates overnight. See the [LLM-as-Judge Prompting Guide](/blog/llm-as-judge-prompting-guide) for the setup. The dimensions that automate cleanly are Role, Format, and Constraints (objective signals). Context, Examples, and Validation still want human eyes, or an eval set you run the prompt against. Combine: LLM-as-judge for the 80%, human review for the 20% the judge gets wrong.

Imtiaz Rayhan

Key takeaways:

The first-draft customer service prompt is almost always a 9–12/35. Not because the author is bad, but because zero-shot prose leaves six of the seven Rubric dimensions unaddressed by default.
Fix the lowest-scoring dimension first, not the easiest one. The discipline is boring and it works.
Constraint tightness and output validation do more for customer service prompts than any other two dimensions. Get those to 4+ and most production incidents disappear.
One well-chosen few-shot example beats three generic ones. For customer service, the example that matters most is the edge case you are already afraid of.
Every Rubric dimension has an edit that takes minutes. If you are spending hours per dimension, the prompt architecture is wrong — go back to RCAF first.
A 31/35 is ship-ready. Do not chase 35; over-constrained customer service prompts become robotic.

The starting prompt

Here is a hypothetical first draft — the kind a support ops manager writes on a Friday afternoon. (Hypothetical example; not a specific tool's template.)

You are a helpful assistant for our customer service team. Read the customer's ticket below and write a reply that solves their problem.

>

Ticket: {{ticket_text}}

Two sentences, one variable, no structure. It will run. It will also fail in ways the author does not yet see.

Scoring the baseline — 9/35

Walking the seven Quality Rubric dimensions, each scored 1–5:

1. Role clarity: 1/5. "Helpful assistant" is role prompting in name only. No voice, expertise level, or posture. The model is guessing who it is.

2. Context sufficiency: 1/5. The prompt drops the ticket in and nothing else — no plan tier, account history, product scope, or policy. It will fabricate or refuse.

3. Instruction specificity: 2/5. "Write a reply that solves their problem" is a named task with a vague verb. No sub-tasks, no success criteria.

4. Format structure: 1/5. No format specified. A three-line email and a twelve-paragraph essay are both "a reply."

5. Example quality: 1/5. No examples. For customer service, where tone calibration matters, few-shot prompting with one anchoring example is almost always worth the tokens.

6. Constraint tightness: 1/5. Nothing stops the model from promising a refund, naming a feature that does not exist, or inventing a case number.

7. Output validation: 2/5. Implicit expectation of human review, but no enforcement and no checklist. Generous 2.

Total: 9/35. Not functional. Usable half the time; unpredictably embarrassing the other half.

Fix sequence — lowest score first

The Rubric discipline is: pick the lowest-scoring dimension, make the edit that raises it, move on. Five dimensions are tied at 1, so we break ties by leverage. For customer service, start with Role — it constrains the other six.

Role clarity: 1 → 5

The system prompt is the right home. Replace vague assistant language with scope, voice, expertise, and posture:

You are a Tier-2 support specialist for Acme Billing, a B2B SaaS invoicing product. Three years of hands-on experience; you know the product deeply. Voice: calm, specific, warm but not sugary. When a customer is upset, acknowledge before you explain. When confused, explain before asking follow-ups. Never speak for the product roadmap; never promise anything outside documented policy.

New score: 5/5.

Context sufficiency: 1 → 4

Add the fields that change per ticket. Do not dump the knowledge base inline — that is a retrieval job:

Customer context:

- Plan tier: {{plan_tier}} (Free, Pro, or Business)

- Account age: {{account_age_days}} days

- Prior tickets in last 30 days: {{recent_ticket_count}}

- Current MRR: {{mrr_usd}}

>

Policy context:

- Refunds: Pro/Business plans eligible for prorated refund within 14 days. Free plan: no refunds.

- Escalation: Business plan with MRR > $500 and prior ticket in last 7 days auto-escalates.

- Feature requests: acknowledge, no timeline commitments, route to product@acmebilling.com.

4 rather than 5 because product-specific FAQs are retrieved separately, not inline.

New score: 4/5.

Instruction specificity: 2 → 5

Split the task into named sub-tasks with success criteria:

Task: Draft a reply that does four things in order:

>

1. Acknowledge the specific situation (one sentence).

2. Diagnose what is happening, naming any policy or feature that applies.

3. Resolve — give the answer, take the action, or name exactly what you need to proceed.

4. Next step — state what happens next and when.

>

Success = a human agent can send your draft with zero edits for routine tickets (plan questions, invoice format, documented features).

New score: 5/5.

Format structure: 1 → 5

Customer service replies have a shape. Specify it as structured output:

Output format: Return a single JSON object (no markdown, no code fences):

>

code

> {
>   "reply_text": "plain text, 60–180 words",
>   "reply_length_words": <int>,
>   "acknowledges": "<specific thing acknowledged>",
>   "diagnosis": "<1-sentence>",
>   "resolution_type": "answered" | "action_taken" | "info_needed" | "escalated",
>   "escalation_needed": <bool>,
>   "escalation_reason": "<string or null>",
>   "confidence": <float 0-1>
> }
>

>

Tone: match the customer's register (formal if formal, casual if casual), but never more emotional than they were.

Machine-checkable, length-bounded, tone rule explicit.

New score: 5/5.

Example quality: 1 → 3

Add one few-shot example covering the hardest pattern — an angry customer on a plan tier where policy does not allow what they are asking for:

Example (Free plan, demands refund):

>

Ticket: "This product is broken and you owe me my money back. Refund me immediately."

>

Reply:

>

code

> {
>   "reply_text": "I hear you — it's frustrating when something isn't working, and I want to help. Looking at your account, you're on the Free plan, which means there isn't a charge to refund. Tell me what specifically broke and I'll either walk you through a fix or flag it to engineering. I'll follow up in this thread today.",
>   "reply_length_words": 68,
>   "acknowledges": "frustration with something not working",
>   "diagnosis": "Free-plan customer requesting refund on unpaid account",
>   "resolution_type": "info_needed",
>   "escalation_needed": false,
>   "escalation_reason": null,
>   "confidence": 0.9
> }
>

A 3 rather than higher because we have not added a happy-path or genuine-escalation example yet. One well-chosen example still beats three generic ones.

New score: 3/5.

Constraint tightness: 1 → 5

Customer service is where constraints earn the most. Explicit DO-NOTs:

Constraints:

>

- Never promise a refund, credit, discount, or escalation not in the policy context.

- Never name a product feature from memory — only from ticket or context block.

- Banned phrases: "unfortunately," "as per our policy," "per your request," "kindly," "please be advised."

- Maximum one apology per reply.

- If the customer is angry, stay one register calmer — do not match the anger.

- If required info is not in the ticket, ask; do not guess.

- If a timeline commitment is needed (ship date, fix ETA), refuse and set escalation_needed: true.

Seven constraints, each covering a known failure mode.

New score: 5/5.

Output validation: 1 → 4

Add a self-check before output; rely on the JSON schema for the structural layer:

Before returning JSON, verify:

>

1. reply_text contains no banned phrases.

2. reply_length_words is 60–180 and matches actual word count.

3. No out-of-policy refund commitment.

4. escalation_needed is true for Business plan with MRR > $500 and prior ticket in last 7 days.

5. confidence is honest — use < 0.5 if any customer detail was a guess.

>

If any check fails, fix the reply before returning.

A 5 would run validation outside the prompt in application code, which is the right long-term home.

New score: 4/5.

The revised prompt — 31/35

The final prompt, assembled from the edits above, is a prompt template with variables for plan tier, account age, prior tickets, MRR, and ticket text. The role block, policy block, task list, JSON schema, edge-case example, seven constraints, and five-item self-check all carry forward as written.

Final score:

Role clarity: 5
Context sufficiency: 4
Instruction specificity: 5
Format structure: 5
Example quality: 3
Constraint tightness: 5
Output validation: 4

Total: 31/35. Ship it.

Anti-patterns we left on the floor

Three things we deliberately did not do, and why:

Chain-of-thought reasoning. For a short customer service reply, "think step by step" before output adds latency and tokens without improving quality — the acknowledge-diagnose-resolve-next-step structure already forces sequential reasoning. Reasoning scaffolds help on multi-step analytical tasks, not on single replies.
Three or four few-shot examples. More examples would anchor tone more tightly, but they would also push the context window toward the ticket text, where attention is most valuable. One well-chosen edge-case example is enough anchoring for this task. Add more only if the eval set shows specific failure modes.
A "here's how to be empathetic" paragraph. Tempting, and it reads well to a human reviewer. But models handle "match the customer's register, stay one calmer" better than they handle a paragraph about empathy. Short, behaviorally-named rules beat essays every time.
An "if you are unsure, ask a human" fallback at the end. We already set escalation_needed: true as the fallback, which is checkable in code. A prose fallback invites the model to over-escalate to feel safe, which clogs the human queue — exactly the thing the prompt was built to prevent.

Pair with RCAF and LLM-as-Judge

RCAF and the Quality Rubric are complementary. RCAF is the drafting skeleton — it forces you to fill Role, Context, Action, and Format slots, which naturally raises those four Rubric scores from 1s to 4s on the first draft. The Rubric then catches the three dimensions RCAF does not explicitly cover: Examples, Constraints, and Validation. Write the draft with RCAF, audit with the Rubric, iterate lowest-dimension-first.

For scale, use LLM-as-Judge to automate Rubric scoring across a library of prompts. Role clarity, format structure, and constraint tightness have objective signals a judge prompt can score reliably. Context, Examples, and Validation need either a human eye or an eval set the prompt runs against. Budget accordingly: LLM-as-judge for the 80%, human review for the 20% where the judge is known to be unreliable. See our AI prompts for customer service library for starting-point prompt templates that already clear the Rubric threshold.

Our position

Customer service prompts earn their score from Constraints and Validation, not Role and Format. Teams obsess over the opening role paragraph and ship prompts that promise refunds they should not. Fix that order.
Aim for 30+, not 35. Over-constrained customer service prompts become robotic and lose the register-matching that makes replies feel human. Leave some slack for the model to be competent.
One edge-case example beats three happy-path examples. The happy path is what the model does well without examples. Use the few-shot slot for the thing you are already afraid of.
Validation belongs partly in the prompt (self-check) and partly in application code (schema enforcement). Do not try to put it all in one place.
Score quarterly, not per ticket. A prompt that was 31/35 in January is often 24/35 by April because the product changed, the policy changed, or the model changed. Re-score on a calendar, not on a complaint.

The SurePrompts Quality Rubric — the 7-dimension framework this post applies
RCAF Prompt Structure — the drafting skeleton the Rubric pairs with
LLM-as-Judge Prompting Guide — automate Rubric scoring at scale
40 AI Prompts for Customer Service — a starter library you can Rubric-score against
Common prompt engineering mistakes
Why your AI prompts suck

Scoring a Customer Service Prompt with the SurePrompts Quality Rubric: A Worked Example

The starting prompt

Scoring the baseline — 9/35

Fix sequence — lowest score first

Role clarity: 1 → 5

Context sufficiency: 1 → 4

Instruction specificity: 2 → 5

Format structure: 1 → 5

Example quality: 1 → 3

Constraint tightness: 1 → 5

Output validation: 1 → 4

The revised prompt — 31/35

Anti-patterns we left on the floor

Pair with RCAF and LLM-as-Judge

Our position

Ready to write better prompts?

Related Articles

The SurePrompts Quality Rubric: A 7-Dimension Framework for Scoring Prompts

The RCAF Prompt Structure: A 4-Part Skeleton for Maintainable Prompts

LLM-as-Judge: A Practical Guide to Automating Prompt Evaluation (2026)

Scoring a Customer Service Prompt with the SurePrompts Quality Rubric: A Worked Example

The starting prompt

Scoring the baseline — 9/35

Fix sequence — lowest score first

Role clarity: 1 → 5

Context sufficiency: 1 → 4

Instruction specificity: 2 → 5

Format structure: 1 → 5

Example quality: 1 → 3

Constraint tightness: 1 → 5

Output validation: 1 → 4

The revised prompt — 31/35

Anti-patterns we left on the floor

Pair with RCAF and LLM-as-Judge

Our position

Related reading

Ready to write better prompts?

Related Articles

The SurePrompts Quality Rubric: A 7-Dimension Framework for Scoring Prompts

The RCAF Prompt Structure: A 4-Part Skeleton for Maintainable Prompts

LLM-as-Judge: A Practical Guide to Automating Prompt Evaluation (2026)