Tip
TL;DR: A realistic first-draft customer service prompt scores 9/35 on the SurePrompts Quality Rubric. Working through the seven dimensions lowest-score-first — role, context, instruction, format, examples, constraints, validation — raises it to 31/35 with edits that take ten minutes. This post shows the before, the edits, and the after.
Key takeaways:
- The first-draft customer service prompt is almost always a 9–12/35. Not because the author is bad, but because zero-shot prose leaves six of the seven Rubric dimensions unaddressed by default.
- Fix the lowest-scoring dimension first, not the easiest one. The discipline is boring and it works.
- Constraint tightness and output validation do more for customer service prompts than any other two dimensions. Get those to 4+ and most production incidents disappear.
- One well-chosen few-shot example beats three generic ones. For customer service, the example that matters most is the edge case you are already afraid of.
- Every Rubric dimension has an edit that takes minutes. If you are spending hours per dimension, the prompt architecture is wrong — go back to RCAF first.
- A 31/35 is ship-ready. Do not chase 35; over-constrained customer service prompts become robotic.
The starting prompt
Here is a hypothetical first draft — the kind a support ops manager writes on a Friday afternoon. (Hypothetical example; not a specific tool's template.)
You are a helpful assistant for our customer service team. Read the customer's ticket below and write a reply that solves their problem.
>
Ticket: {{ticket_text}}
Two sentences, one variable, no structure. It will run. It will also fail in ways the author does not yet see.
Scoring the baseline — 9/35
Walking the seven Quality Rubric dimensions, each scored 1–5:
1. Role clarity: 1/5. "Helpful assistant" is role prompting in name only. No voice, expertise level, or posture. The model is guessing who it is.
2. Context sufficiency: 1/5. The prompt drops the ticket in and nothing else — no plan tier, account history, product scope, or policy. It will fabricate or refuse.
3. Instruction specificity: 2/5. "Write a reply that solves their problem" is a named task with a vague verb. No sub-tasks, no success criteria.
4. Format structure: 1/5. No format specified. A three-line email and a twelve-paragraph essay are both "a reply."
5. Example quality: 1/5. No examples. For customer service, where tone calibration matters, few-shot prompting with one anchoring example is almost always worth the tokens.
6. Constraint tightness: 1/5. Nothing stops the model from promising a refund, naming a feature that does not exist, or inventing a case number.
7. Output validation: 2/5. Implicit expectation of human review, but no enforcement and no checklist. Generous 2.
Total: 9/35. Not functional. Usable half the time; unpredictably embarrassing the other half.
Fix sequence — lowest score first
The Rubric discipline is: pick the lowest-scoring dimension, make the edit that raises it, move on. Five dimensions are tied at 1, so we break ties by leverage. For customer service, start with Role — it constrains the other six.
Role clarity: 1 → 5
The system prompt is the right home. Replace vague assistant language with scope, voice, expertise, and posture:
You are a Tier-2 support specialist for Acme Billing, a B2B SaaS invoicing product. Three years of hands-on experience; you know the product deeply. Voice: calm, specific, warm but not sugary. When a customer is upset, acknowledge before you explain. When confused, explain before asking follow-ups. Never speak for the product roadmap; never promise anything outside documented policy.
New score: 5/5.
Context sufficiency: 1 → 4
Add the fields that change per ticket. Do not dump the knowledge base inline — that is a retrieval job:
Customer context:
- Plan tier: {{plan_tier}} (Free, Pro, or Business)
- Account age: {{account_age_days}} days
- Prior tickets in last 30 days: {{recent_ticket_count}}
- Current MRR: {{mrr_usd}}
>
Policy context:
- Refunds: Pro/Business plans eligible for prorated refund within 14 days. Free plan: no refunds.
- Escalation: Business plan with MRR > $500 and prior ticket in last 7 days auto-escalates.
- Feature requests: acknowledge, no timeline commitments, route to product@acmebilling.com.
4 rather than 5 because product-specific FAQs are retrieved separately, not inline.
New score: 4/5.
Instruction specificity: 2 → 5
Split the task into named sub-tasks with success criteria:
Task: Draft a reply that does four things in order:
>
1. Acknowledge the specific situation (one sentence).
2. Diagnose what is happening, naming any policy or feature that applies.
3. Resolve — give the answer, take the action, or name exactly what you need to proceed.
4. Next step — state what happens next and when.
>
Success = a human agent can send your draft with zero edits for routine tickets (plan questions, invoice format, documented features).
New score: 5/5.
Format structure: 1 → 5
Customer service replies have a shape. Specify it as structured output:
Output format: Return a single JSON object (no markdown, no code fences):
>
code> { > "reply_text": "plain text, 60–180 words", > "reply_length_words": <int>, > "acknowledges": "<specific thing acknowledged>", > "diagnosis": "<1-sentence>", > "resolution_type": "answered" | "action_taken" | "info_needed" | "escalated", > "escalation_needed": <bool>, > "escalation_reason": "<string or null>", > "confidence": <float 0-1> > } >
>
Tone: match the customer's register (formal if formal, casual if casual), but never more emotional than they were.
Machine-checkable, length-bounded, tone rule explicit.
New score: 5/5.
Example quality: 1 → 3
Add one few-shot example covering the hardest pattern — an angry customer on a plan tier where policy does not allow what they are asking for:
Example (Free plan, demands refund):
>
Ticket: "This product is broken and you owe me my money back. Refund me immediately."
>
Reply:
>
code> { > "reply_text": "I hear you — it's frustrating when something isn't working, and I want to help. Looking at your account, you're on the Free plan, which means there isn't a charge to refund. Tell me what specifically broke and I'll either walk you through a fix or flag it to engineering. I'll follow up in this thread today.", > "reply_length_words": 68, > "acknowledges": "frustration with something not working", > "diagnosis": "Free-plan customer requesting refund on unpaid account", > "resolution_type": "info_needed", > "escalation_needed": false, > "escalation_reason": null, > "confidence": 0.9 > } >
A 3 rather than higher because we have not added a happy-path or genuine-escalation example yet. One well-chosen example still beats three generic ones.
New score: 3/5.
Constraint tightness: 1 → 5
Customer service is where constraints earn the most. Explicit DO-NOTs:
Constraints:
>
- Never promise a refund, credit, discount, or escalation not in the policy context.
- Never name a product feature from memory — only from ticket or context block.
- Banned phrases: "unfortunately," "as per our policy," "per your request," "kindly," "please be advised."
- Maximum one apology per reply.
- If the customer is angry, stay one register calmer — do not match the anger.
- If required info is not in the ticket, ask; do not guess.
- If a timeline commitment is needed (ship date, fix ETA), refuse and set escalation_needed: true.
Seven constraints, each covering a known failure mode.
New score: 5/5.
Output validation: 1 → 4
Add a self-check before output; rely on the JSON schema for the structural layer:
Before returning JSON, verify:
>
1. reply_text contains no banned phrases.
2. reply_length_words is 60–180 and matches actual word count.
3. No out-of-policy refund commitment.
4.escalation_neededistruefor Business plan with MRR > $500 and prior ticket in last 7 days.
5. confidence is honest — use < 0.5 if any customer detail was a guess.
>
If any check fails, fix the reply before returning.
A 5 would run validation outside the prompt in application code, which is the right long-term home.
New score: 4/5.
The revised prompt — 31/35
The final prompt, assembled from the edits above, is a prompt template with variables for plan tier, account age, prior tickets, MRR, and ticket text. The role block, policy block, task list, JSON schema, edge-case example, seven constraints, and five-item self-check all carry forward as written.
Final score:
- Role clarity: 5
- Context sufficiency: 4
- Instruction specificity: 5
- Format structure: 5
- Example quality: 3
- Constraint tightness: 5
- Output validation: 4
Total: 31/35. Ship it.
Anti-patterns we left on the floor
Three things we deliberately did not do, and why:
- Chain-of-thought reasoning. For a short customer service reply, "think step by step" before output adds latency and tokens without improving quality — the acknowledge-diagnose-resolve-next-step structure already forces sequential reasoning. Reasoning scaffolds help on multi-step analytical tasks, not on single replies.
- Three or four few-shot examples. More examples would anchor tone more tightly, but they would also push the context window toward the ticket text, where attention is most valuable. One well-chosen edge-case example is enough anchoring for this task. Add more only if the eval set shows specific failure modes.
- A "here's how to be empathetic" paragraph. Tempting, and it reads well to a human reviewer. But models handle "match the customer's register, stay one calmer" better than they handle a paragraph about empathy. Short, behaviorally-named rules beat essays every time.
- An "if you are unsure, ask a human" fallback at the end. We already set
escalation_needed: trueas the fallback, which is checkable in code. A prose fallback invites the model to over-escalate to feel safe, which clogs the human queue — exactly the thing the prompt was built to prevent.
Pair with RCAF and LLM-as-Judge
RCAF and the Quality Rubric are complementary. RCAF is the drafting skeleton — it forces you to fill Role, Context, Action, and Format slots, which naturally raises those four Rubric scores from 1s to 4s on the first draft. The Rubric then catches the three dimensions RCAF does not explicitly cover: Examples, Constraints, and Validation. Write the draft with RCAF, audit with the Rubric, iterate lowest-dimension-first.
For scale, use LLM-as-Judge to automate Rubric scoring across a library of prompts. Role clarity, format structure, and constraint tightness have objective signals a judge prompt can score reliably. Context, Examples, and Validation need either a human eye or an eval set the prompt runs against. Budget accordingly: LLM-as-judge for the 80%, human review for the 20% where the judge is known to be unreliable. See our AI prompts for customer service library for starting-point prompt templates that already clear the Rubric threshold.
Our position
- Customer service prompts earn their score from Constraints and Validation, not Role and Format. Teams obsess over the opening role paragraph and ship prompts that promise refunds they should not. Fix that order.
- Aim for 30+, not 35. Over-constrained customer service prompts become robotic and lose the register-matching that makes replies feel human. Leave some slack for the model to be competent.
- One edge-case example beats three happy-path examples. The happy path is what the model does well without examples. Use the few-shot slot for the thing you are already afraid of.
- Validation belongs partly in the prompt (self-check) and partly in application code (schema enforcement). Do not try to put it all in one place.
- Score quarterly, not per ticket. A prompt that was 31/35 in January is often 24/35 by April because the product changed, the policy changed, or the model changed. Re-score on a calendar, not on a complaint.
Related reading
- The SurePrompts Quality Rubric — the 7-dimension framework this post applies
- RCAF Prompt Structure — the drafting skeleton the Rubric pairs with
- LLM-as-Judge Prompting Guide — automate Rubric scoring at scale
- 40 AI Prompts for Customer Service — a starter library you can Rubric-score against
- Common prompt engineering mistakes
- Why your AI prompts suck