Skip to main content
Back to Blog
RAGASRAG evaluationretrieval-augmented generationevalprompt evaluation

RAGAS Evaluation: A Walkthrough for Quantifying RAG Quality

RAGAS measures RAG systems across 4 metrics — faithfulness, answer relevance, context precision, and context recall. This tutorial walks through each metric on a hypothetical customer-support RAG system.

SurePrompts Team
April 22, 2026
11 min read

TL;DR

RAGAS evaluates RAG systems on four metrics — faithfulness, answer relevance, context precision, context recall. This tutorial shows each metric applied to a hypothetical customer-support RAG system.

Tip

TL;DR: RAGAS scores RAG systems on four metrics — faithfulness and answer relevance for the generator, context precision and context recall for the retriever. This walkthrough shows each metric applied to a hypothetical customer-support RAG, plus how to wire the whole thing into a CI regression check.

Key takeaways:

  • Standard LLM evals miss RAG-specific failure modes. A RAG system can emit fluent, confident answers while being grounded in the wrong documents — or no documents at all.
  • RAGAS splits evaluation into retrieval-side and generation-side metrics. The retriever can be excellent while the generator fabricates, or vice versa, and a single-score eval cannot tell you which.
  • Two of the four metrics — faithfulness and answer relevance — run without ground-truth answers. The other two require a golden set. Start where you are.
  • Faithfulness uses an LLM-as-judge pattern under the hood. Everything said in the LLM-as-Judge Prompting Guide about judge biases applies to RAGAS too.
  • The real value shows up in CI. RAGAS on a golden set turns "did we just regress?" from a vibe into a dashboard.
  • RAGAS is opinionated for RAG. Pair it with the SurePrompts Quality Rubric for prompt quality and the Context Engineering Maturity Model for the broader discipline.

Why RAG needs its own eval framework

A generic LLM eval looks at a question and an answer and asks whether the answer is good. That works for closed-book tasks where the model is supposed to know the answer. It does not work for RAG, where the model is supposed to answer from retrieved context and where "good" has at least three separable failure modes.

Failure one: the retriever pulled the wrong documents, and the generator answered confidently from them anyway. Failure two: the retriever pulled the right documents, but the generator drifted — leaned on training data, hallucinated a citation, or smoothed out a nuance that mattered. The context was fine; the grounding failed. Failure three: the retriever missed a document that existed in the corpus, so the generator answered from partial information.

A single "is this answer good" score cannot distinguish these. You need metrics that look at the retrieval step and the generation step separately. That is the problem RAGAS was built for. The RAG prompt engineering guide covers how to build the system; RAGAS covers how to tell, with numbers, whether what you built is working.

The 4 metrics, in practice

RAGAS defines four core metrics. Two evaluate the generator; two evaluate the retriever. All four are framework-agnostic — they evaluate the inputs and outputs of the RAG pipeline, not the internals of whichever retriever or LLM you happen to use.

Faithfulness — answer stays within retrieved context

Faithfulness asks: of the factual claims in the generated answer, how many are directly supported by the retrieved context? If the answer makes five factual claims and four are traceable to the retrieved documents and one is invented or imported from training data, faithfulness scores around 0.80.

Under the hood, this is an LLM-as-judge pattern. The framework uses a judge model to extract atomic claims from the answer, then asks the judge, per claim, whether the retrieved context entails it. Claim-level verdicts aggregate into the overall score. Everything covered in the LLM-as-Judge Prompting Guide — position bias, verbosity bias, self-preference bias — applies here. A weak judge produces a weak faithfulness score.

Faithfulness does not require a ground-truth answer. It only needs the question, the retrieved context, and the answer. That makes it the cheapest RAGAS metric to bootstrap — you can run it on live production traffic without waiting for a curated golden set.

Answer relevance — answer addresses the question

Answer relevance asks: does the generated answer actually address what the user asked, or does it drift, pad, or answer a related-but-different question? A correct-but-off-topic answer fails here, which is a distinct failure from faithfulness.

The implementation is roughly: given the answer, use an LLM to generate questions the answer could plausibly be responding to, then measure semantic similarity between those and the actual user question. If the answer is on-target, the generated questions cluster tightly around the real one. If the answer drifted, they drift too.

Like faithfulness, answer relevance does not require ground truth. It catches a failure programmatic checks usually miss: the answer is technically correct but is answering the wrong question — common when retrieval succeeds on a related topic and the generator follows the documents instead of the query.

Context precision — retrieved documents ranked by relevance

Context precision asks: of the documents your retriever returned, how well are the relevant ones ranked near the top? A retriever that puts the relevant chunk at position 8 scores worse than one that puts it at position 1, even if both return the same information.

This matters because most RAG systems truncate context — only the top N chunks reach the generator. If the relevant chunk sits at position 8 and you truncate at 5, the generator never sees it, and downstream metrics suffer for reasons that look like a generator problem but are really a ranking problem.

Context precision requires ground truth: either a reference answer (from which relevance per chunk is inferred) or explicit relevance labels. Without one of those, you cannot score precision — only measure that the retriever returned something.

Context recall — all relevant docs retrieved

Context recall asks: of the information needed to answer the question, how much of it is actually present in the retrieved context? If the correct answer requires three facts and only two of them appear in the retrieved chunks, recall is about 0.67, and the generator is structurally incapable of producing a complete answer.

Context recall is the metric that catches "the answer exists in your corpus but the retriever never found it" — a failure mode that is invisible to faithfulness and answer relevance, because both of those only look at what was retrieved. If the retriever missed, they silently pass.

Recall also requires ground truth — specifically, a reference answer. The framework decomposes the reference into its component claims and checks which claims are present in the retrieved context. Missing claims drag the score down.

Worked example — a hypothetical customer support RAG

Hypothetical scenario, not a shipped product. A SaaS company runs a customer-support RAG over its help center. The user asks:

"Does the Pro plan include SSO, and what's the setup process?"

The retriever pulls four chunks:

  • A pricing page excerpt stating Pro includes SAML SSO as an add-on for $10/seat/month.
  • A help-center article titled "Setting up SSO" that walks through SAML configuration step by step.
  • A changelog entry from 2024 about adding SCIM provisioning.
  • A marketing comparison page that mentions SSO in a feature table.

The generator produces:

"Yes, the Pro plan includes SAML SSO as an add-on priced at $10 per seat per month. To set it up, configure your identity provider with the SAML endpoints in your admin settings, upload your IdP metadata, map user attributes, and enable SSO for your workspace. We also support SCIM provisioning for automatic user lifecycle management."

Now walk through each metric. All scores are illustrative, not real measurements.

Faithfulness (illustrative ~0.9). Most claims are supported. Pricing maps to chunk 1, setup steps to chunk 2, SCIM to chunk 3. What would drag the score below 1.0 is "for automatic user lifecycle management" — chunk 3 mentioned SCIM existed but (hypothetically) did not describe what it does. The generator imported that framing from training data.

Answer relevance (illustrative ~0.95). The user asked two things — availability and setup — and the answer addresses both. The SCIM tangent is on-topic enough that a judge would likely score high, though a stricter rubric might penalize it as drift.

Context precision (illustrative ~0.75). Two chunks are highly relevant, one is adjacent-relevant, one is noise. If the marketing page ranked above the setup article, precision drops further. The score depends on exact ranking order — which is exactly the detail this metric surfaces.

Context recall (illustrative ~0.85). Decompose the reference answer into claims: Pro includes SSO, add-on pricing, SAML setup steps. Most are recoverable from the retrieved chunks. The recall hit would come if the reference mentioned, say, that SSO requires an annual contract and no retrieved chunk contained that fact.

Note the pattern: four numbers diagnose four different parts of the pipeline. A single aggregate would blur them into "it's kind of working" — which is exactly what you want to avoid when shipping changes.

Turning this into a CI check

The shift from occasional manual RAGAS runs to CI-grade regression testing is where the framework earns its keep.

The pattern: assemble a golden set of 50-100 representative questions, each paired with a ground-truth answer and (ideally) the corpus documents that should be retrieved. Version-control it alongside the code. On every merge, run the current RAG pipeline against the golden set, score each question on the four metrics, and persist aggregates as an artifact.

Set thresholds on top of that. If faithfulness drops below some floor — say, 0.85 — the CI job fails. If recall drops by more than some delta from the last merge, the job fails. Exact numbers are tuned per system; the discipline is that they exist and gate merges.

The failures this catches are the quiet ones. A prompt tweak that made the generator more confident but less grounded. An embedding-model swap that hurt recall on long-tail questions. A chunking change that looked fine on the first five examples the author checked. None show up in unit tests; all show up in RAGAS scores.

This is the same pattern covered in the broader retrieval-augmented prompting patterns post, applied specifically to eval. It is also what the Context Engineering Maturity Model points at with Level 5: context quality observed with metrics, not with intuition.

RAGAS vs. LLM-as-Judge vs. custom eval

RAGAS is one choice in a small menu.

LLM-as-judge is the most general pattern. You define a rubric, hand it to a judge, get structured scores. Flexible — any rubric, domain, format — but you own the harness, the judge choice, and the bias mitigations.

RAGAS is the opinionated version for RAG. It fixes the four metrics and the judge prompts underneath them, so you do not have to design a rubric for a well-understood problem. For generic RAG, that is a feature. For highly domain-specific RAG (medical, legal, enterprise), the default prompts sometimes need tuning.

Custom evals cover what RAGAS cannot: product rules, compliance, tone. If your support RAG must never recommend a competitor, that is a programmatic check. If it must cite by internal doc ID, that is a regex. Custom evals run alongside RAGAS, not instead.

Serious RAG teams usually run all three: RAGAS for core quality, LLM-as-judge for product-specific rubrics, programmatic checks for hard rules. RAGAS is the middle tier — more structured than vibes, less bespoke than a full in-house eval-harness.

Failure modes

Four anti-patterns worth flagging.

Running precision/recall without ground truth and trusting the numbers. The framework will compute something, but without real ground truth the scores are noise. No golden set yet? Run only faithfulness and answer relevance, and be honest that retrieval quality is uninstrumented.

Judge-model self-preference bias. If generator and judge are the same family, expect optimistic faithfulness scores. The LLM-as-Judge Prompting Guide covers this in detail. The cheapest mitigation is a different family for judge vs. generator.

Golden set drift. A golden set assembled six months ago may no longer represent current traffic, corpus, or product scope. Scores go up while real quality goes sideways. Re-sample from recent production traffic at least quarterly.

Treating RAGAS as a benchmark instead of a regression test. Public benchmark numbers tell you little about your system on your corpus. The useful comparison is you-today vs. you-last-week, not you-vs-the-leaderboard.

Our position

Five opinionated stances.

  • Ship faithfulness first. It runs without ground truth, catches the failure users complain loudest about (fabrication), and forces you to name your judge. Add the other three as your golden set matures.
  • The golden set is the asset, not the framework. Any framework can compute metrics. The thing that makes your eval load-bearing is the 50-100 question golden set with real ground truth, versioned alongside your code. RAGAS is a nice wrapper around a harder problem.
  • Different model families for generator and judge. Self-preference bias is well-documented. Cross-family evaluation is a cheap, high-leverage mitigation.
  • CI thresholds beat dashboards. A dashboard nobody looks at is not an eval. A CI job that blocks merges on a faithfulness regression is. Set thresholds conservatively and adjust when they bite too often.
  • RAGAS is not a substitute for prompt quality. A brilliantly evaluated bad prompt is still a bad prompt. Score prompts against the SurePrompts Quality Rubric and pipelines against RAGAS. You need both.

Try it yourself

Build expert-level prompts from plain English with SurePrompts — 350+ templates with real-time preview.

Open Prompt Builder

Ready to write better prompts?

SurePrompts turns plain English into expert-level AI prompts. 350+ templates, real-time preview, works with any model.

Try AI Prompt Generator