RAGAS Evaluation: A Walkthrough for Quantifying RAG Quality

Q: What is RAGAS?

RAGAS is an open-source evaluation framework purpose-built for retrieval-augmented generation systems. It scores a RAG pipeline on properties that general-purpose LLM evals miss — whether the answer is grounded in the retrieved context, whether the retriever returned the right documents, and whether the generator actually addressed the question. Most of its metrics are implemented with an LLM-as-judge pattern under the hood, but wrapped into a consistent interface so teams can track RAG quality over time instead of eyeballing outputs. The project lives at ragas.io and is maintained by the explodinggradients team on GitHub.

Q: What are the 4 RAGAS metrics?

Faithfulness measures whether the generated answer is supported by the retrieved context — it penalizes claims the model invented or imported from its own training. Answer relevance measures whether the answer actually addresses the user's question, as opposed to drifting or padding. Context precision measures how well the retrieved documents are ranked — are the relevant chunks near the top? Context recall measures whether all the information needed to answer the question was actually retrieved. The first pair evaluates generation; the second pair evaluates retrieval.

Q: Do I need ground-truth answers to run RAGAS?

Partially. Faithfulness and answer relevance can run without ground-truth answers — they compare the generated answer against the question and the retrieved context using an LLM judge, not against a reference. Context precision and context recall do require ground-truth answers (or ground-truth relevant documents), because they measure whether retrieval surfaced the right information. Teams without ground truth can still get useful signal from the two generation-side metrics, then invest in a golden set when they are ready for the retrieval-side ones.

Q: Which RAGAS metric should I prioritize?

It depends on your dominant failure mode. If users complain about fabricated answers or hallucinated citations, faithfulness is the one to watch — it directly measures whether the answer stays inside the retrieved context. If users report that the system answers a different question than they asked, prioritize answer relevance. If the system says 'I don't know' when the information exists in your corpus, that is a context recall problem. If answers are correct but buried in noise from irrelevant retrieval, that is context precision. Start with the metric that maps to your top user complaint.

Q: How does RAGAS relate to the Context Engineering Maturity Model?

The [Context Engineering Maturity Model](/blog/context-engineering-maturity-model) treats measurable context quality as the top of the ladder — Level 5, where retrieval and grounding are observed with metrics, not vibes. RAGAS is one way to operationalize that level for RAG systems specifically. Faithfulness maps to the grounding discipline CEMM Level 5 demands; context precision and recall map to the retrieval-observability requirement. A team running RAGAS in CI on a golden set is, in CEMM terms, operating their RAG context layer at Level 5.

Q: Can I run RAGAS in CI?

Yes — a CI golden set is the recommended pattern for RAG teams shipping changes. Assemble 50-100 representative questions with ground-truth answers and expected relevant documents, run RAGAS against the current pipeline on every merge, and alert when any metric drops below a threshold. The same change — a new prompt, a new embedding model, a new retriever — that would have shipped silently now shows up as a faithfulness or recall regression before it reaches users. The exact CI integration varies by stack, but the pattern (golden set in version control, metrics tracked over time, thresholds enforced) is stable.

Imtiaz Rayhan

Key takeaways:

Standard LLM evals miss RAG-specific failure modes. A RAG system can emit fluent, confident answers while being grounded in the wrong documents — or no documents at all.
RAGAS splits evaluation into retrieval-side and generation-side metrics. The retriever can be excellent while the generator fabricates, or vice versa, and a single-score eval cannot tell you which.
Two of the four metrics — faithfulness and answer relevance — run without ground-truth answers. The other two require a golden set. Start where you are.
Faithfulness uses an LLM-as-judge pattern under the hood. Everything said in the LLM-as-Judge Prompting Guide about judge biases applies to RAGAS too.
The real value shows up in CI. RAGAS on a golden set turns "did we just regress?" from a vibe into a dashboard.
RAGAS is opinionated for RAG. Pair it with the SurePrompts Quality Rubric for prompt quality and the Context Engineering Maturity Model for the broader discipline.

Why RAG needs its own eval framework

A generic LLM eval looks at a question and an answer and asks whether the answer is good. That works for closed-book tasks where the model is supposed to know the answer. It does not work for RAG, where the model is supposed to answer from retrieved context and where "good" has at least three separable failure modes.

Failure one: the retriever pulled the wrong documents, and the generator answered confidently from them anyway. Failure two: the retriever pulled the right documents, but the generator drifted — leaned on training data, hallucinated a citation, or smoothed out a nuance that mattered. The context was fine; the grounding failed. Failure three: the retriever missed a document that existed in the corpus, so the generator answered from partial information.

A single "is this answer good" score cannot distinguish these. You need metrics that look at the retrieval step and the generation step separately. That is the problem RAGAS was built for. The RAG prompt engineering guide covers how to build the system; RAGAS covers how to tell, with numbers, whether what you built is working.

The 4 metrics, in practice

RAGAS defines four core metrics. Two evaluate the generator; two evaluate the retriever. All four are framework-agnostic — they evaluate the inputs and outputs of the RAG pipeline, not the internals of whichever retriever or LLM you happen to use.

Faithfulness — answer stays within retrieved context

Faithfulness asks: of the factual claims in the generated answer, how many are directly supported by the retrieved context? If the answer makes five factual claims and four are traceable to the retrieved documents and one is invented or imported from training data, faithfulness scores around 0.80.

Under the hood, this is an LLM-as-judge pattern. The framework uses a judge model to extract atomic claims from the answer, then asks the judge, per claim, whether the retrieved context entails it. Claim-level verdicts aggregate into the overall score. Everything covered in the LLM-as-Judge Prompting Guide — position bias, verbosity bias, self-preference bias — applies here. A weak judge produces a weak faithfulness score.

Faithfulness does not require a ground-truth answer. It only needs the question, the retrieved context, and the answer. That makes it the cheapest RAGAS metric to bootstrap — you can run it on live production traffic without waiting for a curated golden set.

Answer relevance — answer addresses the question

Answer relevance asks: does the generated answer actually address what the user asked, or does it drift, pad, or answer a related-but-different question? A correct-but-off-topic answer fails here, which is a distinct failure from faithfulness.

The implementation is roughly: given the answer, use an LLM to generate questions the answer could plausibly be responding to, then measure semantic similarity between those and the actual user question. If the answer is on-target, the generated questions cluster tightly around the real one. If the answer drifted, they drift too.

Like faithfulness, answer relevance does not require ground truth. It catches a failure programmatic checks usually miss: the answer is technically correct but is answering the wrong question — common when retrieval succeeds on a related topic and the generator follows the documents instead of the query.

Context precision — retrieved documents ranked by relevance

Context precision asks: of the documents your retriever returned, how well are the relevant ones ranked near the top? A retriever that puts the relevant chunk at position 8 scores worse than one that puts it at position 1, even if both return the same information.

This matters because most RAG systems truncate context — only the top N chunks reach the generator. If the relevant chunk sits at position 8 and you truncate at 5, the generator never sees it, and downstream metrics suffer for reasons that look like a generator problem but are really a ranking problem.

Context precision requires ground truth: either a reference answer (from which relevance per chunk is inferred) or explicit relevance labels. Without one of those, you cannot score precision — only measure that the retriever returned something.

Context recall — all relevant docs retrieved

Context recall asks: of the information needed to answer the question, how much of it is actually present in the retrieved context? If the correct answer requires three facts and only two of them appear in the retrieved chunks, recall is about 0.67, and the generator is structurally incapable of producing a complete answer.

Context recall is the metric that catches "the answer exists in your corpus but the retriever never found it" — a failure mode that is invisible to faithfulness and answer relevance, because both of those only look at what was retrieved. If the retriever missed, they silently pass.

Recall also requires ground truth — specifically, a reference answer. The framework decomposes the reference into its component claims and checks which claims are present in the retrieved context. Missing claims drag the score down.

Worked example — a hypothetical customer support RAG

Hypothetical scenario, not a shipped product. A SaaS company runs a customer-support RAG over its help center. The user asks:

"Does the Pro plan include SSO, and what's the setup process?"

The retriever pulls four chunks:

A pricing page excerpt stating Pro includes SAML SSO as an add-on for $10/seat/month.
A help-center article titled "Setting up SSO" that walks through SAML configuration step by step.
A changelog entry from 2024 about adding SCIM provisioning.
A marketing comparison page that mentions SSO in a feature table.

The generator produces:

"Yes, the Pro plan includes SAML SSO as an add-on priced at $10 per seat per month. To set it up, configure your identity provider with the SAML endpoints in your admin settings, upload your IdP metadata, map user attributes, and enable SSO for your workspace. We also support SCIM provisioning for automatic user lifecycle management."

Now walk through each metric. All scores are illustrative, not real measurements.

Faithfulness (illustrative ~0.9). Most claims are supported. Pricing maps to chunk 1, setup steps to chunk 2, SCIM to chunk 3. What would drag the score below 1.0 is "for automatic user lifecycle management" — chunk 3 mentioned SCIM existed but (hypothetically) did not describe what it does. The generator imported that framing from training data.

Answer relevance (illustrative ~0.95). The user asked two things — availability and setup — and the answer addresses both. The SCIM tangent is on-topic enough that a judge would likely score high, though a stricter rubric might penalize it as drift.

Context precision (illustrative ~0.75). Two chunks are highly relevant, one is adjacent-relevant, one is noise. If the marketing page ranked above the setup article, precision drops further. The score depends on exact ranking order — which is exactly the detail this metric surfaces.

Context recall (illustrative ~0.85). Decompose the reference answer into claims: Pro includes SSO, add-on pricing, SAML setup steps. Most are recoverable from the retrieved chunks. The recall hit would come if the reference mentioned, say, that SSO requires an annual contract and no retrieved chunk contained that fact.

Note the pattern: four numbers diagnose four different parts of the pipeline. A single aggregate would blur them into "it's kind of working" — which is exactly what you want to avoid when shipping changes.

Turning this into a CI check

The shift from occasional manual RAGAS runs to CI-grade regression testing is where the framework earns its keep.

The pattern: assemble a golden set of 50-100 representative questions, each paired with a ground-truth answer and (ideally) the corpus documents that should be retrieved. Version-control it alongside the code. On every merge, run the current RAG pipeline against the golden set, score each question on the four metrics, and persist aggregates as an artifact.

Set thresholds on top of that. If faithfulness drops below some floor — say, 0.85 — the CI job fails. If recall drops by more than some delta from the last merge, the job fails. Exact numbers are tuned per system; the discipline is that they exist and gate merges.

The failures this catches are the quiet ones. A prompt tweak that made the generator more confident but less grounded. An embedding-model swap that hurt recall on long-tail questions. A chunking change that looked fine on the first five examples the author checked. None show up in unit tests; all show up in RAGAS scores.

This is the same pattern covered in the broader retrieval-augmented prompting patterns post, applied specifically to eval. It is also what the Context Engineering Maturity Model points at with Level 5: context quality observed with metrics, not with intuition.

RAGAS vs. LLM-as-Judge vs. custom eval

RAGAS is one choice in a small menu.

LLM-as-judge is the most general pattern. You define a rubric, hand it to a judge, get structured scores. Flexible — any rubric, domain, format — but you own the harness, the judge choice, and the bias mitigations.

RAGAS is the opinionated version for RAG. It fixes the four metrics and the judge prompts underneath them, so you do not have to design a rubric for a well-understood problem. For generic RAG, that is a feature. For highly domain-specific RAG (medical, legal, enterprise), the default prompts sometimes need tuning.

Custom evals cover what RAGAS cannot: product rules, compliance, tone. If your support RAG must never recommend a competitor, that is a programmatic check. If it must cite by internal doc ID, that is a regex. Custom evals run alongside RAGAS, not instead.

Serious RAG teams usually run all three: RAGAS for core quality, LLM-as-judge for product-specific rubrics, programmatic checks for hard rules. RAGAS is the middle tier — more structured than vibes, less bespoke than a full in-house eval-harness.

Failure modes

Four anti-patterns worth flagging.

Running precision/recall without ground truth and trusting the numbers. The framework will compute something, but without real ground truth the scores are noise. No golden set yet? Run only faithfulness and answer relevance, and be honest that retrieval quality is uninstrumented.

Judge-model self-preference bias. If generator and judge are the same family, expect optimistic faithfulness scores. The LLM-as-Judge Prompting Guide covers this in detail. The cheapest mitigation is a different family for judge vs. generator.

Golden set drift. A golden set assembled six months ago may no longer represent current traffic, corpus, or product scope. Scores go up while real quality goes sideways. Re-sample from recent production traffic at least quarterly.

Treating RAGAS as a benchmark instead of a regression test. Public benchmark numbers tell you little about your system on your corpus. The useful comparison is you-today vs. you-last-week, not you-vs-the-leaderboard.

Our position

Five opinionated stances.

Ship faithfulness first. It runs without ground truth, catches the failure users complain loudest about (fabrication), and forces you to name your judge. Add the other three as your golden set matures.

The golden set is the asset, not the framework. Any framework can compute metrics. The thing that makes your eval load-bearing is the 50-100 question golden set with real ground truth, versioned alongside your code. RAGAS is a nice wrapper around a harder problem.

Different model families for generator and judge. Self-preference bias is well-documented. Cross-family evaluation is a cheap, high-leverage mitigation.

CI thresholds beat dashboards. A dashboard nobody looks at is not an eval. A CI job that blocks merges on a faithfulness regression is. Set thresholds conservatively and adjust when they bite too often.

RAGAS is not a substitute for prompt quality. A brilliantly evaluated bad prompt is still a bad prompt. Score prompts against the SurePrompts Quality Rubric and pipelines against RAGAS. You need both.

RAGAS Evaluation: A Walkthrough for Quantifying RAG Quality

Why RAG needs its own eval framework

The 4 metrics, in practice

Faithfulness — answer stays within retrieved context

Answer relevance — answer addresses the question

Context precision — retrieved documents ranked by relevance

Context recall — all relevant docs retrieved

Worked example — a hypothetical customer support RAG

Turning this into a CI check

RAGAS vs. LLM-as-Judge vs. custom eval

Failure modes

Our position

Ready to write better prompts?

Related Articles

RAG Prompt Engineering: How to Write Prompts That Work With Retrieval-Augmented Generation (2026)

The 4 Reusable RAG Prompt Patterns: A Named-Patterns Reference (2026)

LLM-as-Judge: A Practical Guide to Automating Prompt Evaluation (2026)

RAGAS Evaluation: A Walkthrough for Quantifying RAG Quality

Why RAG needs its own eval framework

The 4 metrics, in practice

Faithfulness — answer stays within retrieved context

Answer relevance — answer addresses the question

Context precision — retrieved documents ranked by relevance

Context recall — all relevant docs retrieved

Worked example — a hypothetical customer support RAG

Turning this into a CI check

RAGAS vs. LLM-as-Judge vs. custom eval

Failure modes

Our position

Related reading

Ready to write better prompts?

Related Articles

RAG Prompt Engineering: How to Write Prompts That Work With Retrieval-Augmented Generation (2026)

The 4 Reusable RAG Prompt Patterns: A Named-Patterns Reference (2026)

LLM-as-Judge: A Practical Guide to Automating Prompt Evaluation (2026)