Corrective RAG (CRAG): Grading Retrieved Docs Before You Generate

Q: What is Corrective RAG (CRAG)?

Corrective RAG is a retrieval pattern introduced by Shi et al. in 2024 that inserts a relevance-grading step between retrieval and generation. Every retrieved document is scored by a lightweight evaluator against the query, and the pipeline branches on the aggregate confidence. High confidence flows straight to the generator. Low confidence triggers a fallback — typically query rewriting plus web search — rather than answering from weak context. Ambiguous confidence mixes internal documents with web results. The pattern directly addresses the failure mode where a retriever returns the best-available chunks even when none of them are good, and the generator dutifully hallucinates on top.

Q: How is CRAG different from plain RAG?

Plain RAG retrieves top-k documents and passes them to the generator unconditionally. The generator has no signal for whether those documents actually answer the question, so it produces fluent output regardless of retrieval quality. CRAG adds a grader that produces a confidence score, and a router that can take non-retrieve actions — rewrite the query, search the web, or refuse. The difference shows up most on out-of-corpus questions, where plain RAG hallucinates confidently and CRAG either recovers via fallback or explicitly declines.

Q: How does CRAG differ from Self-RAG?

Both patterns push back on the 'retrieve and trust' default of plain RAG, but they differ in where the logic lives. In Self-RAG, the model is fine-tuned to emit reflection tokens that gate retrieval and critique its own output — the corrective behavior is inside the weights. In CRAG, the corrective logic is an external control loop built around a generic LLM plus a grading model. CRAG is easier to adopt because it does not require fine-tuning; Self-RAG is harder to adopt but integrates more tightly into generation. Teams that cannot fine-tune usually choose CRAG; teams shipping a proprietary model sometimes choose Self-RAG.

Q: What grader should I use for the relevance step?

Two options. A dedicated small classifier — a fine-tuned cross-encoder or a specialized relevance model — is fast and cheap at inference, but requires labeled data. An LLM-as-judge with a narrow relevance rubric is slower and pricier per call, but needs no training data and is easy to iterate on. Most teams start with LLM-as-judge to validate the pattern end-to-end, then swap in a distilled classifier once they have enough labeled query-document pairs from production. The grader does not need to be the same model as the generator, and in fact should not be, to avoid self-preference bias.

Q: What is the cost overhead of CRAG vs. plain RAG?

In the happy path — retrieval confidence is high and no fallback fires — CRAG adds one extra call per query to the grader. If the grader is a small classifier, that overhead is a few milliseconds and fractions of a cent. If the grader is a full LLM-as-judge call, it's closer to a doubling of cost on the retrieval side. In the fallback path, cost increases further because you add a web search call, a query-rewriting call, and a second retrieval and grading pass. The tradeoff is paying more on the subset of queries where plain RAG would have silently hallucinated.

Q: When should I not use CRAG?

Three cases. If your corpus is complete enough that out-of-corpus questions are rare — a closed-domain knowledge base with tight query scope — the fallback logic rarely fires and you are paying grader overhead on every query for little benefit. If you cannot afford web-search latency in the critical path, the fallback branch will hurt p95. And if your grader is worse than random at the relevance task, CRAG will route incorrectly and make things worse, not better — the grader is load-bearing and must be validated independently before the pipeline ships.

Imtiaz Rayhan

Key takeaways:

Plain RAG silently fails on weak retrieval. The retriever returns top-k regardless of whether any of the k are relevant, and the generator produces fluent output either way.
CRAG inserts a relevance grader and a router. The router can do more than generate — it can rewrite the query, fall back to web search, or refuse.
The three branches — correct, incorrect, ambiguous — are the core of the pattern. Each branch has a different downstream action.
The grader is the load-bearing piece. A bad grader is worse than no grader: it routes confidently to the wrong branch.
CRAG and Self-RAG target the same failure mode but live in different places. CRAG is an external control loop; Self-RAG is learned behavior inside a fine-tuned model.
Evaluate the full pipeline with RAGAS and the grader in isolation with a labeled relevance set. One without the other hides regressions.

Why plain RAG silently fails on weak retrieval

The standard RAG contract looks simple. The user asks a question. The retriever pulls top-k chunks. The generator answers from them. Shipped.

The hidden assumption is that some of the top-k chunks actually answer the question. When that holds, RAG works. When it fails — and it fails in predictable situations — plain RAG produces confident wrong answers with no internal signal that anything went wrong.

The failure is structural. A vector retriever does not return "relevant" documents; it returns the nearest documents in embedding space. If your corpus contains nothing relevant, it returns the least-irrelevant chunks. BM25 does not say "no match" when term overlap is low; it ranks whatever matched, however weakly. The retriever's interface — top-k by some score — does not distinguish "these five chunks are great" from "these five chunks are the best of a bad lot."

The generator inherits that ambiguity and resolves it the wrong way. Modern LLMs are trained to produce fluent, confident output from whatever context they receive. Hand them five marginal chunks and a question those chunks do not answer, and they will still produce a fluent, confident answer — often fabricating bridging claims to connect the marginal context to the question.

This is the silent-failure problem — covered in the retrieval-augmented prompting patterns post. The system is most dangerous when it is most wrong, because the output looks identical to a correct answer. Users cannot tell the difference; logs cannot tell the difference. Only ground-truth evaluation can, and most teams do not run it on every query.

CRAG attacks this directly. It inserts a step the retrieval pipeline had been missing: an explicit check on whether the retrieved context is good enough to answer from. If the check fails, the pipeline does something other than generate.

How the grading step works

The grader sits between the retriever and the generator. It takes as input the query and each retrieved document, and produces a relevance score — usually a float in [0, 1] or a discrete label from a small set like {correct, ambiguous, incorrect}.

Two implementation choices.

Dedicated classifier. A small cross-encoder or a fine-tuned relevance model trained on query-document pairs. Fast at inference — a few milliseconds per pair on commodity GPUs, sometimes cheaper on CPU. Cheap per call. The downside is training data: you need thousands of labeled query-document pairs to fine-tune well, and the labels have to match your actual relevance criterion rather than a generic proxy. Teams with mature eval pipelines usually land here eventually.

LLM-as-judge with a relevance rubric. A prompt to a smaller model — a Haiku-tier or GPT-4o-mini-tier model typically — that asks, per chunk: "does this document contain information that would help answer this query? Rate 0-1." Slower than a classifier (a network round-trip per chunk, or batched across chunks) and priced per token, but it works immediately with zero training data. The LLM-as-judge prompting guide covers the rubric design in detail; the CRAG-specific part is keeping the prompt narrow — you are grading relevance only, not answer quality, not factuality.

Most teams start with LLM-as-judge to validate the pattern end-to-end, then swap in a distilled classifier once they have enough production data. Either way, one property is non-negotiable: the grader must not be the same model as the generator. Self-preference bias is real, and a same-family grader will over-approve documents the generator will happily accept.

Per-chunk scores aggregate into an overall confidence. The simplest aggregation is max(scores) — if any chunk is highly relevant, the retrieval is good enough. A stricter aggregation is mean(scores) or count(score > threshold). The right aggregation depends on whether your generator tolerates a mix of relevant and irrelevant chunks or gets confused by noise. A tolerant generator can use max; a noise-sensitive one should use mean and a stricter threshold.

The three branches

CRAG routes on the aggregate confidence into three branches.

Correct (high confidence). The retrieved documents look good. Forward them to the generator as in plain RAG. This is the happy path; on a well-indexed corpus answering in-scope questions, most queries land here. Cost: one extra grader call on top of baseline RAG. Behavior on this branch is identical to plain RAG — the point of the grader is not to improve good queries, it is to catch bad ones.

Incorrect (low confidence). None of the retrieved documents meaningfully answer the query. Do not generate from them. Instead, take a corrective action. Two common ones:

Query rewriting + re-retrieval. Rephrase the query using an LLM — normalize abbreviations, expand implicit context, split a compound query into sub-queries — and run retrieval again. Grade again. If the rewritten query now retrieves confidently, proceed. If not, escalate.

Web search fallback. Send the query (rewritten or not) to a web search API, pull the top results, and use those as the context for generation. This is particularly useful when the question is out-of-corpus — your knowledge base does not contain the answer, but the open web does. Always constrain the search to an allowlist for trust-sensitive domains.

When both fail, the pipeline should refuse rather than answer. A refusal is expensive in UX terms, but a confident wrong answer is more expensive in trust terms.

Ambiguous (middle confidence). Some documents look decent, none look great. Mix internal documents with web results, pass the combined context to the generator, and instruct it to prefer the internal documents when they conflict with the web. This branch exists because the binary correct/incorrect partition is too crude in practice — lots of queries land in the middle, and throwing away the decent internal chunks to start fresh with web search would be wasteful.

The branch boundaries are thresholds on the aggregate confidence. Tune them per system. A reasonable starting point: correct above 0.7, incorrect below 0.4, ambiguous in between. The exact numbers matter less than the discipline of having three explicit branches with different downstream actions.

Worked example — a hypothetical out-of-corpus query

Hypothetical scenario, not a shipped product. A SaaS company runs a support assistant over its help-center corpus. A user types:

"Does your integration work with Salesforce Industries Cloud's latest omnichannel routing?"

The company has never built that integration. The corpus contains generic Salesforce docs and a few pages about standard Salesforce Sales Cloud integration. Nothing about Industries Cloud. Nothing about omnichannel routing.

Plain RAG path. Retriever pulls the three nearest chunks: the generic Salesforce connector doc, a 2023 product update mentioning "expanded Salesforce support," and a marketing page comparing integrations. The generator writes: "Yes, our integration supports Salesforce Industries Cloud's omnichannel routing via our Salesforce connector. Configure it in Settings → Integrations → Salesforce, select Industries Cloud, and enable omnichannel sync." Confident. Specific. Wrong.

CRAG path. Same retrieval. Grader scores each chunk against the query.

Generic connector doc: 0.35 ("mentions Salesforce, does not mention Industries Cloud or omnichannel routing").
2023 product update: 0.25 ("mentions expanded Salesforce support, does not specify Industries Cloud").
Marketing comparison page: 0.15 ("generic feature table, no specific claim").

Max score 0.35. Mean 0.25. Both below the incorrect threshold. Router takes the corrective branch.

First attempt: query rewriting. The rewriter produces: "Salesforce Industries Cloud integration support in [Product]." Re-retrieval. Same generic chunks. Grader scores drop further on the rewritten query because it is more specific. Still incorrect.

Second attempt: web-search fallback. Query goes to a search API constrained to salesforce.com, the company's own blog, and major trade publications. Results: Salesforce's own docs describe Industries Cloud omnichannel routing; the company's blog has no relevant post.

Generator runs on the web results plus a brief note that the internal corpus has no matching integration. Output: "Salesforce Industries Cloud's omnichannel routing is a Salesforce-native capability. Our integration supports Salesforce Sales Cloud and Service Cloud via the standard connector, but does not currently support Industries Cloud's omnichannel routing as a first-class integration. If you need this, please contact support to discuss your use case."

The second answer is less impressive than the first. It is also the correct answer. Users who get the second answer ask a follow-up question or file a feature request. Users who get the first answer spend two hours trying to find a setting that does not exist and then churn.

Cost and latency tradeoffs

On the happy path, CRAG adds one grader call to baseline RAG. With a small classifier, that is milliseconds and a fraction of a cent. With an LLM-as-judge call batched across k chunks, it is 100-500ms and ~1-2x the retrieval cost. Either is tolerable for most interactive workloads.

The cost lives in the fallback branch. When confidence is low, you may run:

A query-rewriting LLM call.
A second retrieval against the internal corpus.
A second grader call.
A web-search API call.
Potentially a third retrieval against web results.
The generator call.

That is 4-6 LLM/API calls where plain RAG would have run 2. Latency on p95 goes up correspondingly — often from ~2s to ~5-8s for fallback queries. Dollar cost per fallback query is roughly 3-5x baseline.

The bet you are making is that fallback queries are a minority of traffic, and that the fallback queries are exactly the ones where plain RAG would have silently hallucinated. If both conditions hold, CRAG is a strong trade: you pay more on a small slice of queries to avoid the queries where plain RAG was most dangerous. If the fallback fires on the majority of queries, you have a corpus-coverage problem, not a retrieval problem, and CRAG is treating a symptom.

A useful discipline: track fallback_rate as a first-class metric. If it climbs above some threshold — 15-20% of traffic, say — escalate to corpus review. Pair this with the measurement discipline described in the RAGAS evaluation walkthrough: CRAG reduces silent hallucination, but only measurement can tell you by how much.

Failure modes

Four anti-patterns worth flagging.

Grader bias toward approval. LLM judges, especially generous ones, tend to score borderline documents as "sort of relevant" rather than commit to "incorrect." The fallback branch rarely fires and CRAG collapses back to plain RAG with extra cost. Calibrate the grader on a labeled set; if precision on the correct label is below ~0.85, tighten the rubric or lower the threshold.

Same-family grader and generator. Using Claude to grade documents for a Claude generator, or GPT-4 for GPT-4, produces self-preference bias — the grader approves documents the generator would accept, regardless of whether they actually answer the question. Use a different model family for grading. This is the same issue covered in the LLM-as-judge prompting guide, applied to retrieval.

Web-search fallback to the open internet. Searching the open web and piping results into your generator imports the web's grounding problems — ad content, AI-generated SEO pages, content farms. Constrain the search API to an allowlist of trustworthy domains. Better to refuse than to cite a fabricated tutorial from a content farm.

Treating CRAG as a substitute for reranking or better retrieval. CRAG improves behavior on queries where retrieval returned weak results. It does not improve the retrieval itself. If your first-stage retriever is weak, fix it first — a rerank layer, a better embedding model, hybrid search. CRAG on top of broken retrieval is expensive and still underperforms good retrieval without CRAG.

Our position

Five opinionated stances.

CRAG belongs on every RAG pipeline that accepts out-of-scope questions. If your corpus has tight scope control — a closed, curated knowledge base answering a tight class of questions — skip it. If your corpus could receive questions it cannot answer (every production support bot), the grader and router pay for themselves.

Start with LLM-as-judge, migrate to a distilled classifier. LLM-as-judge validates the pattern in a week; a classifier ships in a month. Running LLM-as-judge in production is fine for moderate traffic. At high traffic, the inference cost justifies distilling.

Use different model families for grader and generator. Self-preference bias is cheap to avoid and expensive to debug. Haiku grading Sonnet, or GPT-4o-mini grading Claude, are both reasonable pairings.

Instrument the router, not just the generator. Track fallback_rate and refusal_rate as product metrics. A fallback rate that creeps up over time is a corpus-coverage signal; a refusal rate that creeps up is a trust signal. Neither shows up in standard RAG dashboards.

Self-refine is complementary, not redundant. CRAG fixes the input to the generator when retrieval is weak. Self-refine fixes the output of the generator when the first draft is weak. Teams with ambition budget for both; teams with limited time should ship CRAG first because its impact on silent hallucination is larger.

CRAG is in the same family as other retrieval-side correctors — agentic RAG with retrieval as a tool call, Self-RAG with reflection tokens inside the model. All three share a premise: retrieval is not a deterministic subroutine to trust, it is a noisy signal to evaluate. The Context Engineering Maturity Model puts this kind of retrieval-aware pipeline at Level 4 and above: context is not just assembled, it is evaluated before it reaches the generator. For the prompt side of the same discipline, the SurePrompts Quality Rubric covers generator-side quality, and the RCAF prompt structure covers the scaffold those prompts sit inside. The agentic prompt stack describes where a CRAG pipeline fits inside a broader agent loop — the router is just one tool call among many.

Corrective RAG (CRAG): Grading Retrieved Docs Before You Generate

Why plain RAG silently fails on weak retrieval

How the grading step works

The three branches

Worked example — a hypothetical out-of-corpus query

Cost and latency tradeoffs

Failure modes

Our position

Ready to write better prompts?

Related Resources

RAG System Design Template

Related Articles

RAG Prompt Engineering: How to Write Prompts That Work With Retrieval-Augmented Generation (2026)

The 4 Reusable RAG Prompt Patterns: A Named-Patterns Reference (2026)

RAGAS Evaluation: A Walkthrough for Quantifying RAG Quality

Corrective RAG (CRAG): Grading Retrieved Docs Before You Generate

Why plain RAG silently fails on weak retrieval

How the grading step works

The three branches

Worked example — a hypothetical out-of-corpus query

Cost and latency tradeoffs

Failure modes

Our position

Related reading

Ready to write better prompts?

Related Resources

RAG System Design Template

Related Articles

RAG Prompt Engineering: How to Write Prompts That Work With Retrieval-Augmented Generation (2026)

The 4 Reusable RAG Prompt Patterns: A Named-Patterns Reference (2026)

RAGAS Evaluation: A Walkthrough for Quantifying RAG Quality