Tip
TL;DR: Chunking is usually the largest single quality lever in a RAG pipeline. This walkthrough compares fixed-size, semantic, recursive, and parent-document chunking on a hypothetical legal-research assistant — showing what bad chunking looks like, how to diagnose it, and how to fix it. The default answer for most production systems is recursive splitting plus parent-document retrieval.
Key takeaways:
- Chunking caps the ceiling of everything downstream. No embedding model, reranker, or prompt tweak can recover information the chunk boundary destroyed.
- Fixed-size chunking is a fine baseline and a poor ceiling. It ignores document structure and routinely splits sentences and arguments mid-thought.
- Semantic chunking respects natural boundaries but can produce wildly uneven sizes, which hurts retrieval calibration.
- Recursive chunking — split on the biggest separator first, fall back to finer ones only when needed — is the right default for heterogeneous corpora.
- Parent-document retrieval decouples the retrieval unit from the generation unit: index small, return large. On long-form documents, this alone often moves faithfulness several points.
- Treat chunk size as a tunable parameter against an eval set, not a number you pick once and forget. See the RAGAS walkthrough for how to wire this into CI.
Why chunking is usually the biggest RAG quality lever
The instinct for most teams starting a RAG project is to spend their first week agonizing over the embedding model and the vector database. Those choices matter, but they are not where most real quality is won or lost. The piece that runs upstream of both — the decision of where to draw the line between one chunk and the next — is where the ceiling of the whole pipeline gets set.
The reason is structural. An embedding represents the chunk it is handed. If that chunk splits a sentence across the boundary, or packs five topics into 2,000 tokens, or quotes a regulation without the clause that qualifies it, no embedding model can rescue it. The retriever will either miss the chunk entirely or return it along with noise; the generator will either fabricate the missing piece or hedge its way through a partial answer. Both failure modes look downstream from where they originated.
Said another way: chunking decides what the system is physically capable of retrieving. Everything after chunking — semantic search, reranking, HyDE, prompt design — is working with whatever signal the chunks preserved. If the chunks are bad, the rest of the stack is polishing a compromised input.
This puts chunking at the top of the fix-order for any RAG quality work. The Context Engineering Maturity Model treats retrieval quality as a Level 4-5 concern; under the hood, most teams blocked at those levels are blocked on chunking, not on retriever sophistication.
The four strategies
Four strategies cover most of what production RAG systems actually ship. They are not mutually exclusive — the stronger pipelines compose them.
Fixed-size chunking
Fixed-size chunking splits the document into pieces of N tokens (or characters), optionally with an overlap of M tokens between consecutive chunks to preserve context across boundaries. Typical numbers are 500 tokens with 50-token overlap, or 1,000 with 100.
What it does well: it is trivial to implement, trivial to reason about, and produces chunks of predictable size — which makes prompt-budget accounting easy and keeps embedding costs uniform.
What it does badly: it ignores the structure of the source document entirely. It will cheerfully split a sentence, a numbered list, a code block, a table row, or an argument in half. The overlap mitigates but does not fix this — a 50-token overlap saves you on sentence-level cuts but not on section-level ones.
Fixed-size is a perfectly reasonable baseline for a first-pass proof-of-concept. It is a poor ceiling for any system that needs to survive production traffic.
Semantic chunking
Semantic chunking respects natural boundaries — paragraphs, headings, section breaks, or the places where embedding similarity between adjacent sentences drops sharply. Each chunk is meant to correspond to one coherent idea.
What it does well: chunks read like what a human would have split the document into. A definition stays with its examples. A regulation stays with its exceptions. Retrieval returns passages that make sense on their own.
What it does badly: it produces wildly uneven chunks. One section might be 80 tokens; the next might be 1,400. That unevenness creates two problems. First, the long chunks dilute the relevance signal — the embedding averages across too many topics. Second, retrieval calibration gets weird — similarity scores are no longer comparable across chunks of different sizes.
Semantic chunking works best on well-structured corpora with consistent section sizes. It struggles on mixed corpora where document conventions vary.
Recursive chunking
Recursive chunking applies a hierarchy of separators. A common order is: split on headings first; if any resulting chunk is still above the target size, split it on paragraph breaks; if still too large, split on sentences; if still too large, split on tokens. The splitter falls back to finer granularity only when it has to.
What it does well: it combines the structural awareness of semantic chunking with the size predictability of fixed-size. The LangChain RecursiveCharacterTextSplitter and LlamaIndex's default text splitter both implement this pattern, which is why it has become the de facto production default.
What it does badly: the hierarchy of separators has to be chosen for the document type. The separators that work for Markdown files do not work for contract PDFs; the separators that work for prose do not work for code. A single recursive splitter across a heterogeneous corpus will underperform a per-document-type splitter.
Recursive chunking is the right default for most heterogeneous corpora. Pair it with per-document-type separator lists when document conventions are stable enough to warrant the effort.
Parent-document retrieval
Parent-document retrieval is not strictly a chunking strategy — it is a retrieval strategy built on top of chunking — but it belongs in this comparison because it solves a problem the first three cannot.
The idea: decouple the unit of retrieval from the unit of generation. Index small child chunks (say, 300 tokens) so the retriever has sharp, high-precision targets. But when a child chunk is retrieved, return the parent (the enclosing section, or the whole document) to the generator. The retriever gets the precision of small chunks; the generator gets the context of large ones.
This is parent-document retrieval, and for long-form documents where meaning depends on surrounding context, it is often a multi-point quality jump over any single-level chunking scheme.
What it does well: it lets you tune retrieval precision (small child chunks) and generation context (large parents) independently. A legal RAG can retrieve on 200-token passages while the generator sees the entire section. A product-docs chatbot can retrieve on individual steps while the generator sees the full tutorial.
What it does badly: it consumes more context tokens, which raises cost and increases the risk of the model drifting onto irrelevant parts of the parent. It also requires the document to have a sensible parent structure to return — arbitrary mid-document ranges work badly as parents.
Worked example — a hypothetical legal-research assistant
Hypothetical scenario, not a shipped product. All numbers are illustrative.
A mid-size law firm builds an internal research assistant over a corpus of roughly 2,000 case PDFs and 400 statutory documents. Typical queries are things like "What's the standard for tolling the statute of limitations in New York contract disputes, and how have appellate courts treated intentional concealment?"
The first pipeline uses fixed-size chunking — 1,500 tokens, 150-token overlap — with a standard off-the-shelf embedding model. Answers on the eval set score an illustrative ~0.71 on faithfulness and ~0.68 on context recall. Users report that the assistant "finds the right case but answers the wrong question about it."
The diagnosis starts with reading the retrieved chunks for failed queries. Three patterns emerge:
- The retrieved chunk contains the answer, but also a page of procedural history and footnotes. The generator drifts onto the footnotes.
- The retrieved chunk contains the right passage but it starts mid-argument — "This rule, however, does not apply when..." — with no antecedent for this rule. The generator hedges.
- The retrieved chunk is on-topic but missing the adjacent paragraph that contains the qualifier that made the ruling narrow. The generator overstates the holding.
All three trace back to chunking. Failure one is chunks that are too large. Failures two and three are chunks that split structure badly.
The team ships three changes in sequence.
First change — switch to recursive chunking with legal-aware separators. Split first on section headings (§, Section, Roman numerals), then on paragraph breaks, then on sentences. Target 400-token chunks with 50-token overlap. Faithfulness moves to an illustrative ~0.78. Recall moves to ~0.74. The "starts mid-argument" failure mode drops sharply because paragraph and sentence boundaries are now respected.
Second change — add parent-document retrieval. Index the 400-token child chunks. On retrieval, look up the enclosing section (or the whole case for shorter documents) and pass that to the generator. Faithfulness moves to an illustrative ~0.84. The "missing qualifier" failure mode largely disappears because the generator now sees the surrounding paragraph.
Third change — add a cross-encoder reranker on top. Retrieve 30 child chunks, rerank to 8, return the parents of those 8. This is the smallest delta of the three — an illustrative ~0.86 — but it cleans up the tail of queries where the right child chunk was in the candidate set but not in the top few.
Note what did not happen. The team did not swap the embedding model. They did not change the generator prompt. They did not add HyDE or hybrid search. Those are all valid next moves, but the 15-point gain came from chunking and retrieval-unit decisions alone.
How to choose
A short decision list for picking a chunking strategy on a new corpus:
- Is the corpus short, self-contained documents (FAQs, product descriptions, tweets)? Fixed-size chunking at a size slightly larger than the typical document is fine. Parent-document retrieval adds little because there is no meaningful parent.
- Is the corpus well-structured, consistent-format prose (policy docs, cleaned Markdown, a single technical manual)? Semantic chunking on headings and paragraphs works well. Validate chunk size distribution — if it is too long-tailed, fall back to recursive.
- Is the corpus heterogeneous long-form (research papers, legal cases, mixed technical docs)? Recursive chunking with document-type-aware separators, plus parent-document retrieval. This is the production default.
- Is the corpus code or structured data (source files, API references, logs)? Fixed-size by tokens is usually wrong; split on function/class boundaries, or on log-event boundaries. Treat each separator set as a per-language decision.
Independent of strategy, tune chunk size against an eval set. The right size is a function of your embedding model, your retrieval top-K, your generator's context budget, and the shape of your queries. It is not a number you pick once.
Failure modes
Four anti-patterns worth flagging.
Treating chunk size as a one-time decision. Chunk size is a hyperparameter. It interacts with the embedding model, the retrieval top-K, and the generator's context window. When any of those changes, the right chunk size probably changes too. Teams that set it once and forget it regress silently when they swap models.
Fixing chunking by throwing a bigger embedding model at the problem. A stronger embedding model on bad chunks produces worse retrieval than a mid-tier model on good chunks, because the embedding can only represent what is in the chunk. Fix chunking first.
Shipping semantic chunking without measuring chunk size distribution. Semantic chunking that produces some 60-token chunks and some 1,800-token chunks will have unpredictable retrieval behavior. Always plot the distribution. If it is long-tailed, switch to recursive, or cap size with a forced split.
Parent-document retrieval with no structural parents. If the "parent" you return is just "the next 2,000 tokens around the child," you are back to fixed-size chunking with extra steps. Parent-document retrieval only pays off when the parent corresponds to a meaningful unit — a section, a case, a tutorial. Without real document structure, do not reach for it.
Our position
Five opinionated stances.
- Recursive + parent-document is the production default. Fixed-size is fine for a weekend prototype. For anything that ships, assume recursive chunking with document-type-aware separators and parent-document retrieval unless you have a specific reason to deviate. Most of the tuning energy should go into the separator hierarchy and the parent unit, not into chunk size per se.
- Fix chunking before embedding models, rerankers, or prompt tweaks. Chunking sets the ceiling. Everything else raises floors under that ceiling. Teams that reorder this — swap embeddings first, then wonder why quality barely moved — are a recurring pattern in RAG postmortems.
- Chunk size is a tuned hyperparameter, not a folklore number. The right number is a function of your corpus, your embedding model, and your eval set. Run RAGAS at three or four chunk sizes, find the inflection point, and re-run when anything upstream changes. Numbers from blog posts (including this one) are starting points, not answers.
- Read the chunks, not just the metrics. Metrics aggregate; they do not diagnose. When a query fails, pull the retrieved chunks and read them. Most chunking problems announce themselves immediately to the eye — a sentence cut in half, a pronoun with no antecedent, a paragraph that lost its topic sentence. The diagnosis is almost never obscure once you actually look.
- Parent-document retrieval is underused. For any corpus of long-form documents — legal, research, policy, technical manuals — parent-document retrieval is typically a larger and cheaper win than any of the retriever-side upgrades teams reach for first. The conceptual move — decouple retrieval unit from generation unit — is worth internalizing even if you end up implementing it yourself rather than using a framework's version.