Question 9
Domain 2: Data PreparationA legal RAG system must preserve section boundaries and cross-references, but fixed-size chunking is hurting retrieval quality. Which strategy is most appropriate?
Correct answer: B
Explanation
Section-aware hierarchical chunks preserve the document’s “section boundaries” while overlap helps keep related text together for retrieval. Adding metadata supports “cross-references” by letting the system link chunks back to their source sections instead of splitting them arbitrarily with fixed-size chunking.
Why each option is right or wrong
A. Embed each full document as one vector
One vector per full document is too coarse and weakens precise passage-level retrieval.
B. Use section-aware hierarchical chunks with overlap and metadata
Fixed-size chunking is the wrong fit here because it ignores the document’s structural units, so section headings, subparts, and referenced provisions can be split across unrelated chunks and retrieved out of context. A hierarchical scheme that chunks by section and sub-section, with a small overlap between adjacent chunks and metadata such as section numbers and parent-child links, preserves the legal text’s internal references and improves retrieval precision without losing the document’s organization.
C. Randomly sample sentences from each document
Random sentence sampling destroys legal context, hierarchy, and citation relationships needed for grounded retrieval.
D. Use only keyword search with no chunking
Keyword search alone misses semantic matching and does not solve boundary-preserving chunk organization.