Tuesday, May 5, 2026

Things you should look for in a good RAG test data

 1. it should force semantic searching i.e. it should have cases which are not obvious for a dumb machine. e.g. given an HR database, suppose you search for "leave policy". A dumb algo will simply search for keyword "leave", collect all the data that contains the word "leave" and output it. Where as a semantically searched resultset will also include statements  containing "holidays", "off days" etc.


2. Overlapping data: Same data should appear multiple times or may appear singularly but in different context. RAG should be able to differentiate context.


3. Should contain some kind of ambiguity 

4. unclear edge cases. RAG, if properly trained, should be able to conclude on edge cases, depending on avaiable data, and gracefully output if it is not able to provide accurate information.

5. multi-hop reasoning



Types of duplications in data: 

1. Exact duplicates : Data contains same statement/contextual data exactly repeated at many places. 

Models  (embedding and LLM) should be intelligent enough to identify noise vs required data in exact duplicates. Noise-> needs to be eliminated->e.g. Employees are entitled 40 leaves per year.-> If this string is encountered multiple times, it can be eliminated. But in a legal database, same sentence may appear multiple times as a sentence to different crimes, that cannot be and should not be eliminated.


2. Semantic Duplicates : Ordering of words changed/words changed but have same meaning

e.g. "You can change your password after first login" and "Password can be changed by the user after maiden login"

3. Boilerplate Overlap : Common repetitive text found across many documents e.g header/footer.



Strategies to Handle Overlapping Data
  • Semantic Splitting: Using tools to split text at semantic boundaries rather than fixed token counts reduces the need for large overlaps.
  • Deduplication: Implementing hash-based fingerprinting (for exact matches) or using clustering techniques to identify and remove near-duplicates.
  • Summarization: Detecting overlapping chunks during retrieval and summarizing them into a single, comprehensive context for the LLM.
  • Optimal Overlap Setting: Common best practices suggest a 10% to 20% overlap (e.g., 100 tokens of overlap for a 512-token chunk)

 

No comments:

Post a Comment

Redis Topologies

Topology Best Used For Complexity Writes Scal...