Wednesday, May 6, 2026

Design Considerations for high-quality RAG systems

The RAG systems that religiously follow these principles generally fare good at performance: 

  • Good structure preservation 
  • Smart chunking 
  • Rich metadata 
  • Hybrid retrieval 
  • Reranking 
  • Incremental freshness 
  • Strong evaluation loops 


Pillar What It Means Key Techniques / Algorithms / Components Why It Matters / Benefits
Good Structure Preservation Preserve document hierarchy and layout instead of flattening into raw text.
Core Techniques
  • Hierarchical parsing (H1/H2/H3, sections)
  • Layout-aware parsing
  • Parent-child retrieval
  • DOM/Markdown preservation
  • Table/code/list preservation
Common Tools
  • LangChain
  • LlamaParse
  • Unstructured
  • Docling
  • LayoutParser
  • Preserves semantic context
  • Prevents broken tables/sections
  • Improves retrieval relevance
  • Enables context-aware retrieval
  • Better answer grounding
Smart Chunking Split documents intelligently so semantic meaning is preserved.
Chunking Types
  • Fixed-size chunking
  • Recursive chunking
  • Semantic chunking
  • Structure-aware chunking
  • Agentic chunking
Important Parameters
  • Chunk size tuning
  • Overlap tuning
  • Strongly impacts embedding quality
  • Prevents context fragmentation
  • Balances precision vs context richness
  • Reduces retrieval noise
Rich Metadata Attach structured information to chunks/documents.
Metadata Examples
  • source
  • section
  • page
  • author
  • timestamp
  • permissions
  • document type
  • language
  • tenant/user ID
Capabilities Enabled
  • filtering
  • routing
  • access control
  • freshness ranking
  • Enables precise filtering
  • Improves security/access control
  • Supports multi-tenant RAG
  • Enables citations and auditability
  • Improves retrieval quality
Hybrid Retrieval Combine semantic retrieval with keyword-based retrieval.
Retrieval Methods
  • Vector search
  • BM25
  • Lexical search
Fusion Methods
  • Reciprocal Rank Fusion (RRF)
  • Weighted fusion
  • Query expansion
  • Multi-query retrieval
  • Handles exact keywords/IDs/error codes
  • Improves recall significantly
  • Balances semantic + exact matching
  • Standard in production RAG
Reranking Re-score retrieved chunks using stronger relevance models.
Pipeline
Retrieve top-N → rerank → keep top-K
Popular Models
  • BGE Reranker
  • Cohere Rerank
  • Jina Reranker
  • Cross-Encoders
  • Removes noisy retrievals
  • Improves final context quality
  • Higher answer accuracy
  • One of the highest ROI improvements
Incremental Freshness Continuously update knowledge without rebuilding entire index.
Techniques
  • Delta updates
  • Partial re-embedding
  • Versioning
  • Streaming ingestion
  • CDC pipelines
  • Event/webhook-based updates
  • Freshness-aware ranking
  • Keeps knowledge current
  • Reduces reprocessing cost
  • Supports real-time systems
  • Enables rollback/auditing
Strong Evaluation Loops Continuously measure and improve retrieval and generation quality.
Retrieval Metrics
  • Recall@K
  • MRR (Mean Reciprocal Rank)
  • NDCG
Generation Metrics
  • Faithfulness
  • Groundedness
  • Answer relevance
  • Context precision
Evaluation Methods
  • Human evaluation
  • LLM-as-a-judge
  • Synthetic QA generation
  • Detects hallucinations/retrieval failures
  • Enables systematic improvement
  • Optimizes latency/cost/quality
  • Essential for production reliability
Typical Mature RAG Pipeline End-to-end architecture combining all pillars. Documents → Parsing → Chunking → Metadata → Embeddings → Hybrid Index → Retrieval → Reranking → LLM → Evaluation Loop
  • Produces scalable, reliable, production-grade RAG systems
Highest Practical Impact Areas Components that usually improve RAG the most.
Very High Impact
  • Better chunking
  • Hybrid retrieval
  • Reranking
High / Critical
  • Metadata filtering
  • Structure preservation
  • Evaluation loops
  • Incremental freshness
  • Most real-world RAG failures come from weak retrieval pipelines rather than weak LLMs or vector DBs




1. Good Structure Preservation

What it means

When ingesting documents, preserve the document’s natural structure instead of flattening everything into plain text.

Examples of structure:

  • Titles

  • Headings

  • Subheadings

  • Tables

  • Lists

  • Code blocks

  • Sections

  • Page hierarchy

  • HTML DOM structure

  • Markdown hierarchy

  • Parent-child relationships

Instead of:

random merged text blob

Preserve:

Document
 ├── Chapter
 │    ├── Section
 │    │    ├── Paragraph
 │    │    └── Table

Why it matters

LLMs understand semantically organized information better.

Without structure preservation:

  • chunks lose context

  • tables break

  • headings disappear

  • unrelated paragraphs merge

  • retrieval quality drops


Example

Bad chunk:

Annual leave is 20 days. Kubernetes pods...

Good chunk:

Document: HR Policy
Section: Leave Policy
Subsection: Annual Leave
Content: Annual leave is 20 days...

Now retrieval becomes context-aware.


Important techniques

a) Hierarchical parsing

Preserve:

  • H1

  • H2

  • H3

  • sections

  • subsections


b) Layout-aware parsing

Especially for PDFs.

Use parsers that understand:

  • columns

  • tables

  • headers

  • footers

  • reading order

Examples:

  • Unstructured

  • LlamaParse

  • Docling

  • LayoutParser


Advanced idea: Parent-child retrieval

Store:

  • small chunks for embeddings

  • larger parent sections for generation

This improves both:

  • retrieval precision

  • answer completeness


2. Smart Chunking

Chunking is probably the MOST underestimated part of RAG.


Why chunking matters

Embeddings are created per chunk.

Bad chunking destroys semantic meaning.


Types of chunking


a) Fixed-size chunking (basic)

Example:

  • 500 tokens

  • 50 overlap

Simple but crude.

Problems:

  • breaks sentences

  • breaks tables

  • breaks logical sections


b) Recursive chunking

Popular in LangChain.

Attempts splitting in order:

  1. headings

  2. paragraphs

  3. sentences

  4. words

Much better semantic preservation.


c) Semantic chunking

Uses embeddings or similarity to split where topic changes.

Instead of fixed size:

Chunk ends when semantic meaning changes

Very powerful.


d) Structure-aware chunking

Chunk according to:

  • sections

  • markdown blocks

  • HTML

  • code functions/classes

  • legal clauses

  • transcript speaker turns

This is often superior.


e) Agentic chunking (advanced)

LLM decides chunk boundaries dynamically.

Expensive but powerful.


Important chunking principles


Chunk size tradeoff

Small chunks:

✅ precise retrieval
❌ may lose context

Large chunks:

✅ richer context
❌ noisy retrieval

Typical ranges:

Use CaseChunk Size
FAQ200–400
Technical docs400–800
Legal800–1500
Codefunction/class based

Overlap

Overlap helps preserve continuity.

Example:

Chunk1: sentence A B C
Chunk2: C D E

Typical:

10–20% overlap.

Too much overlap:

  • duplicates results

  • wastes tokens

  • hurts retrieval diversity


3. Rich Metadata

Metadata is a SUPERPOWER in RAG.

Most beginners ignore it.


What is metadata?

Extra information attached to chunks.

Example:

{
  "source": "employee_handbook.pdf",
  "department": "HR",
  "section": "Leave Policy",
  "page": 14,
  "date": "2026-01-01",
  "access_level": "internal"
}

Why metadata matters

It enables:

  • filtering

  • routing

  • security

  • freshness

  • hybrid search

  • ranking

  • citations


Metadata examples

MetadataUsage
sourcecitations
authorattribution
timestampfreshness
departmentfiltering
languagemultilingual routing
document_typeretrieval specialization
permissionssecurity

Powerful use cases


a) Time filtering

Example:

Only retrieve policies from last 1 year

Critical for enterprise systems.


b) Access control

User should only retrieve authorized documents.


c) Multi-tenant RAG

Separate users/organizations using metadata filters.


d) Source-aware reranking

Prefer official docs over chats.


4. Hybrid Retrieval

One of the BIGGEST upgrades over naive vector search.


Problem with pure vector search

Embeddings are semantic.

They may fail for:

  • exact keywords

  • IDs

  • codes

  • version numbers

  • acronyms

  • error messages

Example:

ERR_CONN_RESET

Embedding search may fail badly.


Hybrid retrieval combines:

Semantic search

(using embeddings)

AND

Lexical/BM25 keyword search

(using exact terms)


Typical architecture

User Query
   ↓
Vector Search
   +
BM25 Search
   ↓
Merged Results

Why hybrid works so well

Semantic search finds:

vacation policy

BM25 finds:

Annual Leave Policy

Combined = better recall.


Common hybrid techniques

TechniqueDescription
BM25 + Vectormost common
Reciprocal Rank Fusion (RRF)merge rankings
Weighted fusionweighted scores
Multi-query retrievalmultiple reformulated queries
Query expansionsynonyms/related terms

Modern production systems almost always use hybrid retrieval.

Especially enterprise search.


5. Reranking

Reranking is one of the HIGHEST ROI improvements in RAG.

You already noticed this earlier with:

  • BGE Reranker

  • Cohere Reranker

Good catch.


Problem

Initial retrieval is approximate.

Top-10 retrieved chunks often contain noise.


Reranking step

Pipeline:

Query
 ↓
Retrieve top 50
 ↓
Reranker scores relevance
 ↓
Keep top 5
 ↓
Send to LLM

Why rerankers are powerful

Embeddings compare independently.

Rerankers compare:

(query, chunk)

jointly.

This gives MUCH better relevance.


Popular rerankers

ModelNotes
BGE Rerankerstrong open-source
Cohere Rerankvery popular API
Jina Rerankerlightweight
Cross-Encoder modelsclassic approach

Cross-encoder concept

Instead of:

embed(query)
embed(chunk)
cosine similarity

Model directly evaluates:

"How relevant is this chunk to this query?"

Much more accurate.


Cost tradeoff

Rerankers are slower than embeddings.

So:

retrieve many → rerank few

is the standard pattern.


6. Incremental Freshness

A huge production concern.


Problem

Documents change continuously.

Examples:

  • policies updated

  • tickets added

  • wikis edited

  • repos changed

Naive systems require:

re-embed EVERYTHING

which is expensive.


Incremental ingestion

Only process changed documents.

Pipeline:

Detect change
 → parse
 → chunk
 → embed
 → update index

Important concepts


a) Delta updates

Update only changed chunks.


b) Versioning

Track document versions.

Useful for:

  • rollback

  • auditing

  • time-travel queries


c) Streaming ingestion

Real-time updates from:

  • Kafka

  • CDC pipelines

  • webhooks

  • event systems


d) Freshness ranking

Prefer newer documents.

Especially for:

  • news

  • support systems

  • operational knowledge


Enterprise challenge

Freshness vs stability.

Too-frequent updates may:

  • create embedding drift

  • destabilize retrieval

  • increase costs


7. Strong Evaluation Loops

This separates real systems from demos.


Core problem

RAG quality is HARD to judge manually.

You need systematic evaluation.


What should be evaluated?

AreaExample
Retrieval qualityDid we retrieve correct chunks?
GroundednessIs answer supported by context?
HallucinationDid model invent facts?
LatencyResponse speed
CostToken + embedding cost
Citation accuracyCorrect references?

Retrieval metrics


Recall@K

Did relevant chunk appear in top-K?

Example:

top-5 contains answer?

MRR (Mean Reciprocal Rank)

How early correct chunk appears.


NDCG

Ranking quality metric.

Very common in search systems.


Generation evaluation


Faithfulness

Is answer grounded in retrieved docs?


Answer relevance

Did answer actually solve user query?


Context precision

Were retrieved chunks useful or noisy?


Modern evaluation methods


a) Human evaluation

Best quality.

But expensive.


b) LLM-as-a-judge

Use another LLM to evaluate outputs.

Very popular now.


c) Synthetic test generation

Generate QA pairs automatically from docs.

Useful for benchmarking.


Continuous improvement loop

Modern RAG systems usually evolve like this:

Logs
 → failures
 → evaluation
 → retriever tuning
 → chunk tuning
 → reranker tuning
 → prompt tuning
 → re-evaluate

The BIG Picture

Modern RAG is gradually becoming:

Search Engineering
+
Knowledge Engineering
+
LLM Orchestration
+
Evaluation Science

—not just embeddings.


Typical Mature RAG Architecture

Documents
   ↓
Structure-aware parsing
   ↓
Smart chunking
   ↓
Metadata enrichment
   ↓
Embedding generation
   ↓
Hybrid indexing
   ↓
Retrieval
   ↓
Reranking
   ↓
LLM generation
   ↓
Evaluation + feedback loop

Relative Impact (practical experience)

Approximate real-world impact:

TechniqueImpact
Better chunkingVERY HIGH
Hybrid retrievalVERY HIGH
RerankingVERY HIGH
Metadata filteringHIGH
Structure preservationHIGH
Evaluation loopsCRITICAL long-term
Incremental freshnessCRITICAL in production

Important insight

Most RAG failures are NOT because:

  • embedding model is weak

  • vector DB is weak

  • LLM is weak

They are usually because of:

  • poor chunking

  • missing metadata

  • weak retrieval

  • no reranking

  • bad ingestion

  • no evaluation

Those are the real bottlenecks.

No comments:

Post a Comment

LangChain and LlamaIndex

  Aspect LangChain LlamaIndex Winner / Notes Primary Strength Orchestration, Agents & Workflows Data Indexing & Advanced RAG LlamaIn...