The RAG systems that religiously follow these principles generally fare good at performance:
- Good structure preservation
- Smart chunking
- Rich metadata
- Hybrid retrieval
- Reranking
- Incremental freshness
- Strong evaluation loops
| Pillar | What It Means | Key Techniques / Algorithms / Components | Why It Matters / Benefits |
|---|---|---|---|
| Good Structure Preservation | Preserve document hierarchy and layout instead of flattening into raw text. |
Core Techniques
Common Tools
|
|
| Smart Chunking | Split documents intelligently so semantic meaning is preserved. |
Chunking Types
Important Parameters
|
|
| Rich Metadata | Attach structured information to chunks/documents. |
Metadata Examples
Capabilities Enabled
|
|
| Hybrid Retrieval | Combine semantic retrieval with keyword-based retrieval. |
Retrieval Methods
Fusion Methods
|
|
| Reranking | Re-score retrieved chunks using stronger relevance models. |
Pipeline
Retrieve top-N → rerank → keep top-K
Popular Models
|
|
| Incremental Freshness | Continuously update knowledge without rebuilding entire index. |
Techniques
|
|
| Strong Evaluation Loops | Continuously measure and improve retrieval and generation quality. |
Retrieval Metrics
Generation Metrics
Evaluation Methods
|
|
| Typical Mature RAG Pipeline | End-to-end architecture combining all pillars. | Documents → Parsing → Chunking → Metadata → Embeddings → Hybrid Index → Retrieval → Reranking → LLM → Evaluation Loop |
|
| Highest Practical Impact Areas | Components that usually improve RAG the most. |
Very High Impact
High / Critical
|
|
1. Good Structure Preservation
What it means
When ingesting documents, preserve the document’s natural structure instead of flattening everything into plain text.
Examples of structure:
Titles
Headings
Subheadings
Tables
Lists
Code blocks
Sections
Page hierarchy
HTML DOM structure
Markdown hierarchy
Parent-child relationships
Instead of:
random merged text blob
Preserve:
Document
├── Chapter
│ ├── Section
│ │ ├── Paragraph
│ │ └── Table
Why it matters
LLMs understand semantically organized information better.
Without structure preservation:
chunks lose context
tables break
headings disappear
unrelated paragraphs merge
retrieval quality drops
Example
Bad chunk:
Annual leave is 20 days. Kubernetes pods...
Good chunk:
Document: HR Policy
Section: Leave Policy
Subsection: Annual Leave
Content: Annual leave is 20 days...
Now retrieval becomes context-aware.
Important techniques
a) Hierarchical parsing
Preserve:
H1
H2
H3
sections
subsections
b) Layout-aware parsing
Especially for PDFs.
Use parsers that understand:
columns
tables
headers
footers
reading order
Examples:
Unstructured
LlamaParse
Docling
LayoutParser
Advanced idea: Parent-child retrieval
Store:
small chunks for embeddings
larger parent sections for generation
This improves both:
retrieval precision
answer completeness
2. Smart Chunking
Chunking is probably the MOST underestimated part of RAG.
Why chunking matters
Embeddings are created per chunk.
Bad chunking destroys semantic meaning.
Types of chunking
a) Fixed-size chunking (basic)
Example:
500 tokens
50 overlap
Simple but crude.
Problems:
breaks sentences
breaks tables
breaks logical sections
b) Recursive chunking
Popular in LangChain.
Attempts splitting in order:
headings
paragraphs
sentences
words
Much better semantic preservation.
c) Semantic chunking
Uses embeddings or similarity to split where topic changes.
Instead of fixed size:
Chunk ends when semantic meaning changes
Very powerful.
d) Structure-aware chunking
Chunk according to:
sections
markdown blocks
HTML
code functions/classes
legal clauses
transcript speaker turns
This is often superior.
e) Agentic chunking (advanced)
LLM decides chunk boundaries dynamically.
Expensive but powerful.
Important chunking principles
Chunk size tradeoff
Small chunks:
✅ precise retrieval
❌ may lose context
Large chunks:
✅ richer context
❌ noisy retrieval
Typical ranges:
| Use Case | Chunk Size |
|---|---|
| FAQ | 200–400 |
| Technical docs | 400–800 |
| Legal | 800–1500 |
| Code | function/class based |
Overlap
Overlap helps preserve continuity.
Example:
Chunk1: sentence A B C
Chunk2: C D E
Typical:
10–20% overlap.
Too much overlap:
duplicates results
wastes tokens
hurts retrieval diversity
3. Rich Metadata
Metadata is a SUPERPOWER in RAG.
Most beginners ignore it.
What is metadata?
Extra information attached to chunks.
Example:
{
"source": "employee_handbook.pdf",
"department": "HR",
"section": "Leave Policy",
"page": 14,
"date": "2026-01-01",
"access_level": "internal"
}
Why metadata matters
It enables:
filtering
routing
security
freshness
hybrid search
ranking
citations
Metadata examples
| Metadata | Usage |
|---|---|
| source | citations |
| author | attribution |
| timestamp | freshness |
| department | filtering |
| language | multilingual routing |
| document_type | retrieval specialization |
| permissions | security |
Powerful use cases
a) Time filtering
Example:
Only retrieve policies from last 1 year
Critical for enterprise systems.
b) Access control
User should only retrieve authorized documents.
c) Multi-tenant RAG
Separate users/organizations using metadata filters.
d) Source-aware reranking
Prefer official docs over chats.
4. Hybrid Retrieval
One of the BIGGEST upgrades over naive vector search.
Problem with pure vector search
Embeddings are semantic.
They may fail for:
exact keywords
IDs
codes
version numbers
acronyms
error messages
Example:
ERR_CONN_RESET
Embedding search may fail badly.
Hybrid retrieval combines:
Semantic search
(using embeddings)
AND
Lexical/BM25 keyword search
(using exact terms)
Typical architecture
User Query
↓
Vector Search
+
BM25 Search
↓
Merged Results
Why hybrid works so well
Semantic search finds:
vacation policy
BM25 finds:
Annual Leave Policy
Combined = better recall.
Common hybrid techniques
| Technique | Description |
|---|---|
| BM25 + Vector | most common |
| Reciprocal Rank Fusion (RRF) | merge rankings |
| Weighted fusion | weighted scores |
| Multi-query retrieval | multiple reformulated queries |
| Query expansion | synonyms/related terms |
Modern production systems almost always use hybrid retrieval.
Especially enterprise search.
5. Reranking
Reranking is one of the HIGHEST ROI improvements in RAG.
You already noticed this earlier with:
BGE Reranker
Cohere Reranker
Good catch.
Problem
Initial retrieval is approximate.
Top-10 retrieved chunks often contain noise.
Reranking step
Pipeline:
Query
↓
Retrieve top 50
↓
Reranker scores relevance
↓
Keep top 5
↓
Send to LLM
Why rerankers are powerful
Embeddings compare independently.
Rerankers compare:
(query, chunk)
jointly.
This gives MUCH better relevance.
Popular rerankers
| Model | Notes |
|---|---|
| BGE Reranker | strong open-source |
| Cohere Rerank | very popular API |
| Jina Reranker | lightweight |
| Cross-Encoder models | classic approach |
Cross-encoder concept
Instead of:
embed(query)
embed(chunk)
cosine similarity
Model directly evaluates:
"How relevant is this chunk to this query?"
Much more accurate.
Cost tradeoff
Rerankers are slower than embeddings.
So:
retrieve many → rerank few
is the standard pattern.
6. Incremental Freshness
A huge production concern.
Problem
Documents change continuously.
Examples:
policies updated
tickets added
wikis edited
repos changed
Naive systems require:
re-embed EVERYTHING
which is expensive.
Incremental ingestion
Only process changed documents.
Pipeline:
Detect change
→ parse
→ chunk
→ embed
→ update index
Important concepts
a) Delta updates
Update only changed chunks.
b) Versioning
Track document versions.
Useful for:
rollback
auditing
time-travel queries
c) Streaming ingestion
Real-time updates from:
Kafka
CDC pipelines
webhooks
event systems
d) Freshness ranking
Prefer newer documents.
Especially for:
news
support systems
operational knowledge
Enterprise challenge
Freshness vs stability.
Too-frequent updates may:
create embedding drift
destabilize retrieval
increase costs
7. Strong Evaluation Loops
This separates real systems from demos.
Core problem
RAG quality is HARD to judge manually.
You need systematic evaluation.
What should be evaluated?
| Area | Example |
|---|---|
| Retrieval quality | Did we retrieve correct chunks? |
| Groundedness | Is answer supported by context? |
| Hallucination | Did model invent facts? |
| Latency | Response speed |
| Cost | Token + embedding cost |
| Citation accuracy | Correct references? |
Retrieval metrics
Recall@K
Did relevant chunk appear in top-K?
Example:
top-5 contains answer?
MRR (Mean Reciprocal Rank)
How early correct chunk appears.
NDCG
Ranking quality metric.
Very common in search systems.
Generation evaluation
Faithfulness
Is answer grounded in retrieved docs?
Answer relevance
Did answer actually solve user query?
Context precision
Were retrieved chunks useful or noisy?
Modern evaluation methods
a) Human evaluation
Best quality.
But expensive.
b) LLM-as-a-judge
Use another LLM to evaluate outputs.
Very popular now.
c) Synthetic test generation
Generate QA pairs automatically from docs.
Useful for benchmarking.
Continuous improvement loop
Modern RAG systems usually evolve like this:
Logs
→ failures
→ evaluation
→ retriever tuning
→ chunk tuning
→ reranker tuning
→ prompt tuning
→ re-evaluate
The BIG Picture
Modern RAG is gradually becoming:
Search Engineering
+
Knowledge Engineering
+
LLM Orchestration
+
Evaluation Science
—not just embeddings.
Typical Mature RAG Architecture
Documents
↓
Structure-aware parsing
↓
Smart chunking
↓
Metadata enrichment
↓
Embedding generation
↓
Hybrid indexing
↓
Retrieval
↓
Reranking
↓
LLM generation
↓
Evaluation + feedback loop
Relative Impact (practical experience)
Approximate real-world impact:
| Technique | Impact |
|---|---|
| Better chunking | VERY HIGH |
| Hybrid retrieval | VERY HIGH |
| Reranking | VERY HIGH |
| Metadata filtering | HIGH |
| Structure preservation | HIGH |
| Evaluation loops | CRITICAL long-term |
| Incremental freshness | CRITICAL in production |
Important insight
Most RAG failures are NOT because:
embedding model is weak
vector DB is weak
LLM is weak
They are usually because of:
poor chunking
missing metadata
weak retrieval
no reranking
bad ingestion
no evaluation
Those are the real bottlenecks.
No comments:
Post a Comment