| Category | Description | Pros | Cons | Best For | Algorithms / Techniques | Name of Algorithm | Specific Libraries / Implementations |
|---|---|---|---|---|---|---|---|
| Fixed-Size Chunking | Splits text into chunks of fixed character, word, or token count with optional overlap | • Simple & fast • Predictable size • Easy to implement |
• Breaks mid-sentence • Loses boundary context |
Uniform text, quick baselines | Fixed Token/Character Splitter | Fixed-Size / CharacterTextSplitter | LangChain (CharacterTextSplitter, TokenTextSplitter) |
| Recursive Chunking | Hierarchically splits using separator priority until size limit | • Respects structure • Good balance • Reliable |
• Can create suboptimal splits | General-purpose RAG (recommended default) | RecursiveCharacterTextSplitter | RecursiveCharacterTextSplitter | LangChain (most popular), LlamaIndex |
| Sentence / Paragraph-Based | Splits or groups at natural sentence or paragraph boundaries | • High semantic coherence • Readable chunks |
• Variable chunk sizes | Narrative, articles, books | Sentence Tokenizer, Paragraph Splitter | SentenceSplitter / Sentence-Group | LlamaIndex (SentenceSplitter), NLTK, spaCy, LangChain |
| Structure / Document-Based | Uses document hierarchy (headings, pages, tags) | • Logical units • Rich metadata |
• Needs structured docs | PDFs, manuals, reports, Markdown | By-Title, Page-Level, HTML/Markdown Splitter | MarkdownHeaderTextSplitter, PageSplitter | LangChain (MarkdownHeaderTextSplitter), LlamaIndex, Unstructured |
| Sliding Window / Overlapping | Moves a window across text with overlap | • Strong context continuity • Better recall |
• Higher storage & redundancy | Cross-boundary context needs | Token/Character Sliding Window | Sliding Window Chunking | Custom Python, LangChain, LlamaIndex |
| Overlapping Sentence-Group Chunking | Groups sentences into clusters then slides with 1+ sentence overlap from previous cluster | • Coherence + continuity • Reduces boundary loss |
• Moderate complexity & redundancy | Technical & narrative docs with transitions | Sentence-Group Sliding Window | Overlapping Sentence Group Chunking (SGC + Overlap) | Custom implementation (LangChain + spaCy/NLTK) |
| Semantic Chunking | Groups based on embedding similarity; splits on semantic shifts | • Highly meaningful chunks • High relevance |
• Expensive & slower | Technical, complex, domain-specific content | Semantic Similarity Chunking | SemanticChunker | LangChain (SemanticChunker), LlamaIndex, Chonkie |
| Hierarchical / Parent-Child | Multi-level chunks (small for retrieval, large for context) | • Precision + context balance | • Complex indexing | Advanced RAG, long docs, reasoning | Parent-Child Chunking | Parent-Child / Small-to-Big Retrieval | LlamaIndex (most mature), LangChain |
| Late Chunking | Embeds full document first, then chunks | • Excellent global context preservation | • High memory during embedding | Long coherent documents, high accuracy | Late / Embed-then-Split Chunking | Late Chunking | Custom (with long-context embedders), emerging in LlamaIndex/LangChain |
| Adaptive / Dynamic | Adjusts chunk size based on content density or complexity | • Optimized for varied content | • Harder to tune | Heterogeneous document collections | Adaptive / Complexity-Aware Chunking | Adaptive Chunking | Custom, some experimental in LangChain |
| Contextual / Enrichment | Adds surrounding or global context to chunks | • Reduces hallucinations | • Higher token usage | High-faithfulness question answering | Contextual Chunking | Contextual Retrieval / Enrichment | LangChain, LlamaIndex (Node parsers) |
| LLM-Driven / Agentic | LLM decides chunk boundaries | • Most intelligent & domain-aware | • Slow & expensive | High-value, smaller datasets | LLM-Based / Proposition Chunking | LLM Chunker / Agentic Chunking | LangChain, custom LLM calls |
| Hybrid / Multi-Method | Combines two or more strategies | • Best overall performance | • Complex to maintain | Production RAG systems | Recursive + Semantic, Structure + Overlap etc. | Hybrid Chunking | LangChain + LlamaIndex combinations, Chonkie |
Thursday, May 7, 2026
Subscribe to:
Post Comments (Atom)
k8s Networking
Kubernetes networking is designed around one core idea: Every pod can directly communicate with every other pod using IP addresses, witho...
-
http://www.sommarskog.se/share_data.html How to Share Data Between Stored Procedures An SQL text by Erland Sommarskog, SQL Server MVP. M...
-
CONCLUSION : 1. Normally, use following two when you do not want query compilation also to come into picture. CHECKPOINT DBCC DROPCLEA...
-
Most of the google tutorials on keras do not show how to display a confusion matrix for the solution. A confusion matrix can throw a clear l...
No comments:
Post a Comment