rschandrastechblog: Chunking Algorithms

Thursday, May 7, 2026

Chunking Algorithms

Category	Description	Pros	Cons	Best For	Algorithms / Techniques	Name of Algorithm	Specific Libraries / Implementations
Fixed-Size Chunking	Splits text into chunks of fixed character, word, or token count with optional overlap	• Simple & fast • Predictable size • Easy to implement	• Breaks mid-sentence • Loses boundary context	Uniform text, quick baselines	Fixed Token/Character Splitter	Fixed-Size / CharacterTextSplitter	LangChain (`CharacterTextSplitter`, `TokenTextSplitter`)
Recursive Chunking	Hierarchically splits using separator priority until size limit	• Respects structure • Good balance • Reliable	• Can create suboptimal splits	General-purpose RAG (recommended default)	RecursiveCharacterTextSplitter	RecursiveCharacterTextSplitter	LangChain (most popular), LlamaIndex
Sentence / Paragraph-Based	Splits or groups at natural sentence or paragraph boundaries	• High semantic coherence • Readable chunks	• Variable chunk sizes	Narrative, articles, books	Sentence Tokenizer, Paragraph Splitter	SentenceSplitter / Sentence-Group	LlamaIndex (`SentenceSplitter`), NLTK, spaCy, LangChain
Structure / Document-Based	Uses document hierarchy (headings, pages, tags)	• Logical units • Rich metadata	• Needs structured docs	PDFs, manuals, reports, Markdown	By-Title, Page-Level, HTML/Markdown Splitter	MarkdownHeaderTextSplitter, PageSplitter	LangChain (`MarkdownHeaderTextSplitter`), LlamaIndex, Unstructured
Sliding Window / Overlapping	Moves a window across text with overlap	• Strong context continuity • Better recall	• Higher storage & redundancy	Cross-boundary context needs	Token/Character Sliding Window	Sliding Window Chunking	Custom Python, LangChain, LlamaIndex
Overlapping Sentence-Group Chunking	Groups sentences into clusters then slides with 1+ sentence overlap from previous cluster	• Coherence + continuity • Reduces boundary loss	• Moderate complexity & redundancy	Technical & narrative docs with transitions	Sentence-Group Sliding Window	Overlapping Sentence Group Chunking (SGC + Overlap)	Custom implementation (LangChain + spaCy/NLTK)
Semantic Chunking	Groups based on embedding similarity; splits on semantic shifts	• Highly meaningful chunks • High relevance	• Expensive & slower	Technical, complex, domain-specific content	Semantic Similarity Chunking	SemanticChunker	LangChain (`SemanticChunker`), LlamaIndex, Chonkie
Hierarchical / Parent-Child	Multi-level chunks (small for retrieval, large for context)	• Precision + context balance	• Complex indexing	Advanced RAG, long docs, reasoning	Parent-Child Chunking	Parent-Child / Small-to-Big Retrieval	LlamaIndex (most mature), LangChain
Late Chunking	Embeds full document first, then chunks	• Excellent global context preservation	• High memory during embedding	Long coherent documents, high accuracy	Late / Embed-then-Split Chunking	Late Chunking	Custom (with long-context embedders), emerging in LlamaIndex/LangChain
Adaptive / Dynamic	Adjusts chunk size based on content density or complexity	• Optimized for varied content	• Harder to tune	Heterogeneous document collections	Adaptive / Complexity-Aware Chunking	Adaptive Chunking	Custom, some experimental in LangChain
Contextual / Enrichment	Adds surrounding or global context to chunks	• Reduces hallucinations	• Higher token usage	High-faithfulness question answering	Contextual Chunking	Contextual Retrieval / Enrichment	LangChain, LlamaIndex (Node parsers)
LLM-Driven / Agentic	LLM decides chunk boundaries	• Most intelligent & domain-aware	• Slow & expensive	High-value, smaller datasets	LLM-Based / Proposition Chunking	LLM Chunker / Agentic Chunking	LangChain, custom LLM calls
Hybrid / Multi-Method	Combines two or more strategies	• Best overall performance	• Complex to maintain	Production RAG systems	Recursive + Semantic, Structure + Overlap etc.	Hybrid Chunking	LangChain + LlamaIndex combinations, Chonkie

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)