Thursday, May 7, 2026

Chunking Algorithms

Category Description Pros Cons Best For Algorithms / Techniques Name of Algorithm Specific Libraries / Implementations
Fixed-Size Chunking Splits text into chunks of fixed character, word, or token count with optional overlap • Simple & fast
• Predictable size
• Easy to implement
• Breaks mid-sentence
• Loses boundary context
Uniform text, quick baselines Fixed Token/Character Splitter Fixed-Size / CharacterTextSplitter LangChain (CharacterTextSplitter, TokenTextSplitter)
Recursive Chunking Hierarchically splits using separator priority until size limit • Respects structure
• Good balance
• Reliable
• Can create suboptimal splits General-purpose RAG (recommended default) RecursiveCharacterTextSplitter RecursiveCharacterTextSplitter LangChain (most popular), LlamaIndex
Sentence / Paragraph-Based Splits or groups at natural sentence or paragraph boundaries • High semantic coherence
• Readable chunks
• Variable chunk sizes Narrative, articles, books Sentence Tokenizer, Paragraph Splitter SentenceSplitter / Sentence-Group LlamaIndex (SentenceSplitter), NLTK, spaCy, LangChain
Structure / Document-Based Uses document hierarchy (headings, pages, tags) • Logical units
• Rich metadata
• Needs structured docs PDFs, manuals, reports, Markdown By-Title, Page-Level, HTML/Markdown Splitter MarkdownHeaderTextSplitter, PageSplitter LangChain (MarkdownHeaderTextSplitter), LlamaIndex, Unstructured
Sliding Window / Overlapping Moves a window across text with overlap • Strong context continuity
• Better recall
• Higher storage & redundancy Cross-boundary context needs Token/Character Sliding Window Sliding Window Chunking Custom Python, LangChain, LlamaIndex
Overlapping Sentence-Group Chunking Groups sentences into clusters then slides with 1+ sentence overlap from previous cluster • Coherence + continuity
• Reduces boundary loss
• Moderate complexity & redundancy Technical & narrative docs with transitions Sentence-Group Sliding Window Overlapping Sentence Group Chunking (SGC + Overlap) Custom implementation (LangChain + spaCy/NLTK)
Semantic Chunking Groups based on embedding similarity; splits on semantic shifts • Highly meaningful chunks
• High relevance
• Expensive & slower Technical, complex, domain-specific content Semantic Similarity Chunking SemanticChunker LangChain (SemanticChunker), LlamaIndex, Chonkie
Hierarchical / Parent-Child Multi-level chunks (small for retrieval, large for context) • Precision + context balance • Complex indexing Advanced RAG, long docs, reasoning Parent-Child Chunking Parent-Child / Small-to-Big Retrieval LlamaIndex (most mature), LangChain
Late Chunking Embeds full document first, then chunks • Excellent global context preservation • High memory during embedding Long coherent documents, high accuracy Late / Embed-then-Split Chunking Late Chunking Custom (with long-context embedders), emerging in LlamaIndex/LangChain
Adaptive / Dynamic Adjusts chunk size based on content density or complexity • Optimized for varied content • Harder to tune Heterogeneous document collections Adaptive / Complexity-Aware Chunking Adaptive Chunking Custom, some experimental in LangChain
Contextual / Enrichment Adds surrounding or global context to chunks • Reduces hallucinations • Higher token usage High-faithfulness question answering Contextual Chunking Contextual Retrieval / Enrichment LangChain, LlamaIndex (Node parsers)
LLM-Driven / Agentic LLM decides chunk boundaries • Most intelligent & domain-aware • Slow & expensive High-value, smaller datasets LLM-Based / Proposition Chunking LLM Chunker / Agentic Chunking LangChain, custom LLM calls
Hybrid / Multi-Method Combines two or more strategies • Best overall performance • Complex to maintain Production RAG systems Recursive + Semantic, Structure + Overlap etc. Hybrid Chunking LangChain + LlamaIndex combinations, Chonkie

No comments:

Post a Comment

k8s Networking

 Kubernetes networking is designed around one core idea: Every pod can directly communicate with every other pod using IP addresses, witho...