Thursday, May 7, 2026

Normalization Algorithms in Machine Learning





1. Feature Scaling (Traditional ML)

Technique Description Formula / Key Point Best Used When
Min-Max Normalization Scales data to [0, 1] or [a, b] X' = (X - min) / (max - min) Bounded data, Neural Networks
Standardization (Z-score) Mean = 0, Std = 1 X' = (X - μ) / σ Gaussian-like data, Linear models
Robust Scaling Uses median & IQR (robust to outliers) X' = (X - median) / IQR Data with outliers
MaxAbs Scaling Scales by maximum absolute value X' = X / max(|X|) Sparse data
Mean Normalization Centers around zero X' = (X - mean) / (max - min) Less common

2. Normalization for Vectors / Features

Technique Description Formula Use Case
L2 Normalization (Euclidean) Most common vector normalization X' = X / ||X||₂ Distance-based algorithms, Neural Networks
L1 Normalization (Manhattan) Sum of absolute values = 1 X' = X / ||X||₁ Sparse data, Feature importance
Max Normalization Divide by maximum value in vector X' = X / max(|X|) Simple scaling of feature vectors

3. Deep Learning Normalization Layers

Layer Year Key Idea Main Advantage Common Use Cases
Batch Normalization (BatchNorm) 2015 Normalize across batch dimension Accelerates training CNNs (ResNet, etc.)
Layer Normalization (LayerNorm) 2016 Normalize across features (per sample) Works with variable batch sizes Transformers
Instance Normalization 2017 Normalize per sample per channel Style transfer StyleGAN, artistic tasks
Group Normalization 2018 Normalize within groups of channels Good for small batch sizes Object detection
RMS Normalization (RMSNorm) - Normalize by Root Mean Square Simpler & faster Modern LLMs (Llama, etc.)

4. Other Specialized Normalization Techniques

  • Quantile Normalization — Makes distributions identical across samples (popular in bioinformatics)
  • Local Response Normalization (LRN) — Used in early CNNs like AlexNet
  • Weight Normalization — Reparameterizes weights instead of activations
  • Spectral Normalization — Constrains weight matrices for stable GAN training
  • Batch Renormalization — Improved and more stable version of BatchNorm
  • Filter Response Normalization (FRN) — Batch-independent normalization
  • Power Transform (Yeo-Johnson / Box-Cox) — Makes data more Gaussian-like
  • Contrast Normalization — Used in computer vision preprocessing

5. Quick Recommendation Guide

Scenario Recommended Technique
Classical ML (SVM, KNN, etc.) Standardization or Robust Scaling
Neural Networks (small batch) LayerNorm / GroupNorm
Large batch CNNs BatchNorm
Transformers / Large Language Models RMSNorm or LayerNorm
Data with outliers Robust Scaling
Images (style-related) Instance Normalization

L1, L2 and L-Inf Normalizations

Case 1 : L2 (Euclidean) normalization of (2,3) and (3,2)

Euclidean L2 normalization scales a vector so that its total length (magnitude) equals 1, effectively stripping away the "size" of the data while preserving its direction.

Hence, the normalized values of (2, 3) and (3, 2) are not the same. They point in different directions, and their normalized coordinates reflect that.

1. Calculate vector magnitudes

To normalize a vector, you first find its L2 norm (Euclidean distance from the origin) using the formula:

||v||₂ = √(Σ xi²)

For the example (2,3) and (3,2):

Vector A (2, 3):
√(2² + 3²) = √(4 + 9) = √13 ≈ 3.606

Vector B (3, 2):
√(3² + 2²) = √(9 + 4) = √13 ≈ 3.606

2. Divide by magnitude

You then divide each component of the original vector by this magnitude:

Normalized A:
(2/√13, 3/√13) ≈ (0.555, 0.832)

Normalized B:
(3/√13, 2/√13) ≈ (0.832, 0.555)

As you can see, the values are swapped, not identical. They represent distinct points on a unit circle.

Graph image

What does it represent?

It represents the orientation or relative proportions of the components.

In a vector of (2, 3), the second feature is 1.5× larger than the first. Normalization keeps this ratio intact while forcing the vector to sit exactly 1 unit away from the origin.

It effectively says:

“I don't care how much of this stuff we have in total; I only care about the mix or the direction.”

What are the uses?

Cosine Similarity: In Machine Learning (like recommendation systems), we often care about the angle between vectors rather than their magnitude. If you normalize two vectors, their dot product becomes their Cosine Similarity.

Stable Training: In Deep Learning, normalizing inputs or gradients prevents "exploding" values and helps the model converge faster because every feature is on the same scale (between -1 and 1).

Pattern Recognition: In image processing or computer vision, it helps recognize a shape or pattern regardless of the overall brightness or contrast of the image.

✅ Conclusion

The normalized values for (2, 3) and (3, 2) are distinct: approximately (0.55, 0.83) and (0.83, 0.55) respectively.

Euclidean L2 normalization represents the pure direction of a vector on a unit hypersphere, removing the influence of its magnitude.

Would you like to see how this compares to L1 (Manhattan) normalization, which is often used for probability distributions?


Case 2 : L2 (Euclidean) normalization of (2,3) and (10,15)

The L2 normalisations of (2, 3) and (10, 15) are identical.

Since (10, 15) is just a scaled-up version of (2, 3)—specifically 5 × (2, 3)—they both point in the exact same direction. Normalisation strips away that "5x" magnitude, leaving you with the same unit vector.

The Calculation

Vector A (2, 3):

Magnitude:
√(2² + 3²) = √13 ≈ 3.606

Normalised:
(2/3.606, 3/3.606) ≈ (0.555, 0.832)

Vector B (10, 15):

Magnitude:
√(10² + 15²) = √(100 + 225) = √325 ≈ 18.028

Normalised:
(10/18.028, 15/18.028) ≈ (0.555, 0.832)

Why this happens

Normalization treats these two vectors as "the same" because their internal ratio is the same (2:3 = 10:15). In many data science contexts, this is exactly what you want.

For example:

Text Analysis: A short document mentioned "Apple" 2 times and "Orange" 3 times. A long document mentions them 10 and 15 times. Normalization tells your algorithm that both documents have the same topic balance, regardless of their length.

Image Processing: A dim pixel (2, 3) and a bright pixel (10, 15) have the same chrominance (color), just different intensities. Normalization lets you process the color while ignoring the lighting.



✅ Conclusion

The L2 normalisation for both vectors is approximately (0.555, 0.832).

This confirms that normalization captures the proportional relationship between components rather than their absolute scale.



 L1 Normalization

L1 normalization (also known as the Manhattan Norm or Least Absolute Deviations) scales a vector so that the sum of the absolute values of its components equals 1.

Unlike L2 normalization, which focuses on the straight-line "Euclidean" distance, L1 normalization focuses on the "Taxicab" distance—measuring how far you'd travel if you could only move along grid lines.

1. The Formula

To L1 normalize a vector, you divide each component by the L1 norm (the sum of absolute values):

||v||₁ = Σ |xi|

For your examples:

Vector A (2, 3):
|2| + |3| = 5
Normalised: (2/5, 3/5) = (0.4, 0.6)

Vector B (10, 15):
|10| + |15| = 25
Normalised: (10/25, 15/25) = (0.4, 0.6)

As with L2, vectors with the same internal proportions (2:3) result in the same normalized vector.

2. What it represents

L1 normalization represents the contribution share of each element.

If a vector represents counts of different items, the L1 normalized version tells you the percentage or probability of each item occurring relative to the total.

In (0.4, 0.6), the first element represents 40% of the total magnitude and the second represents 60%.

3. Key Uses

Creating Probabilities: It is commonly used to transform a vector of raw scores into a Probability Mass Function (PMF) where all parts sum to 1.

Sparse Modeling (Lasso): In machine learning, L1 regularization Lasso Regression is famous for its ability to zero out less important features, performing automatic feature selection.

Robustness to Outliers: Because it doesn't square the values (unlike L2), L1 is much less sensitive to extreme outliers. An outlier value of 100 is treated as 100 in L1, but as 10,000 in L2.

Graph image

✅ Conclusion

L1 normalization scales a vector so that the sum of its absolute components is 1.

For both (2, 3) and (10, 15), the L1 normalized vector is (0.4, 0.6).

It is primarily used to represent proportions or to create sparse models that ignore irrelevant data.


L-Inf Normalization

L-Infinity Normalization (also called the Max Norm or Supremum Norm) scales a vector so that its largest component has a magnitude of exactly 1.

Instead of looking at the sum (L1) or the square root of squares (L2), it simply looks for the "peak" value in the set.

1. The Formula

To normalize a vector using L-Infinity, you divide every component by the absolute value of the largest component:

||v||∞ = max(|xi|)

For your examples:

Vector A (2, 3): The maximum value is 3.

Normalized: (2/3, 3/3) = (0.66, 1.0)

Vector B (10, 15): The maximum value is 15.

Normalized: (10/15, 15/15) = (0.66, 1.0)

Again, because the proportions are the same, the result is identical.

2. What it represents

L-Infinity normalization represents relative importance compared to the maximum.

It forces the most dominant feature to be the "baseline" (1.0) and scales everything else relative to that peak.

In geometry, if you plot all possible L-Infinity normalized vectors, they form a square (or a hypercube in higher dimensions) rather than a circle or a diamond.

3. Key Uses

Image Processing: It is used to normalize pixel intensities. If you have a dark image, L-infinity normalization scales the brightest pixel to 1.0 (pure white) and stretches all other pixels proportionally, effectively "auto-leveling" the brightness.

Adversarial Machine Learning: In cybersecurity, "L-infinity attacks" are used to test models. An attacker might change every pixel in an image by a tiny, equal amount. L-infinity captures the maximum change made to any single pixel.

Control Systems: It's used when there is a strict limit on a system—for example, if a motor can only handle a maximum of 5 volts, you normalize your control signals so no single output ever exceeds that physical "cap."




✅ Summary Table

Norm Result for (2, 3) Key Logic Best For
L1 (Plots a Diamond) (0.4, 0.6) Components sum to 1 Proportions & Probabilities
L2 (Plots a Circle) (0.55, 0.83) Distance to origin is 1 Directions & Angles
L-inf (Plots a Square) (0.66, 1.0) Max component is 1 Peak values & Constraints

NumPy functions for dot product and cosine similarity

 To calculate dot products and cosine similarity in NumPy, you primarily use np.dot() and np.linalg.norm(). While NumPy has a direct function for the dot product, cosine similarity is typically calculated by combining several operations. 

1. Dot Product
The dot product of two vectors is the sum of the products of their corresponding elements. 
  • np.dot(a, b): The standard function for computing the dot product.
  • a @ b: A more modern and readable operator for matrix multiplication and dot products introduced in Python 3.5.
  • np.inner(a, b): Computes the inner product, which for 1D arrays is identical to the dot product. 
2. Cosine Similarity
NumPy does not have a single "cosine_similarity" function, so you must implement the formula:
similarity  = A(dot)B/Magnitude(A).Magnitude(B)
You can implement this using:
Example Implementation:
python
import numpy as np
from numpy.linalg import norm

def cosine_similarity(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))
Quick Comparison
Metric NumPy Function(s)Result Range
Dot Productnp.dot(a, b) or a @ b-infinity to +infinity
Cosine Similaritynp.dot(a, b) / (norm(a) * norm(b)) -1 to 1 
Note: For a direct, single-function implementation, many developers use the Scikit-learn cosine_similarity function or SciPy's spatial.distance.cosine (which returns cosine distance, or 1 - similarity)

RAG Design Articles on this blog

 

1. Design Considerations for high-quality RAG systems

2. Designing an ingestion pipeline for a RAG

3.Chunking Strategies

4. Retrieval Algorithms in RAG systems

5. Smilarity Metrics and Search Algorithms

Designing an ingestion pipeline for a RAG

Area Key Considerations Important Techniques / Components Why It Matters
Data Sources Source diversity, structured vs unstructured data, data quality PDF parsers, APIs, DB connectors, OCR, deduplication, corruption handling Poor input quality directly reduces retrieval accuracy
Parsing & Extraction Accurate text extraction while preserving structure PyMuPDF, Unstructured, Docling, OCR, layout-aware parsing, table extraction Loss of structure destroys semantic meaning
Structure Preservation Maintain headings, tables, lists, code blocks, hierarchies Hierarchy extraction, markdown preservation, layout-aware chunking Improves contextual retrieval and answer grounding
Cleaning & Normalization Noise reduction, normalization, multilingual handling Unicode normalization, boilerplate removal, OCR cleanup, language detection Cleaner text improves embeddings and retrieval quality
Security & Compliance Sensitive data handling and governance PII masking, redaction, encryption, ACL tagging, audit trails Prevents unauthorized retrieval and compliance violations
Chunking Strategy Chunk size, semantic coherence, overlap tuning Fixed-size, semantic, recursive, structure-aware, adaptive chunking One of the biggest determinants of retrieval performance
Metadata Strategy Rich contextual metadata and filtering support Source tags, timestamps, hierarchy paths, permissions, version metadata Enables filtering, routing, freshness, and secure retrieval
Embedding Strategy Embedding quality, speed, multilingual/domain support General embeddings, domain embeddings, multilingual embeddings, multi-vector embeddings Strong embeddings improve semantic matching
Indexing Strategy Efficient scalable retrieval FAISS, Qdrant, Pinecone, HNSW, IVF, PQ, BM25, hybrid search Determines retrieval latency, scalability, and recall
Enrichment Adding higher-level semantic information Summarization, keyword extraction, entity extraction, graph construction, classification Improves advanced retrieval and reasoning capabilities
Knowledge Graph / Graph RAG Relationship-aware retrieval Entity graphs, semantic edges, citation graphs Useful for multi-hop reasoning and connected knowledge
Freshness & Incremental Updates Continuous ingestion and change tracking CDC, checksums, timestamps, delta indexing, selective re-embedding Keeps RAG knowledge current without full rebuilds
Versioning Handling document evolution Version history, temporal indexing, embedding refresh policies Prevents stale or conflicting retrievals
Scalability & Throughput Large-scale ingestion efficiency Parallel pipelines, queues, Kafka, batch ingestion, streaming ingestion Supports enterprise-scale workloads
Reliability & Fault Tolerance Pipeline robustness Retries, dead-letter queues, idempotency, monitoring, observability Prevents silent ingestion failures and duplication
Cost Optimization Reducing embedding/storage costs Caching, deduplication, quantization, selective ingestion Controls operational expenses at scale
Evaluation & Monitoring Measuring retrieval and ingestion quality Recall@K, MRR, nDCG, chunk evaluation, embedding drift detection Ensures pipeline changes do not degrade retrieval quality
Advanced Retrieval Architectures Multi-level and hierarchical retrieval Parent-child retrieval, hierarchical retrieval, RAPTOR, recursive summaries Improves long-document understanding and reasoning
Agentic Ingestion LLM-assisted ingestion decisions LLM-based chunking, metadata extraction, summarization, classification Higher quality ingestion at higher compute cost
Common Mistakes Design flaws that hurt retrieval Overlapping too much, ignoring metadata, fixed-only chunking, no hybrid search, stale indexes These issues commonly reduce production RAG quality
Core Design Principle RAG quality depends heavily on information architecture Structure preservation, smart chunking, rich metadata, hybrid retrieval, reranking, freshness Good ingestion pipelines outperform naive “embed everything” approaches

 




Designing the ingestion pipeline is one of the most important parts of a RAG system because retrieval quality is often limited more by ingestion mistakes than by the LLM itself.

A good ingestion pipeline should optimize for:

  • Retrieval accuracy

  • Freshness

  • Scalability

  • Cost

  • Latency

  • Maintainability

  • Explainability

  • Security/compliance

Below is a structured breakdown of the major considerations.


1. Source Data Considerations

A. Data Sources

Your pipeline may ingest from:

  • PDFs

  • Word docs

  • HTML/websites

  • Wikis

  • Databases

  • APIs

  • Emails

  • Slack/Teams chats

  • Code repositories

  • Logs

  • Images/OCR scans

  • Audio/video transcripts

Each source needs different parsers and cleaning logic.


B. Structured vs Unstructured

TypeExamplesChallenges
StructuredSQL tables, CSVSchema evolution
Semi-structuredJSON, XMLNested fields
UnstructuredPDFs, textChunking, parsing

C. Data Quality

Bad ingestion = bad retrieval.

Need handling for:

  • Duplicates

  • Corrupted docs

  • OCR errors

  • Encoding issues

  • Boilerplate

  • Missing metadata

  • Empty sections

  • Spam/noise


2. Document Parsing & Extraction

A. Parsing Strategy

Different parsers behave differently.

Examples:

  • Simple text extraction

  • Layout-aware parsing

  • OCR

  • Vision-based parsing

  • Table extraction

Popular tools:

  • PyMuPDF

  • pdfplumber

  • Unstructured

  • Apache Tika

  • LlamaParse

  • Docling

  • OCR engines


B. Preserve Document Structure

Critical for retrieval quality.

Need to preserve:

  • Headings

  • Sections

  • Tables

  • Lists

  • Captions

  • Code blocks

  • Hierarchies

Without structure, semantic meaning is lost.


C. Multimodal Extraction

Modern RAG increasingly needs:

  • Tables

  • Charts

  • Images

  • Diagrams

  • Equations

  • Code snippets

Need strategies like:

  • OCR

  • Caption generation

  • Table serialization

  • Vision embeddings


3. Cleaning & Normalization

A. Text Cleaning

Typical steps:

  • Remove boilerplate

  • Remove headers/footers

  • Normalize whitespace

  • Unicode normalization

  • Remove repeated content

  • Fix OCR artifacts


B. Language Handling

Need support for:

  • Multilingual documents

  • Language detection

  • Translation strategy

  • Locale normalization


C. Sensitive Data Handling

May require:

  • PII masking

  • Redaction

  • Compliance filtering

  • Access control tagging

Especially important for enterprise RAG.


4. Chunking Strategy

Chunking is one of the most important design decisions.


A. Chunk Size Tradeoff

Small chunks:

  • Better precision

  • Worse context continuity

Large chunks:

  • Better context

  • More noise

  • Higher embedding cost


B. Chunking Approaches

Fixed-size Chunking

Simple.

Example:

  • 500 tokens

  • 100 token overlap

Good baseline.


C. Semantic Chunking

Split by:

  • Paragraphs

  • Sections

  • Topic boundaries

  • Heading hierarchy

Better retrieval quality.


D. Recursive Chunking

Hierarchical splitting:

  • Section

  • Subsection

  • Paragraph

  • Sentence

Very common.


E. Structure-aware Chunking

Important for enterprise docs.

Examples:

  • Keep tables intact

  • Keep code blocks intact

  • Preserve markdown sections


F. Adaptive Chunking

Dynamic chunk sizes based on:

  • Content density

  • Topic changes

  • Document type

Advanced systems increasingly use this.


5. Metadata Strategy

Metadata is massively underrated.

Good metadata enables:

  • Filtering

  • Routing

  • Security

  • Hybrid retrieval

  • Ranking


A. Common Metadata

Examples:

  • Source

  • Title

  • Author

  • Timestamp

  • Version

  • Department

  • Access level

  • Language

  • Tags

  • URL

  • Section hierarchy


B. Hierarchical Metadata

Very useful:

Document
 └── Chapter
      └── Section
           └── Paragraph

Improves contextual reconstruction.


C. Temporal Metadata

Needed for:

  • Freshness ranking

  • Time-aware retrieval

  • Versioning


6. Embedding Strategy


A. Embedding Model Selection

Tradeoffs:

FactorConsideration
QualityRetrieval accuracy
SpeedIngestion throughput
CostAPI vs local
DimensionalityStorage + ANN speed
Domain adaptationMedical/legal/code

B. Domain-specific Embeddings

Sometimes generic embeddings fail.

Examples:

  • Code embeddings

  • Legal embeddings

  • Biomedical embeddings


C. Multilingual Embeddings

Needed if corpus is multilingual.


D. Multi-vector Embeddings

Advanced approach:

  • Separate title embedding

  • Summary embedding

  • Content embedding

  • Keyword embedding

Used in high-end systems.


7. Indexing Strategy


A. Vector Database Design

Choices include:

  • FAISS

  • Chroma

  • Milvus

  • Qdrant

  • Weaviate

  • Pinecone

  • Elasticsearch/OpenSearch


B. ANN Algorithm Selection

Critical for scale.

Common algorithms:

  • HNSW

  • IVF

  • PQ

  • ScaNN

  • DiskANN

Tradeoff:

  • Recall

  • Latency

  • Memory


C. Hybrid Search

Very important in production.

Combine:

  • Vector search

  • BM25

  • Keyword search

  • Metadata filtering

Hybrid retrieval usually beats pure vector retrieval.


8. Enrichment & Preprocessing

Modern pipelines often enrich documents before indexing.


A. Summarization

Generate:

  • Chunk summaries

  • Section summaries

  • Document summaries

Helps hierarchical retrieval.


B. Keyword Extraction

Useful for:

  • Sparse retrieval

  • Hybrid search

  • Filtering


C. Entity Extraction

Extract:

  • Names

  • Products

  • Organizations

  • Dates

Useful for graph RAG.


D. Knowledge Graph Construction

Advanced RAG pipelines may create:

  • Entity graphs

  • Relationship graphs

  • Citation graphs


E. Classification & Tagging

Examples:

  • Topic classification

  • Sensitivity labels

  • Intent labels

Useful for routing.


9. Incremental Updates & Freshness

Production RAG systems need continuous ingestion.


A. Change Detection

Need mechanisms like:

  • CDC (Change Data Capture)

  • Webhooks

  • Checksums

  • File hashes

  • Modified timestamps


B. Incremental Re-indexing

Avoid full rebuilds.

Need:

  • Delta updates

  • Partial embedding refresh

  • Selective chunk invalidation


C. Versioning

Questions:

  • Keep old versions?

  • Replace embeddings?

  • Temporal retrieval?


10. Scalability & Throughput


A. Parallel Processing

Need parallelization for:

  • Parsing

  • Chunking

  • Embedding

  • Uploading


B. Batch vs Streaming

ModeUse Case
BatchNightly ingestion
StreamingReal-time knowledge

Many enterprise systems use both.


C. Queue-based Architecture

Often use:

  • Kafka

  • RabbitMQ

  • Pub/Sub

  • Celery

For resilience and scaling.


11. Reliability & Fault Tolerance


A. Retry Mechanisms

Failures happen due to:

  • API limits

  • Corrupt files

  • Network issues

Need retries + dead-letter queues.


B. Idempotency

Reprocessing same doc should not create duplicates.

Very important.


C. Observability

Need monitoring for:

  • Failed docs

  • Embedding latency

  • Queue depth

  • Index health

  • Parsing failures


12. Cost Optimization

RAG ingestion can become expensive.


A. Embedding Cost

Strategies:

  • Deduplication

  • Cache embeddings

  • Skip low-value docs

  • Batch embeddings


B. Storage Optimization

Consider:

  • Vector compression

  • Quantization

  • Tiered storage


C. Selective Ingestion

Not all documents deserve indexing.

Need prioritization.


13. Security & Governance

Enterprise-critical.


A. Access Control

Need document-level permissions.

Otherwise:

  • User may retrieve unauthorized data.


B. Encryption

At:

  • Rest

  • Transit


C. Auditability

Need tracking:

  • Who ingested what

  • When

  • Source lineage


14. Advanced RAG-specific Considerations


A. Parent-Child Retrieval

Store:

  • Small child chunks

  • Large parent docs

Retrieve child → return parent.

Very effective.


B. Hierarchical Retrieval

Retrieve:

  1. Document

  2. Section

  3. Chunk

Reduces noise.


C. RAPTOR-style Pipelines

Recursive abstraction trees.

f(x)=\text{Hierarchical Summarization Tree}

Used for:

  • Long documents

  • Multi-hop reasoning


D. Graph RAG

Ingestion may build:

  • Entity graphs

  • Semantic relations

Useful for:

  • Complex reasoning

  • Connected knowledge


E. Agentic Ingestion

Emerging pattern:

  • LLM decides chunking

  • LLM extracts metadata

  • LLM creates summaries

  • LLM classifies content

More expensive but often higher quality.


15. Evaluation Considerations

You must evaluate ingestion quality itself.


A. Retrieval Evaluation

Metrics:

  • Recall@K

  • MRR

  • nDCG

  • Hit rate


B. Chunk Quality Evaluation

Questions:

  • Are chunks self-contained?

  • Is context preserved?

  • Are tables broken?


C. Embedding Drift

Embedding model updates may degrade retrieval.

Need regression testing.


16. Common Production Mistakes


A. Overlapping Too Much

Huge duplication:

  • Increased storage

  • Worse retrieval diversity


B. Ignoring Metadata

One of the biggest mistakes.


C. Blind Fixed-size Chunking

Often destroys semantic coherence.


D. No Hybrid Retrieval

Pure vector search often underperforms.


E. No Freshness Strategy

Stale knowledge kills trust.


17. Recommended Modern Production Architecture

A common robust architecture:

Sources
   ↓
Connectors
   ↓
Parsing/OCR
   ↓
Cleaning
   ↓
Structure Extraction
   ↓
Chunking
   ↓
Metadata Enrichment
   ↓
Embedding Generation
   ↓
Hybrid Index Creation
   ↓
Vector DB + BM25 Store
   ↓
Monitoring + Evaluation

18. Key Design Principle (Details)

The biggest insight in RAG ingestion:

Retrieval quality is primarily an information architecture problem, not just an embedding problem.

The best systems usually combine:

  • Good structure preservation

  • Smart chunking

  • Rich metadata

  • Hybrid retrieval

  • Reranking

  • Incremental freshness

  • Strong evaluation loops


Normalization Algorithms in Machine Learning

1. Feature Scaling (Traditional ML) Technique Description Formula / Key Point ...