rschandrastechblog: Similarity Metrics & Search Algorithms

Type	Name	Description
Similarity Metric	Cosine Similarity	Measures angle between vectors; most common in RAG; 1 = very similar, 0 = unrelated, -1 = opposite
Similarity Metric	Dot Product	Measures alignment + magnitude; commonly used in embedding models
Similarity Metric	Euclidean Distance (L2)	Straight-line distance between vectors; smaller distance = more similar
Similarity Metric	Manhattan Distance (L1)	Grid-based distance (sum of absolute differences); less common for embeddings
Similarity Metric	Jaccard Similarity	Set-based similarity; used for sparse or keyword-style data
Search Algorithm	Brute Force (Exact Search)	Compares query with every vector; exact results but slow at scale; what you implemented
Search Algorithm	k-d Tree	Space-partitioning tree; efficient for low-dimensional data but performs poorly for high-dimensional embeddings
Search Algorithm	Ball Tree	Uses hyperspheres instead of splits; slightly better than k-d tree in some cases but still limited in high dimensions
Search Algorithm	HNSW (Hierarchical Navigable Small World)	Graph-based ANN algorithm; very fast and accurate; widely used in FAISS, Weaviate, etc.
Search Algorithm	IVF (Inverted File Index)	Clusters vectors first, then searches only relevant clusters; reduces search space significantly
Search Algorithm	PQ (Product Quantization)	Compresses vectors to reduce memory and speed up search; often combined with IVF
Search Algorithm	Annoy (Approximate Nearest Neighbors Oh Yeah)	Tree-based method using random projections; used by Spotify; good balance of speed and simplicity
Search Algorithm	ScaNN	Google’s optimized ANN algorithm; combines partitioning and scoring for efficient search
Search Algorithm	LSH (Locality Sensitive Hashing)	Hashes similar vectors into same buckets; very fast but less accurate

FAISS, Pinecode and Weaviate

Feature	FAISS	Pinecone	Weaviate
Search Algorithms	HNSW, IVF, PQ, Flat (Exact), LSH	Proprietary (built on HNSW, IVF, PQ)	HNSW (custom CRUD-optimized), Flat
Similarity Metrics	L2, Inner Product (IP), Cosine	Cosine, L2, Dot Product	Cosine, Dot Product, L2, Manhattan, Hamming
Primary Focus	Low-level library for researchers	Managed SaaS for production RAG	Open-source DB with Hybrid search

1. FAISS (Facebook AI Similarity Search)

FAISS is a highly flexible library that provides "building blocks" rather than a single fixed algorithm.

Search Algorithms: It offers a wide variety of indexes. Common ones include IndexHNSW (graph-based), IndexIVF (clustering), and IndexFlat (brute-force exact search). It also uses Product Quantization (PQ) to compress vectors.
Similarity Metrics: Primarily optimized for L2 (Euclidean) and Inner Product. It supports Cosine Similarity by normalizing vectors and then using Inner Product.

2. Pinecone

Pinecone is a fully managed service, so its internal "recipe" is proprietary, but it is built on industry-standard concepts.

Search Algorithms: It uses a combination of HNSW, IVF, and PQ within its architecture to balance speed and accuracy at scale.
Similarity Metrics: You choose the metric when creating an index. It supports Cosine Similarity (default for many), Euclidean (L2), and Dot Product.

3. Weaviate

Weaviate is designed as a full database and focuses on high-speed retrieval and hybrid search.

Search Algorithms: The default and most common is a custom, high-performance implementation of HNSW. It is specifically optimized to allow for real-time CRUD (Create, Read, Update, Delete) operations, which is often difficult for standard HNSW.
Similarity Metrics: It provides a broad range: Cosine, Dot Product, L2-Squared, Manhattan, and Hamming. It defaults to Cosine Distance.

rschandrastechblog

Wednesday, May 6, 2026

Similarity Metrics & Search Algorithms

No comments:

Post a Comment

JOURNY - 003

Report Abuse

Followers