Wednesday, April 29, 2026

Vector Search Algorithms

 Vector search algorithms are designed to find items in a large dataset that are mathematically "closest" to a given query vector. Unlike traditional keyword search that looks for exact character matches, these algorithms operate on vector embeddings—numerical representations that capture the semantic meaning and context of data like text, images, and audio.


Core Search Strategies
Algorithms are broadly categorized into Exact and Approximate methods: 
  • Exhaustive K-Nearest Neighbors (kNN): This is a brute-force approach that calculates the distance between the query and every single vector in the database. While it guarantees 100% accuracy (returning the true nearest neighbors), it is computationally expensive and does not scale well for datasets with millions or billions of items.
  • Approximate Nearest Neighbors (ANN): These algorithms trade a small amount of accuracy for massive gains in speed. They use specialized data structures to skip most of the dataset and focus only on the most promising candidates, enabling millisecond response times even at a massive scale. 
Popular ANN Algorithms
Most modern vector databases, such as Pinecone and Weaviate, rely on these ANN techniques: 
  • Hierarchical Navigable Small World (HNSW): Currently the state-of-the-art for many applications, HNSW builds a multi-layered graph where the top layers contain a few "long-range" links for fast global navigation and lower layers contain dense, local connections for refinement.
  • Inverted File Index (IVF): This method clusters vectors into "buckets" around centroids using k-means. At search time, the algorithm only looks in the buckets closest to the query vector, significantly reducing the search space.
  • Product Quantization (PQ): A compression technique that breaks long vectors into smaller sub-vectors and quantizes them into a "codebook". This reduces memory usage and speeds up distance calculations by working with compressed codes rather than raw floats.
  • Locality-Sensitive Hashing (LSH): Uses specialized hash functions designed to maximize "collisions" so that similar vectors end up in the same hash bucket with high probability. 
Mathematical Similarity Metrics
Regardless of the algorithm, a "distance metric" must be chosen to define what "similarity" means: 
  • Cosine Similarity: Measures the angle between two vectors, focusing on their direction rather than their magnitude. It is the standard for text-based semantic search.
  • Euclidean Distance (L2): Measures the straight-line distance between two points in space. It is often used for visual or spatial data.
  • Dot Product: Calculates the product of magnitudes and the cosine of the angle; for normalized vectors, it is identical to cosine similarity but faster to compute.

DiskANN, developed by Microsoft Research, is unique because it shifts the heavy lifting from expensive RAM to cost-effective SSDs without sacrificing significant performance. 
Why DiskANN is Critical
While the algorithms mentioned earlier (like HNSW) are "in-memory first," DiskANN is "disk-native." 
  • Vamana Graph: It uses a specialized graph structure called Vamana that is more robust than HNSW for navigating from disk.
  • Two-Phase Search: It keeps a compressed "guide" (Product Quantization) in RAM to narrow down candidates, then does a fast jump to the SSD to retrieve the full-precision vector for final re-ranking.
  • Massive Scalability: It allows you to search billions of vectors on a single machine with a fraction of the RAM required by HNSW (e.g., 64GB vs. 1TB+).

Other Significant Algorithms
The field moves fast, and there are several other specialized algorithms used by major tech players:
  • ScaNN (Scalable Nearest Neighbors): Developed by Google, it uses a technique called Anisotropic Vector Quantization. It prioritizes reducing "quantization error" for the most important vectors (the ones likely to be neighbors), often outperforming IVF-PQ in both speed and recall.
  • FreshDiskANN: An evolution of DiskANN designed for dynamic datasets. Unlike traditional graph indexes that degrade with frequent edits, FreshDiskANN supports real-time insertions and deletions without requiring a full index rebuild.
  • StreamingDiskANN: Optimized for filtered search. It continuously "streams" candidates from the disk until enough results pass your metadata filters (e.g., "find similar items under $50"), preventing the common "zero-result" bug in traditional HNSW filters.
  • Multi-Scale Tree Graph (MSTG): A newer approach that combines tree-based partitioning with graph-based navigation to better handle high-dimensional data distributions.
  • SPANN (Scalable Approximate Nearest Neighbor): Combines graph indexing with posting lists (like IVF) to achieve higher efficiency than DiskANN in specific massive-scale scenarios.


While HNSW is the "default" in almost every vector database (like Qdrant or Elasticsearch), DiskANN is primarily available in Azure Cosmos DB and specialized tools like pgvectorscale for PostgreSQL


To prevent latency spikes with DiskANN, NVMe SSDs are a strict prerequisite. SATA SSDs are generally unsuitable because their legacy architecture introduces significant bottlenecks that can cause search times to balloon from 10–20ms to over 100ms.

No comments:

Post a Comment

useful links

  https://github.com/huggingface/cookbook/tree/main/notebooks/en  hugging face ai cookbooks  https://colab.research.google.com/github/huggin...