Wednesday, May 6, 2026

RAG Pipeline

 

Documents

   ↓

Embeddings

   ↓

Vector DB (e.g., FAISS)

   ↓

Query → embedding

   ↓

Similarity search

   ↓

Top-k chunks

   ↓

LLM

   ↓

Answer



Retrieval-Augmented Generation (RAG) pipeline is divided into two main phases: the Ingestion phase (where data is prepared) and the Retrieval/Generation phase (where the user interacts with the data). 
RAG Workflow Architecture
  • Ingestion Phase (Offline)
    • Documents: Raw data sources like PDFs or text files are loaded.
    • Embeddings: The text is broken into smaller pieces (chunks) and converted into numerical vectors using an embedding model.
    • Vector DB (e.g., FAISS): These vectors are stored in a specialized database that allows for fast mathematical searching.
  • Inference Phase (Online)
    • Query → Embedding: When you ask a question, it is converted into a vector using the same embedding model used for the documents.
    • Similarity Search: The system compares the query vector against the vectors in the database to find the closest matches.
    • Top-k Chunks: The most relevant "k" pieces of text are retrieved from the database.
    • LLM → Answer: These chunks are sent to a Large Language Model (like GPT-4) as context, allowing it to generate an answer based on your specific documents. 
Visual Workflow Diagram
Below is a representation of how these components connect:
Graph image

No comments:

Post a Comment

JOURNEY - 003

 1. Agent with manual similarity search ( no vector db right now, later we will go for in-memory FAISS) 2. Smart chunking implemented before...