Documents
↓
Embeddings
↓
Vector DB (e.g., FAISS)
↓
Query → embedding
↓
Similarity search
↓
Top-k chunks
↓
LLM
↓
Answer
Retrieval-Augmented Generation (RAG) pipeline is divided into two main phases: the Ingestion phase (where data is prepared) and the Retrieval/Generation phase (where the user interacts with the data).
RAG Workflow Architecture
- Ingestion Phase (Offline)
- Documents: Raw data sources like PDFs or text files are loaded.
- Embeddings: The text is broken into smaller pieces (chunks) and converted into numerical vectors using an embedding model.
- Vector DB (e.g., FAISS): These vectors are stored in a specialized database that allows for fast mathematical searching.
- Inference Phase (Online)
- Query → Embedding: When you ask a question, it is converted into a vector using the same embedding model used for the documents.
- Similarity Search: The system compares the query vector against the vectors in the database to find the closest matches.
- Top-k Chunks: The most relevant "k" pieces of text are retrieved from the database.
- LLM → Answer: These chunks are sent to a Large Language Model (like GPT-4) as context, allowing it to generate an answer based on your specific documents.
Visual Workflow Diagram
Below is a representation of how these components connect:
No comments:
Post a Comment