JOURNEY-012 Added Evaluation Framework

# ============================================================
# STEP 1 — INSTALL REQUIRED LIBRARIES
# ============================================================

# Run these in Google Colab

!pip install -q google-genai
!pip install -q faiss-cpu




%env RETRIEVAL_MODE=faiss



# ============================================================
# STEP 2 — IMPORT LIBRARIES
# ============================================================

import os
import numpy as np
import faiss
import google.genai as genai

from google.colab import userdata


# ============================================================
# STEP 3 — LOAD ENVIRONMENT SETTINGS
# ============================================================

# RETRIEVAL MODES:
#
# "cosine"  -> brute-force cosine similarity
# "faiss"   -> FAISS vector search
#
# Change this anytime later.
#
# For Render deployment:
# use environment variables.

RETRIEVAL_MODE = os.getenv(
    "RETRIEVAL_MODE",
    "cosine"
)

print(f"Retrieval mode: {RETRIEVAL_MODE}")


# ============================================================
# STEP 4 — CONFIGURE GEMINI API
# ============================================================

#GEMINI_API_KEY = userdata.get("GEMINI_API_KEY")
#GEMINI_API_KEY = userdata.get("geminiapikey")
#GEMINI_API_KEY = userdata.get("GEMINI_API_KEY-003")
#GEMINI_API_KEY = userdata.get("GEMINI_API_KEY_004")
#GEMINI_API_KEY = userdata.get("GEMINI_API_KEY_005")
GEMINI_API_KEY = userdata.get("GEMINI_API_KEY_006")

client = genai.Client(api_key=GEMINI_API_KEY)

LLM_MODEL = "models/gemma-4-26b-a4b-it"
#LLM_MODEL = "models/gemini-3.1-flash-lite"
#LLM_MODEL="models/gemini-2.5-flash
#LLM_MODEL="models/gemini-3-flash-preview"
#LLM_MODEL="models/gemini-2.5-flash"
#LLM_MODEL="models/gemini-2.5-flash-lite"
#LLM_MODEL="models/gemini-3.1-pro-preview"
#LLM_MODEL="models/gemini-2.0-flash-lite"

####MEMORY HANDLER
# ============================================================
# CONVERSATIONAL MEMORY (PHASE 1)
# ============================================================

# GOAL
# ----
# Add short-term conversational memory so the system:
#   - understands follow-up questions
#   - remembers recent discussion
#   - supports multi-turn conversations
#
# EXAMPLE
# -------
# User: Why is my pod crashing?
# User: How do I debug it?
#
# "it" should refer to the crashing pod.
#
#
# IMPORTANT
# ---------
# This is NOT semantic/vector memory yet.
#
# This is:
# SHORT-TERM PROMPT MEMORY
#
#
# WHAT WE WILL ADD
# ----------------
# ✅ chat_history
# ✅ memory window
# ✅ history injection into prompt
# ✅ multi-turn continuity
#
#
# ============================================================
# MEMORY CONFIGURATION
# ============================================================

# Number of previous conversation turns to remember.
#
# Example:
# MEMORY_WINDOW = 3
#
# Means:
# last 3 user-assistant exchanges are included.

MEMORY_WINDOW = 3


# ============================================================
# CHAT HISTORY STORAGE
# ============================================================

# Conversation history format:
#
# [
#   {
#       "user": "...",
#       "assistant": "..."
#   }
# ]

chat_history = []


# ============================================================
# BUILD CONVERSATION HISTORY TEXT
# ============================================================

def build_history_context():

    # --------------------------------------------------------
    # TAKE ONLY RECENT MEMORY WINDOW
    # --------------------------------------------------------

    recent_history = chat_history[-MEMORY_WINDOW:]

    # --------------------------------------------------------
    # BUILD HISTORY TEXT
    # --------------------------------------------------------

    history_text = ""

    for turn in recent_history:

        history_text += f"""
User:
{turn['user']}

Assistant:
{turn['assistant']}

"""

    return history_text.strip()
####MEMORY HANDLER





# ============================================================
# STEP 5 — CREATE DATASET
# ============================================================

documents = [

    # --------------------------------------------------------
    # POD FAILURES / DEBUGGING
    # --------------------------------------------------------

    "CrashLoopBackOff occurs when a container repeatedly crashes after starting.",
    "OOMKilled happens when a container exceeds its memory limit.",
    "A container may crash due to missing environment variables.",
    "Incorrect command or entrypoint can cause container startup failure.",
    "Application errors inside the container often lead to restarts.",
    "kubectl logs retrieves logs from a running container.",
    "kubectl describe pod shows events and state transitions.",
    "Liveness probes determine if a container should be restarted.",
    "Readiness probes determine if a pod can receive traffic.",

    # --------------------------------------------------------
    # SCHEDULING
    # --------------------------------------------------------

    "Pods remain pending if no node satisfies resource requests.",
    "Node affinity restricts pods to specific nodes.",
    "Taints prevent pods from being scheduled on certain nodes.",
    "Tolerations allow pods to be scheduled on tainted nodes.",

    # --------------------------------------------------------
    # SERVICES
    # --------------------------------------------------------

    "ClusterIP services expose applications within the cluster.",
    "NodePort services expose applications on node IPs.",
    "LoadBalancer services expose applications externally.",
    "Ingress routes HTTP and HTTPS traffic to services.",

    # --------------------------------------------------------
    # STORAGE
    # --------------------------------------------------------

    "PersistentVolumes provide storage independent of pods.",
    "PersistentVolumeClaims request storage resources.",
    "StorageClasses define dynamic provisioning behavior.",

    # --------------------------------------------------------
    # DEPLOYMENTS
    # --------------------------------------------------------

    "Deployments manage replica sets and pod updates.",
    "Rolling updates gradually replace old pods with new ones.",
    "ReplicaSets maintain a stable number of pod replicas.",

    # --------------------------------------------------------
    # CONFIGURATION
    # --------------------------------------------------------

    "ConfigMaps store non-sensitive configuration data.",
    "Secrets store sensitive data like passwords and tokens.",
    "Environment variables can be injected from ConfigMaps and Secrets.",

    # --------------------------------------------------------
    # IMAGES / REGISTRY
    # --------------------------------------------------------

    "ImagePullBackOff occurs when Kubernetes cannot pull the container image.",
    "Incorrect image name or tag can cause image pull failures.",
    "Private registries require imagePullSecrets for authentication.",

    # --------------------------------------------------------
    # AUTOSCALING
    # --------------------------------------------------------

    "Horizontal Pod Autoscaler scales based on CPU or metrics.",

    # --------------------------------------------------------
    # SECURITY
    # --------------------------------------------------------

    "RBAC controls access permissions inside Kubernetes.",
    "RBAC misconfiguration can block access to resources.",

    # --------------------------------------------------------
    # NETWORKING
    # --------------------------------------------------------

    "NetworkPolicies control communication between pods.",

    # --------------------------------------------------------
    # CLEANUP
    # --------------------------------------------------------

    "Pods stuck in Terminating state may have finalizers blocking deletion."
]

print(f"Total documents: {len(documents)}")


# ============================================================
# STEP 6 — CREATE SLIDING WINDOW CHUNKS
# ============================================================

# WHY?
# ----
# Preserves neighboring semantic context.
#
# Example:
# sentence1 + sentence2 + sentence3
#
# Then:
# sentence2 + sentence3 + sentence4

WINDOW_SIZE = 3
STRIDE = 1

smart_chunks = []

for i in range(0, len(documents) - WINDOW_SIZE + 1, STRIDE):

    chunk = documents[i:i + WINDOW_SIZE]

    chunk_text = "\n".join(chunk)

    smart_chunks.append(chunk_text)

print(f"Total chunks created: {len(smart_chunks)}")


# ============================================================
# STEP 7 — PREPARE STRUCTURED CHUNK DATA
# ============================================================

# prepared_data = []

# for i, chunk in enumerate(smart_chunks):

#     prepared_data.append({
#         "id": f"chunk_{i}",
#         "text": chunk
#     })

# print(f"Prepared chunks: {len(prepared_data)}")
prepared_data = []

for i, chunk in enumerate(smart_chunks):
    prepared_data.append({
        # ----------------------------------------------------
        # UNIQUE SOURCE ID
        # ----------------------------------------------------
        "source_id": f"SOURCE_{i+1}",
        # ----------------------------------------------------
        # CHUNK ID
        # ----------------------------------------------------
        "id": f"chunk_{i}",
        # ----------------------------------------------------
        # ACTUAL CHUNK TEXT
        # ----------------------------------------------------
        "text": chunk,
        # ----------------------------------------------------
        # OPTIONAL METADATA
        # ----------------------------------------------------
        "metadata": {
            "topic": "kubernetes",
            "chunk_number": i
        }
    })

print("Prepared data with source attribution.")


# ============================================================
# STEP 8 — CREATE EMBEDDING FUNCTION
# ============================================================

def get_embedding(text):

    # response = embed_content(
    #     model="models/gemini-embedding-001",
    #     contents=text
    # )

    # return response["embedding"]

    response = client.models.embed_content(
        model="models/gemini-embedding-001",
        contents=text
    )
    # The new SDK returns a list of embeddings in 'embeddings'
    return response.embeddings[0].values


# ============================================================
# STEP 9 — GENERATE CHUNK EMBEDDINGS
# ============================================================

print("Generating embeddings...")

for item in prepared_data:

    embedding = get_embedding(item["text"])

    item["embedding"] = embedding

print("Embeddings generated successfully.")


# ============================================================
# STEP 10 — NORMALIZATION FUNCTION
# ============================================================

def normalize(vec):

    vec = np.array(vec)

    return vec / np.linalg.norm(vec)


# ============================================================
# STEP 11 — COSINE SIMILARITY FUNCTION
# ============================================================

def cosine_similarity(a, b):

    a = normalize(a)
    b = normalize(b)

    return np.dot(a, b)


# ============================================================
# STEP 12 — COSINE RETRIEVAL FUNCTION
# ============================================================

def retrieve_cosine(query, top_k=3, min_score=0.55):

    # --------------------------------------------------------
    # EMBED QUERY
    # --------------------------------------------------------

    query_embedding = get_embedding(query)

    scores = []

    # --------------------------------------------------------
    # CALCULATE COSINE SIMILARITY
    # --------------------------------------------------------

    for item in prepared_data:

        similarity = cosine_similarity(
            query_embedding,
            item["embedding"]
        )

        scores.append((similarity, item))

    # --------------------------------------------------------
    # SORT BY SCORE
    # --------------------------------------------------------

    scores = sorted(
        scores,
        key=lambda x: x[0],
        reverse=True
    )

    # --------------------------------------------------------
    # SIMPLE RE-RANKING
    # --------------------------------------------------------

    reranked = []

    query_words = query.lower().split()

    for sim, item in scores:

        text = item["text"].lower()

        keyword_bonus = sum(
            word in text for word in query_words
        )

        final_score = sim + (0.03 * keyword_bonus)

        reranked.append((final_score, item))

    # --------------------------------------------------------
    # SORT AGAIN AFTER RE-RANKING
    # --------------------------------------------------------

    reranked = sorted(
        reranked,
        key=lambda x: x[0],
        reverse=True
    )

    # --------------------------------------------------------
    # FILTER LOW SCORES
    # --------------------------------------------------------

    filtered = [
        x for x in reranked
        if x[0] >= min_score
    ]

    return filtered[:top_k]


# ============================================================
# STEP 13 — CREATE FAISS EMBEDDING MATRIX
# ============================================================

embedding_matrix = []

for item in prepared_data:

    embedding_matrix.append(item["embedding"])

embedding_matrix = np.array(
    embedding_matrix,
    dtype=np.float32
)

print("Embedding matrix shape:")
print(embedding_matrix.shape)


# ============================================================
# STEP 14 — NORMALIZE EMBEDDINGS FOR FAISS
# ============================================================

# IMPORTANT:
#
# IndexFlatIP uses INNER PRODUCT.
#
# If vectors are normalized:
#
# inner product == cosine similarity

faiss.normalize_L2(embedding_matrix)


# ============================================================
# STEP 15 — CREATE FAISS INDEX
# ============================================================

dimension = embedding_matrix.shape[1]

index = faiss.IndexFlatIP(dimension)

print("FAISS index created.")


# ============================================================
# STEP 16 — ADD EMBEDDINGS TO FAISS INDEX
# ============================================================

index.add(embedding_matrix)

print(f"Total vectors indexed: {index.ntotal}")


# ============================================================
# STEP 17 — FAISS RETRIEVAL FUNCTION
# ============================================================

def retrieve_faiss(query, top_k=3):

    # --------------------------------------------------------
    # EMBED QUERY
    # --------------------------------------------------------

    query_embedding = get_embedding(query)

    # --------------------------------------------------------
    # CONVERT TO NUMPY
    # --------------------------------------------------------

    query_vector = np.array(
        [query_embedding],
        dtype=np.float32
    )

    # --------------------------------------------------------
    # NORMALIZE QUERY VECTOR
    # --------------------------------------------------------

    faiss.normalize_L2(query_vector)

    # --------------------------------------------------------
    # SEARCH FAISS INDEX
    # --------------------------------------------------------

    scores, indices = index.search(
        query_vector,
        top_k
    )

    # --------------------------------------------------------
    # FORMAT RESULTS
    # --------------------------------------------------------

    results = []

    for score, idx in zip(scores[0], indices[0]):

        item = prepared_data[idx]

        results.append((score, item))

    return results


# ============================================================
# STEP 18 — RETRIEVAL ROUTER
# ============================================================

# This decides:
#
# cosine retrieval
# OR
# FAISS retrieval

def retrieve_router(query, top_k=3):

    if RETRIEVAL_MODE == "cosine":

        return retrieve_cosine(
            query=query,
            top_k=top_k
        )

    elif RETRIEVAL_MODE == "faiss":

        return retrieve_faiss(
            query=query,
            top_k=top_k
        )

    else:

        raise ValueError(
            f"Invalid retrieval mode: {RETRIEVAL_MODE}"
        )

#============================================================
# STEP 18.1.0 HYBRID SEARCH BEGIN
#============================================================
# ============================================================
# STEP 18.1.1 — HYBRID RETRIEVAL
# ============================================================

# GOAL
# ----
# Combine:
#
#   1. Semantic Retrieval (Embeddings / FAISS)
#   2. Keyword Retrieval (TF-IDF)
#
# WHY?
# ----
# Embeddings are good for semantic meaning.
#
# But embeddings sometimes miss:
#   - exact error messages
#   - command names
#   - Kubernetes terms
#   - identifiers
#   - rare technical keywords
#
# Example:
#   CrashLoopBackOff
#   OOMKilled
#   kubectl logs
#
# Keyword retrieval helps greatly here.
#
#
# FINAL ARCHITECTURE
# ------------------
#
# Semantic Search
#        +
# Keyword Search
#        ↓
# Combined Scores
#        ↓
# Final Ranked Results
#
#
# ============================================================
# STEP 18.1.2 — IMPORTS
# ============================================================

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


# ============================================================
# STEP 18.1.3 — BUILD TF-IDF INDEX
# ============================================================

# Build keyword-searchable index from chunk texts.

chunk_texts = [item["text"] for item in prepared_data]

tfidf_vectorizer = TfidfVectorizer()

tfidf_matrix = tfidf_vectorizer.fit_transform(chunk_texts)

print("TF-IDF index built.")


# ============================================================
# STEP 18.1.4 — KEYWORD RETRIEVAL FUNCTION
# ============================================================

def keyword_search(query, top_k=5):

    # --------------------------------------------------------
    # CONVERT QUERY TO TF-IDF VECTOR
    # --------------------------------------------------------

    query_vector = tfidf_vectorizer.transform([query])

    # --------------------------------------------------------
    # CALCULATE KEYWORD SIMILARITY
    # --------------------------------------------------------

    similarities = cosine_similarity(

        query_vector,
        tfidf_matrix
    )[0]

    # --------------------------------------------------------
    # SORT RESULTS
    # --------------------------------------------------------

    ranked_indices = similarities.argsort()[::-1]

    results = []

    for idx in ranked_indices[:top_k]:

        score = similarities[idx]

        results.append((score, prepared_data[idx]))

    return results


# ============================================================
# STEP 18.1.5 — SEMANTIC SEARCH FUNCTION
# ============================================================

# Your existing semantic retrieval function.
#
# Rename it to semantic_search()
# for clarity.


def semantic_search(query, top_k=5):

    # --------------------------------------------------------
    # EMBED QUERY
    # --------------------------------------------------------

    query_embedding = get_embedding(query)

    query_embedding = np.array(
        query_embedding,
        dtype=np.float32
    ).reshape(1, -1)

    # --------------------------------------------------------
    # FAISS SEARCH
    # --------------------------------------------------------

    distances, indices = index.search(

        query_embedding,
        top_k
    )

    # --------------------------------------------------------
    # FORMAT RESULTS
    # --------------------------------------------------------

    results = []

    for score, idx in zip(distances[0], indices[0]):

        results.append((float(score), prepared_data[idx]))

    return results


# ============================================================
# STEP 18.1.6 — HYBRID RETRIEVAL
# ============================================================

# COMBINE:
#   semantic score
#   keyword score
#
# FINAL SCORE:
#
# hybrid_score =
#   semantic_weight * semantic_score
# +
#   keyword_weight * keyword_score
#
#
# You can tune these later.

SEMANTIC_WEIGHT = 0.7
KEYWORD_WEIGHT = 0.3


def hybrid_retrieve(query, top_k=5):

    # --------------------------------------------------------
    # GET BOTH RESULT SETS
    # --------------------------------------------------------

    #semantic_results = semantic_search(query, top_k=top_k)
    semantic_results = retrieve_router(query, top_k=top_k)

    keyword_results = keyword_search(query, top_k=top_k)

    # --------------------------------------------------------
    # COMBINE SCORES
    # --------------------------------------------------------

    combined_scores = {}

    # --------------------------------------------------------
    # ADD SEMANTIC SCORES
    # --------------------------------------------------------

    for score, item in semantic_results:

        chunk_id = item["id"]

        if chunk_id not in combined_scores:

            combined_scores[chunk_id] = {

                "item": item,

                "semantic_score": 0,

                "keyword_score": 0
            }

        combined_scores[chunk_id]["semantic_score"] = score

    # --------------------------------------------------------
    # ADD KEYWORD SCORES
    # --------------------------------------------------------

    for score, item in keyword_results:

        chunk_id = item["id"]

        if chunk_id not in combined_scores:

            combined_scores[chunk_id] = {

                "item": item,

                "semantic_score": 0,

                "keyword_score": 0
            }

        combined_scores[chunk_id]["keyword_score"] = score

    # --------------------------------------------------------
    # CALCULATE FINAL HYBRID SCORE
    # --------------------------------------------------------

    final_results = []

    for chunk_id, values in combined_scores.items():

        semantic_score = values["semantic_score"]

        keyword_score = values["keyword_score"]

        hybrid_score = (

            SEMANTIC_WEIGHT * semantic_score

            +

            KEYWORD_WEIGHT * keyword_score
        )

        final_results.append(

            (
                hybrid_score,
                values["item"]
            )
        )

    # --------------------------------------------------------
    # SORT FINAL RESULTS
    # --------------------------------------------------------

    final_results.sort(

        key=lambda x: x[0],

        reverse=True
    )

    return final_results[:top_k]


#===========================================================
# STEP 18.1.0 HYBRID SEARCH END
#===========================================================


# ============================================================
# STEP 19 — BUILD PROMPT - SOME IMPROVEMENTS
# ============================================================

# WHAT THIS IMPROVES
# -------------------
# ✅ Better grounding
# ✅ Reduced hallucinations
# ✅ Better formatting
# ✅ Better instruction following
# ✅ Cleaner troubleshooting answers
#
# IMPORTANT:
# -----------
# This does NOT improve retrieval itself.
#
# It improves:
# HOW the LLM uses retrieved chunks.


def build_prompt(query, retrieved_chunks):

    # --------------------------------------------------------
    # BUILD RETRIEVED CONTEXT
    # --------------------------------------------------------

    context_parts = []

    for i, (score, item) in enumerate(retrieved_chunks, start=1):

        context_parts.append(
            f"""
SOURCE ID: {item["source_id"]}

RELEVANCE SCORE: {score:.4f}

CONTENT:
{item["text"]}
"""
        )

    context_text = "\n".join(context_parts)

    # --------------------------------------------------------
    # BUILD CONVERSATION HISTORY
    # --------------------------------------------------------

    history_text = build_history_context()

    # --------------------------------------------------------
    # FINAL PROMPT
    # --------------------------------------------------------

    prompt = f"""
You are an expert Kubernetes troubleshooting assistant.

Your job is to answer the user's question ONLY using
the retrieved context and conversation history.

IMPORTANT RULES:
----------------

1. Use ONLY the retrieved context and conversation history.

2. Do NOT use outside knowledge.

3. Do NOT invent information.

4. Answer using available context.
   If context is incomplete,
   explicitly mention limitations.

5. If answer is not present at all,
   say:
   "I don't know based on the provided context."

6. Keep answers:
   - concise
   - technically accurate
   - well-structured

7. Use bullet points when appropriate.

8. Prefer information from higher relevance scores.

9. At the end of the answer,
   cite the source IDs used.

10. Use conversation history to understand
    follow-up questions and references.

==================================================
CONVERSATION HISTORY
==================================================

{history_text}

==================================================
RETRIEVED CONTEXT
==================================================

{context_text}

==================================================
USER QUESTION
==================================================

{query}

==================================================
ANSWER FORMAT
==================================================

Answer:
<your answer>

Sources Used:
- SOURCE_X
- SOURCE_Y
"""

    return prompt

# ============================================================
# STEP 20 — GENERATE ANSWER USING GEMINI
# ============================================================

def generate_answer(prompt):

    # model = genai.GenerativeModel(
    #     "gemini-3-flash-preview"
    # )

    # response = model.generate_content(prompt)
    response = client.models.generate_content(
        model = LLM_MODEL,
        contents=prompt)

    return response.text

# ============================================================
# STEP 20.1 VALIDATION LOGIC
# ============================================================
# ============================================================
# STEP 20.1.1 — SELF-CHECKING RAG
# ============================================================

# GOAL
# ----
# After generating an answer:
#
#   1. Ask the LLM to VERIFY the answer
#   2. Check whether answer is grounded
#   3. Detect hallucinations
#   4. Detect unsupported claims
#
#
# WHY THIS MATTERS
# ----------------
# Retrieval can still fail.
#
# LLM may:
#   - overgeneralize
#   - infer unsupported facts
#   - hallucinate details
#
# Self-checking adds:
#
#   GENERATION → VALIDATION
#
#
# IMPORTANT
# ---------
# This is still NOT a full agent.
#
# But this is an EARLY form of:
#   reflection / self-evaluation
#
#
# ============================================================
# STEP 20.1.2 — VALIDATION PROMPT
# ============================================================

def build_validation_prompt(answer, retrieved_chunks):

    # --------------------------------------------------------
    # BUILD RETRIEVED CONTEXT
    # --------------------------------------------------------

    context_parts = []

    for score, item in retrieved_chunks:

        context_parts.append(
            f"""
SOURCE ID: {item["source_id"]}

CONTENT:
{item["text"]}
"""
        )

    context_text = "\n".join(context_parts)

    # --------------------------------------------------------
    # VALIDATION PROMPT
    # --------------------------------------------------------

    validation_prompt = f"""
You are a strict RAG answer validator.

Your task is to determine whether the answer
is FULLY supported by the retrieved context.

IMPORTANT RULES:
----------------

1. Check whether the answer contains:
   - hallucinations
   - unsupported claims
   - invented facts
   - outside knowledge

2. ONLY use the retrieved context.

3. Be strict and conservative.

4. If the answer is partially supported,
   clearly mention unsupported parts.

5. Return your response in this format:

VALIDATION: PASS or FAIL

EXPLANATION:
<short explanation>


==================================================
RETRIEVED CONTEXT
==================================================

{context_text}

==================================================
ANSWER TO VALIDATE
==================================================

{answer}
"""

    return validation_prompt


# ============================================================
# STEP 20.1.3 — VALIDATE ANSWER
# ============================================================

def validate_answer(answer, retrieved_chunks):

    # --------------------------------------------------------
    # BUILD VALIDATION PROMPT
    # --------------------------------------------------------

    validation_prompt = build_validation_prompt(
        answer,
        retrieved_chunks
    )

    # --------------------------------------------------------
    # CALL LLM
    # --------------------------------------------------------

    response = client.models.generate_content(

        model=LLM_MODEL,

        contents=validation_prompt
    )

    validation_result = response.text

    return validation_result
# ============================================================
# STEP 20.1 VALIDATION LOGIC
# ============================================================



# ============================================================
# STEP 21 — MAIN RAG PIPELINE
# ============================================================

def rag_pipeline(query):

    # --------------------------------------------------------
    # HYBRID RETRIEVAL
    # --------------------------------------------------------

    retrieved_chunks = hybrid_retrieve(query)

    # --------------------------------------------------------
    # BUILD MAIN PROMPT
    # --------------------------------------------------------

    prompt = build_prompt(query, retrieved_chunks)

    # --------------------------------------------------------
    # GENERATE ANSWER
    # --------------------------------------------------------

    response = client.models.generate_content(

        model=LLM_MODEL,

        contents=prompt
    )

    answer = response.text

    # --------------------------------------------------------
    # VALIDATE ANSWER
    # --------------------------------------------------------

    validation_result = validate_answer(

        answer,
        retrieved_chunks
    )

    # --------------------------------------------------------
    # STORE IN MEMORY
    # --------------------------------------------------------

    chat_history.append({

        "user": query,

        "assistant": answer
    })

    return answer, validation_result, retrieved_chunks


# ============================================================
# STEP 22 — TEST QUERIES
# ============================================================

test_queries = [

    # --------------------------------------------------------
    # EXACT KEYWORD TESTS
    # --------------------------------------------------------

    "What is CrashLoopBackOff?",

    "How do I use kubectl logs?",

    "What causes OOMKilled?",

    # --------------------------------------------------------
    # SEMANTIC TESTS
    # --------------------------------------------------------

    "Why is my container restarting repeatedly?",

    "How do services expose applications?",

    # --------------------------------------------------------
    # MIXED TEST
    # --------------------------------------------------------

    "How do I debug pod crashes using logs?"
]


# ============================================================
# STEP 23 — RUN TESTS
# ============================================================

for query in test_queries:

    print("\n" + "=" * 80)

    print(f"QUERY: {query}")

    # --------------------------------------------------------
    # RUN PIPELINE
    # --------------------------------------------------------

    answer, validation, sources = rag_pipeline(query)

    # --------------------------------------------------------
    # PRINT ANSWER
    # --------------------------------------------------------

    print("\nANSWER:\n")

    print(answer)

    # --------------------------------------------------------
    # PRINT VALIDATION
    # --------------------------------------------------------

    print("\nSELF-CHECK RESULT:\n")

    print(validation)

    # --------------------------------------------------------
    # PRINT RETRIEVED SOURCES
    # --------------------------------------------------------

    print("\nRETRIEVED SOURCES:\n")

    for score, item in sources:

        print(f"Source ID: {item['source_id']}")

        print(f"Score: {score:.4f}")

        print("-" * 50)

# ============================================================
# OPTIONAL MEMORY RESET
# ============================================================

# Use this whenever you want to clear conversation history.

# chat_history = []


#=============================================================
#STEP 24 : EVALUATION FRAMEWORK - BEGIN
#=============================================================

# ============================================================
# STEP 24.1 — EVALUATION FRAMEWORK
# ============================================================

# GOAL
# ----
# Measure how well the RAG system performs.
#
# We now evaluate:
#
#   1. Retrieval quality
#   2. Grounding quality
#   3. Validation success rate
#   4. Answer quality
#   5. System reliability
#
#
# IMPORTANT
# ---------
# This is NOT truth verification.
#
# We are evaluating:
#
#   "How well does the system work?"
#
# NOT:
#
#   "Is the world factually correct?"
#
#
# ============================================================
# WHAT WE WILL MEASURE
# ============================================================
#
# Retrieval Metrics:
# ------------------
# - retrieval_hit
# - retrieval_score
#
#
# Generation Metrics:
# -------------------
# - answer generated?
# - answer length
#
#
# Validation Metrics:
# -------------------
# - PASS / FAIL
#
#
# End-to-End Metrics:
# -------------------
# - latency
# - average score
#
#
# ============================================================
# STEP 24.2 — IMPORTS
# ============================================================

import time
import pandas as pd


# ============================================================
# STEP 24.3 — DEFINE EVALUATION DATASET
# ============================================================

# IMPORTANT
# ---------
# expected_keywords are used ONLY for evaluation.
#
# This is NOT retrieval itself.
#
# We check:
#   "Did retrieved chunks contain expected concepts?"
#
#
# You can expand this dataset later.

evaluation_dataset = [

    {
        "query": "What causes OOMKilled?",

        "expected_keywords": [

            "OOMKilled",
            "memory limit"
        ]
    },

    {
        "query": "How do I debug pod crashes?",

        "expected_keywords": [

            "kubectl logs",
            "kubectl describe pod"
        ]
    },

    {
        "query": "What is CrashLoopBackOff?",

        "expected_keywords": [

            "CrashLoopBackOff",
            "container repeatedly crashes"
        ]
    },

    {
        "query": "How do services expose applications?",

        "expected_keywords": [

            "ClusterIP",
            "NodePort",
            "LoadBalancer"
        ]
    },

    {
        "query": "What are ConfigMaps used for?",

        "expected_keywords": [

            "ConfigMaps",
            "configuration data"
        ]
    }
]


# ============================================================
# STEP 24.4 — CHECK RETRIEVAL HIT
# ============================================================

def evaluate_retrieval(retrieved_chunks, expected_keywords):

    # --------------------------------------------------------
    # MERGE ALL RETRIEVED TEXT
    # --------------------------------------------------------

    combined_text = ""

    for score, item in retrieved_chunks:

        combined_text += item["text"].lower() + " "

    # --------------------------------------------------------
    # CHECK KEYWORD COVERAGE
    # --------------------------------------------------------

    matched_keywords = []

    missing_keywords = []

    for keyword in expected_keywords:

        if keyword.lower() in combined_text:

            matched_keywords.append(keyword)

        else:

            missing_keywords.append(keyword)

    # --------------------------------------------------------
    # RETRIEVAL HIT SCORE
    # --------------------------------------------------------

    hit_rate = (

        len(matched_keywords)

        /

        len(expected_keywords)
    )

    return {

        "hit_rate": hit_rate,

        "matched_keywords": matched_keywords,

        "missing_keywords": missing_keywords
    }


# ============================================================
# STEP 24.5 — CHECK VALIDATION STATUS
# ============================================================

def extract_validation_status(validation_text):

    validation_text = validation_text.upper()

    if "VALIDATION: PASS" in validation_text:

        return "PASS"

    elif "VALIDATION: FAIL" in validation_text:

        return "FAIL"

    else:

        return "UNKNOWN"


# ============================================================
# STEP 24.6 — EVALUATE FULL PIPELINE
# ============================================================

evaluation_results = []

total_queries = len(evaluation_dataset)

validation_pass_count = 0

total_hit_rate = 0


for item in evaluation_dataset:

    # --------------------------------------------------------
    # QUERY
    # --------------------------------------------------------

    query = item["query"]

    expected_keywords = item["expected_keywords"]

    print("\n" + "=" * 80)

    print(f"Evaluating Query: {query}")

    # --------------------------------------------------------
    # MEASURE LATENCY
    # --------------------------------------------------------

    start_time = time.time()

    # --------------------------------------------------------
    # RUN RAG PIPELINE
    # --------------------------------------------------------

    answer, validation, retrieved_chunks = rag_pipeline(query)

    end_time = time.time()

    latency = end_time - start_time

    # --------------------------------------------------------
    # RETRIEVAL EVALUATION
    # --------------------------------------------------------

    retrieval_eval = evaluate_retrieval(

        retrieved_chunks,

        expected_keywords
    )

    hit_rate = retrieval_eval["hit_rate"]

    total_hit_rate += hit_rate

    # --------------------------------------------------------
    # VALIDATION STATUS
    # --------------------------------------------------------

    validation_status = extract_validation_status(validation)

    if validation_status == "PASS":

        validation_pass_count += 1

    # --------------------------------------------------------
    # STORE RESULTS
    # --------------------------------------------------------

    evaluation_results.append({

        "query": query,

        "hit_rate": round(hit_rate, 2),

        "matched_keywords": retrieval_eval["matched_keywords"],

        "missing_keywords": retrieval_eval["missing_keywords"],

        "validation_status": validation_status,

        "latency_seconds": round(latency, 2),

        "answer_length": len(answer)
    })

    # --------------------------------------------------------
    # PRINT QUICK SUMMARY
    # --------------------------------------------------------

    print(f"Hit Rate: {hit_rate:.2f}")

    print(f"Validation: {validation_status}")

    print(f"Latency: {latency:.2f}s")


# ============================================================
# STEP 24.7 — BUILD RESULTS DATAFRAME
# ============================================================

results_df = pd.DataFrame(evaluation_results)

print("\n")
print("=" * 80)
print("EVALUATION RESULTS")
print("=" * 80)

print(results_df)


# ============================================================
# STEP 24.8 — OVERALL METRICS
# ============================================================

average_hit_rate = total_hit_rate / total_queries

validation_pass_rate = (

    validation_pass_count

    /

    total_queries
)

average_latency = results_df["latency_seconds"].mean()

average_answer_length = results_df["answer_length"].mean()


print("\n")
print("=" * 80)
print("OVERALL SYSTEM METRICS")
print("=" * 80)

print(f"Total Queries: {total_queries}")

print(f"Average Retrieval Hit Rate: {average_hit_rate:.2f}")

print(f"Validation PASS Rate: {validation_pass_rate:.2f}")

print(f"Average Latency: {average_latency:.2f} sec")

print(f"Average Answer Length: {average_answer_length:.2f} chars")


# ============================================================
# STEP 24.9 — OPTIONAL CSV EXPORT
# ============================================================

# Save evaluation report for analysis.

results_df.to_csv(

    "rag_evaluation_results.csv",

    index=False
)

print("\nEvaluation report saved:")
print("rag_evaluation_results.csv")


# ============================================================
# STEP 24.10 — INTERPRETATION GUIDE
# ============================================================

# HIGH HIT RATE
# -------------
# Retriever is finding correct chunks.
#
#
# LOW HIT RATE
# ------------
# Retrieval quality problem:
#   - chunking issue
#   - embedding issue
#   - weak keyword retrieval
#
#
# HIGH VALIDATION PASS RATE
# -------------------------
# Answers are grounded well.
#
#
# LOW VALIDATION PASS RATE
# ------------------------
# Hallucination or prompt grounding issue.
#
#
# HIGH LATENCY
# ------------
# Pipeline too slow.
#
#
# LOW ANSWER QUALITY
# ------------------
# Prompt engineering issue.
#
#
# ============================================================
# STEP 24 COMPLETE
# ============================================================

# You now have:
#
#   ✅ measurable RAG quality
#   ✅ retrieval evaluation
#   ✅ grounding evaluation
#   ✅ latency metrics
#   ✅ benchmarking framework
#   ✅ evaluation reports
#
#
# This is now:
#
#   "Engineering-grade RAG development"
#
# rather than:
#
#   "just experimentation"
#
# ============================================================
#=============================================================
#STEP 24 : EVALUATION FRAMEWORK - END
#=============================================================



====================================================================================
OUTPUT
====================================================================================
Retrieval mode: faiss
Total documents: 34
Total chunks created: 32
Prepared data with source attribution.
Generating embeddings...
Embeddings generated successfully.
Embedding matrix shape:
(32, 3072)
FAISS index created.
Total vectors indexed: 32
TF-IDF index built.

================================================================================
QUERY: What is CrashLoopBackOff?

ANSWER:

Answer:
CrashLoopBackOff occurs when a container repeatedly crashes after starting.

Sources Used:
- SOURCE_1

SELF-CHECK RESULT:

VALIDATION: PASS

EXPLANATION:
The answer is directly supported by the content in SOURCE_1.

RETRIEVED SOURCES:

Source ID: SOURCE_1
Score: 0.6287
--------------------------------------------------
Source ID: SOURCE_5
Score: 0.4774
--------------------------------------------------
Source ID: SOURCE_4
Score: 0.4753
--------------------------------------------------
Source ID: SOURCE_27
Score: 0.4688
--------------------------------------------------
Source ID: SOURCE_2
Score: 0.4660
--------------------------------------------------

================================================================================
QUERY: How do I use kubectl logs?

ANSWER:

Answer:
`kubectl logs` is used to retrieve logs from a running container.

Sources Used:
- SOURCE_4
- SOURCE_5
- SOURCE_6

SELF-CHECK RESULT:

VALIDATION: PASS

EXPLANATION:
The answer is directly and explicitly stated in SOURCE_4, SOURCE_5, and SOURCE_6.

RETRIEVED SOURCES:

Source ID: SOURCE_6
Score: 0.7220
--------------------------------------------------
Source ID: SOURCE_5
Score: 0.7051
--------------------------------------------------
Source ID: SOURCE_4
Score: 0.6621
--------------------------------------------------
Source ID: SOURCE_7
Score: 0.5291
--------------------------------------------------
Source ID: SOURCE_13
Score: 0.4240
--------------------------------------------------

================================================================================
QUERY: What causes OOMKilled?

ANSWER:

Answer:
OOMKilled happens when a container exceeds its memory limit.

Sources Used:
- SOURCE_1
- SOURCE_2

SELF-CHECK RESULT:

VALIDATION: PASS

EXPLANATION:
The answer is directly and explicitly stated in both SOURCE_1 and SOURCE_2.

RETRIEVED SOURCES:

Source ID: SOURCE_2
Score: 0.6038
--------------------------------------------------
Source ID: SOURCE_1
Score: 0.5741
--------------------------------------------------
Source ID: SOURCE_3
Score: 0.4389
--------------------------------------------------
Source ID: SOURCE_4
Score: 0.4239
--------------------------------------------------
Source ID: SOURCE_5
Score: 0.4185
--------------------------------------------------

================================================================================
QUERY: Why is my container restarting repeatedly?

ANSWER:

Answer:
A container may restart repeatedly due to several causes:
* **CrashLoopBackOff**: The container repeatedly crashes after starting.
* **OOMKilled**: The container exceeds its memory limit.
* **Missing environment variables**: These can cause a container to crash.
* **Incorrect command or entrypoint**: This can cause container startup failure.
* **Application errors**: Errors occurring inside the container often lead to restarts.

Sources Used:
- SOURCE_1
- SOURCE_2
- SOURCE_3
- SOURCE_4
- SOURCE_5

SELF-CHECK RESULT:

VALIDATION: PASS

EXPLANATION:
All claims made in the answer are directly supported by the provided context:
- CrashLoopBackOff is defined in SOURCE_1.
- OOMKilled is defined in SOURCE_1 and SOURCE_2.
- Missing environment variables is mentioned in SOURCE_1, SOURCE_2, and SOURCE_3.
- Incorrect command or entrypoint is mentioned in SOURCE_2, SOURCE_3, and SOURCE_4.
- Application errors are mentioned in SOURCE_3, SOURCE_4, and SOURCE_5.

RETRIEVED SOURCES:

Source ID: SOURCE_1
Score: 0.6207
--------------------------------------------------
Source ID: SOURCE_3
Score: 0.5957
--------------------------------------------------
Source ID: SOURCE_4
Score: 0.5732
--------------------------------------------------
Source ID: SOURCE_2
Score: 0.5525
--------------------------------------------------
Source ID: SOURCE_5
Score: 0.4963
--------------------------------------------------

================================================================================
QUERY: How do services expose applications?

ANSWER:

Answer:
Services expose applications in the following ways:
* **ClusterIP**: Exposes applications within the cluster.
* **NodePort**: Exposes applications on node IPs.
* **LoadBalancer**: Exposes applications externally.

Additionally, Ingress routes HTTP and HTTPS traffic to services.

Sources Used:
- SOURCE_12
- SOURCE_13
- SOURCE_14
- SOURCE_15
- SOURCE_16

SELF-CHECK RESULT:

VALIDATION: PASS

EXPLANATION:
All statements in the answer are directly supported by the retrieved context. Specifically:
- ClusterIP, NodePort, and LoadBalancer definitions are found in SOURCE_14, SOURCE_15, SOURCE_13, and SOURCE_12.
- The Ingress description is found in SOURCE_15 and SOURCE_16.

RETRIEVED SOURCES:

Source ID: SOURCE_14
Score: 0.7823
--------------------------------------------------
Source ID: SOURCE_15
Score: 0.7405
--------------------------------------------------
Source ID: SOURCE_13
Score: 0.6684
--------------------------------------------------
Source ID: SOURCE_16
Score: 0.6363
--------------------------------------------------
Source ID: SOURCE_12
Score: 0.5554
--------------------------------------------------

================================================================================
QUERY: How do I debug pod crashes using logs?

ANSWER:

Answer:
To retrieve logs from a running container, you can use:
* `kubectl logs`

Additionally, `kubectl describe pod` can be used to show events and state transitions.

Sources Used:
- SOURCE_4
- SOURCE_5
- SOURCE_6

SELF-CHECK RESULT:

VALIDATION: PASS

EXPLANATION:
The answer is fully supported by the retrieved context. SOURCE_4, SOURCE_5, and SOURCE_6 all explicitly state that `kubectl logs` retrieves logs from a running container and that `kubectl describe pod` shows events and state transitions.

RETRIEVED SOURCES:

Source ID: SOURCE_5
Score: 0.6122
--------------------------------------------------
Source ID: SOURCE_6
Score: 0.5722
--------------------------------------------------
Source ID: SOURCE_4
Score: 0.5710
--------------------------------------------------
Source ID: SOURCE_1
Score: 0.5372
--------------------------------------------------
Source ID: SOURCE_2
Score: 0.4618
--------------------------------------------------

================================================================================
Evaluating Query: What causes OOMKilled?
Hit Rate: 1.00
Validation: PASS
Latency: 16.91s

================================================================================
Evaluating Query: How do I debug pod crashes?
Hit Rate: 1.00
Validation: PASS
Latency: 83.98s

================================================================================
Evaluating Query: What is CrashLoopBackOff?
Hit Rate: 1.00
Validation: PASS
Latency: 23.92s

================================================================================
Evaluating Query: How do services expose applications?
Hit Rate: 1.00
Validation: PASS
Latency: 66.64s

================================================================================
Evaluating Query: What are ConfigMaps used for?
Hit Rate: 1.00
Validation: PASS
Latency: 42.03s


================================================================================
EVALUATION RESULTS
================================================================================
                                  query  hit_rate  \
0                What causes OOMKilled?       1.0   
1           How do I debug pod crashes?       1.0   
2             What is CrashLoopBackOff?       1.0   
3  How do services expose applications?       1.0   
4         What are ConfigMaps used for?       1.0   

                                   matched_keywords missing_keywords  \
0                         [OOMKilled, memory limit]               []   
1              [kubectl logs, kubectl describe pod]               []   
2  [CrashLoopBackOff, container repeatedly crashes]               []   
3               [ClusterIP, NodePort, LoadBalancer]               []   
4                  [ConfigMaps, configuration data]               []   

  validation_status  latency_seconds  answer_length  
0              PASS            16.91            105  
1              PASS            83.98            828  
2              PASS            23.92            109  
3              PASS            66.64            325  
4              PASS            42.03            208  


================================================================================
OVERALL SYSTEM METRICS
================================================================================
Total Queries: 5
Average Retrieval Hit Rate: 1.00
Validation PASS Rate: 1.00
Average Latency: 46.70 sec
Average Answer Length: 315.00 chars

Evaluation report saved:
rag_evaluation_results.csv
RS Chandras Tech Blog | AI, ML, Agentic AI

Tuesday, May 12, 2026

JOURNEY-012 Added Evaluation Framework

No comments:

Post a Comment

React Custom Hooks

Report Abuse

Followers