Tuesday, May 5, 2026

Python List, NumPy Array and Pandas DataFrame

 The strength of numpy arrays is that it allows for vectorized operations (applying a calculation to the whole array at once without loops). Also the numpy array elements , unlike python list, are stored in Contiguous memory locations allowing fast operations and vectorized operations. 


C, C++, Rust, Go, FORTRAN, Swift and C# require arrays to be stored in contiguos memory locations. 


For Java object arrays and Python lists, pointers to elements are stored in contiguous locations, but not elements. However, Java primitive data type arrays are still stored in contiguous locations.


Feature Python ListNumPy ArrayPandas DataFrame
Dimensions1D (can nest)N-dimensional2D (tabular)
Data TypesHeterogeneous (mixed)Homogeneous (same)Heterogeneous (per column)
IndexingInteger-basedInteger-basedLabeled (names) & Integer
SpeedSlow for large dataFast (C-optimized)Slower than NumPy for pure math
Best ForGeneral programmingMath/Scientific computingData analysis & manipulation

JOURNEY 002 - SMART CHUNKING

StepKey ActionsOutcome
Generate Embeddings- Convert each sentence into a vector- Use Google embedding API- Store embeddings in arrayšŸ‘‰ Semantic meaning is captured numerically
Apply Clustering (KMeans)- Feed embeddings into KMeans- Choose number of clusters- Assign each sentence to a clusteršŸ‘‰ Similar sentences get grouped automatically
Build Smart Chunks- Combine sentences within each cluster- Preserve related context- Format into readable blocksšŸ‘‰ Each chunk now contains meaningful grouped knowledge
Structure the Data- Assign unique IDs- Store chunk text- Add cluster metadatašŸ‘‰ Data becomes RAG-ready and easy to use later

=================================================================

!pip install google-generativeai scikit-learn numpy

=================================================================

# ================================
# STEP 1: IMPORTS + API SETUP
# ================================

import google.generativeai as genai
#FOR GOOGLE COLAB
from google.colab import userdata
#FOR OTHER ENVIRONS
#import os
import numpy as np
from sklearn.cluster import KMeans
from collections import defaultdict


# šŸ”‘ Replace with your API key
#FOR NON-COLAB
#api_key = os.environ.get("GEMINI_API_KEY")
#FOR COLAB
api_key =  userdata.get("GEMINI_API_KEY")

#if not api_key:
#    raise ValueError("API key not found. Please set GEMINI_API_KEY in Colab Secrets.")

genai.configure(api_key=api_key)

# ================================
# STEP 2: DATA SET
# ================================

documents = [
"CrashLoopBackOff occurs when a container repeatedly crashes after starting.",
"A container may crash due to missing environment variables.",
"Incorrect command or entrypoint can cause container startup failure.",
"Application errors inside the container often lead to restarts.",
"OOMKilled happens when a container exceeds its memory limit.",

"ImagePullBackOff occurs when Kubernetes cannot pull the container image.",
"Incorrect image name or tag can cause image pull failures.",
"Private registries require imagePullSecrets for authentication.",

"kubectl logs retrieves logs from a running container.",
"kubectl describe pod shows events and state transitions.",

"Pods remain pending if no node satisfies resource requests.",
"Node affinity restricts pods to specific nodes.",
"Taints prevent pods from being scheduled on certain nodes.",
"Tolerations allow pods to be scheduled on tainted nodes.",

"Liveness probes determine if a container should be restarted.",
"Readiness probes determine if a pod can receive traffic.",
"A failing readiness probe removes the pod from service endpoints.",

"ClusterIP services expose applications within the cluster.",
"NodePort services expose applications on node IPs.",

"PersistentVolumes provide storage independent of pods.",
"PersistentVolumeClaims request storage resources.",

"ConfigMaps store non-sensitive configuration data.",
"Secrets store sensitive data like passwords and tokens.",

"Deployments manage replica sets and pod updates.",
"Horizontal Pod Autoscaler scales based on CPU or metrics.",

"Pods stuck in Terminating state may have finalizers blocking deletion.",
"RBAC misconfiguration can block access to resources."
]

# ================================
# STEP 3: GENERATE EMBEDDINGS
# ================================
def get_embedding(text):
    response = genai.embed_content(
        #model="models/embedding-001", #this is now deprecated and you will get error
        model="models/gemini-embedding-001",
        content=text
    )
    return response["embedding"]

print("Generating embeddings...")

embeddings = []
for doc in documents:
    emb = get_embedding(doc)
    embeddings.append(emb)

embeddings = np.array(embeddings)

print(f"Embeddings shape: {embeddings.shape}")

# ================================
# STEP 4: CLUSTER uSING KMEANS
# ================================
NUM_CLUSTERS = 6  # šŸ”§ Tune this later

kmeans = KMeans(n_clusters=NUM_CLUSTERS, random_state=42)
labels = kmeans.fit_predict(embeddings)

print("Clustering complete.")

# ================================
# STEP 5: GROUP DOCUMENTS BY CLUSTER
# ================================
clustered_docs = defaultdict(list)

for doc, label in zip(documents, labels):
    clustered_docs[label].append(doc)

# ================================
# STEP 6: BUILD SMART CHUNKS
# ================================
smart_chunks = []

for cluster_id, docs in clustered_docs.items():
   
    chunk_text = f"Cluster {cluster_id}\n\n"
   
    for d in docs:
        chunk_text += f"- {d}\n"
   
    smart_chunks.append(chunk_text.strip())

# ================================
# STEP 7: ADD METADATA (FINAL DATA PREPARED)
# ================================
prepared_data = []

for i, chunk in enumerate(smart_chunks):
    prepared_data.append({
        "id": f"chunk_{i}",
        "cluster": i,
        "text": chunk
    })
# ================================
# STEP 8: INSPECT RESULTS
# ================================
for item in prepared_data:
    print(item["text"])
    print("=" * 60)

# ================================
# STEP 9: DEBUG CLUSTER QUALITY
# ================================
for cluster_id, docs in clustered_docs.items():
    print(f"\nšŸ”¹ Cluster {cluster_id} ({len(docs)} items)")
    for d in docs:
        print("-", d)


==================================================================================
Output
==================================================================================
Generating embeddings... Embeddings shape: (27, 3072) Clustering complete. Cluster 4 - CrashLoopBackOff occurs when a container repeatedly crashes after starting. - ImagePullBackOff occurs when Kubernetes cannot pull the container image. - kubectl logs retrieves logs from a running container. - kubectl describe pod shows events and state transitions. ============================================================ Cluster 2 - A container may crash due to missing environment variables. - Incorrect command or entrypoint can cause container startup failure. - Application errors inside the container often lead to restarts. - OOMKilled happens when a container exceeds its memory limit. - Liveness probes determine if a container should be restarted. ============================================================ Cluster 5 - Incorrect image name or tag can cause image pull failures. - Private registries require imagePullSecrets for authentication. - ConfigMaps store non-sensitive configuration data. - Secrets store sensitive data like passwords and tokens. ============================================================ Cluster 0 - Pods remain pending if no node satisfies resource requests. - Node affinity restricts pods to specific nodes. - Taints prevent pods from being scheduled on certain nodes. - Tolerations allow pods to be scheduled on tainted nodes. - Readiness probes determine if a pod can receive traffic. - A failing readiness probe removes the pod from service endpoints. - Horizontal Pod Autoscaler scales based on CPU or metrics. - Pods stuck in Terminating state may have finalizers blocking deletion. - RBAC misconfiguration can block access to resources. ============================================================ Cluster 1 - ClusterIP services expose applications within the cluster. - NodePort services expose applications on node IPs. ============================================================ Cluster 3 - PersistentVolumes provide storage independent of pods. - PersistentVolumeClaims request storage resources. - Deployments manage replica sets and pod updates. ============================================================ šŸ”¹ Cluster 4 (4 items) - CrashLoopBackOff occurs when a container repeatedly crashes after starting. - ImagePullBackOff occurs when Kubernetes cannot pull the container image. - kubectl logs retrieves logs from a running container. - kubectl describe pod shows events and state transitions. šŸ”¹ Cluster 2 (5 items) - A container may crash due to missing environment variables. - Incorrect command or entrypoint can cause container startup failure. - Application errors inside the container often lead to restarts. - OOMKilled happens when a container exceeds its memory limit. - Liveness probes determine if a container should be restarted. šŸ”¹ Cluster 5 (4 items) - Incorrect image name or tag can cause image pull failures. - Private registries require imagePullSecrets for authentication. - ConfigMaps store non-sensitive configuration data. - Secrets store sensitive data like passwords and tokens. šŸ”¹ Cluster 0 (9 items) - Pods remain pending if no node satisfies resource requests. - Node affinity restricts pods to specific nodes. - Taints prevent pods from being scheduled on certain nodes. - Tolerations allow pods to be scheduled on tainted nodes. - Readiness probes determine if a pod can receive traffic. - A failing readiness probe removes the pod from service endpoints. - Horizontal Pod Autoscaler scales based on CPU or metrics. - Pods stuck in Terminating state may have finalizers blocking deletion. - RBAC misconfiguration can block access to resources. šŸ”¹ Cluster 1 (2 items) - ClusterIP services expose applications within the cluster. - NodePort services expose applications on node IPs. šŸ”¹ Cluster 3 (3 items) - PersistentVolumes provide storage independent of pods. - PersistentVolumeClaims request storage resources. - Deployments manage replica sets and pod updates.















 

models/embedding-001 is not found for API version v1beta, or is not supported for embedContent.

 

NotFound: 404 POST https://generativelanguage.googleapis.com/v1beta/models/embedding-001:embedContent?%24alt=json%3Benum-encoding%3Dint: models/embedding-001 is not found for API version v1beta, or is not supported for embedContent. Call ListModels to see the list of available models and their supported methods.



This is because embedding-001 is being phased out in favor of gemini-embedding-001.


You can list the available models like :

for m in genai.list_models():
    if 'embedContent' in m.supported_generation_methods:
        print(m.name)

as of 05/may/2026, on colab, it listed:
models/gemini-embedding-001 models/gemini-embedding-2-preview models/gemini-embedding-2


But there is a difference in older embedding-001 and newer gemini-embedding-001.
  • Dimension Changes: This model produces 3072-dimension vectors by default. This may differ from older models (768 dimensions), potentially breaking existing vector databases.
  • Dimension Control: You can use output_dimensionality to reduce the vector size to match your existing database requirements.
  • Batch Size: This model may have strict batch size limits in certain SDK implementations






  • Journey - 001 : Naive Chunking

     # ================================

    # STEP 1: RAW DATASET

    # ================================


    documents = [


    # --- Pod Failures ---

    "CrashLoopBackOff occurs when a container repeatedly crashes after starting.",

    "A container may crash due to missing environment variables.",

    "Incorrect command or entrypoint can cause container startup failure.",

    "Application errors inside the container often lead to restarts.",

    "OOMKilled happens when a container exceeds its memory limit.",


    # --- Image Issues ---

    "ImagePullBackOff occurs when Kubernetes cannot pull the container image.",

    "Incorrect image name or tag can cause image pull failures.",

    "Private registries require imagePullSecrets for authentication.",


    # --- Debugging Commands ---

    "kubectl logs retrieves logs from a running container.",

    "kubectl describe pod shows events and state transitions.",


    # --- Scheduling Issues ---

    "Pods remain pending if no node satisfies resource requests.",

    "Node affinity restricts pods to specific nodes.",

    "Taints prevent pods from being scheduled on certain nodes.",

    "Tolerations allow pods to be scheduled on tainted nodes.",


    # --- Probes ---

    "Liveness probes determine if a container should be restarted.",

    "Readiness probes determine if a pod can receive traffic.",

    "A failing readiness probe removes the pod from service endpoints.",


    # --- Networking ---

    "ClusterIP services expose applications within the cluster.",

    "NodePort services expose applications on node IPs.",


    # --- Storage ---

    "PersistentVolumes provide storage independent of pods.",

    "PersistentVolumeClaims request storage resources.",


    # --- Configuration ---

    "ConfigMaps store non-sensitive configuration data.",

    "Secrets store sensitive data like passwords and tokens.",


    # --- Deployment & Scaling ---

    "Deployments manage replica sets and pod updates.",

    "Horizontal Pod Autoscaler scales based on CPU or metrics.",


    # --- Misc ---

    "Pods stuck in Terminating state may have finalizers blocking deletion.",

    "RBAC misconfiguration can block access to resources."


    ]



    # ================================

    # STEP 2: CLEANING FUNCTION

    # ================================


    def clean_text(text):

        return text.strip().replace("\n", " ")



    # ================================

    # STEP 3: BASIC CHUNKING

    # ================================


    def chunk_documents(docs, chunk_size=3):

        chunks = []

        

        for i in range(0, len(docs), chunk_size):

            chunk = " ".join(docs[i:i+chunk_size])

            chunks.append(chunk)

        

        return chunks


    chunks = chunk_documents(documents, chunk_size=3)



    # ================================

    # STEP 4: ADD METADATA STRUCTURE

    # ================================


    prepared_data = []


    for i, chunk in enumerate(chunks):

        prepared_data.append({

            "id": f"chunk_{i}",

            "text": clean_text(chunk)

        })



    # ================================

    # STEP 5: INSPECT OUTPUT

    # ================================


    for item in prepared_data[:5]:

        print(item)

        print("-" * 50)



    # ================================

    # OPTIONAL: SIZE CHECK

    # ================================


    print(f"Total raw documents: {len(documents)}")

    print(f"Total chunks created: {len(prepared_data)}")




    ==========================================================================

    {'id': 'chunk_0', 'text': 'CrashLoopBackOff occurs when a container repeatedly crashes after starting. A container may crash due to missing environment variables. Incorrect command or entrypoint can cause container startup failure.'} -------------------------------------------------- {'id': 'chunk_1', 'text': 'Application errors inside the container often lead to restarts. OOMKilled happens when a container exceeds its memory limit. ImagePullBackOff occurs when Kubernetes cannot pull the container image.'} -------------------------------------------------- {'id': 'chunk_2', 'text': 'Incorrect image name or tag can cause image pull failures. Private registries require imagePullSecrets for authentication. kubectl logs retrieves logs from a running container.'} -------------------------------------------------- {'id': 'chunk_3', 'text': 'kubectl describe pod shows events and state transitions. Pods remain pending if no node satisfies resource requests. Node affinity restricts pods to specific nodes.'} -------------------------------------------------- {'id': 'chunk_4', 'text': 'Taints prevent pods from being scheduled on certain nodes. Tolerations allow pods to be scheduled on tainted nodes. Liveness probes determine if a container should be restarted.'} -------------------------------------------------- Total raw documents: 27 Total chunks created: 9

    Chunking Considerations

     

    šŸ”‘ 1: Chunk by meaning, not by size

    šŸ”‘ 2: Keep chunks self-contained

    “If this chunk is retrieved alone, does it still make sense?”

    šŸ”‘ 3: Avoid mixing unrelated topics

    šŸ”‘ 4: Include cause + symptom + fix (if applicable)

    šŸ”‘ 5: Maintain moderate size

    • 50–200 words per chunk (rule of thumb) 

    šŸ”‘ 6: Overlap slightly

    šŸ”‘ 7: Preserve important keywords

    šŸ‘‰ These are retrieval anchors

    šŸ”‘ 8: Structure beats raw text

    Structured chunks > plain sentences

    šŸ”‘ 9: One chunk = one intent

    šŸ”‘ 10: Think like a query

    “What question would retrieve this?”  => no answer means weak chunk

    Things you should look for in a good RAG test data

     1. it should force semantic searching i.e. it should have cases which are not obvious for a dumb machine. e.g. given an HR database, suppose you search for "leave policy". A dumb algo will simply search for keyword "leave", collect all the data that contains the word "leave" and output it. Where as a semantically searched resultset will also include statements  containing "holidays", "off days" etc.


    2. Overlapping data: Same data should appear multiple times or may appear singularly but in different context. RAG should be able to differentiate context.


    3. Should contain some kind of ambiguity 

    4. unclear edge cases. RAG, if properly trained, should be able to conclude on edge cases, depending on avaiable data, and gracefully output if it is not able to provide accurate information.

    5. multi-hop reasoning



    Types of duplications in data: 

    1. Exact duplicates : Data contains same statement/contextual data exactly repeated at many places. 

    Models  (embedding and LLM) should be intelligent enough to identify noise vs required data in exact duplicates. Noise-> needs to be eliminated->e.g. Employees are entitled 40 leaves per year.-> If this string is encountered multiple times, it can be eliminated. But in a legal database, same sentence may appear multiple times as a sentence to different crimes, that cannot be and should not be eliminated.


    2. Semantic Duplicates : Ordering of words changed/words changed but have same meaning

    e.g. "You can change your password after first login" and "Password can be changed by the user after maiden login"

    3. Boilerplate Overlap : Common repetitive text found across many documents e.g header/footer.



    Strategies to Handle Overlapping Data
    • Semantic Splitting: Using tools to split text at semantic boundaries rather than fixed token counts reduces the need for large overlaps.
    • Deduplication: Implementing hash-based fingerprinting (for exact matches) or using clustering techniques to identify and remove near-duplicates.
    • Summarization: Detecting overlapping chunks during retrieval and summarizing them into a single, comprehensive context for the LLM.
    • Optimal Overlap Setting: Common best practices suggest a 10% to 20% overlap (e.g., 100 tokens of overlap for a 512-token chunk)

     

    Monday, May 4, 2026

    Type hinting and Union Types

    Type hinting and union types provide ways to specify expected data types, improving code readability and enabling early error detection. Their implementation varies significantly between dynamically typed languages (JavaScript, Python) and statically typed ones (Java, C#). 

    1. Python
    • Type Hinting: Introduced in Python 3.5 (PEP 484), type hints allow you to annotate variables and function signatures (e.g., def greet(name: str) -> str:).
    • Union Types: These specify that a value can be one of several types.
      • Old Syntax: Union[int, str] (requires from typing import Union).
      • New Syntax (3.10+): Use the | operator, such as int | str.
    • Enforcement: By default, Python's interpreter ignores these hints at runtime. They are primarily used by static analysis tools like mypy or IDEs to catch bugs. 
    2. JavaScript (and TypeScript)
    • Type Hinting: Standard JavaScript does not have native type hinting. Developers often use JSDoc comments (e.g., /** @type {string} */) to provide hints for editors like VS Code.
    • TypeScript: Most "type hinting" in the JS ecosystem occurs via TypeScript, which adds a full static type system.
    • Union Types: TypeScript uses the | symbol (e.g., string | number) to describe a value that can be one of several types.
    • Enforcement: Like Python, these are removed during compilation and are not enforced by the JavaScript engine at runtime. 
    3. C#
    • Type Hinting: C# is statically typed, meaning types are usually mandatory and enforced by the compiler.
    • Union Types: Traditionally, C# lacked native union types, forcing developers to use inheritance or discriminated union patterns with records.
    • Upcoming Feature: Modern C# (previewed for C# 15) is introducing native union declarations using the union keyword, allowing a closed set of types that the compiler can exhaustively check in switch expressions. 
    4. Java
    • Type Hinting: Java requires explicit type declarations for all variables and method signatures (e.g., int count = 5;), which the compiler strictly enforces.
    • Union Types: Java does not have native union types.
      • Workarounds: Developers simulate them using sealed classes or interfaces (introduced in Java 17) combined with pattern matching to handle different possible subtypes safely.
    • Exception: The only native "union-like" behavior is in multi-catch blocks: catch (IOException | SQLException e). 
    Comparison Summary
    Language Type Hinting StyleUnion Type SupportRuntime Enforcement
    PythonOptional annotationsint | str (3.10+)No (Static only)
    JavaScriptJSDoc (Native) / TSstring | number (TS)No (Static only)
    C#Static & mandatoryunion (Upcoming)Yes (Compiler)
    JavaStatic & mandatorySealed classes (Simulated)Yes (Compiler)

    Python List, NumPy Array and Pandas DataFrame

     The strength of numpy arrays is that it allows for vectorized operations (applying a calculation to the whole array at once without loops)...