Tuesday, May 5, 2026

JOURNEY 002 - SMART CHUNKING

StepKey ActionsOutcome
Generate Embeddings- Convert each sentence into a vector- Use Google embedding API- Store embeddings in arrayšŸ‘‰ Semantic meaning is captured numerically
Apply Clustering (KMeans)- Feed embeddings into KMeans- Choose number of clusters- Assign each sentence to a clusteršŸ‘‰ Similar sentences get grouped automatically
Build Smart Chunks- Combine sentences within each cluster- Preserve related context- Format into readable blocksšŸ‘‰ Each chunk now contains meaningful grouped knowledge
Structure the Data- Assign unique IDs- Store chunk text- Add cluster metadatašŸ‘‰ Data becomes RAG-ready and easy to use later

=================================================================

!pip install google-generativeai scikit-learn numpy

=================================================================

# ================================
# STEP 1: IMPORTS + API SETUP
# ================================

import google.generativeai as genai
#FOR GOOGLE COLAB
from google.colab import userdata
#FOR OTHER ENVIRONS
#import os
import numpy as np
from sklearn.cluster import KMeans
from collections import defaultdict


# šŸ”‘ Replace with your API key
#FOR NON-COLAB
#api_key = os.environ.get("GEMINI_API_KEY")
#FOR COLAB
api_key =  userdata.get("GEMINI_API_KEY")

#if not api_key:
#    raise ValueError("API key not found. Please set GEMINI_API_KEY in Colab Secrets.")

genai.configure(api_key=api_key)

# ================================
# STEP 2: DATA SET
# ================================

documents = [
"CrashLoopBackOff occurs when a container repeatedly crashes after starting.",
"A container may crash due to missing environment variables.",
"Incorrect command or entrypoint can cause container startup failure.",
"Application errors inside the container often lead to restarts.",
"OOMKilled happens when a container exceeds its memory limit.",

"ImagePullBackOff occurs when Kubernetes cannot pull the container image.",
"Incorrect image name or tag can cause image pull failures.",
"Private registries require imagePullSecrets for authentication.",

"kubectl logs retrieves logs from a running container.",
"kubectl describe pod shows events and state transitions.",

"Pods remain pending if no node satisfies resource requests.",
"Node affinity restricts pods to specific nodes.",
"Taints prevent pods from being scheduled on certain nodes.",
"Tolerations allow pods to be scheduled on tainted nodes.",

"Liveness probes determine if a container should be restarted.",
"Readiness probes determine if a pod can receive traffic.",
"A failing readiness probe removes the pod from service endpoints.",

"ClusterIP services expose applications within the cluster.",
"NodePort services expose applications on node IPs.",

"PersistentVolumes provide storage independent of pods.",
"PersistentVolumeClaims request storage resources.",

"ConfigMaps store non-sensitive configuration data.",
"Secrets store sensitive data like passwords and tokens.",

"Deployments manage replica sets and pod updates.",
"Horizontal Pod Autoscaler scales based on CPU or metrics.",

"Pods stuck in Terminating state may have finalizers blocking deletion.",
"RBAC misconfiguration can block access to resources."
]

# ================================
# STEP 3: GENERATE EMBEDDINGS
# ================================
def get_embedding(text):
    response = genai.embed_content(
        #model="models/embedding-001", #this is now deprecated and you will get error
        model="models/gemini-embedding-001",
        content=text
    )
    return response["embedding"]

print("Generating embeddings...")

embeddings = []
for doc in documents:
    emb = get_embedding(doc)
    embeddings.append(emb)

embeddings = np.array(embeddings)

print(f"Embeddings shape: {embeddings.shape}")

# ================================
# STEP 4: CLUSTER uSING KMEANS
# ================================
NUM_CLUSTERS = 6  # šŸ”§ Tune this later

kmeans = KMeans(n_clusters=NUM_CLUSTERS, random_state=42)
labels = kmeans.fit_predict(embeddings)

print("Clustering complete.")

# ================================
# STEP 5: GROUP DOCUMENTS BY CLUSTER
# ================================
clustered_docs = defaultdict(list)

for doc, label in zip(documents, labels):
    clustered_docs[label].append(doc)

# ================================
# STEP 6: BUILD SMART CHUNKS
# ================================
smart_chunks = []

for cluster_id, docs in clustered_docs.items():
   
    chunk_text = f"Cluster {cluster_id}\n\n"
   
    for d in docs:
        chunk_text += f"- {d}\n"
   
    smart_chunks.append(chunk_text.strip())

# ================================
# STEP 7: ADD METADATA (FINAL DATA PREPARED)
# ================================
prepared_data = []

for i, chunk in enumerate(smart_chunks):
    prepared_data.append({
        "id": f"chunk_{i}",
        "cluster": i,
        "text": chunk
    })
# ================================
# STEP 8: INSPECT RESULTS
# ================================
for item in prepared_data:
    print(item["text"])
    print("=" * 60)

# ================================
# STEP 9: DEBUG CLUSTER QUALITY
# ================================
for cluster_id, docs in clustered_docs.items():
    print(f"\nšŸ”¹ Cluster {cluster_id} ({len(docs)} items)")
    for d in docs:
        print("-", d)


==================================================================================
Output
==================================================================================
Generating embeddings... Embeddings shape: (27, 3072) Clustering complete. Cluster 4 - CrashLoopBackOff occurs when a container repeatedly crashes after starting. - ImagePullBackOff occurs when Kubernetes cannot pull the container image. - kubectl logs retrieves logs from a running container. - kubectl describe pod shows events and state transitions. ============================================================ Cluster 2 - A container may crash due to missing environment variables. - Incorrect command or entrypoint can cause container startup failure. - Application errors inside the container often lead to restarts. - OOMKilled happens when a container exceeds its memory limit. - Liveness probes determine if a container should be restarted. ============================================================ Cluster 5 - Incorrect image name or tag can cause image pull failures. - Private registries require imagePullSecrets for authentication. - ConfigMaps store non-sensitive configuration data. - Secrets store sensitive data like passwords and tokens. ============================================================ Cluster 0 - Pods remain pending if no node satisfies resource requests. - Node affinity restricts pods to specific nodes. - Taints prevent pods from being scheduled on certain nodes. - Tolerations allow pods to be scheduled on tainted nodes. - Readiness probes determine if a pod can receive traffic. - A failing readiness probe removes the pod from service endpoints. - Horizontal Pod Autoscaler scales based on CPU or metrics. - Pods stuck in Terminating state may have finalizers blocking deletion. - RBAC misconfiguration can block access to resources. ============================================================ Cluster 1 - ClusterIP services expose applications within the cluster. - NodePort services expose applications on node IPs. ============================================================ Cluster 3 - PersistentVolumes provide storage independent of pods. - PersistentVolumeClaims request storage resources. - Deployments manage replica sets and pod updates. ============================================================ šŸ”¹ Cluster 4 (4 items) - CrashLoopBackOff occurs when a container repeatedly crashes after starting. - ImagePullBackOff occurs when Kubernetes cannot pull the container image. - kubectl logs retrieves logs from a running container. - kubectl describe pod shows events and state transitions. šŸ”¹ Cluster 2 (5 items) - A container may crash due to missing environment variables. - Incorrect command or entrypoint can cause container startup failure. - Application errors inside the container often lead to restarts. - OOMKilled happens when a container exceeds its memory limit. - Liveness probes determine if a container should be restarted. šŸ”¹ Cluster 5 (4 items) - Incorrect image name or tag can cause image pull failures. - Private registries require imagePullSecrets for authentication. - ConfigMaps store non-sensitive configuration data. - Secrets store sensitive data like passwords and tokens. šŸ”¹ Cluster 0 (9 items) - Pods remain pending if no node satisfies resource requests. - Node affinity restricts pods to specific nodes. - Taints prevent pods from being scheduled on certain nodes. - Tolerations allow pods to be scheduled on tainted nodes. - Readiness probes determine if a pod can receive traffic. - A failing readiness probe removes the pod from service endpoints. - Horizontal Pod Autoscaler scales based on CPU or metrics. - Pods stuck in Terminating state may have finalizers blocking deletion. - RBAC misconfiguration can block access to resources. šŸ”¹ Cluster 1 (2 items) - ClusterIP services expose applications within the cluster. - NodePort services expose applications on node IPs. šŸ”¹ Cluster 3 (3 items) - PersistentVolumes provide storage independent of pods. - PersistentVolumeClaims request storage resources. - Deployments manage replica sets and pod updates.















 

No comments:

Post a Comment

Understanding the React ESLint Warning: “Avoid Calling setState() Directly Within an Effect”

While working with React, I encountered an interesting ESLint warning related to useEffect . The application itself w...