Saturday, May 30, 2026

Machine Learning and AI Model Taxonomy

The following table compares major categories of Machine Learning, Deep Learning, Generative AI, and Reinforcement Learning models.

Category Model Type Core Purpose / Characteristic Ideal Input Data Type Training Paradigm Popular Examples
Traditional ML Linear Models Assumes linear relationships between features. Structured/Tabular (Numbers) Supervised Linear Regression, Logistic Regression
Tree-Based Models Splits data like flowchart branches based on values. Structured/Tabular (Mixed) Supervised Decision Trees, Random Forest, XGBoost
Distance-Based Classifies data points based on geometric proximity. Structured/Tabular (Normalized) Supervised K-Nearest Neighbors, SVM
Probabilistic Uses probability theory and Bayes' Theorem. Structured, Text (Word counts) Supervised Naive Bayes, Hidden Markov Models
Clustering Unsupervised grouping of similar unlabeled points. Structured/Tabular Unsupervised K-Means, DBSCAN
Dimensionality Compresses datasets by reducing redundant features. High-Dimensional Tabular Unsupervised PCA, t-SNE
RNNs & Sequence Vanilla RNN Processes sequences step-by-step with memory. Sequential (Text, Time-Series) Supervised/Self-Sup. Standard Elman RNN
LSTM Retains long-term context using gating mechanisms. Sequential (Text, Audio, Sensors) Supervised/Self-Sup. Standard LSTM, BiLSTM
GRU Streamlined, faster version of LSTM with fewer gates. Sequential (Text, Audio, Sensors) Supervised/Self-Sup. Standard GRU
CNNs (Spatial) Image Class. Identifies the main subject within a static frame. Spatial Grids (Images, Videos) Supervised ResNet, VGG16, MobileNet
Object Detection Locates and labels multiple distinct items in space. Spatial Grids (Images, Videos) Supervised YOLO, Faster R-CNN
Segmentation Classifies every single individual pixel. Spatial Grids (Medical scans) Supervised U-Net, Mask R-CNN
Transformers Encoder-Only Extracts context and meaning from sequences. Sequential (Text, Code) Self-Supervised BERT, RoBERTa
Decoder-Only Predicts the next sequence element autoregressively. Sequential (Text, Code) Self-Supervised GPT-4, Llama 3, Claude 3.5
Encoder-Decoder Translates/maps one sequence onto another. Sequential (Source Text) Self-Supervised T5, BART
Generative AI Multimodal Processes and outputs multiple mediums natively. Mixed (Text, Image, Video, Audio) Self-Supervised Google Gemini, GPT-4o
Diffusion Models Generates media by removing noise iteratively. Text prompts, Random noise Supervised (Latent) Stable Diffusion, Midjourney, Sora
GANs Two networks compete to create realistic data. Random noise vectors, Images Unsupervised/Adverserial StyleGAN, CycleGAN
VAEs Compresses data down and decodes new variants. Images, Structured vectors Unsupervised Beta-VAE
Reinforcement Value-Based RL Finds actions by calculating future rewards. Environment States, Screen pixels Trial-and-error Reward Deep Q-Networks (DQN)
Policy-Based RL Directly learns behaviors for a given environment. Environment States, Screen pixels Trial-and-error Reward

CNN vs RNN

The following table compares the key characteristics of CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network).

Feature CNN (Convolutional Neural Network) RNN (Recurrent Neural Network)
Primary Data Type Spatial Data (Images, grids, matrices) Sequential Data (Text, audio, time-series)
Feature Extraction Extracts spatial features hierarchically (edges, shapes, objects) using convolutional filters. Extracts temporal features by learning patterns and dependencies across time steps.
Memory & Context Stateless and feedforward. Does not remember context or previous steps; processes each input independently. Stateful with memory loops. Retains a hidden state to pass context from previous steps forward.
How It Works Uses filters/kernels to slide over an image and detect localized patterns. Uses recurrent feedback loops, allowing past data to influence future predictions.
Input/Output Size Usually requires fixed-size inputs and outputs. Highly flexible; handles variable-length inputs and outputs.
Training Speed Faster. Convolutions allow for highly parallelized processing. Slower. Must process data step-by-step, making parallelization difficult.

LSTM and Types of Recurrent Neural Network (RNN) Architectures

LSTM (Long Short-Term Memory) is a specialized type of Recurrent Neural Network (RNN) designed to overcome the memory limitations of standard RNNs [1].

The broader family of RNN models can be categorized into several architectural types based on how inputs and outputs are structured:

1. Standard/Vanilla RNNs

  • One-to-One: Used for standard classification where temporal sequence is not a factor.
  • One-to-Many: Takes a single input to output a sequence (e.g., image captioning, where one image generates a descriptive sentence).
  • Many-to-One: Takes a sequence of inputs and produces a single output (e.g., sentiment analysis of a text block).

2. Sequence Models (Many-to-Many)

  • Synchronous: Inputs and outputs are aligned step-by-step (e.g., video frame classification).
  • Asynchronous (Encoder-Decoder): The input sequence is read entirely before the output sequence begins (e.g., machine translation).

3. Advanced/Modified RNN Architectures

Architecture Description
LSTM (Long Short-Term Memory) Features "gating" mechanisms that regulate information flow, allowing the model to remember long-term dependencies.
GRU (Gated Recurrent Unit) A streamlined variation of LSTM that combines the forget and input gates into a single update gate, often training faster.
Bidirectional RNNs Processes sequences in both forward and backward directions simultaneously, useful when the entire context is needed (e.g., filling in missing words in a sentence).

PyTorch torch.dot() does not broadcast

In PyTorch, torch.dot() does not broadcast because it is strictly designed to compute the dot product of two 1D tensors (vectors) of the same number of elements.

If you pass multi-dimensional tensors (like matrices or batches) to torch.dot(), PyTorch will throw a RuntimeError.

🛠️ The Solution: What to Use Instead

To perform matrix multiplication with automatic broadcasting, you should use alternative PyTorch functions depending on your specific use case:

1. Use torch.matmul or the @ Operator (Recommended)

This is the closest equivalent to NumPy's np.dot. It supports broadcasting completely across batch dimensions.

Best for: Standard matrix multiplication, vector-matrix products, and batched operations.

python
import torch

# Batch of 10 matrices (10 x 3 x 4) and a matrix (4 x 5)
A = torch.randn(10, 3, 4)
B = torch.randn(4, 5)

# B is automatically broadcasted to match A's batch size
result = torch.matmul(A, B)  # Shape: [10, 3, 5]
# OR using the operator
result = A @ B               # Shape: [10, 3, 5]

2. Use torch.mm

This multiplies exactly two 2D matrices. It does not broadcast.

Best for: Strict 2D matrix multiplication where you want an error if dimensions don't align perfectly.

python
A = torch.randn(3, 4)
B = torch.randn(4, 5)
result = torch.mm(A, B)  # Shape: [3, 5]

3. Use torch.bmm

This performs batch matrix multiplication. Both tensors must be 3D, and their batch sizes must match exactly. It does not broadcast.

Best for: Explicitly controlled batch matrix multiplications.

python
A = torch.randn(10, 3, 4)
B = torch.randn(10, 4, 5)
result = torch.bmm(A, B)  # Shape: [10, 3, 5]

4. Use Element-wise Multiplication * with .sum()

If you want a traditional dot product behavior (multiply matching elements and sum them up) over a specific dimension of a broadcasted tensor, combine the * operator with .sum().

Best for: Custom element-wise operations before reducing.

python
A = torch.randn(10, 3)
B = torch.randn(1, 3)  # Broadcasts along the batch dimension (1 -> 10)

# Multiply element-wise (broadcasts) and sum over the last dimension
result = (A * B).sum(dim=-1)  # Shape: [10]

📊 Quick Comparison Summary

Function / Operator Input Dimensions Allowed Supports Broadcasting? Primary Use Case
torch.dot Strictly 1D and 1D ❌ No Basic vector-vector dot product
torch.mm Strictly 2D and 2D ❌ No Standard 2D matrix multiplication
torch.bmm Strictly 3D and 3D ❌ No Strict batch matrix multiplication
torch.matmul / @ Any dimensions Yes Flexible, broadcast-safe multiplication

Back to Basics (Mathematics!) : If an expression contains square root or fraction , how will you decide whether to apply Product Rule or Chain Rule ?

When an expression contains square roots or fractions, the choice between the chain rule and the product rule still depends on whether the functions are nested or multiplied.

To make differentiation easier, always rewrite square roots as fractional exponents (√x = x1/2) and fractions using negative exponents (1/x = x-1) before applying either rule.

Here is how you handle square roots and fractions with both rules.

1. Identify Rules for Square Roots

Chain Rule (Nested Square Root)

Use the chain rule when an entire multi-term expression sits inside the square root.

Example: y = √(5x3 + 2)

Rewrite: y = (5x3 + 2)1/2

Step 1: Differentiate Outside Function

Bring down the exponent 1/2 and subtract 1 from the power. Leave the inside unchanged.

(1/2)(5x3 + 2)-1/2

Step 2: Multiply by Inside Derivative

The derivative of the inside (5x3 + 2) is 15x2. Multiply this to the outside derivative.

dy/dx = (1/2)(5x3 + 2)-1/2 · (15x2)

Step 3: Simplify and Rewrite

dy/dx = (15x2)/(2√(5x3 + 2))

Product Rule (Multiplied Square Root)

Use the product rule when a square root is an independent term multiplying another distinct function of x.

Example: y = √x · ln(x)

Rewrite: y = x1/2 · ln(x)

Step 1: Set up Parts

First function (f): x1/2 ⇒ f' = (1/2)x-1/2 = 1/(2√x)

Second function (g): ln(x) ⇒ g' = 1/x

Step 2: Apply Product Formula

Multiply f' · g + f · g':

dy/dx = (1/(2√x))(ln(x)) + (√x)(1/x)

Step 3: Simplify and Rewrite

dy/dx = ln(x)/(2√x) + √x/x = (ln(x) + 2)/(2√x)

2. Identify Rules for Fractions

Chain Rule (Nested Fraction)

Use the chain rule when a fraction is nested inside another power or function, or when the entire denominator can be raised to a negative exponent.

Example: y = 1/(x2 + 4)

Rewrite: y = (x2 + 4)-1

Step 1: Differentiate Outside Function

Bring down -1 and decrease the power to -2.

-1(x2 + 4)-2

Step 2: Multiply by Inside Derivative

The derivative of the inside (x2 + 4) is 2x.

dy/dx = -1(x2 + 4)-2 · (2x)

Step 3: Simplify and Rewrite

dy/dx = -2x/(x2 + 4)2

Product Rule (Multiplied Fraction)

Use the product rule instead of the quotient rule when you rewrite a fractional term as a negative power multiplying another function.

Example: y = ex/x3

Rewrite: y = ex · x-3

Step 1: Set up Parts

First function (f): ex ⇒ f' = ex

Second function (g): x-3 ⇒ g' = -3x-4

Step 2: Apply Product Formula

Multiply f' · g + f · g':

dy/dx = (ex)(x-3) + (ex)(-3x-4)

Step 3: Simplify and Rewrite

dy/dx = ex/x3 − 3ex/x4 = ex(x − 3)/x4

Side-by-Side Structural Summary

Structure Type Function Appearance Rule Choice Rewrite Strategy
Nested Root y = √expression Chain Rule (expression)1/2
Multiplied Root y = √x · f(x) Product Rule x1/2 · f(x)
Nested Fraction y = 1/expression Chain Rule (expression)-1
Multiplied Fraction y = f(x) · 1/g(x) Product Rule f(x) · (g(x))-1

How a Neural Network Calculates Loss During Supervised Training

Consider training of neural network on a labelled training dataset of cats and dogs. A neural network calculates loss during training by mathematically comparing its predicted output against the explicit ground-truth label provided in the training dataset. The network cannot detect an error by looking at an image alone; it relies entirely on human-provided answers (labels) to measure its mistakes.

Step 1: The Forward Pass

When a network sees an image for the first time, it performs a forward pass:

Input: The raw pixel values of the image are fed into the input layer.
Calculation: The pixels pass through hidden layers where they are multiplied by randomly initialized weights.
Prediction: The output layer generates a guess, usually formatted as decimal probabilities.
For example, if you feed the network a new image of a Cat, it might output:
[Cat: 0.20, Dog: 0.80] (It guessed a dog).

Step 2: The Ground-Truth Comparison

The network "knows" it is wrong because supervised training data pairs every image with an exact answer key called a ground-truth label. This label is converted into a vector using a process called one-hot encoding:

True Label for Cat:
[Cat: 1.0, Dog: 0.0]

Step 3: Calculating the Loss Value

The loss function acts as a mathematical evaluator that compares the prediction vector to the true label vector.

A common algorithm used for classification is Cross-Entropy Loss. It uses logarithms to aggressively penalize confident, incorrect guesses. Another basic alternative is Mean Squared Error (MSE):

Error=Prediction-True Label

Cat Node Error: 0.20 - 1.0 = -0.80

Dog Node Error: 0.80 - 0.0 = 0.80
These individual errors are processed by the loss function to produce a single number, the Loss Score. A high loss score means a terrible guess; a loss score close to zero means a near-perfect guess.

Step 4: Backpropagation and Readjusting Weights

Once the single loss score is determined, the network utilizes calculus to pinpoint exactly which internal weights caused the bad score.

The Chain Rule

The network calculates the gradient of the loss function. It traces backward from the output layer through the hidden layers using the mathematical chain rule.

Attributing Blame

This step determines how much each specific weight contributed to the overall error score.

Gradient Descent

An optimizer algorithm updates the internal weights by nudging them in the opposite direction of the error gradient.

Training Outcome

Over millions of iterations across a diverse training dataset, this cycle repeatedly reduces the loss score until the network correctly prioritizes the features of a cat over a dog.

kube-prometheus-stack

Use the kube-prometheus-stack Helm chart to deploy both Prometheus and Grafana automatically. Prometheus scrapes pod metrics generated by cAdvisor and kube-state-metrics, which you then visualize by importing a pre-built Kubernetes dashboard in Grafana.

1. Deploy the Monitoring Stack

Using Helm is the easiest way to deploy everything required to your cluster.

bash
# Add the necessary Helm repositories
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Create a dedicated namespace and install the stack
kubectl create namespace monitoring
helm install prometheus-stack prometheus-community/kube-prometheus-stack --namespace monitoring

2. Access the Dashboards

To view your data, expose the services locally using port-forwarding:

Grafana

bash
# Forward Grafana to port 3000
kubectl port-forward svc/prometheus-stack-grafana -n monitoring 3000:80


Navigate to http://localhost:3000. The default username is admin. Retrieve the auto-generated password by running:

kubectl get secret --namespace monitoring prometheus-stack-grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo

Prometheus

bash
# Forward Prometheus to port 9090
kubectl port-forward svc/prometheus-stack-kube-prome-prometheus -n monitoring 9090:9090


Navigate to http://localhost:9090 to run direct PromQL queries.

3. Visualize Pod Metrics in Grafana

Once logged into Grafana:

  1. Go to Connections > Data Sources.
  2. Add Prometheus and use the internal cluster URL http://cluster.local.
  3. Go to Dashboards > Import.
  4. Enter the Dashboard ID 6417 (Kubernetes Cluster) or 15760 (Node Exporter), select your Prometheus data source, and click Import.

Thursday, May 28, 2026

Kube-proxy in Kubernetes

Kube-proxy is a foundational Kubernetes network agent that runs on every node in a cluster. Its primary job is to translate Kubernetes Service definitions into actual network rules, enabling reliable service discovery and load balancing between containers.

Because individual Pods are ephemeral and their IP addresses change every time they are restarted or scaled, a consistent way to reach them is needed. Kube-proxy solves this by continuously monitoring the Kubernetes API server for changes to Service and EndpointSlice objects.

How It Works

Virtual IPs

When you create a Service, it gets assigned a stable, virtual IP address (ClusterIP).

Rule Generation

Kube-proxy reads this assignment and configures the node's underlying networking stack to intercept traffic headed for this virtual IP.

Routing & Load Balancing

It rewrites the packet headers so the traffic is transparently routed directly to one of the actual backend Pods backing that Service. If there are multiple Pods, it distributes the load across them.

Modes of Operation

Kube-proxy operates in one of several modes to manipulate network traffic, depending on your cluster's configuration:

iptables (Default)

Evaluates traffic sequentially using Linux iptables. It is highly reliable but can experience performance overhead in very large clusters with thousands of services.

IPVS (IP Virtual Server)

Designed for high performance, IPVS routes traffic in the Linux kernel using hash tables. It offers significantly faster lookup times and supports advanced load balancing algorithms (e.g., round-robin, least connections).

Userspace (Legacy)

The oldest mode, where kube-proxy actively intercepts traffic in user space and proxies it to the pods. It is slower and rarely used today.

A Modern Shift: eBPF

While kube-proxy has historically been a mandatory component, modern Kubernetes environments are increasingly replacing or supplementing it with eBPF-based networking plugins (like Cilium). eBPF bypasses the need to program traditional iptables or IPVS rules, operating directly within the kernel for faster, more secure network management and load balancing

Machine Learning and AI Model Taxonomy

The following table compares major categories of Machine Learning, Deep Learning, Generative AI, and Reinforcem...