Wednesday, July 1, 2026

Linear Regression

Linear Regression is a mathematical method used to predict the value of a continuous target variable based on one or more input features.

Consider two related variables, x and y, where:

  • x is the independent variable (feature/predictor).
  • y is the dependent variable (target/output).

Given a dataset containing values of x and their corresponding values of y, linear regression attempts to predict unknown or future values of y for any given value of x.

Key Idea

Linear Regression assumes that the relationship between the input feature(s) and the output can be approximated using a straight line (or a hyperplane in higher dimensions).

Linear Regression Equation

For a single input feature, Linear Regression tries to find the best-fitting equation:

y = wx + b

where:

  • w = Weight (Slope)
  • b = Bias (Intercept)

This version is known as Simple Linear Regression.

When multiple input features are present, the equation becomes:

y = w₁x₁ + w₂x₂ + ... + wₙxₙ + b

This is known as Multiple Linear Regression.

Important

The learning process of Linear Regression consists of finding the values of the weights (w) and bias (b) that best fit the training data.

Hypothesis and Hypothesis Space

Every possible combination of the parameters w and b defines a different straight line.

Each such line (or equation) is called a Hypothesis.

The collection of all possible hypotheses is called the Hypothesis Space.

Think of it this way

Every possible straight line that could be drawn through the data belongs to the hypothesis space. The objective of training is to choose the single line that best represents the observed data.

The entire training problem therefore reduces to finding the best hypothesis.


Prediction and Error

Every hypothesis predicts one output value for every input value.

For each training sample:

Prediction = wx + b

The prediction is then compared with the actual value from the dataset.

The difference between the actual value and the predicted value is called the Error or Loss.

Loss = Actual Value − Predicted Value

Each training sample produces one loss value.

Since a dataset usually contains hundreds or thousands of observations, we need a single number that summarizes the prediction error across the entire dataset.

Definition

The overall error across the complete training dataset is called the Cost Function.

Common Cost Functions

Two cost functions are most commonly used in Linear Regression.

Loss Function Description
L1 Loss
(Mean Absolute Error)
Average of the absolute prediction errors.
L2 Loss
(Mean Squared Error)
Average of the squared prediction errors.

1. Mean Absolute Error (L1 Loss)

The absolute value of every prediction error is computed before averaging.

MAE = Average(|Actual − Predicted|)

Since every error contributes proportionally, large errors are not excessively penalized.


2. Mean Squared Error (L2 Loss)

Each prediction error is squared before computing the average.

MSE = Average((Actual − Predicted)2)

Because the errors are squared, larger errors contribute much more heavily to the total loss than smaller ones.


Why L2 Loss Is Usually Preferred

Although both L1 and L2 losses are widely used, the L2 loss function offers two important mathematical advantages.

1. Differentiability

The L2 loss function is smooth everywhere. Its derivative exists at every point, including where the loss equals zero.

Why This Matters

Since the derivative exists everywhere, optimization algorithms such as Gradient Descent can easily determine the direction in which the parameters should be updated.

2. Convexity

The L2 loss function is also convex.

A convex function has only one global minimum and no local minima.

Important Clarification

L1 loss is also convex. The primary mathematical advantage of L2 over L1 is its smooth differentiability, not convexity itself.
Coming Next

In the next section, we'll see how Linear Regression uses these properties to find the optimal values of w and b using the Normal Equation and Gradient Descent.

Cost Functions

Each hypothesis produces a predicted value of y for every input value x.

The difference between the predicted value and the actual value is called the loss (or error).

The total loss over the complete training dataset is called the Cost Function.

Goal: Find the values of w and b that minimize the overall cost.

Common Loss Functions

1. Mean Absolute Error (L1 Loss)

The absolute value of each prediction error is calculated and then averaged.

MAE = |Actual − Predicted|

Characteristics:

  • Easy to understand
  • Robust against outliers
  • Continuous everywhere
  • Not differentiable at zero
Important:
L1 Loss is convex but contains a sharp corner at zero. Because of this kink, its derivative is undefined exactly at zero.

2. Mean Squared Error (L2 Loss)

Instead of taking the absolute value of the error, the error is squared. The squared errors are then averaged.

MSE = (Actual − Predicted)2

Why L2 Loss is Preferred

L2 loss has two extremely important mathematical properties.

1. Differentiability

L2 loss is smooth everywhere. Its derivative exists for every possible value, including zero.

Because the derivative exists everywhere, calculus-based optimization methods such as Gradient Descent work extremely well.

2. Convexity

The L2 cost function is convex.

A convex function contains only one minimum. Therefore there is no danger of getting trapped inside a local minimum.

Result:
There is a single global minimum that represents the best possible regression line.

Finding the Best Hypothesis

Since L2 loss is both differentiable and convex, the optimal values of w and b can be obtained in two ways.

Method 1 : Normal Equation

The derivative of the cost function with respect to w and b is computed.

The derivatives are set equal to zero and solved directly.

This analytical solution is called the Normal Equation.

Method 2 : Gradient Descent

Instead of solving the equations directly, Gradient Descent starts with random values of w and b and repeatedly improves them.

Each iteration moves the parameters in the direction that reduces the cost.

Gradient Descent is preferred for very large datasets because solving the Normal Equation becomes computationally expensive.

Why L1 Loss Requires Different Optimization

Since L1 loss is not differentiable at zero, the Normal Equation cannot be used.

Instead, optimization techniques such as:

  • Subgradient Descent
  • Linear Programming
  • Simplex-based Optimization

are commonly used.

Notice that L1 loss is still continuous and convex. Its limitation is only the lack of differentiability at one point.

Key Assumptions of Linear Regression

Linear Regression performs well only when certain mathematical assumptions are reasonably satisfied.

Assumption Description
Linearity The relationship between the input variables and the target variable should be approximately linear.
Independence Observations and their prediction errors should be independent of one another.
Homoscedasticity The variance of the residual errors should remain approximately constant across all values of the independent variables.
Normality Residual errors should be approximately normally distributed.
Note

Linear Regression is surprisingly robust. Minor violations of these assumptions often do not significantly affect prediction accuracy, especially when working with large datasets.

Evaluating a Linear Regression Model

After training a regression model, we need to evaluate how well it fits the training or test data. Several evaluation metrics are commonly used.

Metric Purpose
R² (Coefficient of Determination) Measures how much of the variation in the target variable is explained by the model.
Mean Absolute Error (MAE) Average absolute prediction error.
Root Mean Squared Error (RMSE) Square root of the Mean Squared Error. Larger errors receive greater penalty.
Rule of Thumb
  • Higher R² is generally better.
  • Lower MAE is better.
  • Lower RMSE is better.

Overfitting and Regularization

When the number of input features becomes very large, a regression model may begin to memorize the training data instead of learning the underlying relationship.

This phenomenon is called Overfitting.

An overfitted model usually performs extremely well on the training data but performs poorly on previously unseen test data.

To reduce overfitting, additional penalty terms can be added to the cost function. This technique is known as Regularization.


Common Regularization Techniques

Algorithm Penalty Added Characteristics
Ridge Regression L2 Penalty Shrinks coefficient values but rarely makes them exactly zero.
Lasso Regression L1 Penalty Can reduce some coefficients completely to zero, effectively performing feature selection.
ElasticNet L1 + L2 Combines the advantages of both Ridge and Lasso.
Key Point

Regularization intentionally allows a small increase in training error in exchange for much better performance on unseen data.

scikit-learn and pytorch implementation

The final section explains the default behavior of Linear Regression in two popular Machine Learning frameworks:

  • scikit-learn
  • PyTorch

It also discusses how regularization is enabled in each framework and the default settings used by Ridge, Lasso, and ElasticNet.


Default Behavior in scikit-learn

The LinearRegression class in scikit-learn implements Ordinary Least Squares (OLS).

By default, it applies no regularization whatsoever.

Default Behavior
  • LinearRegression() → Ordinary Least Squares (No Regularization)

If regularization is required, scikit-learn provides separate algorithms.

Model Regularization Default Parameters
LinearRegression() None Ordinary Least Squares
Ridge() L2 (Ridge) alpha = 1.0
Lasso() L1 (Lasso) alpha = 1.0
ElasticNet() L1 + L2 alpha = 1.0
l1_ratio = 0.5
Important

scikit-learn never applies regularization automatically. You must explicitly choose Ridge, Lasso, or ElasticNet if regularization is desired.

Default Behavior in PyTorch

PyTorch follows a fundamentally different philosophy.

Unlike scikit-learn, PyTorch does not provide a dedicated "Linear Regression" algorithm. Instead, users construct the model themselves using neural network building blocks.

Typical components include:
  • nn.Linear for the model
  • nn.MSELoss() for the loss function
  • An optimizer such as SGD or Adam
Because every component is built explicitly by the developer, PyTorch applies no regularization by default.

Adding Regularization in PyTorch

Regularization is typically introduced in one of two ways.

Method 1 : Add the Penalty to the Loss Function

The loss function can be manually extended by adding either an L1 or L2 penalty.

loss = mse_loss + λ × Σ(weights²)

The above expression corresponds to Ridge (L2) regularization.

Similarly, Lasso (L1) regularization can be implemented by summing the absolute values of the weights.


Method 2 : Use weight_decay

Most PyTorch optimizers expose a parameter called weight_decay.

This parameter automatically performs L2 (Ridge) regularization during optimization.

optimizer = torch.optim.Adam( model.parameters(), lr=0.001, weight_decay=0.0001 )
The default value is weight_decay = 0 which means regularization is disabled by default.

Does PyTorch Support L1 Regularization?

Unlike L2 regularization, PyTorch optimizers do not provide a built-in parameter for L1 regularization.

If L1 (Lasso-style) regularization is required, it must be added manually to the loss function.

Summary
  • L2 → Built into optimizers via weight_decay.
  • L1 → Must always be added manually.

Framework Comparison

Framework Default Regularization How to Enable
scikit-learn None Use Ridge(), Lasso(), or ElasticNet().
PyTorch None Use weight_decay (L2) or manually modify the loss function (L1).

Key Takeaways

  • Linear Regression finds the best-fitting line through the training data.
  • The quality of a hypothesis is measured using a Cost Function.
  • L2 (Mean Squared Error) is the most commonly used loss because it is both smooth (differentiable) and convex.
  • The optimal parameters can be obtained analytically using the Normal Equation or iteratively using Gradient Descent.
  • Regularization reduces overfitting by penalizing large model weights.
  • Ridge uses an L2 penalty, while Lasso uses an L1 penalty.
  • scikit-learn and PyTorch both disable regularization by default; it must be explicitly enabled by the developer.

Thursday, June 18, 2026

Data Types in C# and other languages

The following reference table compares commonly used C# numeric, scientific computing, and AI-oriented data types with their equivalents in Java, JavaScript, and Python. It also highlights memory consumption, binary representation, implicit conversion behavior, precision, and support for special values such as positive and negative infinity.

Quick Observation:
Traditional integer types (short, int, long) provide exact arithmetic but cannot represent infinity. Floating-point types (float and double) follow IEEE-754 standards and support ±Infinity. For financial calculations requiring exact decimal precision, C# provides the decimal type.
C# Data Type Bytes Required Internally Represented as Binary? Implicitly Converted? Supports ±Infinity? Range (Approx.) Precision Java Equivalent JavaScript Equivalent Python Equivalent
short 2 bytes Yes (Two's Complement) Yes No ~10⁴ (-32,768 to 32,767) Exact Integer short Number int
ushort 2 bytes Yes (Unsigned Binary) Yes No ~10⁴ (0 to 65,535) Exact Integer No Direct Equivalent Uint16Array int
int 4 bytes Yes (Two's Complement) Yes No ~10⁹ Exact Integer int Number int
long 8 bytes Yes (Two's Complement) Yes No ~10¹⁸ Exact Integer long BigInt int
BigInteger Dynamic Yes No No Limited only by RAM Exact Integer java.math.BigInteger BigInt int
float 4 bytes IEEE-754 Yes → double Yes 10⁻⁴⁵ to 10³⁸ 6–9 digits float Float32Array numpy.float32
double 8 bytes IEEE-754 No Yes 10⁻³²⁴ to 10³⁰⁸ 15–17 digits double Number float
decimal 16 bytes Base-10 Representation No No 10⁻²⁸ to 10²⁸ 28–29 digits BigDecimal Big.js / Decimal.js decimal.Decimal
Vector3 12 bytes Yes No Depends on float components Float Range 6–9 digits Vector3f Object / Array NumPy float32 vector
Vector<T> 16–64 bytes Yes (SIMD Registers) No Depends on T Depends on T Depends on T Java Vector API Engine SIMD Optimizations NumPy Arrays
Tensor<T> Dynamic Yes No Depends on T Depends on T Depends on T DJL NDArray / TensorFlow Tensor tf.Tensor PyTorch Tensor / TensorFlow Tensor
AI & Machine Learning Note:
For traditional business applications, int, long, double, and decimal dominate usage. For scientific computing, graphics, machine learning, and neural networks, higher-level structures such as Vector3, Vector<T>, and Tensor<T> become significantly more important because they map efficiently to SIMD instructions, GPUs, and tensor-processing hardware.

Monday, June 15, 2026

The Basics of Tuples

In Python, a method or function that returns multiple values is commonly described as returning a tuple. When those returned values are assigned directly to multiple variables, the process is called tuple unpacking.

Important:
Technically, a Python function can return only a single object. When multiple values are returned using commas, Python automatically packs them into a single tuple object behind the scenes.

Python Example

def get_user_data():
    name = "Alice"
    age = 30
    return name, age

# Unpacking the tuple
user_name, user_age = get_user_data()

Key Concepts

Concept Description
Tuple Packing Multiple values are grouped into a single tuple object.
Tuple Unpacking Returned tuple elements are assigned to individual variables.

Equivalent Concepts in Other Languages

Java, JavaScript, and C# all provide mechanisms similar to Python's tuple packing and unpacking, although their syntax and implementation differ significantly.

1. JavaScript — Object and Array Destructuring

JavaScript is arguably the closest language to Python in this regard. It supports both array-based and object-based destructuring.

Using Arrays (Positional)

function getCoordinates() {
    return [10, 20];
}

const [x, y] = getCoordinates();

Using Objects (Named)

function getUser() {
    return { name: "Alice", age: 30 };
}

const { name, age } = getUser();
Advantage: JavaScript can unpack by position (arrays) or by name (objects).

2. C# — Tuples and Deconstruction

C# provides first-class support for tuples and deconstruction, making it extremely similar to Python.

(string name, int age) GetUserData()
{
    return ("Alice", 30);
}

var (name, age) = GetUserData();
Advantage: Strong typing combined with concise unpacking syntax.

3. Java — Records and Custom Objects

Java does not support native tuple unpacking like Python or C#. Historically, developers returned custom wrapper classes.

Modern Java (Java 16+) introduced Records, which provide a concise solution for data containers.

public record UserData(String name, int age) {}

public UserData getUserData() {
    return new UserData("Alice", 30);
}

// Usage
UserData data = getUserData();

System.out.println(data.name());
Advantage: Type-safe immutable data carriers with very little boilerplate code.

Summary Comparison

Language Mechanism Best Feature
Python Tuples Built-in and implicit syntax
JavaScript Destructuring Supports arrays and objects
C# ValueTuple Strong typing with elegant syntax
Java Records Type-safe data containers

ValueTuple vs Tuple in C#

C# provides two different tuple implementations:

  • System.ValueTuple (modern, value type)
  • System.Tuple (legacy, reference type)

Comparison Table

Feature System.ValueTuple System.Tuple
Memory Allocation Stack allocation (typically) Heap allocation
Syntax (int, string) Tuple<int,string>
Named Elements Supported Not supported
Mutability Mutable Immutable
Destructuring Native support Manual extraction required

Code Comparison

// Modern ValueTuple

(int Id, string Name) person = (1, "Alice");

Console.WriteLine(person.Name);


// Legacy Tuple

Tuple<int, string> oldPerson =
    new Tuple<int, string>(1, "Alice");

Console.WriteLine(oldPerson.Item2);

Python, JavaScript, C#, and Java: Returning Multiple Values

In Python, a method or function that returns multiple values is called returning a tuple (or tuple unpacking when assigning the results).

Technically, a Python function can only return a single object. When you separate multiple variables with commas, Python automatically packages them into a single tuple object behind the scenes.

Code Example

def get_user_data():
    name = "Alice"
    age = 30
    return name, age  # This returns a single tuple: ("Alice", 30)

# Unpacking the tuple into separate variables
user_name, user_age = get_user_data()

Key Concepts

  • Tuple Packing: The function groups multiple items into one tuple.
  • Tuple Unpacking: The code calling the function assigns those items to individual variables.

Equivalent Concepts in JavaScript, C#, and Java

Java, JavaScript, and C# all provide mechanisms that achieve goals similar to Python's tuple packing and unpacking, although the syntax and implementation differ.

1. JavaScript: Object and Array Destructuring

JavaScript is the closest to Python. It achieves this natively using Arrays or Objects, combined with a feature called destructuring.

Using Arrays (Positional)

function getCoordinates() {
    return [10, 20];
}

const [x, y] = getCoordinates(); // Destructuring assignment

Using Objects (Named)

function getUser() {
    return { name: "Alice", age: 30 };
}

const { name, age } = getUser(); // Unpacks by property name
Key Advantage: JavaScript supports both positional unpacking (arrays) and named unpacking (objects).

2. C#: Tuples and Deconstruction

C# provides strongly typed native support for tuples and deconstruction, making it one of the closest languages to Python in this area.

(string name, int age) GetUserData() {
    return ("Alice", 30);
}

// Unpacking (Deconstruction)
var (name, age) = GetUserData();
Key Advantage: Strong typing with very clean syntax.

3. Java: Records and Custom Objects

Java does not provide native tuple unpacking syntax like Python or C#. Traditionally, Java applications returned custom wrapper classes. Modern Java introduced Records, which significantly reduce the required boilerplate.

Using Records (Modern Java)

public record UserData(String name, int age) {}

public UserData getUserData() {
    return new UserData("Alice", 30);
}

// Usage
UserData data = getUserData();
System.out.println(data.name());
Key Advantage: Type-safe immutable data containers with minimal code.

Summary Comparison

Language Mechanism Best Feature
Python Tuples Built-in, implicit syntax
JavaScript Destructuring Supports arrays and objects
C# ValueTuple Strongly typed, elegant syntax
Java Records Type-safe data containers

What is ValueTuple in C#?

ValueTuple is a native structure introduced in C# 7.0 that provides a lightweight, high-performance way to group multiple values together.

There is also a reference-type equivalent called Tuple (not "ReferenceTuple").

ValueTuple vs Tuple

Feature System.ValueTuple System.Tuple
Memory Allocation Stack Allocation Heap Allocation
Syntax (int, string) Tuple<int, string>
Named Elements Supported Not Supported
Mutability Mutable Immutable
Deconstruction Native Support Manual Extraction Required

Quick Code Comparison

// Modern approach (ValueTuple)
(int Id, string Name) person = (1, "Alice");

Console.WriteLine(person.Name);

// Legacy approach (Tuple)
Tuple<int, string> oldPerson =
    new Tuple<int, string>(1, "Alice");

Console.WriteLine(oldPerson.Item2);
Observation: ValueTuple is cleaner, faster, supports naming, and works naturally with deconstruction syntax.

Java Records vs C# Records

Conceptually, Java Records and C# Records were introduced to solve the same problem: reducing boilerplate code for classes whose primary purpose is holding data.

Both automatically generate common methods such as:

  • Equals()
  • GetHashCode()
  • ToString()

However, they differ significantly in mutability, memory model, and language flexibility.


1. Immutability

Java Records

Java Records are strictly immutable.

Every component defined in a Java Record is implicitly marked as final, meaning the value cannot be changed after object creation.

public record User(String name, int age) {}

Once created, the fields cannot be modified.

C# Records

C# Records are more flexible.

By default, positional records use init-only properties, which behave similarly to immutable objects.

public record User(string Name, int Age);

However, developers can explicitly create mutable record properties if required.

Key Difference:
Java Records are always immutable.
C# Records can be immutable or mutable depending on design choices.

2. Underlying Types (Reference vs Value)

The memory model differs substantially between the two languages.

Feature Java Record C# Record
Type Category Reference Type Only Reference or Value Type
Heap Allocation Always Heap Depends on Declaration
Developer Choice No Yes

Java Record Example

public record User(String name, int age) {}

This is always a reference type.

C# Record Class Example

public record class User(
    string Name,
    int Age
);

Behaves as a reference type.

C# Record Struct Example

public record struct User(
    string Name,
    int Age
);

Behaves as a value type.

C# Advantage: Developers can choose between value semantics and reference semantics based on application requirements.

3. Non-Destructive Mutation

One of the most popular features of C# Records is the with expression.

It allows you to create a modified copy of an existing immutable object without changing the original object.

C# Example

var originalUser =
    new User("Alice", 30);

var updatedUser =
    originalUser with { Age = 31 };

The original object remains unchanged.

A new object is created with only the specified changes applied.

Benefit: Safe updates without accidental mutation.

Java Equivalent

Java currently does not provide a built-in equivalent of the with keyword.

To achieve the same behavior, developers typically:

  • Create a new Record instance manually.
  • Implement custom copy methods.
  • Use builder patterns.
User updatedUser =
    new User(
        originalUser.name(),
        31
    );
Key Difference:
C# has native support for non-destructive mutation.
Java requires manual object creation.

4. Property Access vs Method Access

Another major difference between Java Records and C# Records is how their data is accessed.

Java Record Access

Java Records expose their components using automatically generated methods.

public record User(
    String name,
    int age
) {}

User user = new User("Alice", 30);

System.out.println(user.name());
System.out.println(user.age());
Important: The values are accessed through methods (name(), age()) rather than properties.

C# Record Access

C# Records expose values using properties.

public record User(
    string Name,
    int Age
);

User user = new User("Alice", 30);

Console.WriteLine(user.Name);
Console.WriteLine(user.Age);
Important: Values are accessed using standard property syntax, which feels natural to most C# developers.

Complete Comparison: Java Records vs C# Records

Feature Java Record C# Record
Purpose Reduce data-class boilerplate Reduce data-class boilerplate
Immutability Always Immutable Configurable
Reference Type Always Optional
Value Type Option No Yes (record struct)
Property Access Method Syntax Property Syntax
Auto-generated Equals() Yes Yes
Auto-generated HashCode Yes Yes
Auto-generated ToString() Yes Yes
with Expression No Yes
Destructuring No Native Support Native Support
Language Version Java 16+ C# 9+

When Should You Use Records?

Records are ideal whenever the primary purpose of an object is to carry data rather than implement complex business behavior.

Typical Use Cases

  • REST API Request Models
  • REST API Response Models
  • DTOs (Data Transfer Objects)
  • Configuration Objects
  • Event Messages
  • Message Queue Payloads
  • Immutable Domain Objects
  • Value Objects in Domain Driven Design (DDD)
Rule of Thumb:

If your class primarily stores data and requires generated methods like equals(), hashCode(), and toString(), a Record is usually a better choice than a traditional class.

Example: Traditional Class vs Record

Traditional Java Class

public class User {

    private final String name;
    private final int age;

    public User(String name, int age) {
        this.name = name;
        this.age = age;
    }

    public String getName() {
        return name;
    }

    public int getAge() {
        return age;
    }

    // equals()
    // hashCode()
    // toString()
}

Java Record

public record User(
    String name,
    int age
) {}
A large amount of boilerplate code disappears while retaining type safety, immutability, and automatically generated utility methods.

Conceptual Summary

Concept Think Of It As
Python Tuple Quick grouping of values
JavaScript Destructuring Flexible unpacking mechanism
C# ValueTuple Strongly typed tuple
System.Tuple Older reference-based tuple
Java Record Immutable data container
C# Record Flexible modern data container
with Expression Clone and modify safely

In Short

Python Tuple     = Quick Multiple Return Values JavaScript Destructuring     = Array/Object Unpacking C# ValueTuple     = Strongly Typed Tuple System.Tuple     = Legacy Reference Tuple Java Record     = Immutable Data Class C# Record     = Flexible Data Class Java Record     = Reference Type Only C# Record     = Reference Type OR Value Type C# "with"     = Clone + Modify Records     = Less Boilerplate, More Readability

Sunday, June 14, 2026

Forward Diffusion Process in DDPM

The forward process (also called the diffusion process) systematically adds Gaussian noise to clean data until it eventually becomes nearly pure random noise.

Core Idea:
Start with a clean image (or data point), add a tiny amount of noise repeatedly over thousands of steps, and eventually obtain pure Gaussian noise.

Equation 1: Step-by-Step Noise Addition

q(xt|xt-1) = N(xt; √(1-βt)xt-1, βtI)

This equation describes how the noisy sample at timestep t is generated from the sample at timestep t−1.

Term Meaning
q(xt|xt-1) Probability of transitioning from step t−1 to step t
N(·) Gaussian (Normal) distribution
xt New noisy sample generated at timestep t
√(1−βt)xt−1 Mean of the Gaussian distribution
βtI Variance of the Gaussian distribution
Why scale the previous image?
Without the factor √(1−βt), variance would continuously grow and eventually explode. Scaling keeps the process mathematically stable.

Equation 2: Full Diffusion Trajectory

q(x1:T|x0) = ∏t=1T q(xt|xt−1)

This equation represents the probability of the entire diffusion trajectory from the original clean sample x₀ to the final noisy sample xT.

Term Meaning
q(x1:T|x0) Joint probability of the complete noisy trajectory
Product operator multiplying probabilities of every step
Markov Property Each state depends only on its immediate predecessor
Important Observation:
The diffusion process forms a Markov Chain. The current state remembers only the previous state and ignores everything earlier.

Deriving the Closed-Form Sampling Formula

Instead of repeatedly executing thousands of diffusion steps, DDPM derives a direct mathematical shortcut that allows sampling xt directly from x0.

Step 1: Define New Variables

αt = 1 − βt

ᾱt = ∏i=1t αi

Here:

  • αt = amount of original signal retained during one step
  • ᾱt = cumulative signal retained after many diffusion steps

Reparameterization Form

xt = √αtxt−1 + √(1−αtt−1

where εt−1 ~ N(0,I)

This formulation explicitly separates:

  • The preserved signal component
  • The newly injected Gaussian noise
Interpretation:
Every diffusion step keeps part of the original image while injecting a small amount of fresh random noise.

Markov Chains, Hidden States, and Hidden Markov Models (HMMs)

A Markov Chain is a mathematical system that models how things move from one state to another, based on the rule that the next state depends only on the current state. A Hidden State refers to an underlying, unobservable true state of a system that can only be guessed by looking at visible outputs.

These concepts are foundational to probability, statistics, and machine learning, and they often work together in what is known as a Hidden Markov Model (HMM).

Key Idea: A Markov Chain models transitions between states, while a Hidden Markov Model extends this concept by introducing hidden states that cannot be observed directly.

1. Markov Chain: The Basics

A Markov Chain describes a series of events where the probability of the next event happening depends entirely on the present event, completely ignoring the past. This is known as the Markov Property or "memorylessness".

Example: Weather Forecasting

Imagine the weather. If today is Sunny, tomorrow might have:

  • 70% chance of being Sunny
  • 20% chance of being Cloudy
  • 10% chance of being Rainy

Because you can directly see and measure the weather, this can be modeled as an observable Markov Chain.

2. Hidden State: The Invisible Driver

In many real-world scenarios, you cannot directly observe the state of a system. Instead, you have:

  • Hidden States → The actual, unobservable conditions.
  • Observations → The visible results influenced by those hidden states.

Example: Inferring a Person's Mood

Imagine you want to track a person's mood (Happy or Sad), but they are locked in a room.

  • Mood → Hidden State
  • Shirt Color (Red, Green, Blue) → Observation

Although you cannot directly observe the person's mood, you can observe the shirt color they wear each day. Using a Hidden Markov Model, you can infer the most likely mood sequence based on the observed shirt colors.

3. How They Work Together: Hidden Markov Models (HMMs)

In a Hidden Markov Model, the hidden states themselves form a Markov Chain. For example, a person's mood today influences their mood tomorrow.

To make this work, the model relies on three core probability components:

Component Purpose Example
Transition Probabilities Probability of moving from one hidden state to another. Chance a Sad mood follows a Happy mood.
Emission Probabilities Probability of seeing an observation given a hidden state. Chance of wearing a Red shirt while Happy.
Initial State Probabilities Probability of starting in a specific hidden state. Probability that Day 1 starts Happy.

4. Real-World Applications

Experts generally agree that while basic Markov Chains are useful for simple predictions, Hidden Markov Models excel at interpreting noisy and partially observable data.

  • Speech Recognition
    Translating audio waveforms (observations) into spoken words or phonemes (hidden states).
  • Natural Language Processing (NLP)
    Assigning parts of speech such as nouns, verbs, or adjectives (hidden states) to observed words in a sentence.
  • Finance
    Predicting hidden market regimes such as Bull Markets or Bear Markets from observed trading patterns and volatility.

The Three Classic HMM Machine Learning Tasks

Hidden Markov Models are traditionally used to solve three major classes of machine learning problems.

1. The Evaluation Task (Likelihood)

Objective: Compute the total probability of observing a specific sequence.

Problem: Given a trained model and a sequence of visible events, determine how likely it is that the sequence was generated by the model.

Algorithm: Forward-Backward Algorithm.

Example: Determining whether a sequence of network traffic logs resembles normal behavior or a cyber attack.

2. The Decoding Task (Inference)

Objective: Find the most likely sequence of hidden states.

Problem: You can see the outputs, but want to uncover the hidden state sequence that generated them.

Algorithm: Viterbi Algorithm.

Example: Part-of-Speech Tagging in NLP, where words are visible observations and grammatical categories are hidden states.

3. The Learning Task (Training)

Objective: Learn the model parameters from observed data.

Problem: Given only observation sequences, estimate the transition and emission probabilities.

Algorithm: Baum-Welch Algorithm (a special form of Expectation-Maximization).

Example: Training a speech recognition system using large collections of audio recordings.

Summary of Core HMM Tasks

Task Question Being Answered Algorithm Output
Evaluation How likely is this observation sequence? Forward-Backward Probability Score
Decoding What hidden states generated the observations? Viterbi Most Likely State Sequence
Learning What should the model parameters be? Baum-Welch Trained Model Parameters

Common Machine Learning Applications

  • Speech Recognition
    Matching spoken audio signals (observations) to phonemes or words (hidden states).
  • Bioinformatics
    Finding genes within DNA sequences by modeling patterns of nucleotides.
  • Stock Market Analysis
    Predicting hidden market conditions such as Bull Markets and Bear Markets from observable market behavior.
Conclusion: 

Observable Markov Chain:
State → State → State

Hidden Markov Model:
Hidden State → Hidden State → Hidden State
    ↓                ↓                ↓
Observation → Observation → Observation

The observations are visible, but the hidden states must be inferred.

Greek Alphabet Reference for Machine Learning, Statistics, and AI

In machine learning algorithms, mathematics, statistics, and AI research papers, Greek letters are used extensively to represent variables, parameters, distributions, loss functions, learning rates, eigenvalues, degrees of freedom, and many other concepts.

Many practitioners encounter confusion because some Greek letters have pronunciations that differ significantly from their English appearance. For example, the symbol ν, commonly used to represent degrees of freedom, is pronounced "Nu" rather than sounding like the English letter "v". Similarly, Epsilon (ε) and Upsilon (υ) are entirely different letters despite their similar names.

Quick Tip: If you regularly read research papers, becoming familiar with Greek letter names can significantly improve your ability to follow mathematical notation and technical discussions.

Complete Greek Alphabet Reference

Uppercase Lowercase Greek Name English Pronunciation
Α α Alpha AH-fah (like 'a' in father)
Β β Beta VEE-tah (like 'v' in vine)
Γ γ Gamma GHAH-mah (soft, breathy 'g')
Δ δ Delta THEL-tah (like 'th' in then)
Ε ε Epsilon EH-psi-lon (like 'e' in pet)
Ζ ζ Zeta ZEE-tah (like 'z' in zebra)
Η η Eta EE-tah (like 'ee' in meet)
Θ θ Theta THEE-tah (like 'th' in thin)
Ι ι Iota ee-OH-tah
Κ κ Kappa KAH-pah
Λ λ Lambda LAHM-thah
Μ μ Mu mee
Ν ν Nu nee (like knee)
Ξ ξ Xi kshee
Ο ο Omicron OH-mee-kron
Π π Pi pee
Ρ ρ Rho roh
Σ σ / ς Sigma SEEGH-mah
Τ τ Tau taf
Υ υ Upsilon EE-psi-lon
Φ φ Phi fee
Χ χ Chi hee (breathy 'h')
Ψ ψ Psi psee
Ω ω Omega oh-MEH-ghah
Note: The lowercase form ς is a special version of Sigma used only when Sigma appears as the final letter of a Greek word (for example: οδυσσεύς).

Greek Letters Frequently Seen in AI & Machine Learning

  • α (Alpha) → Learning Rate
  • β (Beta) → Momentum, Beta Distribution Parameters
  • γ (Gamma) → Discount Factor in Reinforcement Learning
  • δ (Delta) → Error Terms and Differences
  • ε (Epsilon) → Small Constant, Exploration Rate
  • λ (Lambda) → Regularization Parameters
  • μ (Mu) → Mean of a Distribution
  • ν (Nu) → Degrees of Freedom
  • σ (Sigma) → Standard Deviation
  • θ (Theta) → Model Parameters / Weights
  • π (Pi) → Policy Function in Reinforcement Learning
  • ρ (Rho) → Correlation Coefficient
  • Ω (Omega) → Asymptotic Complexity Notation

Linear Regression

Linear Regression is a mathematical method used to predict the value of a continuous target variable based on one or more input features. ...