Wednesday, July 1, 2026

Linear Regression

Linear Regression is a mathematical method used to predict the value of a continuous target variable based on one or more input features.

Consider two related variables, x and y, where:

  • x is the independent variable (feature/predictor).
  • y is the dependent variable (target/output).

Given a dataset containing values of x and their corresponding values of y, linear regression attempts to predict unknown or future values of y for any given value of x.

Key Idea

Linear Regression assumes that the relationship between the input feature(s) and the output can be approximated using a straight line (or a hyperplane in higher dimensions).

Linear Regression Equation

For a single input feature, Linear Regression tries to find the best-fitting equation:

y = wx + b

where:

  • w = Weight (Slope)
  • b = Bias (Intercept)

This version is known as Simple Linear Regression.

When multiple input features are present, the equation becomes:

y = w₁x₁ + w₂x₂ + ... + wₙxₙ + b

This is known as Multiple Linear Regression.

Important

The learning process of Linear Regression consists of finding the values of the weights (w) and bias (b) that best fit the training data.

Hypothesis and Hypothesis Space

Every possible combination of the parameters w and b defines a different straight line.

Each such line (or equation) is called a Hypothesis.

The collection of all possible hypotheses is called the Hypothesis Space.

Think of it this way

Every possible straight line that could be drawn through the data belongs to the hypothesis space. The objective of training is to choose the single line that best represents the observed data.

The entire training problem therefore reduces to finding the best hypothesis.


Prediction and Error

Every hypothesis predicts one output value for every input value.

For each training sample:

Prediction = wx + b

The prediction is then compared with the actual value from the dataset.

The difference between the actual value and the predicted value is called the Error or Loss.

Loss = Actual Value − Predicted Value

Each training sample produces one loss value.

Since a dataset usually contains hundreds or thousands of observations, we need a single number that summarizes the prediction error across the entire dataset.

Definition

The overall error across the complete training dataset is called the Cost Function.

Common Cost Functions

Two cost functions are most commonly used in Linear Regression.

Loss Function Description
L1 Loss
(Mean Absolute Error)
Average of the absolute prediction errors.
L2 Loss
(Mean Squared Error)
Average of the squared prediction errors.

1. Mean Absolute Error (L1 Loss)

The absolute value of every prediction error is computed before averaging.

MAE = Average(|Actual − Predicted|)

Since every error contributes proportionally, large errors are not excessively penalized.


2. Mean Squared Error (L2 Loss)

Each prediction error is squared before computing the average.

MSE = Average((Actual − Predicted)2)

Because the errors are squared, larger errors contribute much more heavily to the total loss than smaller ones.


Why L2 Loss Is Usually Preferred

Although both L1 and L2 losses are widely used, the L2 loss function offers two important mathematical advantages.

1. Differentiability

The L2 loss function is smooth everywhere. Its derivative exists at every point, including where the loss equals zero.

Why This Matters

Since the derivative exists everywhere, optimization algorithms such as Gradient Descent can easily determine the direction in which the parameters should be updated.

2. Convexity

The L2 loss function is also convex.

A convex function has only one global minimum and no local minima.

Important Clarification

L1 loss is also convex. The primary mathematical advantage of L2 over L1 is its smooth differentiability, not convexity itself.
Coming Next

In the next section, we'll see how Linear Regression uses these properties to find the optimal values of w and b using the Normal Equation and Gradient Descent.

Cost Functions

Each hypothesis produces a predicted value of y for every input value x.

The difference between the predicted value and the actual value is called the loss (or error).

The total loss over the complete training dataset is called the Cost Function.

Goal: Find the values of w and b that minimize the overall cost.

Common Loss Functions

1. Mean Absolute Error (L1 Loss)

The absolute value of each prediction error is calculated and then averaged.

MAE = |Actual − Predicted|

Characteristics:

  • Easy to understand
  • Robust against outliers
  • Continuous everywhere
  • Not differentiable at zero
Important:
L1 Loss is convex but contains a sharp corner at zero. Because of this kink, its derivative is undefined exactly at zero.

2. Mean Squared Error (L2 Loss)

Instead of taking the absolute value of the error, the error is squared. The squared errors are then averaged.

MSE = (Actual − Predicted)2

Why L2 Loss is Preferred

L2 loss has two extremely important mathematical properties.

1. Differentiability

L2 loss is smooth everywhere. Its derivative exists for every possible value, including zero.

Because the derivative exists everywhere, calculus-based optimization methods such as Gradient Descent work extremely well.

2. Convexity

The L2 cost function is convex.

A convex function contains only one minimum. Therefore there is no danger of getting trapped inside a local minimum.

Result:
There is a single global minimum that represents the best possible regression line.

Finding the Best Hypothesis

Since L2 loss is both differentiable and convex, the optimal values of w and b can be obtained in two ways.

Method 1 : Normal Equation

The derivative of the cost function with respect to w and b is computed.

The derivatives are set equal to zero and solved directly.

This analytical solution is called the Normal Equation.

Method 2 : Gradient Descent

Instead of solving the equations directly, Gradient Descent starts with random values of w and b and repeatedly improves them.

Each iteration moves the parameters in the direction that reduces the cost.

Gradient Descent is preferred for very large datasets because solving the Normal Equation becomes computationally expensive.

Why L1 Loss Requires Different Optimization

Since L1 loss is not differentiable at zero, the Normal Equation cannot be used.

Instead, optimization techniques such as:

  • Subgradient Descent
  • Linear Programming
  • Simplex-based Optimization

are commonly used.

Notice that L1 loss is still continuous and convex. Its limitation is only the lack of differentiability at one point.

Key Assumptions of Linear Regression

Linear Regression performs well only when certain mathematical assumptions are reasonably satisfied.

Assumption Description
Linearity The relationship between the input variables and the target variable should be approximately linear.
Independence Observations and their prediction errors should be independent of one another.
Homoscedasticity The variance of the residual errors should remain approximately constant across all values of the independent variables.
Normality Residual errors should be approximately normally distributed.
Note

Linear Regression is surprisingly robust. Minor violations of these assumptions often do not significantly affect prediction accuracy, especially when working with large datasets.

Evaluating a Linear Regression Model

After training a regression model, we need to evaluate how well it fits the training or test data. Several evaluation metrics are commonly used.

Metric Purpose
R² (Coefficient of Determination) Measures how much of the variation in the target variable is explained by the model.
Mean Absolute Error (MAE) Average absolute prediction error.
Root Mean Squared Error (RMSE) Square root of the Mean Squared Error. Larger errors receive greater penalty.
Rule of Thumb
  • Higher R² is generally better.
  • Lower MAE is better.
  • Lower RMSE is better.

Overfitting and Regularization

When the number of input features becomes very large, a regression model may begin to memorize the training data instead of learning the underlying relationship.

This phenomenon is called Overfitting.

An overfitted model usually performs extremely well on the training data but performs poorly on previously unseen test data.

To reduce overfitting, additional penalty terms can be added to the cost function. This technique is known as Regularization.


Common Regularization Techniques

Algorithm Penalty Added Characteristics
Ridge Regression L2 Penalty Shrinks coefficient values but rarely makes them exactly zero.
Lasso Regression L1 Penalty Can reduce some coefficients completely to zero, effectively performing feature selection.
ElasticNet L1 + L2 Combines the advantages of both Ridge and Lasso.
Key Point

Regularization intentionally allows a small increase in training error in exchange for much better performance on unseen data.

scikit-learn and pytorch implementation

The final section explains the default behavior of Linear Regression in two popular Machine Learning frameworks:

  • scikit-learn
  • PyTorch

It also discusses how regularization is enabled in each framework and the default settings used by Ridge, Lasso, and ElasticNet.


Default Behavior in scikit-learn

The LinearRegression class in scikit-learn implements Ordinary Least Squares (OLS).

By default, it applies no regularization whatsoever.

Default Behavior
  • LinearRegression() → Ordinary Least Squares (No Regularization)

If regularization is required, scikit-learn provides separate algorithms.

Model Regularization Default Parameters
LinearRegression() None Ordinary Least Squares
Ridge() L2 (Ridge) alpha = 1.0
Lasso() L1 (Lasso) alpha = 1.0
ElasticNet() L1 + L2 alpha = 1.0
l1_ratio = 0.5
Important

scikit-learn never applies regularization automatically. You must explicitly choose Ridge, Lasso, or ElasticNet if regularization is desired.

Default Behavior in PyTorch

PyTorch follows a fundamentally different philosophy.

Unlike scikit-learn, PyTorch does not provide a dedicated "Linear Regression" algorithm. Instead, users construct the model themselves using neural network building blocks.

Typical components include:
  • nn.Linear for the model
  • nn.MSELoss() for the loss function
  • An optimizer such as SGD or Adam
Because every component is built explicitly by the developer, PyTorch applies no regularization by default.

Adding Regularization in PyTorch

Regularization is typically introduced in one of two ways.

Method 1 : Add the Penalty to the Loss Function

The loss function can be manually extended by adding either an L1 or L2 penalty.

loss = mse_loss + λ × Σ(weights²)

The above expression corresponds to Ridge (L2) regularization.

Similarly, Lasso (L1) regularization can be implemented by summing the absolute values of the weights.


Method 2 : Use weight_decay

Most PyTorch optimizers expose a parameter called weight_decay.

This parameter automatically performs L2 (Ridge) regularization during optimization.

optimizer = torch.optim.Adam( model.parameters(), lr=0.001, weight_decay=0.0001 )
The default value is weight_decay = 0 which means regularization is disabled by default.

Does PyTorch Support L1 Regularization?

Unlike L2 regularization, PyTorch optimizers do not provide a built-in parameter for L1 regularization.

If L1 (Lasso-style) regularization is required, it must be added manually to the loss function.

Summary
  • L2 → Built into optimizers via weight_decay.
  • L1 → Must always be added manually.

Framework Comparison

Framework Default Regularization How to Enable
scikit-learn None Use Ridge(), Lasso(), or ElasticNet().
PyTorch None Use weight_decay (L2) or manually modify the loss function (L1).

Key Takeaways

  • Linear Regression finds the best-fitting line through the training data.
  • The quality of a hypothesis is measured using a Cost Function.
  • L2 (Mean Squared Error) is the most commonly used loss because it is both smooth (differentiable) and convex.
  • The optimal parameters can be obtained analytically using the Normal Equation or iteratively using Gradient Descent.
  • Regularization reduces overfitting by penalizing large model weights.
  • Ridge uses an L2 penalty, while Lasso uses an L1 penalty.
  • scikit-learn and PyTorch both disable regularization by default; it must be explicitly enabled by the developer.

No comments:

Post a Comment

Linear Regression

Linear Regression is a mathematical method used to predict the value of a continuous target variable based on one or more input features. ...