Linear Regression is a mathematical method used to predict the value of a continuous target variable based on one or more input features.
Consider two related variables, x and y, where:
- x is the independent variable (feature/predictor).
- y is the dependent variable (target/output).
Given a dataset containing values of x and their corresponding values of y, linear regression attempts to predict unknown or future values of y for any given value of x.
Linear Regression assumes that the relationship between the input feature(s) and the output can be approximated using a straight line (or a hyperplane in higher dimensions).
Linear Regression Equation
For a single input feature, Linear Regression tries to find the best-fitting equation:
where:
- w = Weight (Slope)
- b = Bias (Intercept)
This version is known as Simple Linear Regression.
When multiple input features are present, the equation becomes:
This is known as Multiple Linear Regression.
The learning process of Linear Regression consists of finding the values of the weights (w) and bias (b) that best fit the training data.
Hypothesis and Hypothesis Space
Every possible combination of the parameters w and b defines a different straight line.
Each such line (or equation) is called a Hypothesis.
The collection of all possible hypotheses is called the Hypothesis Space.
Every possible straight line that could be drawn through the data belongs to the hypothesis space. The objective of training is to choose the single line that best represents the observed data.
The entire training problem therefore reduces to finding the best hypothesis.
Prediction and Error
Every hypothesis predicts one output value for every input value.
For each training sample:
The prediction is then compared with the actual value from the dataset.
The difference between the actual value and the predicted value is called the Error or Loss.
Each training sample produces one loss value.
Since a dataset usually contains hundreds or thousands of observations, we need a single number that summarizes the prediction error across the entire dataset.
The overall error across the complete training dataset is called the Cost Function.
Common Cost Functions
Two cost functions are most commonly used in Linear Regression.
| Loss Function | Description |
|---|---|
|
L1 Loss (Mean Absolute Error) |
Average of the absolute prediction errors. |
|
L2 Loss (Mean Squared Error) |
Average of the squared prediction errors. |
1. Mean Absolute Error (L1 Loss)
The absolute value of every prediction error is computed before averaging.
Since every error contributes proportionally, large errors are not excessively penalized.
2. Mean Squared Error (L2 Loss)
Each prediction error is squared before computing the average.
Because the errors are squared, larger errors contribute much more heavily to the total loss than smaller ones.
Why L2 Loss Is Usually Preferred
Although both L1 and L2 losses are widely used, the L2 loss function offers two important mathematical advantages.
1. Differentiability
The L2 loss function is smooth everywhere. Its derivative exists at every point, including where the loss equals zero.
Since the derivative exists everywhere, optimization algorithms such as Gradient Descent can easily determine the direction in which the parameters should be updated.
2. Convexity
The L2 loss function is also convex.
A convex function has only one global minimum and no local minima.
L1 loss is also convex. The primary mathematical advantage of L2 over L1 is its smooth differentiability, not convexity itself.
In the next section, we'll see how Linear Regression uses these properties to find the optimal values of w and b using the Normal Equation and Gradient Descent.
Cost Functions
Each hypothesis produces a predicted value of y for every input value x.
The difference between the predicted value and the actual value is called the loss (or error).
The total loss over the complete training dataset is called the Cost Function.
Common Loss Functions
1. Mean Absolute Error (L1 Loss)
The absolute value of each prediction error is calculated and then averaged.
Characteristics:
- Easy to understand
- Robust against outliers
- Continuous everywhere
- Not differentiable at zero
L1 Loss is convex but contains a sharp corner at zero. Because of this kink, its derivative is undefined exactly at zero.
2. Mean Squared Error (L2 Loss)
Instead of taking the absolute value of the error, the error is squared. The squared errors are then averaged.
Why L2 Loss is Preferred
L2 loss has two extremely important mathematical properties.
1. Differentiability
L2 loss is smooth everywhere. Its derivative exists for every possible value, including zero.
2. Convexity
The L2 cost function is convex.
A convex function contains only one minimum. Therefore there is no danger of getting trapped inside a local minimum.
There is a single global minimum that represents the best possible regression line.
Finding the Best Hypothesis
Since L2 loss is both differentiable and convex, the optimal values of w and b can be obtained in two ways.
Method 1 : Normal Equation
The derivative of the cost function with respect to w and b is computed.
The derivatives are set equal to zero and solved directly.
Method 2 : Gradient Descent
Instead of solving the equations directly, Gradient Descent starts with random values of w and b and repeatedly improves them.
Each iteration moves the parameters in the direction that reduces the cost.
Why L1 Loss Requires Different Optimization
Since L1 loss is not differentiable at zero, the Normal Equation cannot be used.
Instead, optimization techniques such as:
- Subgradient Descent
- Linear Programming
- Simplex-based Optimization
are commonly used.
Key Assumptions of Linear Regression
Linear Regression performs well only when certain mathematical assumptions are reasonably satisfied.
| Assumption | Description |
|---|---|
| Linearity | The relationship between the input variables and the target variable should be approximately linear. |
| Independence | Observations and their prediction errors should be independent of one another. |
| Homoscedasticity | The variance of the residual errors should remain approximately constant across all values of the independent variables. |
| Normality | Residual errors should be approximately normally distributed. |
Linear Regression is surprisingly robust. Minor violations of these assumptions often do not significantly affect prediction accuracy, especially when working with large datasets.
Evaluating a Linear Regression Model
After training a regression model, we need to evaluate how well it fits the training or test data. Several evaluation metrics are commonly used.
| Metric | Purpose |
|---|---|
| R² (Coefficient of Determination) | Measures how much of the variation in the target variable is explained by the model. |
| Mean Absolute Error (MAE) | Average absolute prediction error. |
| Root Mean Squared Error (RMSE) | Square root of the Mean Squared Error. Larger errors receive greater penalty. |
- Higher R² is generally better.
- Lower MAE is better.
- Lower RMSE is better.
Overfitting and Regularization
When the number of input features becomes very large, a regression model may begin to memorize the training data instead of learning the underlying relationship.
This phenomenon is called Overfitting.
To reduce overfitting, additional penalty terms can be added to the cost function. This technique is known as Regularization.
Common Regularization Techniques
| Algorithm | Penalty Added | Characteristics |
|---|---|---|
| Ridge Regression | L2 Penalty | Shrinks coefficient values but rarely makes them exactly zero. |
| Lasso Regression | L1 Penalty | Can reduce some coefficients completely to zero, effectively performing feature selection. |
| ElasticNet | L1 + L2 | Combines the advantages of both Ridge and Lasso. |
Regularization intentionally allows a small increase in training error in exchange for much better performance on unseen data.
scikit-learn and pytorch implementation
The final section explains the default behavior of Linear Regression in two popular Machine Learning frameworks:
- scikit-learn
- PyTorch
It also discusses how regularization is enabled in each framework and the default settings used by Ridge, Lasso, and ElasticNet.
Default Behavior in scikit-learn
The LinearRegression class in scikit-learn implements Ordinary Least Squares (OLS).
By default, it applies no regularization whatsoever.
- LinearRegression() → Ordinary Least Squares (No Regularization)
If regularization is required, scikit-learn provides separate algorithms.
| Model | Regularization | Default Parameters |
|---|---|---|
| LinearRegression() | None | Ordinary Least Squares |
| Ridge() | L2 (Ridge) | alpha = 1.0 |
| Lasso() | L1 (Lasso) | alpha = 1.0 |
| ElasticNet() | L1 + L2 |
alpha = 1.0 l1_ratio = 0.5 |
scikit-learn never applies regularization automatically. You must explicitly choose Ridge, Lasso, or ElasticNet if regularization is desired.
Default Behavior in PyTorch
PyTorch follows a fundamentally different philosophy.
Unlike scikit-learn, PyTorch does not provide a dedicated "Linear Regression" algorithm. Instead, users construct the model themselves using neural network building blocks.
Typical components include:- nn.Linear for the model
- nn.MSELoss() for the loss function
- An optimizer such as SGD or Adam
Adding Regularization in PyTorch
Regularization is typically introduced in one of two ways.
Method 1 : Add the Penalty to the Loss Function
The loss function can be manually extended by adding either an L1 or L2 penalty.
The above expression corresponds to Ridge (L2) regularization.
Similarly, Lasso (L1) regularization can be implemented by summing the absolute values of the weights.
Method 2 : Use weight_decay
Most PyTorch optimizers expose a parameter called weight_decay.
This parameter automatically performs L2 (Ridge) regularization during optimization.
Does PyTorch Support L1 Regularization?
Unlike L2 regularization, PyTorch optimizers do not provide a built-in parameter for L1 regularization.
If L1 (Lasso-style) regularization is required, it must be added manually to the loss function.
- L2 → Built into optimizers via weight_decay.
- L1 → Must always be added manually.
Framework Comparison
| Framework | Default Regularization | How to Enable |
|---|---|---|
| scikit-learn | None | Use Ridge(), Lasso(), or ElasticNet(). |
| PyTorch | None | Use weight_decay (L2) or manually modify the loss function (L1). |
Key Takeaways
- Linear Regression finds the best-fitting line through the training data.
- The quality of a hypothesis is measured using a Cost Function.
- L2 (Mean Squared Error) is the most commonly used loss because it is both smooth (differentiable) and convex.
- The optimal parameters can be obtained analytically using the Normal Equation or iteratively using Gradient Descent.
- Regularization reduces overfitting by penalizing large model weights.
- Ridge uses an L2 penalty, while Lasso uses an L1 penalty.
- scikit-learn and PyTorch both disable regularization by default; it must be explicitly enabled by the developer.
No comments:
Post a Comment