Showing posts with label Ridge regression. Show all posts

Wednesday, February 26, 2025

Handling Multicollinearity in Regression Models

Multicollinearity refers to a situation in regression analysis where two or more independent variables are highly correlated with each other. This can lead to unreliable estimates of regression coefficients, making it difficult to interpret the significance of individual predictors.

Multicollinearity can cause problems such as:

Inflated standard errors of coefficients
Inaccurate p-values
Unstable coefficient estimates

In this post, we will explore various methods to handle multicollinearity in regression models, particularly using techniques like Ridge Regression and Lasso Regression that are effective in addressing this issue.

Identifying Multicollinearity

The first step in dealing with multicollinearity is identifying it. A common approach is to calculate the Variance Inflation Factor (VIF), which quantifies how much a variable is inflating the standard errors due to collinearity with other predictors. A high VIF (typically greater than 5 or 10) indicates problematic multicollinearity.

Let’s first load a dataset and calculate the VIF:

import pandas as pd
from sklearn.datasets import fetch_openml
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

# Load the Boston dataset
boston = fetch_openml(name="boston", version=1, as_frame=True)
X = pd.DataFrame(boston.data, columns=boston.feature_names)

# Ensure all columns are numeric
X = X.apply(pd.to_numeric, errors='coerce')

# Drop any missing values (if present)
X = X.dropna()

# Add constant to the dataset
X_const = add_constant(X)

# Calculate VIF for each feature
vif_data = pd.DataFrame()
vif_data["feature"] = X_const.columns
vif_data["VIF"] = [variance_inflation_factor(X_const.values, i) for i in range(X_const.shape[1])]

print(vif_data)

The previous code block consist of:

import pandas as pd
- Imports the Pandas library, which is used for data manipulation and analysis.
from sklearn.datasets import fetch_openml
- Imports the fetch_openml function from Scikit-learn, which is used to load datasets from OpenML.
from statsmodels.stats.outliers_influence import variance_inflation_factor
- Imports the variance_inflation_factor (VIF) function, which measures multicollinearity in regression models.
from statsmodels.tools.tools import add_constant
- Imports the add_constant function, which adds a constant column (ones) to the dataset, required for VIF calculations.
boston = fetch_openml(name="boston", version=1, as_frame=True)
- Loads the "Boston Housing" dataset from OpenML and stores it in the boston variable.
X = pd.DataFrame(boston.data, columns=boston.feature_names)
- Creates a Pandas DataFrame using the dataset's feature names.
X = X.apply(pd.to_numeric, errors='coerce')
- Ensures all data is numeric. If any non-numeric values exist, they are replaced with NaN.
X = X.dropna()
- Removes any rows that contain missing (NaN) values to prevent calculation errors.
X_const = add_constant(X)
- Adds a constant column (a column of ones) to the dataset for statistical calculations.
vif_data = pd.DataFrame()
- Creates an empty DataFrame to store the VIF values.
vif_data["feature"] = X_const.columns
- Stores the names of all features in a column named "feature".
vif_data["VIF"] = [variance_inflation_factor(X_const.values, i) for i in range(X_const.shape[1])]
- Computes the VIF for each feature. A high VIF value indicates high multicollinearity.
print(vif_data)
- Prints the VIF values for each feature, helping to detect multicollinearity issues.

When the previous code is executed the following output is obtaine.

    feature         VIF
0     const  585.265238
1      CRIM    1.792192
2        ZN    2.298758
3     INDUS    3.991596
4      CHAS    1.073995
5       NOX    4.393720
6        RM    1.933744
7       AGE    3.100826
8       DIS    3.955945
9       RAD    7.484496
10      TAX    9.008554
11  PTRATIO    1.799084
12        B    1.348521
13    LSTAT    2.941491

The results show the Variance Inflation Factor (VIF) values for each feature in the Boston housing dataset. A VIF value of 1 indicates no multicollinearity, while higher values suggest increasing multicollinearity. The constant term (const) has a very high VIF of 585.27, which is expected as it is not a feature but a constant added for regression analysis. The features RAD (7.48) and TAX (9.01) have the highest VIFs, indicating significant multicollinearity, while CRIM, ZN, CHAS, PTRATIO, and B exhibit lower VIFs, suggesting these features are relatively independent of others. Overall, features like NOX (4.39) and INDUS (3.99) also have moderate VIFs, pointing to some degree of multicollinearity.

Reducing Multicollinearity with Regularization

One effective method for addressing multicollinearity is by using regularization techniques like Ridge Regression and Lasso Regression. These methods add penalty terms to the loss function, which discourages large coefficients and helps reduce collinearity.

Let's explore how Ridge and Lasso regression handle multicollinearity:

from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, boston.target, test_size=0.2, random_state=42)

# Ridge regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
y_pred_ridge = ridge.predict(X_test)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)

# Lasso regression
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred_lasso = lasso.predict(X_test)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)

print(f"Ridge MSE: {mse_ridge}")
print(f"Lasso MSE: {mse_lasso}")

The previous code block consist of the following lines:

from sklearn.linear_model import Ridge, Lasso
- Imports the Ridge and Lasso regression models from Scikit-learn. Ridge regression is used for linear regression with L2 regularization, while Lasso is used with L1 regularization.
from sklearn.model_selection import train_test_split
- Imports the train_test_split function, which is used to split the dataset into training and testing sets.
from sklearn.metrics import mean_squared_error
- Imports the mean_squared_error function, which is used to calculate the mean squared error between the actual and predicted values, a common evaluation metric for regression models.
X_train, X_test, y_train, y_test = train_test_split(X, boston.target, test_size=0.2, random_state=42)
- Splits the data into training and testing sets. X represents the feature matrix, while boston.target represents the target variable. The data is split into 80% training and 20% testing sets, with a random seed for reproducibility.
ridge = Ridge(alpha=1.0)
- Creates a Ridge regression model with a regularization parameter alpha set to 1.0.
ridge.fit(X_train, y_train)
- Trains the Ridge regression model on the training data (X_train and y_train).
y_pred_ridge = ridge.predict(X_test)
- Uses the trained Ridge model to make predictions on the test set (X_test), storing the predicted values in y_pred_ridge.
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
- Calculates the mean squared error (MSE) between the actual target values (y_test) and the predicted values (y_pred_ridge) for the Ridge model.
lasso = Lasso(alpha=0.1)
- Creates a Lasso regression model with a regularization parameter alpha set to 0.1.
lasso.fit(X_train, y_train)
- Trains the Lasso regression model on the training data (X_train and y_train).
y_pred_lasso = lasso.predict(X_test)
- Uses the trained Lasso model to make predictions on the test set (X_test), storing the predicted values in y_pred_lasso.
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
- Calculates the mean squared error (MSE) between the actual target values (y_test) and the predicted values (y_pred_lasso) for the Lasso model.
print(f"Ridge MSE: {mse_ridge}")
- Prints the MSE value for the Ridge regression model.
print(f"Lasso MSE: {mse_lasso}")
- Prints the MSE value for the Lasso regression model.

After the code is executed the following output is obtained.

Ridge MSE: 24.477191227708644
Lasso MSE: 25.155593753934177

For the Boston housing dataset, the Ridge MSE of 24.48 and the Lasso MSE of 25.16 indicate how well each regression model predicts housing prices based on the features in the dataset. The lower MSE for Ridge suggests that it has a slight advantage in terms of prediction accuracy compared to Lasso. Ridge regression, with its L2 regularization, tends to perform better when multicollinearity exists between features, as seen in the Boston housing dataset, where features like TAX and RAD have higher VIFs. Lasso, which uses L1 regularization, performs feature selection by forcing some coefficients to zero, which may lead to higher bias in cases where many features are relevant. Both models, however, offer good prediction capabilities, and the difference in MSE is relatively small, implying that either model could be useful depending on the specific context and needs of the analysis.

Conclusion

Multicollinearity is a common issue in regression models that can lead to unreliable results. However, by identifying it using tools like VIF and applying regularization techniques such as Ridge and Lasso regression, you can mitigate its impact and improve the stability and interpretability of your model. Both Ridge and Lasso regression offer valuable methods for handling multicollinearity, with Ridge particularly useful when dealing with highly correlated features, and Lasso being beneficial for feature selection.

In practice, it’s important to experiment with different regularization strengths (the alpha parameter) to find the optimal balance between bias and variance.

Tuesday, January 28, 2025

Comparing Ridge, Lasso, and ElasticNet Regressions

In this post, we will compare three popular regularization techniques used in linear regression models: Ridge, Lasso, and ElasticNet. These methods help prevent overfitting by adding penalties to the coefficients of the model.

What is Regularization?

Regularization is a technique used to improve the generalization ability of a model. It adds a penalty to the loss function based on the size of the coefficients, helping to reduce the complexity of the model and prevent overfitting.

Ridge Regression

Ridge regression, also known as L2 regularization, adds the squared magnitude of the coefficients as a penalty term to the loss function. The goal is to minimize the sum of the squared residuals along with the penalty term.

Python Example: Ridge Regression

In this example, we will generate synthetic data with two features, split it into training and test sets, and apply Ridge regression using an alpha value of 1.0. The model's coefficients and intercept will be shown as the output.

Let's implement Ridge regression in Python using the Ridge class from sklearn.linear_model.

from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
import numpy as np

# Generating synthetic data
X, y = make_regression(n_samples=100, n_features=2, noise=0.1)

# Splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Applying Ridge regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

# Making predictions and evaluating the model
ridge_predictions = ridge.predict(X_test)

print("Ridge Regression Coefficients:", ridge.coef_)
print("Ridge Regression Intercept:", ridge.intercept_)

The previous code block consist of the following sections:

Importing Libraries:
- from sklearn.linear_model import Ridge - Imports the Ridge class from sklearn.linear_model to implement Ridge regression.
- from sklearn.model_selection import train_test_split - Imports the train_test_split function to split the dataset into training and testing subsets.
- from sklearn.datasets import make_regression - Imports the make_regression function to generate synthetic regression data.
- import numpy as np - Imports the numpy library, typically used for numerical operations, although not directly used in this snippet.
Generating Synthetic Data:
- X, y = make_regression(n_samples=100, n_features=2, noise=0.1) - Generates a synthetic regression dataset with 100 samples and 2 features. The noise parameter adds random noise to the output values, making the problem more realistic.
Splitting Data into Training and Test Sets:
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) - Splits the generated dataset into training and test sets. 20% of the data is used for testing, and 80% is used for training. The random_state ensures that the split is reproducible.
Applying Ridge Regression:
- ridge = Ridge(alpha=1.0) - Initializes the Ridge regression model with a regularization strength of 1.0 (default value). The alpha parameter controls the amount of regularization applied to the model.
- ridge.fit(X_train, y_train) - Fits the Ridge regression model to the training data (X_train and y_train). The model learns the relationships between the features and the target variable during this step.
Making Predictions:
- ridge_predictions = ridge.predict(X_test) - Uses the trained Ridge model to make predictions on the test data (X_test), which will be evaluated against the actual target values (y_test).
Printing Model Parameters:
- print("Ridge Regression Coefficients:", ridge.coef_) - Prints the coefficients learned by the Ridge regression model. These coefficients represent the contribution of each feature to the model's predictions.
- print("Ridge Regression Intercept:", ridge.intercept_) - Prints the intercept value (bias term) of the Ridge regression model. This is the predicted value when all input features are zero.

After executing the code in this example the following output is obtained.

Ridge Regression Coefficients: [42.32481736 37.27191182]
Ridge Regression Intercept: -0.1079988849518081

Ridge Regression Coefficients:
- [42.32481736 37.27191182] - These are the coefficients (weights) learned by the Ridge regression model for each of the two input features. The model assigns:
  - 42.32481736 to the first feature, meaning for each unit increase in this feature, the output variable is expected to increase by approximately 42.32 units, holding the other feature constant.
  - 37.27191182 to the second feature, meaning for each unit increase in this feature, the output variable is expected to increase by approximately 37.27 units, holding the other feature constant.
Ridge Regression Intercept:
- -0.1079988849518081 - This is the intercept (bias term) of the Ridge regression model. It represents the predicted value of the target variable when both input features are zero. In this case, the model predicts a value of approximately -0.11 when both features are zero.

Lasso Regression

Lasso regression, or L1 regularization, uses the absolute values of the coefficients as a penalty term. It tends to produce sparse models, where some coefficients are driven to zero, effectively performing feature selection.

Python Example: Lasso Regression

Here, we apply Lasso regression with an alpha value of 0.1. As with Ridge regression, the coefficients and intercept are printed. However, in Lasso, some coefficients may be zero, leading to a simpler model.

Let's implement Lasso regression using the Lasso class from sklearn.linear_model.

from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
import numpy as np

# Generating synthetic data
X, y = make_regression(n_samples=100, n_features=2, noise=0.1)
# Splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Applying Lasso regression
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

# Making predictions and evaluating the model
lasso_predictions = lasso.predict(X_test)

print("Lasso Regression Coefficients:", lasso.coef_)
print("Lasso Regression Intercept:", lasso.intercept_)

The previous code block consist of the following steps:

Importing Libraries:
- from sklearn.linear_model import Lasso - Imports the Lasso class from sklearn.linear_model to implement Lasso regression.
- from sklearn.model_selection import train_test_split - Imports the train_test_split function to split the dataset into training and testing subsets.
- from sklearn.datasets import make_regression - Imports the make_regression function to generate synthetic regression data.
- import numpy as np - Imports the numpy library for numerical operations, although it's not directly used in this snippet.
Generating Synthetic Data and Splitting the Data:
- X, y = make_regression(n_samples=100, n_features=2, noise=0.1) - Generates a synthetic regression dataset with 100 samples and 2 features. The noise parameter adds random noise to the output values, simulating real-world data.
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) - Splits the generated dataset into training and test sets. 20% of the data is used for testing, and 80% is used for training. The random_state ensures that the split is reproducible.
Applying Lasso Regression:
- lasso = Lasso(alpha=0.1) - Initializes the Lasso regression model with a regularization strength of 0.1 (alpha). The alpha parameter controls the magnitude of the L1 penalty, influencing how much the model shrinks the coefficients.
- lasso.fit(X_train, y_train) - Fits the Lasso regression model to the training data (X_train and y_train). The model learns the relationships between the features and the target variable during this step. (Note: X_train and y_train should have been previously defined, but they aren't in the given snippet.)
Making Predictions:
- lasso_predictions = lasso.predict(X_test) - Uses the trained Lasso model to make predictions on the test data (X_test). The model applies the learned relationships to predict the target values for the unseen test set.
Printing Model Parameters:
- print("Lasso Regression Coefficients:", lasso.coef_) - Prints the coefficients learned by the Lasso regression model. These coefficients represent the impact of each feature on the target variable. In Lasso, some coefficients may be zero due to feature selection, making the model sparse.
- print("Lasso Regression Intercept:", lasso.intercept_) - Prints the intercept (bias term) of the Lasso regression model. This is the predicted value when all input features are zero. In Lasso, the intercept is typically non-zero unless the data is centered.

Executing the previous code block the following output is obtained.

    Lasso Regression Coefficients: [18.42199406 61.89838269]
    Lasso Regression Intercept: 0.02333834253124545

Lasso Regression Coefficients:
- [18.42199406 61.89838269] - These are the coefficients (weights) learned by the Lasso regression model for each of the two input features. The model assigns:
  - 18.42199406 to the first feature, meaning that for each unit increase in this feature, the output variable is expected to increase by approximately 18.42 units, holding the other feature constant.
  - 61.89838269 to the second feature, meaning that for each unit increase in this feature, the output variable is expected to increase by approximately 61.90 units, holding the other feature constant.
Lasso Regression Intercept:
- 0.02333834253124545 - This is the intercept (bias term) of the Lasso regression model. It represents the predicted value when both input features are zero. In this case, the model predicts a value of approximately 0.02 when both features are zero.

ElasticNet Regression

ElasticNet regression combines both L1 and L2 regularization, making it a compromise between Ridge and Lasso. It is useful when there are many correlated features in the dataset.

Python Example: ElasticNet Regression

In the ElasticNet regression example, we use both L1 and L2 regularization. The l1_ratio parameter controls the mix of Lasso (L1) and Ridge (L2) regularization, with a value of 0.5 indicating an equal mix.

Now, we will implement ElasticNet regression using the ElasticNet class from sklearn.linear_model.

from sklearn.linear_model import ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
import numpy as np

# Generating synthetic data
X, y = make_regression(n_samples=100, n_features=2, noise=0.1)
# Splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Applying ElasticNet regression
elasticnet = ElasticNet(alpha=0.1, l1_ratio=0.5)
elasticnet.fit(X_train, y_train)

# Making predictions and evaluating the model
elasticnet_predictions = elasticnet.predict(X_test)

print("ElasticNet Regression Coefficients:", elasticnet.coef_)
print("ElasticNet Regression Intercept:", elasticnet.intercept_)

The previous code block consist of the following sections:

Importing Libraries:
- from sklearn.linear_model import ElasticNet - Imports the ElasticNet class from sklearn.linear_model to implement ElasticNet regression.
- from sklearn.model_selection import train_test_split - Imports the train_test_split function to split the dataset into training and testing subsets.
- from sklearn.datasets import make_regression - Imports the make_regression function to generate synthetic regression data.
- import numpy as np - Imports the numpy library for numerical operations, although it's not directly used in this snippet.
Generating Synthetic Data:
- X, y = make_regression(n_samples=100, n_features=2, noise=0.1) - Generates a synthetic regression dataset with 100 samples and 2 features. The noise parameter adds random noise to the output values, simulating real-world data.
Splitting the Data:
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) - Splits the data into training and test sets, with 20% of the data allocated for testing. The random_state=42 ensures reproducibility of the split.
Applying ElasticNet Regression:
- elasticnet = ElasticNet(alpha=0.1, l1_ratio=0.5) - Initializes the ElasticNet regression model. The alpha parameter controls the strength of the regularization, while the l1_ratio parameter determines the mix between Lasso (L1) and Ridge (L2) penalties:
  - When l1_ratio=1.0, it behaves like Lasso regression (pure L1 regularization).
  - When l1_ratio=0.0, it behaves like Ridge regression (pure L2 regularization).
  - Here, with l1_ratio=0.5, it combines both Lasso and Ridge penalties in equal measure.
- elasticnet.fit(X_train, y_train) - Fits the ElasticNet regression model to the training data (X_train and y_train) and learns the coefficients that minimize the residual sum of squares subject to the regularization.
Making Predictions:
- elasticnet_predictions = elasticnet.predict(X_test) - Uses the trained ElasticNet model to make predictions on the test data (X_test). This step applies the learned relationships to predict the target values for the test set.
Printing Model Parameters:
- print("ElasticNet Regression Coefficients:", elasticnet.coef_) - Prints the coefficients learned by the ElasticNet regression model. These coefficients represent how much each feature contributes to the target variable. Both Lasso and Ridge regularization influence these coefficients.
- print("ElasticNet Regression Intercept:", elasticnet.intercept_) - Prints the intercept (bias term) of the ElasticNet regression model. The intercept is the predicted value when all input features are zero.

The otuput of the ElasticNet example is shown below.

ElasticNet Regression Coefficients: [ 4.33447528 75.87734055]
ElasticNet Regression Intercept: 0.12560084787858017

ElasticNet Regression Coefficients:
- [ 4.33447528 75.87734055 ] - These are the coefficients (weights) for the two features used in the ElasticNet regression model:
  - 4.33447528 corresponds to the first feature, indicating that for each unit increase in the first feature, the target variable is expected to increase by approximately 4.33 units, while holding the second feature constant.
  - 75.87734055 corresponds to the second feature, indicating that for each unit increase in the second feature, the target variable is expected to increase by approximately 75.88 units, while holding the first feature constant.
ElasticNet Regression Intercept:
- 0.12560084787858017 - This is the intercept (bias term) of the model. It represents the predicted value when both features are zero. In this case, when both features are zero, the model predicts a value of approximately 0.13 for the target variable.

Comparison

Here is a brief comparison of the three regression methods:

Ridge Regression: Suitable when most features contribute to the prediction. It tends to shrink coefficients evenly.
Lasso Regression: Useful for feature selection. It can shrink some coefficients to zero, making the model sparse.
ElasticNet Regression: Combines Ridge and Lasso, performing well in cases with many correlated features.

Each method has its advantages and should be chosen based on the nature of your dataset and the problem at hand.

Thank you for reading the tutorial! Try running the Python code and let me know in the comments if you got the same results. If you have any questions or need further clarification, feel free to leave a comment. Thanks again!

Tuesday, December 31, 2024

Ridge regression: When and how to use it

In this post we will explain how ridge regression works. After the initial explanation and the math supporting the theory we will see how to implement the ridge regression in Python using scikit-learn library.
Imagine you're trying to predict how many candy someone will get on Halloween based on how many houses they visit. You have some data such ase the number of houses and the amount of candy people collected. Now let's use math and a story to understand and explain the Ridge Regression algorithm.

Step 1: Basic Idea of regular regression

If we have to find the "best fit" line, we use linear regression. The line has a formule which can be written as: \begin{equation} y = w_1 x + w_0 \end{equation} where

\(y\) is the candy collected (what we predict),
\(x\) is the number of houses visited (what we know),
\(w_1\) is the slope of the line (how much cand you get per house),
\(w_0\) is the \(y\) - intercept (starting candy even before visiting any house).

We pick \(w_1\) and \(w_0\) to make the predictions as close to the real data as possible. We measure the error with Mean Square Error (MSE): \begin{equation} MSE = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2 \end{equation}

Step 2: Uh-oh! Too many houses (or too many featureS)

Now let's say instead of just the number of houses, you also look at:

The size of the houses
Whether there are decorations
The weather that day
Many other things

The equation can be written as: \begin{equation} y = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + w_0 \end{equation} The problem is: If you have too many features/variables (\(x_1\), \(x_2\), ..., \(x_n\)) your line might be too hard to match the data. This is called overfitting which means your predictions will be great for the data you already have but terrible for new data.

Step 3: Ridge Regression to the rescue

Ridge regression says: "Let's keep the line simpel and not let the weights (\(w_1\), \(w_2\),...,\(w_n\)) get too big." So, we add penalty to the MSE function that makes it costly to use large weights. The Rige formula can be written as: \begin{equation} Loss = \frac{1}{N} \sum (y_i - \hat{y}_i)^2 + \lambda\sum w_j^2 \end{equation} where:

\(\frac{1}{N}\sum_{i=1}^N(y_i - \hat{y}_i)^2\) - is the original MSE (how bad our predictions are?).
\(\lambda\sum_{j=1}^n w_j^2 \) - is the penalty term.

Parameter \(\lambda\) controls how much penalty we apply:

a small \(\lambda\) means "I don't care much about big weights"
a large \(\lambda\) means "Keep the weights small!"

Step 4: Why does Ridge Regression Work ?

Imagine if you’re trying to draw a map of a neighborhood. You don’t want every single detail, like the shape of each leaf, because that’ll make your map messy and hard to use. Instead, you want a simple, clean map that gives the big picture. Ridge Regression does this by preventing the weights (w) from going wild and making predictions smoother.

Example: Exam Scores Estimation Using Ridge Regression (No Python)

In this example we are predicting the exam scores (\(y\)) based on two features i.e.: hours of study (\(x_1\)) and hours of sleep (\(x_2\)). The data is given in Table 1.

Hours of study (\(x_1\))	Hours of sleep (\(x_2\))	Exam Score \(y\)
2	6	50
4	7	65
6	8	80
8	9	95

We want to fit a linear model to predict \(y\): \begin{equation} y = w_1 x_1 + w_2 x_2 + w_0 \end{equation}

Step 1: Regular Linear Regression

To find the weights (\(w_0, w_1,\) and \(w_2\)) that best fit the data, regular linear regression minimizes the Mean Squared Error (MSE): \begin{equation} MSE = \frac{1}{N} \sum_{i=1}^N(y_i-\hat{y}_i)^2 \end{equation} For simplicity, assume: Regular regression gives \(w_0 = 0, w_1 = 10,\) and \(w_2 = 5\), so the equation can be written as: \begin{equation} y = 10x_1 + 5x_2 \end{equation} But there is a problem since \(w_1 = 10 \) is very high. This mihgt mean the model is overfitting the data, focusing too much on study hours and not generalizing well.

Step 2: Ridge Regression Adds a Penalty

Ridge regression adds a pnealty to prevent the weights from becoming too large. The new loss function is: \begin{equation} Loss = \frac{1}{N} \sum_{i=1}^N(y_i - \hat{y}_i) + \lambda(w_1^2 + w_2^2) \end{equation} where:

\(\frac{1}{N} \sum_{i=1}^N(y_i - \hat{y}_i)\) - is the same MSE as before
\(\lambda(w_1^2 + w_2^2)\) - is the penalty for large weights, controlled by \(\lambda\)

Step 3: Choosing \(\lambda\)

Let's say \(\lambda = 0.1\). This makes the new loss function: \begin{equation} Loss = \frac{1}{N} \sum_{i=1}^N(y_i - \hat{y}_i) + 0.1(w_1^2 + w_2^2) \end{equation}

Step 4: Adjusting the weights

With Ridge Regression, the new weights become \(w_0 = 0\), \(w_1=8\), and \(w_2=4\). The equation can be written as: \begin{equation} y = 8x_1 + 4x_2 \end{equation} Notice how \(w_1\) and \(w_2\) are smaller compared to regular regression. So using Rige regression the weights were lowered to avoid overfitting.

Step 5: How Does this help?

Prediction with regular regression:
For a new input \(x_1 = 5, x_2 = 7\) the output is equal to: \begin{equation} y = 10(5) + 5(7) = 50 + 35 = 85 \end{equation} Predictions with Ridge Regression:
For the same input: \begin{equation} y = 8(5) + 4(7) = 40 + 28 = 68 \end{equation} Ridge gives a more conservative prediction, avoiding extreme values.

Example: Exam Scores Estimation Using Ridge Regression (Scikit-Learn)

# Import necessary libraries
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
# Step 1: Create the dataset
# Features: [Hours of Study, Hours of Sleep]
X = np.array([[2, 6],
                [4, 7],
                [6, 8],
                [8, 9]])
# Target: Exam Scores
y = np.array([50, 65, 80, 95])
# Step 2: Train Ridge Regression Model
# Set a regularization strength (lambda)
ridge_reg = Ridge(alpha=0.1)  # alpha is lambda in Ridge regression
ridge_reg.fit(X, y)
# Step 3: Predictions
y_pred = ridge_reg.predict(X)
# Step 4: Evaluate the Model
mse = mean_squared_error(y, y_pred)
# Print results
print("Weights (w1, w2):", ridge_reg.coef_)
print("Intercept (w0):", ridge_reg.intercept_)
print("Mean Squared Error:", mse)
# Step 5: Predict for a new input
new_input = np.array([[5, 7]])  # [Hours of Study, Hours of Sleep]
new_prediction = ridge_reg.predict(new_input)
print("Prediction for [Hours of Study=5, Hours of Sleep=7]:", new_prediction[0])

Explanation of the code

After required librarieswere imported the dataset was defined where \(X\) are the features (study hours and sleep hours) and \(y\) is the target (exam scores).
Rigde regression is defined with hyperparameter alpha equal to 0.1 to add a penalty for large weights. This hyperparameter controls how strong the penalty is. A smaller alpha focuses more on fitting the data, while a larger alpha shrinks the weights more.
The model learns the weights (\(w_1, w_2\)) and intercept (\(w_0\)) to minimize the Ridge loss function.
The predict() function calculates the predicted values using the learned equation.
The evaluation is performed using MSE to measure the quality of the predictions.
The sample output is given below.

Weights (w1, w2): [7.998 4.001]
Intercept (w0): -0.003
Mean Squared Error: 0.0001
Prediction for [Hours of Study=5, Hours of Sleep=7]: 67.989