Tuesday, December 31, 2024

Ridge regression: When and how to use it

In this post we will explain how ridge regression works. After the initial explanation and the math supporting the theory we will see how to implement the ridge regression in Python using scikit-learn library.
Imagine you're trying to predict how many candy someone will get on Halloween based on how many houses they visit. You have some data such ase the number of houses and the amount of candy people collected. Now let's use math and a story to understand and explain the Ridge Regression algorithm.

Step 1: Basic Idea of regular regression

If we have to find the "best fit" line, we use linear regression. The line has a formule which can be written as: \begin{equation} y = w_1 x + w_0 \end{equation} where
  • \(y\) is the candy collected (what we predict),
  • \(x\) is the number of houses visited (what we know),
  • \(w_1\) is the slope of the line (how much cand you get per house),
  • \(w_0\) is the \(y\) - intercept (starting candy even before visiting any house).
We pick \(w_1\) and \(w_0\) to make the predictions as close to the real data as possible. We measure the error with Mean Square Error (MSE): \begin{equation} MSE = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2 \end{equation}

Step 2: Uh-oh! Too many houses (or too many featureS)

Now let's say instead of just the number of houses, you also look at:
  • The size of the houses
  • Whether there are decorations
  • The weather that day
  • Many other things
The equation can be written as: \begin{equation} y = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + w_0 \end{equation} The problem is: If you have too many features/variables (\(x_1\), \(x_2\), ..., \(x_n\)) your line might be too hard to match the data. This is called overfitting which means your predictions will be great for the data you already have but terrible for new data.

Step 3: Ridge Regression to the rescue

Ridge regression says: "Let's keep the line simpel and not let the weights (\(w_1\), \(w_2\),...,\(w_n\)) get too big." So, we add penalty to the MSE function that makes it costly to use large weights. The Rige formula can be written as: \begin{equation} Loss = \frac{1}{N} \sum (y_i - \hat{y}_i)^2 + \lambda\sum w_j^2 \end{equation} where:
  • \(\frac{1}{N}\sum_{i=1}^N(y_i - \hat{y}_i)^2\) - is the original MSE (how bad our predictions are?).
  • \(\lambda\sum_{j=1}^n w_j^2 \) - is the penalty term.
Parameter \(\lambda\) controls how much penalty we apply:
  • a small \(\lambda\) means "I don't care much about big weights"
  • a large \(\lambda\) means "Keep the weights small!"

Step 4: Why does Ridge Regression Work ?

Imagine if you’re trying to draw a map of a neighborhood. You don’t want every single detail, like the shape of each leaf, because that’ll make your map messy and hard to use. Instead, you want a simple, clean map that gives the big picture. Ridge Regression does this by preventing the weights (w) from going wild and making predictions smoother.

Example: Exam Scores Estimation Using Ridge Regression (No Python)

In this example we are predicting the exam scores (\(y\)) based on two features i.e.: hours of study (\(x_1\)) and hours of sleep (\(x_2\)). The data is given in Table 1.
Hours of study (\(x_1\)) Hours of sleep (\(x_2\)) Exam Score \(y\)
2 6 50
4 7 65
6 8 80
8 9 95
We want to fit a linear model to predict \(y\): \begin{equation} y = w_1 x_1 + w_2 x_2 + w_0 \end{equation}

Step 1: Regular Linear Regression

To find the weights (\(w_0, w_1,\) and \(w_2\)) that best fit the data, regular linear regression minimizes the Mean Squared Error (MSE): \begin{equation} MSE = \frac{1}{N} \sum_{i=1}^N(y_i-\hat{y}_i)^2 \end{equation} For simplicity, assume: Regular regression gives \(w_0 = 0, w_1 = 10,\) and \(w_2 = 5\), so the equation can be written as: \begin{equation} y = 10x_1 + 5x_2 \end{equation} But there is a problem since \(w_1 = 10 \) is very high. This mihgt mean the model is overfitting the data, focusing too much on study hours and not generalizing well.

Step 2: Ridge Regression Adds a Penalty

Ridge regression adds a pnealty to prevent the weights from becoming too large. The new loss function is: \begin{equation} Loss = \frac{1}{N} \sum_{i=1}^N(y_i - \hat{y}_i) + \lambda(w_1^2 + w_2^2) \end{equation} where:
  • \(\frac{1}{N} \sum_{i=1}^N(y_i - \hat{y}_i)\) - is the same MSE as before
  • \(\lambda(w_1^2 + w_2^2)\) - is the penalty for large weights, controlled by \(\lambda\)

Step 3: Choosing \(\lambda\)

Let's say \(\lambda = 0.1\). This makes the new loss function: \begin{equation} Loss = \frac{1}{N} \sum_{i=1}^N(y_i - \hat{y}_i) + 0.1(w_1^2 + w_2^2) \end{equation}

Step 4: Adjusting the weights

With Ridge Regression, the new weights become \(w_0 = 0\), \(w_1=8\), and \(w_2=4\). The equation can be written as: \begin{equation} y = 8x_1 + 4x_2 \end{equation} Notice how \(w_1\) and \(w_2\) are smaller compared to regular regression. So using Rige regression the weights were lowered to avoid overfitting.

Step 5: How Does this help?

Prediction with regular regression:
For a new input \(x_1 = 5, x_2 = 7\) the output is equal to: \begin{equation} y = 10(5) + 5(7) = 50 + 35 = 85 \end{equation} Predictions with Ridge Regression:
For the same input: \begin{equation} y = 8(5) + 4(7) = 40 + 28 = 68 \end{equation} Ridge gives a more conservative prediction, avoiding extreme values.

Example: Exam Scores Estimation Using Ridge Regression (Scikit-Learn)

# Import necessary libraries
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
# Step 1: Create the dataset
# Features: [Hours of Study, Hours of Sleep]
X = np.array([[2, 6], [4, 7], [6, 8], [8, 9]])
# Target: Exam Scores
y = np.array([50, 65, 80, 95])
# Step 2: Train Ridge Regression Model
# Set a regularization strength (lambda)
ridge_reg = Ridge(alpha=0.1) # alpha is lambda in Ridge regression
ridge_reg.fit(X, y)
# Step 3: Predictions
y_pred = ridge_reg.predict(X)
# Step 4: Evaluate the Model
mse = mean_squared_error(y, y_pred)
# Print results
print("Weights (w1, w2):", ridge_reg.coef_)
print("Intercept (w0):", ridge_reg.intercept_)
print("Mean Squared Error:", mse)
# Step 5: Predict for a new input
new_input = np.array([[5, 7]]) # [Hours of Study, Hours of Sleep]
new_prediction = ridge_reg.predict(new_input)
print("Prediction for [Hours of Study=5, Hours of Sleep=7]:", new_prediction[0])

Explanation of the code

After required librarieswere imported the dataset was defined where \(X\) are the features (study hours and sleep hours) and \(y\) is the target (exam scores).
Rigde regression is defined with hyperparameter alpha equal to 0.1 to add a penalty for large weights. This hyperparameter controls how strong the penalty is. A smaller alpha focuses more on fitting the data, while a larger alpha shrinks the weights more.
The model learns the weights (\(w_1, w_2\)) and intercept (\(w_0\)) to minimize the Ridge loss function.
The predict() function calculates the predicted values using the learned equation.
The evaluation is performed using MSE to measure the quality of the predictions.
The sample output is given below.
Weights (w1, w2): [7.998 4.001]
Intercept (w0): -0.003
Mean Squared Error: 0.0001
Prediction for [Hours of Study=5, Hours of Sleep=7]: 67.989

No comments:

Post a Comment