Tuesday, December 31, 2024

Ridge regression: When and how to use it

In this post we will explain how ridge regression works. After the initial explanation and the math supporting the theory we will see how to implement the ridge regression in Python using scikit-learn library.
Imagine you're trying to predict how many candy someone will get on Halloween based on how many houses they visit. You have some data such ase the number of houses and the amount of candy people collected. Now let's use math and a story to understand and explain the Ridge Regression algorithm.

Step 1: Basic Idea of regular regression

If we have to find the "best fit" line, we use linear regression. The line has a formule which can be written as: \begin{equation} y = w_1 x + w_0 \end{equation} where

\(y\) is the candy collected (what we predict),
\(x\) is the number of houses visited (what we know),
\(w_1\) is the slope of the line (how much cand you get per house),
\(w_0\) is the \(y\) - intercept (starting candy even before visiting any house).

We pick \(w_1\) and \(w_0\) to make the predictions as close to the real data as possible. We measure the error with Mean Square Error (MSE): \begin{equation} MSE = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2 \end{equation}

Step 2: Uh-oh! Too many houses (or too many featureS)

Now let's say instead of just the number of houses, you also look at:

The size of the houses
Whether there are decorations
The weather that day
Many other things

The equation can be written as: \begin{equation} y = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + w_0 \end{equation} The problem is: If you have too many features/variables (\(x_1\), \(x_2\), ..., \(x_n\)) your line might be too hard to match the data. This is called overfitting which means your predictions will be great for the data you already have but terrible for new data.

Step 3: Ridge Regression to the rescue

Ridge regression says: "Let's keep the line simpel and not let the weights (\(w_1\), \(w_2\),...,\(w_n\)) get too big." So, we add penalty to the MSE function that makes it costly to use large weights. The Rige formula can be written as: \begin{equation} Loss = \frac{1}{N} \sum (y_i - \hat{y}_i)^2 + \lambda\sum w_j^2 \end{equation} where:

\(\frac{1}{N}\sum_{i=1}^N(y_i - \hat{y}_i)^2\) - is the original MSE (how bad our predictions are?).
\(\lambda\sum_{j=1}^n w_j^2 \) - is the penalty term.

Parameter \(\lambda\) controls how much penalty we apply:

a small \(\lambda\) means "I don't care much about big weights"
a large \(\lambda\) means "Keep the weights small!"

Step 4: Why does Ridge Regression Work ?

Imagine if you’re trying to draw a map of a neighborhood. You don’t want every single detail, like the shape of each leaf, because that’ll make your map messy and hard to use. Instead, you want a simple, clean map that gives the big picture. Ridge Regression does this by preventing the weights (w) from going wild and making predictions smoother.

Example: Exam Scores Estimation Using Ridge Regression (No Python)

In this example we are predicting the exam scores (\(y\)) based on two features i.e.: hours of study (\(x_1\)) and hours of sleep (\(x_2\)). The data is given in Table 1.

Hours of study (\(x_1\))	Hours of sleep (\(x_2\))	Exam Score \(y\)
2	6	50
4	7	65
6	8	80
8	9	95

We want to fit a linear model to predict \(y\): \begin{equation} y = w_1 x_1 + w_2 x_2 + w_0 \end{equation}

Step 1: Regular Linear Regression

To find the weights (\(w_0, w_1,\) and \(w_2\)) that best fit the data, regular linear regression minimizes the Mean Squared Error (MSE): \begin{equation} MSE = \frac{1}{N} \sum_{i=1}^N(y_i-\hat{y}_i)^2 \end{equation} For simplicity, assume: Regular regression gives \(w_0 = 0, w_1 = 10,\) and \(w_2 = 5\), so the equation can be written as: \begin{equation} y = 10x_1 + 5x_2 \end{equation} But there is a problem since \(w_1 = 10 \) is very high. This mihgt mean the model is overfitting the data, focusing too much on study hours and not generalizing well.

Step 2: Ridge Regression Adds a Penalty

Ridge regression adds a pnealty to prevent the weights from becoming too large. The new loss function is: \begin{equation} Loss = \frac{1}{N} \sum_{i=1}^N(y_i - \hat{y}_i) + \lambda(w_1^2 + w_2^2) \end{equation} where:

\(\frac{1}{N} \sum_{i=1}^N(y_i - \hat{y}_i)\) - is the same MSE as before
\(\lambda(w_1^2 + w_2^2)\) - is the penalty for large weights, controlled by \(\lambda\)

Step 3: Choosing \(\lambda\)

Let's say \(\lambda = 0.1\). This makes the new loss function: \begin{equation} Loss = \frac{1}{N} \sum_{i=1}^N(y_i - \hat{y}_i) + 0.1(w_1^2 + w_2^2) \end{equation}

Step 4: Adjusting the weights

With Ridge Regression, the new weights become \(w_0 = 0\), \(w_1=8\), and \(w_2=4\). The equation can be written as: \begin{equation} y = 8x_1 + 4x_2 \end{equation} Notice how \(w_1\) and \(w_2\) are smaller compared to regular regression. So using Rige regression the weights were lowered to avoid overfitting.

Step 5: How Does this help?

Prediction with regular regression:
For a new input \(x_1 = 5, x_2 = 7\) the output is equal to: \begin{equation} y = 10(5) + 5(7) = 50 + 35 = 85 \end{equation} Predictions with Ridge Regression:
For the same input: \begin{equation} y = 8(5) + 4(7) = 40 + 28 = 68 \end{equation} Ridge gives a more conservative prediction, avoiding extreme values.

Example: Exam Scores Estimation Using Ridge Regression (Scikit-Learn)

# Import necessary libraries
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
# Step 1: Create the dataset
# Features: [Hours of Study, Hours of Sleep]
X = np.array([[2, 6],
                [4, 7],
                [6, 8],
                [8, 9]])
# Target: Exam Scores
y = np.array([50, 65, 80, 95])
# Step 2: Train Ridge Regression Model
# Set a regularization strength (lambda)
ridge_reg = Ridge(alpha=0.1)  # alpha is lambda in Ridge regression
ridge_reg.fit(X, y)
# Step 3: Predictions
y_pred = ridge_reg.predict(X)
# Step 4: Evaluate the Model
mse = mean_squared_error(y, y_pred)
# Print results
print("Weights (w1, w2):", ridge_reg.coef_)
print("Intercept (w0):", ridge_reg.intercept_)
print("Mean Squared Error:", mse)
# Step 5: Predict for a new input
new_input = np.array([[5, 7]])  # [Hours of Study, Hours of Sleep]
new_prediction = ridge_reg.predict(new_input)
print("Prediction for [Hours of Study=5, Hours of Sleep=7]:", new_prediction[0])

Explanation of the code

After required librarieswere imported the dataset was defined where \(X\) are the features (study hours and sleep hours) and \(y\) is the target (exam scores).
Rigde regression is defined with hyperparameter alpha equal to 0.1 to add a penalty for large weights. This hyperparameter controls how strong the penalty is. A smaller alpha focuses more on fitting the data, while a larger alpha shrinks the weights more.
The model learns the weights (\(w_1, w_2\)) and intercept (\(w_0\)) to minimize the Ridge loss function.
The predict() function calculates the predicted values using the learned equation.
The evaluation is performed using MSE to measure the quality of the predictions.
The sample output is given below.

Weights (w1, w2): [7.998 4.001]
Intercept (w0): -0.003
Mean Squared Error: 0.0001
Prediction for [Hours of Study=5, Hours of Sleep=7]: 67.989

Monday, December 23, 2024

How to customize scatter plot markers (Basic)?

When creating visualizations in Matplotlib, customizing markers and colors is necessary to make data more engaging and easier to interpret. In this tutorial we will explore how to change marker, styles, colors, and add custom combinations for a professional-looking plot. Markers help indicate specific data points on a line or scatter plot. Matplotlib enables you to change marker type style, marker size, and colors.... The Matplotlib provides several markers, including circles, triangles, squares and more. The commonly used marker styles are:

o - circle
s - square
"^" - triangle up
"v" - triangle down
"D" - diamond
"X" - x shape

Example 1 - Changing the marker type For this example we will need two libraries i.e. numpy as np and matplotlib.pyplot as plt. The numpy is required to generate the data while matplotlib.pyplot module is required for plotting the data.

import numpy as np
import matplotlib.pyplot as plt

The x coordinates will be generated using the np.linspacE() function while y coordinates will be generated using the np.sin() function where x coordinates will be input.

x = np.linspace(0,10,10)
y = np.sin(x)
print("x = {}".format(x))
print("y = {}".format(y))

The output of the script up to this point is given below.

x = array ([ 0. , 1.11111111 , 2.22222222, 3.33333333 , 4.44444444 , 5.55555556 , 6.66666667 , 7.77777778 , 8.88888889 , 10.])
y = array ([ 0. , 0.8961922 , 0.79522006 ,-0.19056796 , -0.96431712 ,4 -0.66510151 , 0.37415123 , 0.99709789 , 0.51060568 ,-0.54402111])

The basic code for creating the scatter plot is given below.

plt.figure(figsize=(12,8))
plt.scatter(x,y,marker='o', label='Circle Marker')
plt.grid(True)
plt.show()

Using plt.fgiure(figsize=(12,8)) we have created an empty figure with size 12 by 8 inches. The plt.scatter(x,y, marker='o', label='Circle Marker') will create the scatter plot with points having x and y coordinates, marker = 'o' is the Circle marker that is used by default in the scatter plot. The label is the name that will be display in legend if the plt.legend is defined. The plt.grid(True) defines the grid which is added to the plot and the plt.show() will show the plot.
However in this example we want to show different types of the markers so we will show the original scatter plot with x,y coordinates. Then we will increase the values of y by adding the value of 0.5 to the coordinates multiple times.
So in one plot we will graphically represent 6 diffrent sine functions and these are: \begin{eqnarray} y_1 &=& \sin(x)\\ \nonumber y_2 &=& \sin(x+0.5)\\ \nonumber y_3 &=& \sin(x+1)\\ \nonumber y_4 &=& \sin(x+1.5)\\ \nonumber y_5 &=& \sin(x+2)\\ \nonumber y_6 &=& \sin(x+2.5)\\ \nonumber \end{eqnarray} It should be noted that first \(y_1\) points will be visualized using circle type marker ('o'), the \(y_2\) will be visualized using square type marker ('s'), the \(y_3\) will be visualized using traingle type marker ('^'), the \(y_4\) will be visualized using diamond marker ('D'),and the \(y_5\) marker will be visualized using x shape marker ('X').

plt.scatter(x,y,marker='o', label='Circle Marker')
plt.scatter(x,y+0.5,marker='s', label='Square Marker')
plt.scatter(x,y+1,marker='^', label='Traingle Up Marker')
plt.scatter(x,y+1.5,marker='v', label='Traingle Down Marker')
plt.scatter(x,y+2,marker='D', label='Diamond Marker')
plt.scatter(x,y+2.5,marker='X', label='X Shape Marker')

Usin the previous code in one figure we will visualize 6 different sine function using different markers. The entire code created in this example is shown below.

import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0,10,10)
y = np.sin(x)
print("x = {}".format(x))
print("y = {}".format(y))
plt.figure(figsize=(12,8))
plt.scatter(x,y,marker='o', label='Circle Marker')
plt.scatter(x,y+0.5,marker='s', label='Square Marker')
plt.scatter(x,y+1,marker='^', label='Traingle Up Marker')
plt.scatter(x,y+1.5,marker='v', label='Traingle Down Marker')
plt.scatter(x,y+2,marker='D', label='Diamond Marker')
plt.scatter(x,y+2.5,marker='X', label='X Shape Marker')
plt.grid(True)
plt.legend()
plt.show()

In the previous code the plt.legend() function was used to show the labels of each scatte plot in the plot legend. After executing the previous code the result is shwon in Figure 1.

Figure 1 - The scatter plot with customized markers Example 2 - Customizing marker size and edge color Besides the markerd type the matplotlib library offers you to customize the marker size and the edge color. In this example we will use the same libraries as in the previous example as well as the same data. The code for defining libraries and dataset is given below.

import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0,10,10)
y = np.sin(x)

Then we will create the scatter plot with marker type 'o' (circle), we will set the markersize to 200 (s = 200), marker edge width to 2 (linewidths = 2),and marker edge color to black (edgecolor = 'black'). The second set of points will be created with the offset of y-coordiantes by 2. The same marker type will be used, the markersize will be set to 300, the line width to 5, and the edge color to black. The label of the first set of points (x,y) will be Customized Marker_1 while the label of the second set of points (x,y+2) will be Customized Marker_2

plt.figure(figsize=(12,8))
plt.scatter(x,y, marker = 'o', s = 200, linewidth = 2, edgecolor = 'black', label="Customized Marker")
plt.grid(True)
plt.legend()
plt.show()

The entire code used in this example is shown below.

import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0,10,10)
y = np.sin(x)
print("x = {}".format(x))
print("y = {}".format(y))
plt.figure(figsize=(12,8))
plt.scatter(x,y, marker = 'o', s = 200, linewidth = 2, edgecolor = 'black', label="Customized Marker_1")
plt.scatter(x,y+2, marker = 'o', s = 300, linewidth = 5, edgecolor = 'black', label="Customized Marker_2")
plt.grid(True)
plt.legend()
plt.show()

After executing the previous code the scatter plot with two set of points where each set has its own customized marker size and edge color as shown in Figure 2.

Figure 2 - Scatter plot with customized markers (marker size and edge color). Example 3 - Changing marker colors There are few methods how colors can be specified in the matplotlib i.e. using color names, hex codes, and RGB tuples. In all the cases we will use the same type of dataset as before. Here is the entire code how to generate the dataset with libraries.

import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0,10,10)
y = np.sin(x)

In this example we will define the color in three different ways:

using color names ('red', 'blue', 'green',...)
using hex codes (#000000, #2ca02c, #c3b530,...), and
using red, green, and blue (RGB) normalized values in a tuple format. These tuples can define any color by setting the values from 0 to 1 (normalzed) for red, green, and blue.

The first two functions \(y_1 = \sin(x)\), and \(y_2 = \sin(x)+0.5\) will be shown with red and green colors, respectively.

plt.scatter(x,y, color = 'red')
plt.scatter(x,y+0.5, color = 'green')

The next two functions \(y_3 = \sin(x)+ 1\) and the \(y_4 = \sin(x) + 1.5\) will be shown with #0000FF and #FF9A00 colors respectively.

plt.scatter(x,y+1, color = #ebebeb)
plt.scatter(x,y+1.5, color = #FF9A00)

The final two functions \(y_5 = \sin(x)+2\) and \(y_6 = \sin(x) + 2.5\) will be shown using colors (0.2, 0.6, 0.1) and (0.4, 0.23, 0.48).

plt.scatter(x,y+2, color = (0.2, 0.6, 0.1))
plt.scatter(x,y+2.5, color = (0.4, 0.23, 0.48))

The entire code created in this example is shown below.

import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0,10,10)
            y = np.sin(x)
print("x = {}".format(x))
print("y = {}".format(y))plt.figure(figsize=(12,8))
plt.scatter(x,y, color = 'red')
plt.scatter(x,y+0.5, color = 'green')
plt.scatter(x,y+1, color = "#0000FF")
plt.scatter(x,y+1.5, color = "#FF9A00")
plt.scatter(x,y+2, color = (0.2, 0.6, 0.1))
plt.scatter(x,y+2.5, color = (0.4, 0.23, 0.48))
plt.grid(True)
plt.show()

The result is shown in Figure 3.

Figure 3 - The scatter plot with customized colored markers.