Friday, January 3, 2025

Regularization in Logistic Regression

Imagine you're trying to guess wheter someone likes ice cream or not based on how many ice creams the've eaten in the past week. You draw a straight line on a chart to help make you guesses. But if you try to make your line fit too perfectly to all the points, it might go all wiggly and weird, and it won't work well for new guesses. That's what happens when we overthing or overfit the data.
Regularization is like telling the line:
Hey, don't go too crazy trying to fit everything exactly. Keep it simple!"
It's a way to keep things balanced so the line works well for both the points we already know and new ones we don't.

The math behind regularization

We have explained the Logistic Regression multiple times and if your interested please visit Logistic Regression.
The logistic regression predicts probabilities for binary outcomes (e.g. yes/not, 0/1). The logisitc regression model predicts: \begin{equation} \hat{y} = \sigma(z) = \frac{1}{1+e^{-z}} \end{equation} where:
  • \(\hat{y}\) - s the predicted probability (between 0 and 1)
  • \(z = wx + b\) - is the linear combination of weights and features
  • \(w\) are the weights (coefficients)
  • \(x\) are the features (input variables)
  • \(b\) is the bias term
  • \(\sigma(z)\) - is the sigmoid function, which maps any real number into the range [0, 1]
  • \(\)
The goal is to minimize the log-loss function (also called cross-entropy loss): \begin{equation} Log-Loss = -\frac{1}{N}\sum_{i=1}^N\left[y_i \log(\hat{y}_i) + (1+ y_i)\log(1-\hat{y}_i)\right] \end{equation} where:
  • \(N\) - is the number of samples
  • \(y_i\) - is the true label (0 or 1)
  • \(\hat{y}_i\) - is the predicted probability for sample \(i\)
This loss function pnealizes incorrect predictions more heavily when the model is confident but wrong.

Why Regularization?

When the weights become too large, the model overfits the training data. Regularization adds a penalty to the loss function to discourage large weights, i.e. the regularization prevents the overfitting.
There are three types of regularizaition and these are L1 regularization, L2 Regularization, and combination of L1 and L2.
  • L1 Regularization (Lasso) - Adds the sum of the absolute values of the weights to the loss function: \begin{equation} Loss = OriginalLoss + \lambda \sum_{j=1}^p|w_j| \end{equation} where:
    • \(\lambda\) - controls the strength of the penalty.
    • \(w_j\) - are the weights (coefficients).
    • \(p\) - is the number of features.
    Effect: L1 - Regularization encourages sparsity, meanining some weights may become exactly 0, effectively removing less important features.
  • L2 Regularization (Ridge): L2 Regularization adds the sum of the suqared values of the weights to the los functions: \begin{equation} Loss = Original Loss + \lambda \sum_{j = 1}^p w_j^2 \end{equation} where:
    • \(\lambda\) - controls the strength of the penalty.
    • \(w_j\) - are the weights (coefficients).
    • \(p\) is the number of features.
    Effect L2 Regularization discourages large weights but does not drive them to exactly 0. It works well when all features contribute to the output.
  • Elastic Net: Combines L1 and L2 penalties in the loss function. \begin{equation} Loss = Original Loss + \lambda_1 \sum_{j = 1}^p |w_j| + \lambda_2 \sum_{j = 1}^p w_j^2 \end{equation} where:
    • \(\lambda_1\) - controls the L1 penalty (sparsity).
    • \(\lambda_2\) - controls the L2 penalty.
    Effect: Elastic NEt strikes a balance between L1 and L2 regularization, making it suitable for situations where some features are irrelevant (L1) and others are correlated (L2).

Example (without Python)

We are solving for the regularized logistic regression loss using L2 regularization for the given dataset.
Fruits (\(x_1\)) Vegetables (\(x_2\)) Healthy (\(y\))
2 3 1
1 0 0
3 2 1
0 1 0
The predicted probability \(\hat{y}_i\) for each data point is given by the sigmoi function: \begin{equation} \hat{y}_i = \frac{1}{q+e^{-z_i}} \end{equation} where: \begin{equation} z_i = w_1x_{1i} + w_2 x_{2i} + b \end{equation}
  • \(w_1\) and \(w_2\) are weights for fruits and vegetables.
  • \(b\) is the bias term.
The L2 regularized loss function is: \begin{equation} Loss = -\frac{1}{n}\sum_{i=1}^{n}\left(y_i \log(\hat{y}_i) + (1+y_i)\log(1-\hat{y}_i)\right) + \lambda(w_1^2 + w_2^2) \end{equation} The number of samples \(n\) is 4 since the entire dataset has 4 samples and the \(\lambda\) (regualrization parameter) is equal to 0.1.
Let's assume that initial weights \(w_1 = 2.0\), \(w_2 = 1.5\), and \(b = 0.5\).
Compute the \(z_i\) and \(\hat{y}_i\).
  1. For the first sample input values are \(x_1 = 2\) and \(x_2 = 3\) the \(z_1\) and \(\hat{y}_1\) are equal to: \begin{equation} z_1 = 2(2) + 1.5(3) + 0.5 = 9.0 \end{equation} \begin{equation} \hat{y}_1 = \frac{1}{e^{-9.0}} = 0.999 \end{equation}
  2. For the first sample input values are \(x_1 = 1\) and \(x_2 = 0\) the \(z_2\) and \(\hat{y}_2\) are equal to: \begin{equation} z_2 = 2(1) + 1.5(0) + 0.5 = 2.5 \end{equation} \begin{equation} \hat{y}_2 = \frac{1}{e^{-2.5}} = 0.924 \end{equation}
  3. For the first sample input values are \(x_1 = 3\) and \(x_2 = 2\) the \(z_3\) and \(\hat{y}_3\) are equal to: \begin{equation} z_3 = 2(3) + 1.5(2) + 0.5 = 10.5 \end{equation} \begin{equation} \hat{y}_3 = \frac{1}{e^{-10.5}} = 0.999 \end{equation}
  4. For the first sample input values are \(x_1 = 0\) and \(x_2 = 1\) the \(z_4\) and \(\hat{y}_4\) are equal to: \begin{equation} z_4 = 2(0) + 1.5(1) + 0.5 = 2.0 \end{equation} \begin{equation} \hat{y}_4 = \frac{1}{e^{-2.0}} = 0.881 \end{equation}
The log-los value can be obtaine from the log-los function: \begin{eqnarray} LogLoss &=& -\frac{1}{4} (1⋅log(0.999)+(1−1)⋅log(1−0.999)+0⋅log(0.924)\\ \nonumber &+&(1−0)⋅log(1−0.924)+1⋅log(0.999)\\ \nonumber &+&(1−1)⋅log(1−0.999)+0⋅log(0.881)+(1−0)⋅log(1−0.881)) \end{eqnarray} \begin{equation} LogLoss = -\frac{1}{4}(0.209) = 0.052 \end{equation} The L2 penalty is equal to: \begin{equation} L2Penalty = \lambda(w_1^2 + w_2^2) = 0.1(4+2.25) = 0.625 \end{equation} The total regularized loss is equal to: \begin{equation} Total Loss = LogLoss + L2 Penalty = 0.052 + 0.625 = 0.677 \end{equation} The total regularized loss with \(\lambda = 0.1\), \(w_1 = 2.0\), \(w_2 = 1.5\), and \(b = 0.5\) is \(0.677\). This shows how L2 regularization penalizes large weights to prevent overfitting.

Example with scikit-learn Python

Two libraries will be required numpy and pandas.
import numpy as np
import pandas as pd
Next we will need LogisticRegression from skleanr.linear_model and accuracy_score from sklearn.metrics.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
Now we will create the dataset that was shown in previous example without PYthon in form of the table.
# Dataset
data = { 'Fruits': [2, 1, 3, 0], 'Vegetables': [3, 0, 2, 1], 'Healthy': [1, 0, 1, 0] }
Since the dataset is too small the the train_test_split function will not be used. Instead the entire LogistRegression algorithm will be trained with entire dataset. However, we have to divide the dataset on input variables and the output target variable. The input variables (storred under large X variable) will contain Fruts and Vegetables, while output target varaible will be Heatly dataset variable. But before that we will transform the dataset into pandas DataFrame.
# Convert to DataFrame
df = pd.DataFrame(data)
X = df[['Fruits', 'Vegetables']]
y = df['Healthy']
Now we can define the LogisiticRegression algorithm and train it on the entire dataset using the fit() function. The hyperparameters of the LogisticRegression algorithm/model that will be used in this example are penalty = 'l2' and C = 1.0. The penalty that will be used is l2 regularization which is defualt value when the algorithm is called. The C is the invers of regularization strength, must be a positive float. Like in support vector machines, smaller values specify stronger regularization. The value by default is 1.0.
# Logistic Regression with L2 Regularization
model = LogisticRegression(penalty='l2', C=1.0)
model.fit(X, y)
Finally we will make predictions, using predict() function and providing X to the trained model. Then we will show the weights using built in coef_ function and calculate and show the classification accuracy using the accuracy_score function.
# Predictions
predictions = model.predict(X)
print("Weights:", model.coef_)
print("Accuracy:", accuracy_score(y, predictions))
The output obtained in this example is given below.
Weights: [[0.74078792 0.74078792]]
>ccuracy: 1.0
Since the LogisticRegression is trained and tested on the same dataset the obtained accuracy value must be equal to 1.0 i.e. perfect classification accuracy. The weights for both inpur variables are the same which indicates the equal level on contribution to the output.

No comments:

Post a Comment