Regularization is like telling the line:
Hey, don't go too crazy trying to fit everything exactly. Keep it simple!"
It's a way to keep things balanced so the line works well for both the points we already know and new ones we don't.
The math behind regularization
We have explained the Logistic Regression multiple times and if your interested please visit Logistic Regression.The logistic regression predicts probabilities for binary outcomes (e.g. yes/not, 0/1). The logisitc regression model predicts: \begin{equation} \hat{y} = \sigma(z) = \frac{1}{1+e^{-z}} \end{equation} where:
- \(\hat{y}\) - s the predicted probability (between 0 and 1)
- \(z = wx + b\) - is the linear combination of weights and features
- \(w\) are the weights (coefficients)
- \(x\) are the features (input variables)
- \(b\) is the bias term
- \(\sigma(z)\) - is the sigmoid function, which maps any real number into the range [0, 1]
- \(\)
- \(N\) - is the number of samples
- \(y_i\) - is the true label (0 or 1)
- \(\hat{y}_i\) - is the predicted probability for sample \(i\)
Why Regularization?
When the weights become too large, the model overfits the training data. Regularization adds a penalty to the loss function to discourage large weights, i.e. the regularization prevents the overfitting.There are three types of regularizaition and these are L1 regularization, L2 Regularization, and combination of L1 and L2.
- L1 Regularization (Lasso) - Adds the sum of the absolute values of the weights to the loss function:
\begin{equation}
Loss = OriginalLoss + \lambda \sum_{j=1}^p|w_j|
\end{equation}
where:
- \(\lambda\) - controls the strength of the penalty.
- \(w_j\) - are the weights (coefficients).
- \(p\) - is the number of features.
- L2 Regularization (Ridge): L2 Regularization adds the sum of the suqared values of the weights to the los functions:
\begin{equation}
Loss = Original Loss + \lambda \sum_{j = 1}^p w_j^2
\end{equation}
where:
- \(\lambda\) - controls the strength of the penalty.
- \(w_j\) - are the weights (coefficients).
- \(p\) is the number of features.
- Elastic Net: Combines L1 and L2 penalties in the loss function.
\begin{equation}
Loss = Original Loss + \lambda_1 \sum_{j = 1}^p |w_j| + \lambda_2 \sum_{j = 1}^p w_j^2
\end{equation}
where:
- \(\lambda_1\) - controls the L1 penalty (sparsity).
- \(\lambda_2\) - controls the L2 penalty.
Example (without Python)
We are solving for the regularized logistic regression loss using L2 regularization for the given dataset.
Fruits (\(x_1\)) | Vegetables (\(x_2\)) | Healthy (\(y\)) |
---|---|---|
2 | 3 | 1 |
1 | 0 | 0 |
3 | 2 | 1 |
0 | 1 | 0 |
- \(w_1\) and \(w_2\) are weights for fruits and vegetables.
- \(b\) is the bias term.
Let's assume that initial weights \(w_1 = 2.0\), \(w_2 = 1.5\), and \(b = 0.5\).
Compute the \(z_i\) and \(\hat{y}_i\).
- For the first sample input values are \(x_1 = 2\) and \(x_2 = 3\) the \(z_1\) and \(\hat{y}_1\) are equal to: \begin{equation} z_1 = 2(2) + 1.5(3) + 0.5 = 9.0 \end{equation} \begin{equation} \hat{y}_1 = \frac{1}{e^{-9.0}} = 0.999 \end{equation}
- For the first sample input values are \(x_1 = 1\) and \(x_2 = 0\) the \(z_2\) and \(\hat{y}_2\) are equal to: \begin{equation} z_2 = 2(1) + 1.5(0) + 0.5 = 2.5 \end{equation} \begin{equation} \hat{y}_2 = \frac{1}{e^{-2.5}} = 0.924 \end{equation}
- For the first sample input values are \(x_1 = 3\) and \(x_2 = 2\) the \(z_3\) and \(\hat{y}_3\) are equal to: \begin{equation} z_3 = 2(3) + 1.5(2) + 0.5 = 10.5 \end{equation} \begin{equation} \hat{y}_3 = \frac{1}{e^{-10.5}} = 0.999 \end{equation}
- For the first sample input values are \(x_1 = 0\) and \(x_2 = 1\) the \(z_4\) and \(\hat{y}_4\) are equal to: \begin{equation} z_4 = 2(0) + 1.5(1) + 0.5 = 2.0 \end{equation} \begin{equation} \hat{y}_4 = \frac{1}{e^{-2.0}} = 0.881 \end{equation}
Example with scikit-learn Python
Two libraries will be required numpy and pandas.
import numpy as npNext we will need LogisticRegression from skleanr.linear_model and accuracy_score from sklearn.metrics.
import pandas as pd
from sklearn.linear_model import LogisticRegressionNow we will create the dataset that was shown in previous example without PYthon in form of the table.
from sklearn.metrics import accuracy_score
# DatasetSince the dataset is too small the the train_test_split function will not be used. Instead the entire LogistRegression algorithm will be trained with entire dataset. However, we have to divide the dataset on input variables and the output target variable. The input variables (storred under large X variable) will contain Fruts and Vegetables, while output target varaible will be Heatly dataset variable. But before that we will transform the dataset into pandas DataFrame.
data = { 'Fruits': [2, 1, 3, 0], 'Vegetables': [3, 0, 2, 1], 'Healthy': [1, 0, 1, 0] }
# Convert to DataFrameNow we can define the LogisiticRegression algorithm and train it on the entire dataset using the fit() function. The hyperparameters of the LogisticRegression algorithm/model that will be used in this example are penalty = 'l2' and C = 1.0. The penalty that will be used is l2 regularization which is defualt value when the algorithm is called. The C is the invers of regularization strength, must be a positive float. Like in support vector machines, smaller values specify stronger regularization. The value by default is 1.0.
df = pd.DataFrame(data)
X = df[['Fruits', 'Vegetables']]
y = df['Healthy']
# Logistic Regression with L2 RegularizationFinally we will make predictions, using predict() function and providing X to the trained model. Then we will show the weights using built in coef_ function and calculate and show the classification accuracy using the accuracy_score function.
model = LogisticRegression(penalty='l2', C=1.0)
model.fit(X, y)
# PredictionsThe output obtained in this example is given below.
predictions = model.predict(X)
print("Weights:", model.coef_)
print("Accuracy:", accuracy_score(y, predictions))
Weights: [[0.74078792 0.74078792]]Since the LogisticRegression is trained and tested on the same dataset the obtained accuracy value must be equal to 1.0 i.e. perfect classification accuracy. The weights for both inpur variables are the same which indicates the equal level on contribution to the output.
>ccuracy: 1.0