Tuesday, January 28, 2025

Comparing Ridge, Lasso, and ElasticNet Regressions

In this post, we will compare three popular regularization techniques used in linear regression models: Ridge, Lasso, and ElasticNet. These methods help prevent overfitting by adding penalties to the coefficients of the model.

What is Regularization?

Regularization is a technique used to improve the generalization ability of a model. It adds a penalty to the loss function based on the size of the coefficients, helping to reduce the complexity of the model and prevent overfitting.

Ridge Regression

Ridge regression, also known as L2 regularization, adds the squared magnitude of the coefficients as a penalty term to the loss function. The goal is to minimize the sum of the squared residuals along with the penalty term.

Python Example: Ridge Regression

In this example, we will generate synthetic data with two features, split it into training and test sets, and apply Ridge regression using an alpha value of 1.0. The model's coefficients and intercept will be shown as the output.

Let's implement Ridge regression in Python using the Ridge class from sklearn.linear_model.

from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
import numpy as np

# Generating synthetic data
X, y = make_regression(n_samples=100, n_features=2, noise=0.1)

# Splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Applying Ridge regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

# Making predictions and evaluating the model
ridge_predictions = ridge.predict(X_test)

print("Ridge Regression Coefficients:", ridge.coef_)
print("Ridge Regression Intercept:", ridge.intercept_)

The previous code block consist of the following sections:

Importing Libraries:
- from sklearn.linear_model import Ridge - Imports the Ridge class from sklearn.linear_model to implement Ridge regression.
- from sklearn.model_selection import train_test_split - Imports the train_test_split function to split the dataset into training and testing subsets.
- from sklearn.datasets import make_regression - Imports the make_regression function to generate synthetic regression data.
- import numpy as np - Imports the numpy library, typically used for numerical operations, although not directly used in this snippet.
Generating Synthetic Data:
- X, y = make_regression(n_samples=100, n_features=2, noise=0.1) - Generates a synthetic regression dataset with 100 samples and 2 features. The noise parameter adds random noise to the output values, making the problem more realistic.
Splitting Data into Training and Test Sets:
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) - Splits the generated dataset into training and test sets. 20% of the data is used for testing, and 80% is used for training. The random_state ensures that the split is reproducible.
Applying Ridge Regression:
- ridge = Ridge(alpha=1.0) - Initializes the Ridge regression model with a regularization strength of 1.0 (default value). The alpha parameter controls the amount of regularization applied to the model.
- ridge.fit(X_train, y_train) - Fits the Ridge regression model to the training data (X_train and y_train). The model learns the relationships between the features and the target variable during this step.
Making Predictions:
- ridge_predictions = ridge.predict(X_test) - Uses the trained Ridge model to make predictions on the test data (X_test), which will be evaluated against the actual target values (y_test).
Printing Model Parameters:
- print("Ridge Regression Coefficients:", ridge.coef_) - Prints the coefficients learned by the Ridge regression model. These coefficients represent the contribution of each feature to the model's predictions.
- print("Ridge Regression Intercept:", ridge.intercept_) - Prints the intercept value (bias term) of the Ridge regression model. This is the predicted value when all input features are zero.

After executing the code in this example the following output is obtained.

Ridge Regression Coefficients: [42.32481736 37.27191182]
Ridge Regression Intercept: -0.1079988849518081

Ridge Regression Coefficients:
- [42.32481736 37.27191182] - These are the coefficients (weights) learned by the Ridge regression model for each of the two input features. The model assigns:
  - 42.32481736 to the first feature, meaning for each unit increase in this feature, the output variable is expected to increase by approximately 42.32 units, holding the other feature constant.
  - 37.27191182 to the second feature, meaning for each unit increase in this feature, the output variable is expected to increase by approximately 37.27 units, holding the other feature constant.
Ridge Regression Intercept:
- -0.1079988849518081 - This is the intercept (bias term) of the Ridge regression model. It represents the predicted value of the target variable when both input features are zero. In this case, the model predicts a value of approximately -0.11 when both features are zero.

Lasso Regression

Lasso regression, or L1 regularization, uses the absolute values of the coefficients as a penalty term. It tends to produce sparse models, where some coefficients are driven to zero, effectively performing feature selection.

Python Example: Lasso Regression

Here, we apply Lasso regression with an alpha value of 0.1. As with Ridge regression, the coefficients and intercept are printed. However, in Lasso, some coefficients may be zero, leading to a simpler model.

Let's implement Lasso regression using the Lasso class from sklearn.linear_model.

from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
import numpy as np

# Generating synthetic data
X, y = make_regression(n_samples=100, n_features=2, noise=0.1)
# Splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Applying Lasso regression
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

# Making predictions and evaluating the model
lasso_predictions = lasso.predict(X_test)

print("Lasso Regression Coefficients:", lasso.coef_)
print("Lasso Regression Intercept:", lasso.intercept_)

The previous code block consist of the following steps:

Importing Libraries:
- from sklearn.linear_model import Lasso - Imports the Lasso class from sklearn.linear_model to implement Lasso regression.
- from sklearn.model_selection import train_test_split - Imports the train_test_split function to split the dataset into training and testing subsets.
- from sklearn.datasets import make_regression - Imports the make_regression function to generate synthetic regression data.
- import numpy as np - Imports the numpy library for numerical operations, although it's not directly used in this snippet.
Generating Synthetic Data and Splitting the Data:
- X, y = make_regression(n_samples=100, n_features=2, noise=0.1) - Generates a synthetic regression dataset with 100 samples and 2 features. The noise parameter adds random noise to the output values, simulating real-world data.
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) - Splits the generated dataset into training and test sets. 20% of the data is used for testing, and 80% is used for training. The random_state ensures that the split is reproducible.
Applying Lasso Regression:
- lasso = Lasso(alpha=0.1) - Initializes the Lasso regression model with a regularization strength of 0.1 (alpha). The alpha parameter controls the magnitude of the L1 penalty, influencing how much the model shrinks the coefficients.
- lasso.fit(X_train, y_train) - Fits the Lasso regression model to the training data (X_train and y_train). The model learns the relationships between the features and the target variable during this step. (Note: X_train and y_train should have been previously defined, but they aren't in the given snippet.)
Making Predictions:
- lasso_predictions = lasso.predict(X_test) - Uses the trained Lasso model to make predictions on the test data (X_test). The model applies the learned relationships to predict the target values for the unseen test set.
Printing Model Parameters:
- print("Lasso Regression Coefficients:", lasso.coef_) - Prints the coefficients learned by the Lasso regression model. These coefficients represent the impact of each feature on the target variable. In Lasso, some coefficients may be zero due to feature selection, making the model sparse.
- print("Lasso Regression Intercept:", lasso.intercept_) - Prints the intercept (bias term) of the Lasso regression model. This is the predicted value when all input features are zero. In Lasso, the intercept is typically non-zero unless the data is centered.

Executing the previous code block the following output is obtained.

    Lasso Regression Coefficients: [18.42199406 61.89838269]
    Lasso Regression Intercept: 0.02333834253124545

Lasso Regression Coefficients:
- [18.42199406 61.89838269] - These are the coefficients (weights) learned by the Lasso regression model for each of the two input features. The model assigns:
  - 18.42199406 to the first feature, meaning that for each unit increase in this feature, the output variable is expected to increase by approximately 18.42 units, holding the other feature constant.
  - 61.89838269 to the second feature, meaning that for each unit increase in this feature, the output variable is expected to increase by approximately 61.90 units, holding the other feature constant.
Lasso Regression Intercept:
- 0.02333834253124545 - This is the intercept (bias term) of the Lasso regression model. It represents the predicted value when both input features are zero. In this case, the model predicts a value of approximately 0.02 when both features are zero.

ElasticNet Regression

ElasticNet regression combines both L1 and L2 regularization, making it a compromise between Ridge and Lasso. It is useful when there are many correlated features in the dataset.

Python Example: ElasticNet Regression

In the ElasticNet regression example, we use both L1 and L2 regularization. The l1_ratio parameter controls the mix of Lasso (L1) and Ridge (L2) regularization, with a value of 0.5 indicating an equal mix.

Now, we will implement ElasticNet regression using the ElasticNet class from sklearn.linear_model.

from sklearn.linear_model import ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
import numpy as np

# Generating synthetic data
X, y = make_regression(n_samples=100, n_features=2, noise=0.1)
# Splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Applying ElasticNet regression
elasticnet = ElasticNet(alpha=0.1, l1_ratio=0.5)
elasticnet.fit(X_train, y_train)

# Making predictions and evaluating the model
elasticnet_predictions = elasticnet.predict(X_test)

print("ElasticNet Regression Coefficients:", elasticnet.coef_)
print("ElasticNet Regression Intercept:", elasticnet.intercept_)

The previous code block consist of the following sections:

Importing Libraries:
- from sklearn.linear_model import ElasticNet - Imports the ElasticNet class from sklearn.linear_model to implement ElasticNet regression.
- from sklearn.model_selection import train_test_split - Imports the train_test_split function to split the dataset into training and testing subsets.
- from sklearn.datasets import make_regression - Imports the make_regression function to generate synthetic regression data.
- import numpy as np - Imports the numpy library for numerical operations, although it's not directly used in this snippet.
Generating Synthetic Data:
- X, y = make_regression(n_samples=100, n_features=2, noise=0.1) - Generates a synthetic regression dataset with 100 samples and 2 features. The noise parameter adds random noise to the output values, simulating real-world data.
Splitting the Data:
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) - Splits the data into training and test sets, with 20% of the data allocated for testing. The random_state=42 ensures reproducibility of the split.
Applying ElasticNet Regression:
- elasticnet = ElasticNet(alpha=0.1, l1_ratio=0.5) - Initializes the ElasticNet regression model. The alpha parameter controls the strength of the regularization, while the l1_ratio parameter determines the mix between Lasso (L1) and Ridge (L2) penalties:
  - When l1_ratio=1.0, it behaves like Lasso regression (pure L1 regularization).
  - When l1_ratio=0.0, it behaves like Ridge regression (pure L2 regularization).
  - Here, with l1_ratio=0.5, it combines both Lasso and Ridge penalties in equal measure.
- elasticnet.fit(X_train, y_train) - Fits the ElasticNet regression model to the training data (X_train and y_train) and learns the coefficients that minimize the residual sum of squares subject to the regularization.
Making Predictions:
- elasticnet_predictions = elasticnet.predict(X_test) - Uses the trained ElasticNet model to make predictions on the test data (X_test). This step applies the learned relationships to predict the target values for the test set.
Printing Model Parameters:
- print("ElasticNet Regression Coefficients:", elasticnet.coef_) - Prints the coefficients learned by the ElasticNet regression model. These coefficients represent how much each feature contributes to the target variable. Both Lasso and Ridge regularization influence these coefficients.
- print("ElasticNet Regression Intercept:", elasticnet.intercept_) - Prints the intercept (bias term) of the ElasticNet regression model. The intercept is the predicted value when all input features are zero.

The otuput of the ElasticNet example is shown below.

ElasticNet Regression Coefficients: [ 4.33447528 75.87734055]
ElasticNet Regression Intercept: 0.12560084787858017

ElasticNet Regression Coefficients:
- [ 4.33447528 75.87734055 ] - These are the coefficients (weights) for the two features used in the ElasticNet regression model:
  - 4.33447528 corresponds to the first feature, indicating that for each unit increase in the first feature, the target variable is expected to increase by approximately 4.33 units, while holding the second feature constant.
  - 75.87734055 corresponds to the second feature, indicating that for each unit increase in the second feature, the target variable is expected to increase by approximately 75.88 units, while holding the first feature constant.
ElasticNet Regression Intercept:
- 0.12560084787858017 - This is the intercept (bias term) of the model. It represents the predicted value when both features are zero. In this case, when both features are zero, the model predicts a value of approximately 0.13 for the target variable.

Comparison

Here is a brief comparison of the three regression methods:

Ridge Regression: Suitable when most features contribute to the prediction. It tends to shrink coefficients evenly.
Lasso Regression: Useful for feature selection. It can shrink some coefficients to zero, making the model sparse.
ElasticNet Regression: Combines Ridge and Lasso, performing well in cases with many correlated features.

Each method has its advantages and should be chosen based on the nature of your dataset and the problem at hand.

Thank you for reading the tutorial! Try running the Python code and let me know in the comments if you got the same results. If you have any questions or need further clarification, feel free to leave a comment. Thanks again!

Feature Scaling Techniques in Machine Learning

Feature scaling is essential for many machine learning algorithms to perform well. In this section, we will describe several feature scaling techniques, provide a simple example dataset, and showcase the results of applying each technique.

1. MaxAbsScaler

The MaxAbsScaler scales each feature by its maximum absolute value. It scales the data to a range between -1 and 1 while preserving the sparsity of the dataset (if any). This method is useful when the data is already centered around zero and you want to maintain its sparsity.

Example:

from sklearn.preprocessing import MaxAbsScaler
import numpy as np
    
# Example dataset
data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]])
    
scaler = MaxAbsScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)

The previos code block of the following steps:

Importing necessary libraries:
- from sklearn.preprocessing import MaxAbsScaler: Imports the MaxAbsScaler from the sklearn.preprocessing module, which is used to scale each feature by its maximum absolute value.
- import numpy as np: Imports the numpy library, which is used to create and manipulate arrays.
Defining the example dataset:
- data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]): Creates a 3x3 NumPy array called data with 3 rows and 3 columns. The values include both positive and negative numbers.
Creating the scaler object:
- scaler = MaxAbsScaler(): Creates an instance of the MaxAbsScaler, which scales the data by its maximum absolute value.
Applying the scaler to the dataset:
- scaled_data = scaler.fit_transform(data): Applies the MaxAbsScaler to the dataset by fitting the scaler to the data and then transforming it. The result is stored in scaled_data.
Displaying the scaled data:
- print(scaled_data): Prints the transformed data to the console. The data is scaled so that each feature is divided by its maximum absolute value, resulting in values between -1 and 1.

After the code is exectuted the following output is obtained.


    [[ 0.25  0.4   0.5 ]
     [-0.25 -0.4  -0.5 ]
     [ 1.    1.    1.  ]]

2. MinMaxScaler

The MinMaxScaler transforms the data into a fixed range, usually between 0 and 1. The formula is:

\begin{equation} X_{scaled} = \frac{X - X_{min}}{X_{max} - X_{min}} \end{equation}

This scaler is useful when features have different units or scales and you need to standardize them into the same range for convergence.

Example:

from sklearn.preprocessing import MinMaxScaler
import numpy as np
# Example dataset
data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]) 
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)

The previous code block consist of the following steps:

Importing necessary libraries:
- from sklearn.preprocessing import MinMaxScaler: Imports the MinMaxScaler from the sklearn.preprocessing module, which is used to scale features to a specified range, typically between 0 and 1.
- import numpy as np: Imports the numpy library, which is used to create and manipulate arrays.
Defining the example dataset:
- data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]): Creates a 3x3 NumPy array called data with 3 rows and 3 columns. The values in the array are a mix of positive and negative numbers.
Creating the scaler object:
- scaler = MinMaxScaler(): Creates an instance of the MinMaxScaler. This scaler transforms the data by scaling each feature to a given range (default is between 0 and 1), based on the minimum and maximum values of each feature.
Applying the scaler to the dataset:
- scaled_data = scaler.fit_transform(data): This line of code first fits the MinMaxScaler to the data (i.e., computes the minimum and maximum values for each feature) and then transforms the data by scaling each feature to the range [0, 1]. The resulting transformed data is stored in scaled_data.
Displaying the scaled data:
- print(scaled_data): Prints the scaled data to the console. The values in each column are now transformed to lie between 0 and 1, according to the minimum and maximum values of each feature in the original data.

When the previous code is executed the ofllowing output is obtained.

[[0.25 0.4  0.5 ]
[0.   0.   0.  ]
[1.   1.   1.  ]]

3. Normalizer

The Normalizer scales each sample (row) to have a unit norm (magnitude of 1). This is useful when you want to scale each observation independently of the others.

Example:

from sklearn.preprocessing import Normalizer
import numpy as np 
# Example dataset
data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]) 
scaler = Normalizer()
scaled_data = scaler.fit_transform(data)
print(scaled_data)

The previous code block consist of the following steps:

Importing necessary libraries:
- from sklearn.preprocessing import Normalizer: Imports the Normalizer from the sklearn.preprocessing module, which is used to normalize the dataset. Normalization scales each sample (row) to have a unit norm.
- import numpy as np: Imports the numpy library, which is used to create and manipulate arrays.
Defining the example dataset:
- data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]): Creates a 3x3 NumPy array called data with 3 rows and 3 columns. The values in this array include both positive and negative numbers.
Creating the scaler object:
- scaler = Normalizer(): Creates an instance of the Normalizer class. The Normalizer scales each sample (row) in the dataset to have a unit norm (i.e., the Euclidean norm of the row is 1).
Applying the scaler to the dataset:
- scaled_data = scaler.fit_transform(data): This line of code first fits the Normalizer to the dataset (calculates the necessary values for normalization) and then transforms the data, scaling each row to have a unit norm. The resulting scaled data is stored in scaled_data.
Displaying the scaled data:
- print(scaled_data): Prints the normalized data to the console. Each row in the output will have a Euclidean norm of 1, meaning that the sum of squares of the elements in each row will be equal to 1.

When the code block is excuted the following output is obtained.

[[0.26726124 0.53452248 0.80178373]
[-0.26726124 -0.53452248 -0.80178373]
[0.26726124 0.33407655 0.40089186]]

4. PowerTransformer

The PowerTransformer applies a power transformation to make data more Gaussian-like. It includes two methods: the Box-Cox and Yeo-Johnson transformations, which are useful for correcting skewed data.

Example:

from sklearn.preprocessing import PowerTransformer
import numpy as np 
# Example dataset
data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]])     
scaler = PowerTransformer()
scaled_data = scaler.fit_transform(data)

The previous code block consist of the following steps:

Importing necessary libraries:
- from sklearn.preprocessing import PowerTransformer: Imports the PowerTransformer from the sklearn.preprocessing module, which is used to apply power transformations to make data more Gaussian (normal) by applying a nonlinear transformation to the features.
- import numpy as np: Imports the numpy library, which is used for creating and manipulating arrays.
Defining the example dataset:
- data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]): Creates a 3x3 NumPy array called data with 3 rows and 3 columns. The dataset contains both positive and negative numbers.
Creating the scaler object:
- scaler = PowerTransformer(): Creates an instance of the PowerTransformer. This scaler transforms the data using a power transformation to make the data distribution closer to a normal (Gaussian) distribution. It applies a Box-Cox transformation or a Yeo-Johnson transformation, depending on the data's characteristics (positive vs. both positive and negative values).
Applying the scaler to the dataset:
- scaled_data = scaler.fit_transform(data): This line of code first fits the PowerTransformer to the dataset (calculates the necessary transformation parameters) and then transforms the data. The transformed data is stored in scaled_data. The result is a transformed dataset that aims to have a more Gaussian distribution for each feature.

When the previous code is executed the following output is obtained.

[[-0.75592499 -0.75592499 -0.75592499]
[ 0.75592499  0.75592499  0.75592499]
[-1.60169291 -1.60169291 -1.60169291]]

5. RobustScaler

The RobustScaler uses the median and interquartile range for scaling, making it robust to outliers. It scales the data by subtracting the median and dividing by the interquartile range.

Example:

from sklearn.preprocessing import RobustScaler
import numpy as np 
# Example dataset
data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]])     
scaler = RobustScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)

The previous code block consist of the following steps:

Importing necessary libraries:
- from sklearn.preprocessing import RobustScaler: Imports the RobustScaler from the sklearn.preprocessing module, which is used for scaling the features of the dataset using the median and interquartile range (IQR) instead of mean and standard deviation.
- import numpy as np: Imports the numpy library, which is used for creating and manipulating arrays.
Defining the example dataset:
- data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]): Creates a 3x3 NumPy array called data with 3 rows and 3 columns. The dataset contains both positive and negative numbers.
Creating the scaler object:
- scaler = RobustScaler(): Creates an instance of the RobustScaler. This scaler transforms the data by using the median and the interquartile range (IQR) for scaling, which makes it robust to outliers. It is particularly useful when the dataset has extreme outliers that could affect scaling using standard techniques.
Applying the scaler to the dataset:
- scaled_data = scaler.fit_transform(data): This line of code fits the RobustScaler to the dataset (calculates the necessary values like median and IQR for scaling) and then transforms the data. The resulting scaled data is stored in scaled_data.
Displaying the scaled data:
- print(scaled_data): Prints the scaled data to the console. The values are scaled by subtracting the median of each feature and then dividing by the interquartile range (IQR), making them less sensitive to outliers.

When the previous code block is exectued the following output is obtained.

[[ 0.   0.   0.  ]
[-0.5 -0.5 -0.5 ]
[ 1.   1.   1.  ]]

6. StandardScaler

The StandardScaler standardizes features by removing the mean and scaling to unit variance. The formula is:

\begin{equation} X_{scaled} = \frac{X - mean}{std_{dev}} \end{equation}

This method is useful when the data follows a Gaussian distribution or when features have different variances.

Example:

from sklearn.preprocessing import StandardScaler
import numpy as np 
# Example dataset
data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]])     
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)

The previous code block consist of the following steps:

Importing necessary libraries:
- from sklearn.preprocessing import StandardScaler: Imports the StandardScaler from the sklearn.preprocessing module, which is used to scale the dataset by transforming it into a distribution with a mean of 0 and a standard deviation of 1.
- import numpy as np: Imports the numpy library, which is used to create and manipulate arrays.
Defining the example dataset:
- data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]): Creates a 3x3 NumPy array called data with 3 rows and 3 columns. The values in the array include both positive and negative numbers.
Creating the scaler object:
- scaler = StandardScaler(): Creates an instance of the StandardScaler. This scaler transforms the data to have a mean of 0 and a standard deviation of 1. It is commonly used when the features in the dataset are on different scales.
Applying the scaler to the dataset:
- scaled_data = scaler.fit_transform(data): This line of code first fits the StandardScaler to the dataset (calculates the mean and standard deviation for each feature) and then transforms the data, scaling it so that each feature has a mean of 0 and a standard deviation of 1. The resulting scaled data is stored in scaled_data.
Displaying the scaled data:
- print(scaled_data): Prints the scaled data to the console. Each feature will have been transformed to have a mean of 0 and a standard deviation of 1. This ensures that all features are on a comparable scale, which can improve the performance of certain machine learning algorithms.

When the code is executed the following output is obtained.

[[-0.26726124 -0.26726124 -0.26726124]
[ 0.26726124  0.26726124  0.26726124]
[ 1.06904497  1.06904497  1.06904497]]

This is the end of the tutorial on feature scaling techniuqes. Please try the code described in the post and if you have any question regarding this tutorial please leave the commnet below. Thank you.

Interpreting Coefficients in Linear and Logistic Regression

Pythonholics: Interpreting Coefficients in Linear and Logistic Regression

Interpreting coefficients in linear and logistic regression is essential for understanding the relationship between variables in statistical and machine learning models. In linear regression, coefficients quantify how much the dependent variable changes for a one-unit increase in an independent variable, assuming all other variables remain constant. Logistic regression, used for binary classification, provides coefficients that explain the impact of predictors on the log-odds of an event occurring, which can be further converted into odds ratios for easier interpretation. By understanding these coefficients, practitioners can gain insights into the significance, magnitude, and direction of predictors, enabling informed decision-making and better model explanations.

Linear Regression

Linear regression models the relationship between a dependent variable and one or more independent variables using a straight line. The coefficients represent the change in the dependent variable for a one-unit increase in the independent variable.

Example: Predicting house prices based on square footage.

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# Sample data
data = {'Square_Feet': [1500, 1800, 2400, 3000, 3500],
        'Price': [300000, 350000, 400000, 500000, 600000]}
df = pd.DataFrame(data)

# Model
X = df[['Square_Feet']]
y = df['Price']
model = LinearRegression()
model.fit(X, y)

# Coefficients
print("Coefficient (Slope):", model.coef_[0])
print("Intercept:", model.intercept_)

# Interpretation
# For every additional square foot, the house price increases by model.coef_[0] units.

The previous code block consist of the following code lines:

The code imports necessary libraries: numpy, pandas, and LinearRegression from sklearn.linear_model.
A dictionary named data is created with two keys: Square_Feet (independent variable) and Price (dependent variable), representing house sizes and their corresponding prices.
The dictionary is converted into a pandas DataFrame called df for easier manipulation.
The independent variable (Square_Feet) is assigned to X, and the dependent variable (Price) is assigned to y.
An instance of LinearRegression is created and stored in the variable model.
The model is trained on the data using model.fit(X, y), where the algorithm learns the relationship between square footage and price.
The slope (coefficient) of the regression line is retrieved using model.coef_[0], which indicates how much the price increases for each additional square foot.
The y-intercept of the regression line is retrieved using model.intercept_, representing the price of a house when the square footage is 0.
The code prints the slope and intercept values to interpret the linear relationship between the variables.
Interpretation: The coefficient (model.coef_[0]) indicates that for every additional square foot of house size, the price increases by the given amount (in the same units as Price).

When the code is executed the following result is obtained.

Coefficient (Slope): 144.21669106881407
Intercept: 78111.27379209368

The interpretation of the output is as follows:

Coefficient (Slope): 144.21669106881407 For every additional square foot of house size, the house price increases by approximately 144.22 units. In this context, if the price is in dollars, then for every extra square foot, the price increases by $144.22.
Intercept: 78111.27379209368 When the house size is 0 square feet (which is theoretical and may not have practical meaning), the predicted house price is approximately $78,111.27. The intercept represents the baseline value of the dependent variable (price) when all predictors (square footage) are zero.
Practical Interpretation: The model suggests that larger houses cost more, with an increase of $144.22 for each additional square foot.

Key Point: The coefficient for Square_Feet shows how much the price changes per square foot.

Logistic Regression

Logistic regression is used for classification problems, predicting the probability of a binary outcome. The coefficients represent the change in the log-odds of the outcome for a one-unit increase in the predictor variable.

Example: Predicting whether a customer will buy a product based on income.

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

# Sample data
data = {'Income': [30000, 45000, 60000, 80000, 100000],
        'Purchased': [0, 0, 1, 1, 1]}
df = pd.DataFrame(data)

# Model
X = df[['Income']]
y = df['Purchased']
model = LogisticRegression()
model.fit(X, y)

# Coefficients
print("Coefficient (Log-Odds):", model.coef_[0][0])
print("Intercept:", model.intercept_[0])

# Probability Interpretation
import math
odds_ratio = math.exp(model.coef_[0][0])
print("Odds Ratio:", odds_ratio)

# For every additional dollar in income, the odds of purchase increase by odds_ratio times.

The previous code block example consist of the following code lines:

Imports: The code imports necessary libraries for the task:

NumPy - a library for numerical operations in Python, although it is not directly used in the code.
Pandas - a library used for data manipulation and analysis. It is used to create the DataFrame `df` containing the sample data.
LogisticRegression from sklearn.linear_model - a machine learning model used for binary classification tasks, in this case, predicting whether a purchase will be made based on income.

Sample Data: The dictionary `data` contains two key-value pairs:

'Income': The income values of 5 individuals, used as the independent variable for prediction.
'Purchased': A binary target variable (0 or 1) representing whether the individual made a purchase (1) or not (0).

The dictionary is converted into a DataFrame `df` using pd.DataFrame(data).

Model Training: The logistic regression model is trained using the data:

X: The independent variable, which is the 'Income' column from the DataFrame, selected using df[['Income']].
y: The target variable, which is the 'Purchased' column from the DataFrame, selected using df['Purchased'].
Logistic Regression Model: An instance of LogisticRegression() is created and trained using the fit method with the input data X and the target variable y.

Model Coefficients: After the model is trained, the coefficients are displayed:

Coefficient (Log-Odds): The model’s coefficient is extracted using model.coef_[0][0], which represents the log-odds for a one-unit increase in income. This is printed out.
Intercept: The model’s intercept is extracted using model.intercept_[0], which represents the log-odds of the baseline (when income = 0). This is printed out as well.

Probability Interpretation: The odds ratio is calculated to interpret the model’s prediction:

Odds Ratio: The odds ratio is calculated using the formula math.exp(model.coef_[0][0]), which converts the log-odds to the actual odds ratio. This shows how much the odds of purchasing increase for every additional dollar of income.

Conclusion: The print statement "For every additional dollar in income, the odds of purchase increase by odds_ratio times." concludes the interpretation of the odds ratio, giving insight into the model’s behavior.

When the code is executed the following output is obtained.

Coefficient (Log-Odds): 1.652730135568006e-05
Intercept: -6.136333210253191e-10
Odds Ratio: 1.0000165274379322

Here is the explanation of the obtained results:

Coefficient (Log-Odds): 1.652730135568006e-05

This is the coefficient (log-odds) obtained for the "Income" variable in the logistic regression model. It represents the change in the log-odds of purchasing a product for a one-unit increase in income.
The value of 1.652730135568006e-05 (which is a very small number) suggests that for every 1-dollar increase in income, the log-odds of purchasing the product increase by approximately 0.0000165. This is a very small effect.

Intercept: -6.136333210253191e-10

The intercept (log-odds) represents the baseline log-odds when income is 0 (i.e., no income). The value -6.136333210253191e-10 is a very small negative number, suggesting that with an income of 0, the log-odds of purchasing the product are extremely close to zero, which makes sense because it would be highly unlikely that someone with no income would make a purchase.

Odds Ratio: 1.0000165274379322

The odds ratio is calculated by exponentiating the coefficient (log-odds). In this case, exp(1.652730135568006e-05) gives an odds ratio of 1.0000165274379322.
An odds ratio of approximately 1 means that the increase in income has a very small effect on the odds of making a purchase. Specifically, for every additional dollar in income, the odds of making a purchase increase by a factor of 1.0000165, which is a very slight increase. The odds ratio close to 1 indicates that income has only a minimal effect on the probability of purchasing in this model.

Key Point: Convert the coefficient to an odds ratio using the exponential function to interpret it in terms of probability.

Conclusion

In both linear and logistic regression, the coefficients are essential for understanding the relationship between the independent variables (predictors) and the dependent variable (outcome). In linear regression, the coefficient represents the change in the dependent variable for each one-unit change in the independent variable. A positive coefficient indicates a direct relationship, while a negative coefficient suggests an inverse relationship between the two variables. On the other hand, in logistic regression, the coefficient represents the change in the log-odds of the outcome occurring for a one-unit change in the independent variable. Although interpreting log-odds is not as straightforward as interpreting linear regression coefficients, the results can be converted into an odds ratio by exponentiating the coefficient, which is easier to interpret.

The odds ratio in logistic regression helps to understand how the odds of the event change with each one-unit increase in the independent variable. An odds ratio of 1 means no effect on the odds, while values greater than 1 or less than 1 indicate an increase or decrease in the odds, respectively. In our example, the odds ratio of approximately 1 suggests that income has a minimal effect on the likelihood of making a purchase. This indicates that other factors beyond income may have a greater influence on purchasing behavior.

In summary, understanding how to interpret coefficients in both linear and logistic regression models is crucial for making informed decisions based on model predictions. The coefficients provide insights into how each independent variable contributes to the outcome, and the odds ratio in logistic regression offers a more intuitive way to interpret the relationship between the predictors and the event being studied.

Friday, January 3, 2025

Regularization in Logistic Regression

Imagine you're trying to guess wheter someone likes ice cream or not based on how many ice creams the've eaten in the past week. You draw a straight line on a chart to help make you guesses. But if you try to make your line fit too perfectly to all the points, it might go all wiggly and weird, and it won't work well for new guesses. That's what happens when we overthing or overfit the data.
Regularization is like telling the line:
Hey, don't go too crazy trying to fit everything exactly. Keep it simple!"
It's a way to keep things balanced so the line works well for both the points we already know and new ones we don't.

The math behind regularization

We have explained the Logistic Regression multiple times and if your interested please visit Logistic Regression.
The logistic regression predicts probabilities for binary outcomes (e.g. yes/not, 0/1). The logisitc regression model predicts: \begin{equation} \hat{y} = \sigma(z) = \frac{1}{1+e^{-z}} \end{equation} where:

$\hat{y}$ - s the predicted probability (between 0 and 1)
$z = wx + b$ - is the linear combination of weights and features
$w$ are the weights (coefficients)
$x$ are the features (input variables)
$b$ is the bias term
$\sigma(z)$ - is the sigmoid function, which maps any real number into the range [0, 1]

The goal is to minimize the log-loss function (also called cross-entropy loss): \begin{equation} Log-Loss = -\frac{1}{N}\sum_{i=1}^N\left[y_i \log(\hat{y}_i) + (1+ y_i)\log(1-\hat{y}_i)\right] \end{equation} where:

$N$ - is the number of samples
$y_i$ - is the true label (0 or 1)
$\hat{y}_i$ - is the predicted probability for sample $i$

This loss function pnealizes incorrect predictions more heavily when the model is confident but wrong.

Why Regularization?

When the weights become too large, the model overfits the training data. Regularization adds a penalty to the loss function to discourage large weights, i.e. the regularization prevents the overfitting.
There are three types of regularizaition and these are L1 regularization, L2 Regularization, and combination of L1 and L2.

L1 Regularization (Lasso) - Adds the sum of the absolute values of the weights to the loss function: \begin{equation} Loss = OriginalLoss + \lambda \sum_{j=1}^p|w_j| \end{equation} where:
- $\lambda$ - controls the strength of the penalty.
- $w_j$ - are the weights (coefficients).
- $p$ - is the number of features.
Effect: L1 - Regularization encourages sparsity, meanining some weights may become exactly 0, effectively removing less important features.
L2 Regularization (Ridge): L2 Regularization adds the sum of the suqared values of the weights to the los functions: \begin{equation} Loss = Original Loss + \lambda \sum_{j = 1}^p w_j^2 \end{equation} where:
- $\lambda$ - controls the strength of the penalty.
- $w_j$ - are the weights (coefficients).
- $p$ is the number of features.
Effect L2 Regularization discourages large weights but does not drive them to exactly 0. It works well when all features contribute to the output.
Elastic Net: Combines L1 and L2 penalties in the loss function. \begin{equation} Loss = Original Loss + \lambda_1 \sum_{j = 1}^p |w_j| + \lambda_2 \sum_{j = 1}^p w_j^2 \end{equation} where:
- $\lambda_1$ - controls the L1 penalty (sparsity).
- $\lambda_2$ - controls the L2 penalty.
Effect: Elastic NEt strikes a balance between L1 and L2 regularization, making it suitable for situations where some features are irrelevant (L1) and others are correlated (L2).

Example (without Python)

We are solving for the regularized logistic regression loss using L2 regularization for the given dataset.

Fruits ($x_1$)	Vegetables ($x_2$)	Healthy ($y$)
2	3	1
1	0	0
3	2	1
0	1	0

The predicted probability $\hat{y}_i$ for each data point is given by the sigmoi function: \begin{equation} \hat{y}_i = \frac{1}{q+e^{-z_i}} \end{equation} where: \begin{equation} z_i = w_1x_{1i} + w_2 x_{2i} + b \end{equation}

$w_1$ and $w_2$ are weights for fruits and vegetables.
$b$ is the bias term.

The L2 regularized loss function is: \begin{equation} Loss = -\frac{1}{n}\sum_{i=1}^{n}\left(y_i \log(\hat{y}_i) + (1+y_i)\log(1-\hat{y}_i)\right) + \lambda(w_1^2 + w_2^2) \end{equation} The number of samples $n$ is 4 since the entire dataset has 4 samples and the $\lambda$ (regualrization parameter) is equal to 0.1.
Let's assume that initial weights $w_1 = 2.0$, $w_2 = 1.5$, and $b = 0.5$.
Compute the $z_i$ and $\hat{y}_i$.

For the first sample input values are $x_1 = 2$ and $x_2 = 3$ the $z_1$ and $\hat{y}_1$ are equal to: \begin{equation} z_1 = 2(2) + 1.5(3) + 0.5 = 9.0 \end{equation} \begin{equation} \hat{y}_1 = \frac{1}{e^{-9.0}} = 0.999 \end{equation}
For the first sample input values are $x_1 = 1$ and $x_2 = 0$ the $z_2$ and $\hat{y}_2$ are equal to: \begin{equation} z_2 = 2(1) + 1.5(0) + 0.5 = 2.5 \end{equation} \begin{equation} \hat{y}_2 = \frac{1}{e^{-2.5}} = 0.924 \end{equation}
For the first sample input values are $x_1 = 3$ and $x_2 = 2$ the $z_3$ and $\hat{y}_3$ are equal to: \begin{equation} z_3 = 2(3) + 1.5(2) + 0.5 = 10.5 \end{equation} \begin{equation} \hat{y}_3 = \frac{1}{e^{-10.5}} = 0.999 \end{equation}
For the first sample input values are $x_1 = 0$ and $x_2 = 1$ the $z_4$ and $\hat{y}_4$ are equal to: \begin{equation} z_4 = 2(0) + 1.5(1) + 0.5 = 2.0 \end{equation} \begin{equation} \hat{y}_4 = \frac{1}{e^{-2.0}} = 0.881 \end{equation}

The log-los value can be obtaine from the log-los function: \begin{eqnarray} LogLoss &=& -\frac{1}{4} (1⋅log(0.999)+(1−1)⋅log(1−0.999)+0⋅log(0.924)\\ \nonumber &+&(1−0)⋅log(1−0.924)+1⋅log(0.999)\\ \nonumber &+&(1−1)⋅log(1−0.999)+0⋅log(0.881)+(1−0)⋅log(1−0.881)) \end{eqnarray} \begin{equation} LogLoss = -\frac{1}{4}(0.209) = 0.052 \end{equation} The L2 penalty is equal to: \begin{equation} L2Penalty = \lambda(w_1^2 + w_2^2) = 0.1(4+2.25) = 0.625 \end{equation} The total regularized loss is equal to: \begin{equation} Total Loss = LogLoss + L2 Penalty = 0.052 + 0.625 = 0.677 \end{equation} The total regularized loss with $\lambda = 0.1$, $w_1 = 2.0$, $w_2 = 1.5$, and $b = 0.5$ is $0.677$. This shows how L2 regularization penalizes large weights to prevent overfitting.

Example with scikit-learn Python

Two libraries will be required numpy and pandas.

import numpy as np
import pandas as pd

Next we will need LogisticRegression from skleanr.linear_model and accuracy_score from sklearn.metrics.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Now we will create the dataset that was shown in previous example without PYthon in form of the table.

# Dataset
data = {
                'Fruits': [2, 1, 3, 0],
                'Vegetables': [3, 0, 2, 1],
                'Healthy': [1, 0, 1, 0]
            }

Since the dataset is too small the the train_test_split function will not be used. Instead the entire LogistRegression algorithm will be trained with entire dataset. However, we have to divide the dataset on input variables and the output target variable. The input variables (storred under large X variable) will contain Fruts and Vegetables, while output target varaible will be Heatly dataset variable. But before that we will transform the dataset into pandas DataFrame.

# Convert to DataFrame
df = pd.DataFrame(data)
X = df[['Fruits', 'Vegetables']]
y = df['Healthy']

Now we can define the LogisiticRegression algorithm and train it on the entire dataset using the fit() function. The hyperparameters of the LogisticRegression algorithm/model that will be used in this example are penalty = 'l2' and C = 1.0. The penalty that will be used is l2 regularization which is defualt value when the algorithm is called. The C is the invers of regularization strength, must be a positive float. Like in support vector machines, smaller values specify stronger regularization. The value by default is 1.0.

# Logistic Regression with L2 Regularization
model = LogisticRegression(penalty='l2', C=1.0)
model.fit(X, y)

Finally we will make predictions, using predict() function and providing X to the trained model. Then we will show the weights using built in coef_ function and calculate and show the classification accuracy using the accuracy_score function.

# Predictions
predictions = model.predict(X)
print("Weights:", model.coef_)
print("Accuracy:", accuracy_score(y, predictions))

The output obtained in this example is given below.

Weights: [[0.74078792 0.74078792]]
>ccuracy: 1.0

Since the LogisticRegression is trained and tested on the same dataset the obtained accuracy value must be equal to 1.0 i.e. perfect classification accuracy. The weights for both inpur variables are the same which indicates the equal level on contribution to the output.

Thursday, January 2, 2025

Multiclass Classification with Logistic Regression

Imagine your are in a candy store. There are three types of candies: chocolate, gummy bears, and lollipops. You want to teach a robot how to figure out what kind of candy it is looking at just by looking at its shape and color. This is a problem called mutliclass classification
Now, let's explain how we can use something called logistic regression ot help the robot decide. Don't worry about the fancy name-it's just a way of making choices based on some numbers!

What is Logistic Regression?

Think of logistic regression as a way of answering yes-or-no questions. For example, if the robot asks:

Is this candy chocolate? - it gets an answer that's a number between 0 and 1, like 0.8 (which means it's a 80% sure it's chocolate).
If the robot asks about gummy bears, it might get 0.1 which means only 10% sure.

But wait - we're dealing with three types of candy. So how can we handle more than one question at the time? That's where mutliclass logistic regression comes in. For more information about logistic regression and how it works please check Logistic Regression for Binary Classification.

How Does Multiclass Logistic Regression Work?

Instead of asking just one question, the robot asks three:

IS this candy chocolate?
Is this candy a gummy bear?
Is this candy a lollipop?

The robot looks at the answeres (let's call them probabilities) and picks the candy type with the highest probability. For example:

Chocolate - 0.7 (70%)
Gummy Bears - 0.2 (20%)
Lollipops - 0.1 (10%)

Since 0.7 is the biggest number, the robot decides it is chocolate.

The math behind the Multiclass Logistic Regression

Step 1: Features and weights
The robot uses the following parameters to caclulate the probabilities

$x$ - Features of the candy (shape, color)
$w$ - Weights corresponding to each feature, which determine their importance
$b$ - a bias term to adjust the results.
$e$ - Euler's number, a mathematical constant often used in probability and exponential calculations
$K$ - the total number of candy types (e.g. 3)

The formula to calculate the score for each candy type $j$ is: \begin{equation} z_j = \sum_i w_{j,i}x_i + b_j \end{equation} where:

$z_j$ - the score for candy $j$
$w_{j,i}$ - weight for feature $i$ of candy type $j$.
$x_i$ - value of feature $i$
$b_j$ - bias for candy type $j$

When the socre is calculated it is used in softmax function to calculate the probability. The softmax function can be written as: \begin{equation} P(y = j) = \frac{e^{z_j}}{\sum_{k=1}^K e^{z_k}} \end{equation} where:

$P(y=j)$ - the probability of the candy being type $j$
$z_j$ - the score for candy $j$.
\sum_{k=1}^K e^{z_k} - the sum of exponential scores for all K candy types, ensuring the probabilites sum to 1.

Example of multiclass logistic regression in Python

In this example we will train the Logistic regression on multiclass dataset. The dataset will also be created in this example and it is the Candy dataset.
The first step is to import the required libraries. We will need NumPy (to create the dataset), the LogisticRegression method from the sklearn.linear_model module, and the train_test_split method from the sklearn.model_selection module. Finally, we will use the classification_report method from the sklearn.metrics module.

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.moel_selection import train_test_split
from sklearn.metrics import classification_report

The second step is to create the candy data. The X variable will contain candy features (input variables) which are shape and color. The shape value 0 indicates a round candy while value 1 indicates square shaped candy. The color has three values where 0 idnicates red candy, 1 brown candy, and 2 yellow candy. So frist column in the dataset is shape an the second is color.

# Features: shape and color
X = np.array([
              [0, 0],  # Red round candy
              [1, 1],  # Brown square candy
              [0, 2],  # Yellow round candy
              [1, 0],  # Red square candy
              [0, 1],  # Brown round candy
              [1, 2]   # Yellow square candy
              ])

The labels y (target variable) contains three values where 0 is for chocolate, 1 for gummy bears and 2 for lollipops.

y = np.array([0, 0, 2, 1, 0, 2])

Now that dataset is defined you can split the dataset on train and test datasets using train_test_split method. The dataset (X,y) will be divided on train and test dataset in 70:30 ratio and to do that in train_test_split function we will set the test_size paramter to 0.3. We will also define the random_state = 42 to shuffle the data before splitting.
After splitting the train data (X_train, y_train) will be used to train the LgoisticRegression algorithm. The test dataset (X_test, y_test) will be used to test the trained model.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train the model
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
model.fit(X_train, y_train)

If you execute the code written so far nothing will happen. To show some results we will need to test the model. To do that we will use the built in function predict() using which the trained model will predict the output based on the provided input. The output will be stored under variable name y_predict. This variable will be used in the classification_report function alongside the y_test values to measure the performance of the trained model on unseen data.

y_pred = model.predict(X_test)
# Print results
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=["Chocolate", "Gummy Bears", "Lollipops"]))

The classification output is given below.

Classification Report:
            precision    recall  f1-score   support

 Chocolate       0.00      0.00      0.00       2.0
Gummy Bears       0.00      0.00      0.00       0.0
 Lollipops       0.00      0.00      0.00       0.0

  accuracy                           0.00       2.0
 macro avg       0.00      0.00      0.00       2.0
weighted avg       0.00      0.00      0.00       2.0

The resuls are all 0 since the test dataset contains only two samples that belong to class 0.
Finally we will give the robot new candy to classify. In this case we will define new candy sample with round shape and brown color (0,1).

new_candy = np.array([[0, 1]])  # Brown round candy
prediction = model.predict(new_candy)
print("Predicted Candy:", prediction)

The output of the previous code block is given below.

Predicted Candy: [2]

So the logistic regression predicted that brown round candy actually belongs to class 2. However, the true value should be equal to 0 since in the initial dataset the same samples has the label 0 i.e. it belongs to class 0.

Wednesday, January 1, 2025

Polynomial regression in Scikit-learn

Imagine you're trying to draw a line to connect a buch of dots on a piece of paper. If all the dots are kind of in a straight line, you just draw a straight line, right? If stright line goes through all the dots or majority of them and if these dots are kined of in a stright line then that's a linear regression.
But what to do when the dots form a curve, like a shape of a hill or a rollercoaster? A straight line won't fit very well. Instead, we will need a bendy line that can go up and down to match the cruve. That's where polynomial regression comes in!.

What is Polynomial Regression?

A Polynomial Regression is like upgrading from a stright ruler to a flexible ruler that can ben. Instead of just fitting a straight line ($y = mx + c$), you can use a formula that can be written as: \begin{equation} y = a_0 + a_1 x + a_2 x^2 + a_3 x^3 + \cdots + a_nx^n \end{equation} where:

$x$ - this is the input (like dots on the paper)
$y$ - This is the output (the line you're drawing)
$a_0, a_1,a_2,...,a_n$ . These are numbers (coefficients) that the math figures out to make the line fit the dots.
$x^2,x^3,...,x^n$ - These make the line bend. The higher the power (n), the more bendy the line can be.

The proper question that should be proposed is when to use the polynomial regression? The polynomial regression is appropriate to use when:

The data doesn't fit a straight line but follows a curve
You notice atterns like ups and downs (e.g. growth trends, hills, valleys)
You want a model that's simple but flexible enough to capture curves.

How to use the polynomial regression?

The application of the polynomial regression will be shown on the following example:

Look at the data - Suppose you're measuring how fast a toy car rools down a hill over time. The speed might increase slowly at first, then zoom up fast. The graph of this data could like a curve.
Pick a polynomial degree ($n$) - The idea is to start from the lowest degree ($n=2$) ( a simple bendy line, a parabola). If that's not curvy enough, try $n=3$, $n=4$, etc. But don't make it too bendy, or it might wiggle too much and ift random noise instead of the real pattern.
Fit the equation - Use a computer to calculate the coefficients ($a_0$, $a_1$, $a_2$,...) that make the line match your data as closely as possible.
Check the fit - Does the line match the dots? IF not, adjust the degree of the polynomial.

Key Things to Remember

Don't overdo it: If you make the polynomial too bendy ($n$ too high), it will try to fit every single dot perfectly, even the random little bumps (noise). That's bad because it won't work very well on the new data due to overfitting.
Balanved simpicity and accuracy - find the lowest degree $n$ that fits the curve well.

It’s like building a toy car track. Sometimes a straight ramp is enough, but other times you need to add curves to make it exciting! That’s the magic of polynomial regression.

Example 1 - Estimation of the plants growth based on the exposure to sunlight.

You’re trying to figure out the relationship between the number of hours a plant gets sunlight (x) and how tall it grows (y). Your measurements are:

$x$ (hours of sunlight)	$y$ (height in cm)
1	2
2	6
3	10
4	18
5	26

The data from the table is graphically shown in Figure 1.

Figure 1 - Height in cm versus hours of sunlight From Figure 1 it can be noticed that the points cannot be fitted using straight line. So, we will try the polynomial regression of degree 2 ($y = a_0 + a_1 x + a_2 x^2 $).

Step 1: Set up the equation

For degree 2, the equation in general form can be written as: \begin{equation} y = a_0 + a_1 x + a_2 x^2 \end{equation} In the previous equaiton we have to find $a_0$, $a_1$, and $a_2$ which are called intercept, linear term, and quadratic term.

Step 2: Organize the data

$x$	$y$	$x^2$
1	2	1
2	6	4
3	10	9
4	18	16
5	26	25

Step 3: Write the system of equations

To solve for $a_0$, $a_1$, and $a_2$, we use normal equaitons derived from least squares:

Sum of $y$: \begin{equation} \sum y = na_0 + a_1\sum x + a_2\sum x^2 \end{equation}
Sum of $xy$: \begin{equation} \sum xy = a_0\sum x + a_1\sum x^2 + a_2\sum x^3 \end{equation}
Sum of $x^2y$: \begin{equation} \sum x^2y = a_0\sum x^2 + a_1\sum x^3 + a_2\sum x^4 \end{equation}

Step 4: Plug in the data

Now we have to calculate all the sums. \begin{equation} \sum x = 1+2+3+4+5 = 15 \end{equation} \begin{equation} \sum x^2 = 1+4+9+16+25 = 25 \end{equation} \begin{equation} \sum x^3 = 1+8+27+64+125 = 225 \end{equation} \begin{equation} \sum x^4 = 1 + 16+ 81+256+625 = 979 \end{equation} \begin{equation} \sum y = 2 + 6 + 10 + 18 + 26 = 62 \end{equation} \begin{equation} \sum xy = 1\cdot 2 + 2 \cdot 6 + 3 \cdot 10 + 4 \cdot 18 + 5 \cdot 26 = 230 \end{equation} \begin{equation} \sum x^2 y = 1 \cdot 2 + 4 \cdot 6 + 9 \cdot 10 + 16 \cdot 18 + 25 \cdot 26 = 978 \end{equation} With the substitution of the obtained sums into equations for $\sum y$, $\sum xy$, and $\sum x^2 y$ the following linear equaions are obtained: \begin{eqnarray} 62 &=& 5a_0 + 15a_1 + 55a_2\\ \nonumber 230 &=& 15a_0 + 55a_1 + 225 a_2 \\ \nonumber 978 &=& 55a_0 + 225a_1 + 979 a_2 \end{eqnarray} These three equations can be solved manually or using caclulator. However, first the equations have to be simplifed (if possible) to isolate $a_0$, $a_1$, and \(a_2). Then we have to use the substitution or elimination to find the coefficients. After solving these three equations with three unkowns the vlaues of the unknowns are equal to: \begin{equation} a_0 = 0.8, a_1 = 0.2, a_2 = 1.0 \end{equation}

Step 5: Write the final equation

The polynomial regression equation can be written as: \begin{equation} y = 0.8 + 0.2x + 1.0x^2 \end{equation} The output is grapically shown in Figure 2.

Figure 2 - approximation of data using polynomial regression. As seen from Fiugre 2 using polynomial regression we have obtained the function that can successfully approaximate the coordinates shown with blue points. The model (polynomial regression) is not overfitted since hte curve does not go through every single data sample.

Step 7: Use the equation

Now you can predict the plant height for any number of sunlight howurs. For example, if $x = 6$ - 6 hours of sunlight then the predicted plant height is equal to: \begin{equation} y = 0.8 + 0.2\cdot 6 + 1.0\cdot(6)^2 = 38 [\mathrm{cm}] \end{equation}

Fruits (\(x_1\))	Vegetables (\(x_2\))	Healthy (\(y\))
2	3	1
1	0	0
3	2	1
0	1	0

\(x\)	\(y\)	\(x^2\)
1	2	1
2	6	4
3	10	9
4	18	16
5	26	25