PYTHONHOLICS: Adding Interaction Terms to Linear Regression

Adding Interaction Terms to Linear Regression

Interaction terms in linear regression models are used to capture the effect of two or more predictor variables working together. These terms are crucial when we believe that the relationship between a predictor and the response variable is not just linear, but also dependent on the values of other predictors. In this post, we will explore how to add interaction terms to a linear regression model using Python and scikit-learn.

Understanding Interaction Terms

In a basic linear regression model, the relationship between the predictor variables and the target variable is assumed to be linear. However, in many real-world situations, this assumption may not hold true. For example, the effect of one predictor variable on the target variable might depend on the level of another predictor variable. In such cases, adding interaction terms can help model these relationships more accurately.

Consider the following general linear regression equation:

\begin{equation} y=\beta_0 + \beta_1\cdot x_1 + \beta_2\cdot x_2 + \varepsilon \end{equation}

Where:

y is the target variable
x1 and x2 are predictor variables
β0, β1, and β2 are coefficients
ε is the error term

To include an interaction between x1 and x2, the model becomes:

\begin{equation} y=\beta_0 + \beta_1\cdot x_1 + \beta_2\cdot x_2 + \beta_3 \cdot (x_1\cdot x_2)+\varepsilon \end{equation}

Here, β3 represents the coefficient of the interaction term (x1 * x2). This term allows the model to account for the combined effect of x1 and x2 on the target variable y.

Adding Interaction Terms in Python

Now let's look at an example where we add interaction terms to a linear regression model using Python and the scikit-learn library. We will use the popular Boston housing dataset to predict the price of houses based on features like crime rate, average number of rooms, and distance to employment centers, among others.

# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
#from sklearn.datasets import load_boston #Works with scikit-learn &leq 1.2.0
from sklearn.datasets import fetch_openml #Use when you want the Boston housing dataset however you have scikit-learn version equal or higher than 1.2.0
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

# Load the Boston dataset
#boston = load_boston()#Works with scikit-learn &leq 1.2.0
boston = fetch_openml(name="boston", version=1, as_frame=True)#Use when you want the Boston housing dataset however you have scikit-learn version equal or higher than 1.2.0
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target

# Add interaction terms using PolynomialFeatures
poly = PolynomialFeatures(interaction_only=True, include_bias=False)
X_poly = poly.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.3, random_state=42)

# Create and train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

It should be noted that in this example we have used the Boston housing dataset. Unfortunately, the Boston housing dataset is no longer available in scikit-learn versions 1.2.0 or higher. However, you can still download the dataset from the openl.org repository using the fetch_openml function from sklearn.datasets module.
After importing all necessary libraries you have to download the Boston housing dataset using the fetch_openml function.

boston = fetch_openml(name="boston", version=1, as_frame=True)

Explanation of the Code

In this code, we:

Import necessary libraries, including scikit-learn's LinearRegression, PolynomialFeatures, and the Boston dataset.
Load the Boston housing dataset and separate the features (X) and the target variable (y).
Use PolynomialFeatures with interaction_only=True to generate only the interaction terms between the original features (excluding the squared terms).
Split the data into training and testing sets, train a linear regression model, and evaluate its performance using the mean squared error (MSE) metric.

When the previous code is executed the obtained result is:

Mean Squared Error: 16.320305223697712

Conclusion

Interaction terms can significantly improve the performance of a linear regression model when the relationship between predictors is not purely additive. By using techniques such as PolynomialFeatures in scikit-learn, you can easily add interaction terms and enhance your model’s predictive power. However, it's essential to avoid overfitting by carefully selecting interaction terms and evaluating the model on a separate test set.

When to Use Interaction Terms

Interaction terms should be used when you believe that the effect of one predictor variable depends on the value of another predictor. However, it's important to be cautious when adding too many interaction terms, as this could lead to overfitting. Always evaluate your model using cross-validation and test data to ensure its generalizability.

PYTHONHOLICS

Wednesday, February 26, 2025