PYTHONHOLICS: Decision Trees for Regression

Decision Trees for Regression

Decision Trees are not only useful for classification tasks but also for regression problems. In this post, we will focus on Decision Trees for regression, where the goal is to predict a continuous target variable based on input features. Let's dive into how Decision Trees work for regression tasks and implement them in Python using Scikit-learn.

What is a Decision Tree for Regression?

A Decision Tree for regression works similarly to a classification tree, except that instead of classifying data into categories, it predicts a continuous value. The tree splits the data at each internal node based on feature values and predicts the average value of the target variable within the leaf nodes. These splits are chosen to minimize the variance of the target variable in each resulting node.

Key Features of Decision Trees for Regression:

Non-linear relationships: Decision Trees can model complex, non-linear relationships between the input features and the target variable.
No need for feature scaling: Decision Trees do not require feature normalization or scaling, making them simpler to implement in comparison to models like linear regression.
Overfitting risk: Like in classification, Decision Trees for regression can overfit the data if the tree is too deep.

Advantages and Disadvantages of Decision Trees for Regression

Advantages:

Simple to understand and interpret.
Can handle both numerical and categorical features.
Non-linear models that can capture complex patterns.
No need for feature scaling or normalization.

Disadvantages:

Prone to overfitting if not properly pruned.
Instability: small changes in the data can lead to large changes in the tree structure.
Less accurate when compared to more advanced algorithms like Random Forests or Gradient Boosting Machines.

Decision Tree Regressor in Python

In this section, we will implement a Decision Tree regressor using Scikit-learn and demonstrate how to use it for predicting continuous values.

1. Load the Dataset

For this example, we will use the California Housing dataset, which contains information about various features of houses in California and the target variable, which is the house value.

from sklearn.datasets import fetch_california_housing
import pandas as pd

# Load the California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Create a DataFrame for better visualization
df = pd.DataFrame(X, columns=data.feature_names)
df['target'] = y
df.head()

The previous code consist of the following code lines:

Load the California Housing dataset:
- data = fetch_california_housing() - Loads the California Housing dataset using the fetch_california_housing function from sklearn.datasets. The dataset contains features related to housing prices in California.
- X = data.data - Assigns the input features (data) of the dataset to X.
- y = data.target - Assigns the target values (housing prices) of the dataset to y.
Create a DataFrame for better visualization:
- df = pd.DataFrame(X, columns=data.feature_names) - Creates a pandas DataFrame from the input features X, and assigns the appropriate feature names from data.feature_names to the DataFrame's columns for better readability.
- df['target'] = y - Adds the target column (housing prices) to the DataFrame as the last column, labeled as "target".
Display the first few rows of the DataFrame:
- df.head() - Displays the first 5 rows of the DataFrame for a preview of the data, including both the features and the target column.

2. Train the Decision Tree Regressor

Now, let’s train a Decision Tree model using the Scikit-learn `DecisionTreeRegressor`. We will first split the dataset into training and testing sets and then train the regressor.

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train the Decision Tree regressor
dt_regressor = DecisionTreeRegressor(random_state=42)
dt_regressor.fit(X_train, y_train)

The previous code block consist of following code lines:

Importing libraries
- from sklearn.model_selection import train_test_split - train_test_split function imported from sklearn.model_selection module which is used to split the dataset to train and test dataset in user-specified ratio.
- from sklearn.tree import DecisionTreeRegressor - Decision Tree Regressor algorithm imported from sklearn.tree module
Split the data into training and testing sets:
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) - Splits the data into training and testing sets using the train_test_split function. 30% of the data is reserved for testing (test_size=0.3), and the random_state=42 ensures reproducibility of the split.
Initialize and train the Decision Tree regressor:
- dt_regressor = DecisionTreeRegressor(random_state=42) - Initializes a DecisionTreeRegressor model, setting random_state=42 to ensure reproducibility of the results.
- dt_regressor.fit(X_train, y_train) - Trains the Decision Tree regressor model using the training data (X_train and y_train) to learn the relationship between the features and the target values.

3. Evaluate the Model

After training the model, let’s evaluate its performance by predicting values on the test set and calculating the Mean Squared Error (MSE).

from sklearn.metrics import mean_squared_error

# Make predictions on the test set
y_pred = dt_regressor.predict(X_test)

# Calculate the Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

The previous code consist of the following code lines:

Make predictions on the test set:
- y_pred = dt_regressor.predict(X_test) - Uses the trained dt_regressor to make predictions on the test set (X_test) based on the learned model.
Calculate the Mean Squared Error:
- mse = mean_squared_error(y_test, y_pred) - Computes the Mean Squared Error (MSE) by comparing the true values (y_test) with the predicted values (y_pred). The MSE measures the average squared difference between predicted and actual values, indicating the model's performance.
Print the Mean Squared Error:
- print(f"Mean Squared Error: {mse:.2f}") - Displays the calculated MSE value, rounded to two decimal places, to evaluate the performance of the regression model.

After executing the code the following output is obtained.

Mean Squared Error: 0.53

4. Visualize the Decision Tree

One of the advantages of Decision Trees is their interpretability. We can visualize the trained regression tree using Scikit-learn’s `plot_tree` function.

from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

# Visualize the decision tree
plt.figure(figsize=(12,8))
plot_tree(dt_regressor, filled=True, feature_names=data.feature_names, rounded=True)
plt.show()

The previous code block consist of the following code lines:

Visualize the decision tree:
- plt.figure(figsize=(12,8)) - Creates a new figure for plotting with a specified size (12 inches by 8 inches) to ensure the decision tree is displayed clearly.
Plot the decision tree:
- plot_tree(dt_regressor, filled=True, feature_names=data.feature_names, rounded=True) - Uses the plot_tree function to visualize the trained decision tree model (dt_regressor). It includes the following options:
  - filled=True - Fills the nodes of the tree with colors to represent the predicted values or class probabilities.
  - feature_names=data.feature_names - Labels the features (input variables) used in the decision tree at each node.
  - rounded=True - Rounds the corners of the nodes for a cleaner and more visually appealing tree structure.
Show the plot:
- plt.show() - Displays the generated decision tree plot to the user.

The plot above shows the decision tree and how it splits the data at each node. Unlike classification trees, the leaf nodes will show the predicted values (rather than class labels) based on the average target value for the data points in that leaf.

Pruning the Decision Tree

Just like in classification, Decision Trees for regression can overfit the data if the tree is too deep. One way to reduce overfitting is by pruning the tree, either by setting a maximum depth or by requiring a minimum number of samples in a leaf node.

Prune the Tree Using Maximum Depth

Let’s prune the tree by setting a maximum depth to prevent overfitting.

# Train the decision tree with a maximum depth of 5
dt_regressor_pruned = DecisionTreeRegressor(max_depth=5, random_state=42)
dt_regressor_pruned.fit(X_train, y_train)

# Visualize the pruned decision tree
plt.figure(figsize=(12,8))
plot_tree(dt_regressor_pruned, filled=True, feature_names=data.feature_names, rounded=True)
plt.show()

The previous code block consist of the following code lines:

Train the decision tree with a maximum depth of 5:
- dt_regressor_pruned = DecisionTreeRegressor(max_depth=5, random_state=42) - Initializes a decision tree regressor with a maximum depth of 5 to limit the depth of the tree and prevent overfitting. The random_state=42 ensures reproducibility of results.
- dt_regressor_pruned.fit(X_train, y_train) - Fits the decision tree model (dt_regressor_pruned) to the training data (X_train, y_train). This step trains the model based on the features and targets from the training set.
Visualize the pruned decision tree:
- plt.figure(figsize=(12,8)) - Creates a new figure for plotting with a specified size (12 inches by 8 inches) to ensure the pruned decision tree is displayed clearly.
- plot_tree(dt_regressor_pruned, filled=True, feature_names=data.feature_names, rounded=True) - Visualizes the pruned decision tree model (dt_regressor_pruned) with the following options:
  - filled=True - Fills the nodes of the tree with colors to represent the predicted values or class probabilities.
  - feature_names=data.feature_names - Labels the features (input variables) used in the decision tree at each node.
  - rounded=True - Rounds the corners of the nodes for a cleaner and more visually appealing tree structure.
Show the plot:
- plt.show() - Displays the pruned decision tree plot to the user.

Conclusion

Decision Trees for regression are a powerful tool for predicting continuous target variables. They are easy to interpret and can model non-linear relationships in the data. However, they are prone to overfitting, which can be mitigated through pruning or using ensemble methods like Random Forests.

In this post, we have covered the basics of Decision Trees for regression, including how to implement and evaluate a Decision Tree regressor using Scikit-learn, as well as how to visualize and prune the tree to avoid overfitting. Try experimenting with different datasets and pruning strategies to see how you can improve the performance of your regression models!

PYTHONHOLICS

Thursday, February 27, 2025