Monday, March 3, 2025

Random Forests for Classification

Random Forests for Classification

Random Forest is a powerful ensemble learning algorithm that improves classification performance by combining multiple decision trees. It reduces overfitting and increases accuracy by leveraging the power of randomness in data selection and tree construction.

1. What is a Random Forest?

A Random Forest is a machine learning algorithm that belongs to the ensemble learning family, meaning it combines multiple models to improve predictive accuracy and reduce overfitting. Specifically, it is an extension of decision trees, where a large number of decision trees are trained on different subsets of the data, and their outputs are aggregated to produce the final prediction. Each tree in the Random Forest is built using a random selection of features and a random subset of training data, often sampled with replacement (a technique called bootstrapping). For classification tasks, the final output is determined by majority voting among the trees, while for regression tasks, it is the average of the individual tree predictions. The main advantages of Random Forest include its ability to handle large datasets with high dimensionality, its robustness to noise and overfitting, and its capability to capture complex patterns in the data. It is widely used in various applications such as finance, healthcare, image recognition, and fraud detection due to its strong performance and ease of implementation.

2. Loading and Preparing the Dataset

The Iris dataset is a well-known dataset in machine learning, commonly used for classification tasks. It contains 150 samples of iris flowers, categorized into three species: Setosa, Versicolor, and Virginica. Each sample has four features—sepal length, sepal width, petal length, and petal width—which help distinguish between the species. To demonstrate how to train a Random Forest classifier using this dataset, we first need to load the data and preprocess it, ensuring it is formatted correctly for training. We then split the dataset into training and testing sets to evaluate the model’s performance. Next, we create a Random Forest classifier by specifying parameters such as the number of trees in the forest, the maximum depth of each tree, and the criteria for splitting nodes. The classifier is then trained on the training data using an ensemble of decision trees, each built from a random subset of the dataset and features. Once trained, the model is tested on the unseen test data to assess its accuracy and generalization ability. By aggregating predictions from multiple trees, the Random Forest classifier reduces variance and prevents overfitting, resulting in a robust and reliable model. This approach makes it an excellent choice for real-world classification problems, where data may be noisy or complex.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import numpy as np

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
  • from sklearn.datasets import load_iris
    • Imports the load_iris function from the sklearn.datasets module.
    • This function is used to load the famous Iris dataset, which is commonly used for classification tasks.
  • from sklearn.model_selection import train_test_split
    • Imports the train_test_split function from the sklearn.model_selection module.
    • This function is used to split the dataset into training and testing sets.
  • from sklearn.ensemble import RandomForestClassifier
    • Imports the RandomForestClassifier from the sklearn.ensemble module.
    • This is the machine learning model that will be trained to classify iris species based on their features.
  • from sklearn.metrics import accuracy_score
    • Imports the accuracy_score function from the sklearn.metrics module.
    • This function will be used to evaluate the model's performance by comparing predicted and actual values.
  • import numpy as np
    • Imports the NumPy library, a fundamental package for numerical computing in Python.
    • It provides support for large, multi-dimensional arrays and various mathematical functions.
  • iris = load_iris()
    • Loads the Iris dataset and stores it in the variable iris.
    • The dataset contains flower measurements and their corresponding species labels.
  • X, y = iris.data, iris.target
    • Extracts the feature data (X) and target labels (y) from the iris dataset.
    • X contains numerical measurements (sepal length, sepal width, petal length, and petal width).
    • y contains the class labels (0 for Setosa, 1 for Versicolor, and 2 for Virginica).
  • X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    • Splits the dataset into training and testing sets using the train_test_split function.
    • X_train and y_train contain 80% of the data, used for training.
    • X_test and y_test contain 20% of the data, used for testing.
    • The test_size=0.2 argument specifies that 20% of the data should be reserved for testing.
    • The random_state=42 ensures that the split is reproducible by setting a fixed random seed.
When the previous code is executed nothing will happen due to the fact that we have not used any print function to show the output. Using this code we have imported necessary libraries for this example, load the iris dataset, and split the data for training and testing the Random Forest Classifier. The next step is to define the random forest classifier model and to train it using training data.

3. Training a Random Forest Classifier

Now, let's train a Random Forest classifier with Scikit-learn.

# Train a Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")
    
  • # Train a Random Forest Classifier
    • This is a comment indicating that the following lines of code will train a Random Forest classifier.
  • clf = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)
    • Creates an instance of the RandomForestClassifier from Scikit-Learn.
    • n_estimators=100: Specifies that the Random Forest will consist of 100 decision trees.
    • max_depth=3: Limits the depth of each decision tree to 3 levels to prevent overfitting.
    • random_state=42: Ensures reproducibility by setting a fixed random seed.
  • clf.fit(X_train, y_train)
    • Trains (fits) the Random Forest model using the training data.
    • The model learns patterns in X_train (features) to map them to y_train (labels).
  • # Make predictions
    • This is a comment indicating that the following lines of code will make predictions using the trained model.
  • y_pred = clf.predict(X_test)
    • Uses the trained model to predict the class labels for the test dataset X_test.
    • The predicted labels are stored in the variable y_pred.
  • # Evaluate accuracy
    • This is a comment indicating that the following lines of code will evaluate the model's accuracy.
  • accuracy = accuracy_score(y_test, y_pred)
    • Calculates the accuracy of the model by comparing predicted labels (y_pred) with actual labels (y_test).
    • The accuracy score represents the proportion of correct predictions made by the model.
  • print(f"Model Accuracy: {accuracy:.4f}")
    • Prints the accuracy of the model formatted to four decimal places.
    • The f-string is used for string formatting, making the output more readable.
After the code written so far is executed the only output that is obtained is
Model Accuracy: 1.0000
The result shows that trained RFC has perfect classification performance on the test dataset. The nex step in this investigation would be to determine feature imporance i.e. to determine which features have most contribution to the lable/output variable.

4. Feature Importance in Random Forest

Random Forests provide a built-in way to determine feature importance. This helps in understanding which features are most influential in classification.

import matplotlib.pyplot as plt

# Extract feature importances
importances = clf.feature_importances_
feature_names = iris.feature_names

# Plot feature importance
plt.figure(figsize=(8, 5))
plt.barh(feature_names, importances, color="skyblue")
plt.xlabel("Feature Importance")
plt.ylabel("Feature")
plt.title("Feature Importance in Random Forest")
plt.show()
    
  • import matplotlib.pyplot as plt
    • Imports the pyplot module from the Matplotlib library, which is used for data visualization.
    • This module provides functions to create various types of plots, such as bar charts, line graphs, and histograms.
  • # Extract feature importances
    • This is a comment indicating that the following lines of code will extract the feature importance values from the trained model.
  • importances = clf.feature_importances_
    • Retrieves the feature importance values from the trained Random Forest model.
    • Each value represents how much a particular feature contributes to the model's decision-making process.
  • feature_names = iris.feature_names
    • Extracts the names of the features from the Iris dataset.
    • The feature names include sepal length, sepal width, petal length, and petal width.
  • # Plot feature importance
    • This is a comment indicating that the following lines of code will generate a bar chart to visualize feature importance.
  • plt.figure(figsize=(8, 5))
    • Creates a new figure for the plot with a specified size of 8 inches by 5 inches.
    • This ensures that the plot is clear and well-sized for visualization.
  • plt.barh(feature_names, importances, color="skyblue")
    • Creates a horizontal bar chart where:
    • feature_names are placed on the y-axis.
    • importances (feature importance values) are represented on the x-axis.
    • The bars are colored skyblue for better visualization.
  • plt.xlabel("Feature Importance")
    • Labels the x-axis as "Feature Importance" to indicate what the values represent.
  • plt.ylabel("Feature")
    • Labels the y-axis as "Feature" to indicate that it represents the different features of the dataset.
  • plt.title("Feature Importance in Random Forest")
    • Sets the title of the plot to "Feature Importance in Random Forest" to describe the visualization.
  • plt.show()
    • Displays the plot, making the feature importance visualization visible.
After previous code is executed the following plot is obtained which is shown in Figure 1.
2025-03-03T22:19:56.692250 image/svg+xml Matplotlib v3.9.2, https://matplotlib.org/
Figure 1 - Feature importance on iris dataset variables obtained using trained RFC model.
The feature importance results indicate the relative contribution of each feature to the model's predictions. Among the four features, "petal length (cm)" and "petal width (cm)" hold the most significant importance, with values of 0.4522 and 0.4317, respectively. These two features dominate the decision-making process, suggesting that they provide the most information about the target variable. In contrast, "sepal length (cm)" and "sepal width (cm)" have much lower importance scores, with values of 0.1062 and 0.0099. This implies that these features contribute far less to the model's predictive ability compared to the petal dimensions. Overall, petal length and petal width appear to be the key drivers in distinguishing between the classes in this model.

5. Hyperparameter Tuning for Better Performance

To improve performance, we can tune hyperparameters using GridSearchCV. In case of grid search we will find the optimal combination of some of RFC hyperparameters such as n_estimatros, max_depth, min_samples. In case of GridSearchCV we will try some combinations i.e. the n_estimators parameter will be set to 50, 100, and 200. The max_depth will be set to 3, 5, and 10 while min_samples_split will be set to 2, 5, and 10. The entire code for performing the grid search CV is shown below.

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 10],
    'min_samples_split': [2, 5, 10]
}

# Perform grid search
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)
print("Best Score: " , grid_search.best_score_)
    
  • Importing the GridSearchCV module: The code begins by importing GridSearchCV from the sklearn.model_selection module. This is a method used to search for the best combination of hyperparameters for a model.
  • Defining the parameter grid: The param_grid dictionary is created to define a range of values for each hyperparameter. In this case:
    • 'n_estimators': Number of trees in the forest, with possible values 50, 100, and 200.
    • 'max_depth': Maximum depth of each tree, with possible values 3, 5, and 10.
    • 'min_samples_split': Minimum number of samples required to split an internal node, with possible values 2, 5, and 10.
  • Performing grid search: GridSearchCV is initialized with the RandomForestClassifier, the param_grid, and other parameters:
    • cv=5: The number of cross-validation folds to use (5 in this case).
    • scoring='accuracy': The metric used to evaluate the model performance (accuracy in this case).
  • Fitting the model: The fit method is called on the grid search, using X_train and y_train as input. This will train the model using each combination of parameters defined in the param_grid.
  • Displaying best parameters: The best_params_ attribute of the grid_search object is printed to show the combination of hyperparameters that provided the best performance based on the grid search results.
  • Displaying best score: The best_score_ attribute of the grid_search object is printed to show the best accuracy achieved using RFC in GridSearchCV.
After executed the grid search CV the print functions should display the best parameters and the highest classification accuracy score.
  Best Parameters: {'max_depth': 3, 'min_samples_split': 2, 'n_estimators': 50}
Best Score:  0.95
  

6. Key Takeaways

  • Random Forests improve classification by reducing overfitting compared to single Decision Trees.
  • They provide feature importance values, aiding in feature selection.
  • Hyperparameter tuning helps in optimizing model performance.

By leveraging Random Forests, you can build robust classification models with improved accuracy and generalization!

Saturday, March 1, 2025

Feature Importance in Decision Trees

Feature Importance in Decision Trees - Pythonholics

Decision Trees are widely used in machine learning because they provide not only high accuracy but also interpretability. One of the most valuable aspects of Decision Trees is their ability to rank feature importance, which helps in understanding which features contribute the most to predictions.

1. What is Feature Importance?

Feature importance measures how much each feature contributes to reducing impurity in a Decision Tree model. Scikit-learn provides an easy way to extract these values using the feature_importances_ attribute.

2. Loading and Preparing the Dataset

We will use the Iris dataset for this demonstration.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import matplotlib.pyplot as plt

# Load the dataset
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
The previous code block consist of the following code lines.
  • from sklearn.datasets import load_iris: This imports the load_iris function from the sklearn.datasets module, which loads the Iris dataset.
  • from sklearn.model_selection import train_test_split: This imports the train_test_split function from the sklearn.model_selection module, used to split data into training and testing sets.
  • from sklearn.tree import DecisionTreeClassifier: This imports the DecisionTreeClassifier from the sklearn.tree module, which is used to create a decision tree model for classification.
  • import numpy as np: This imports the numpy library with the alias np, which is used for numerical operations in Python.
  • import matplotlib.pyplot as plt: This imports the matplotlib.pyplot module with the alias plt, which is used for creating plots and visualizations.
  • # Load the dataset: A comment indicating that the next lines of code will load the dataset.
  • iris = load_iris(): This loads the Iris dataset into the iris variable. The dataset contains features (data) and target labels (target).
  • X, y = iris.data, iris.target: This splits the iris dataset into two variables: X (feature data) and y (target labels).
  • feature_names = iris.feature_names: This stores the feature names of the Iris dataset into the variable feature_names.
  • # Split into training and testing sets: A comment indicating that the data will be split into training and testing sets.
  • X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42): This splits the dataset into training and testing sets using the train_test_split function. test_size=0.2 means 20% of the data is used for testing, and random_state=42 ensures the split is reproducible.
The code written so far contained importing libraries, dataset and preparing the dataset for training and testing the decision tree classifier. The next step is define the classification model and train it using X_train, y_train dataset values.

3. Training a Decision Tree Model

Let's train a Decision Tree classifier and extract feature importance values.

# Train a Decision Tree Classifier
clf = DecisionTreeClassifier(max_depth=3, random_state=42)
clf.fit(X_train, y_train)

# Extract feature importances
importances = clf.feature_importances_

# Display feature importance values
for feature, importance in zip(feature_names, importances):
    print(f"{feature}: {importance:.4f}")
    
The previous code block consist of the following lines of code.
  • # Train a Decision Tree Classifier: A comment indicating that the next lines of code will train a decision tree classifier.
  • clf = DecisionTreeClassifier(max_depth=3, random_state=42): This creates an instance of the DecisionTreeClassifier with a maximum depth of 3 (to prevent overfitting) and a random_state=42 for reproducibility of results.
  • clf.fit(X_train, y_train): This trains the decision tree classifier on the training data X_train (features) and y_train (target labels).
  • # Extract feature importances: A comment indicating that the following line of code will extract the importance of each feature in the trained model.
  • importances = clf.feature_importances_: This retrieves the feature importance values from the trained decision tree model and stores them in the importances variable.
  • # Display feature importance values: A comment indicating that the next lines of code will display the feature importance values.
  • for feature, importance in zip(feature_names, importances):: This iterates through the feature_names (the names of the features) and importances (the importance values) simultaneously using the zip function.
  • print(f"{feature}: {importance:.4f}"): This prints each feature's name and its corresponding importance value, formatted to four decimal places.
After executing the code written so far we will obtain the feature importances of dataset values.
sepal length (cm): 0.0000
sepal width (cm): 0.0000
petal length (cm): 0.9346
petal width (cm): 0.0654
The feature importances indicate how much each feature contributes to the decision tree's ability to classify the Iris dataset. The "sepal length (cm)" and "sepal width (cm)" features have a feature importance of 0.0000, meaning that these features have no significant contribution to the classification task in this specific decision tree model. On the other hand, "petal length (cm)" has a high feature importance of 0.9346, suggesting that it plays a dominant role in the model's decision-making process. "Petal width (cm)" also contributes to some extent with a feature importance of 0.0654, though it is less significant compared to "petal length." These values demonstrate that the decision tree heavily relies on the petal-related features for classification.

4. Visualizing Feature Importance

We can plot the feature importance values for better understanding.

# Plot feature importance
plt.figure(figsize=(8, 5))
plt.barh(feature_names, importances, color="skyblue")
plt.xlabel("Feature Importance")
plt.ylabel("Feature")
plt.title("Feature Importance in Decision Tree")
plt.show()
    
The previous code block consisto of the following lines of code.
  • # Plot feature importance: A comment indicating that the following lines of code will plot the feature importance.
  • plt.figure(figsize=(8, 5)): This creates a new figure for the plot with a specified size of 8 inches by 5 inches using matplotlib.pyplot.
  • plt.barh(feature_names, importances, color="skyblue"): This creates a horizontal bar plot (barh) where the feature_names are plotted on the y-axis and the importances are plotted on the x-axis. The bars are colored "skyblue".
  • plt.xlabel("Feature Importance"): This sets the label for the x-axis as "Feature Importance".
  • plt.ylabel("Feature"): This sets the label for the y-axis as "Feature".
  • plt.title("Feature Importance in Decision Tree"): This sets the title of the plot as "Feature Importance in Decision Tree".
  • plt.show(): This displays the plot on the screen.

5. Interpreting Feature Importance

Higher feature importance values indicate that the feature has a greater influence on the model’s decisions. Features with very low importance can often be removed to simplify the model without significant loss in performance.

6. Using Feature Importance for Feature Selection

If some features have very low importance, we can remove them and retrain the model:

# Select important features (threshold of 0.1)
important_features = [feature for feature, importance in zip(feature_names, importances) if importance > 0.1]
print("Selected Features:", important_features)
    
The previous block of code consist of the following lines of code.
  • # Select important features (threshold of 0.1): A comment indicating that the following code will select features that have an importance greater than 0.1.
  • important_features = [feature for feature, importance in zip(feature_names, importances) if importance > 0.1]: This is a list comprehension that iterates over the feature_names and importances simultaneously using zip. It selects only those features where the importance is greater than 0.1 and stores them in the important_features list.
  • print("Selected Features:", important_features): This prints the list of selected important features to the console.
After executing the previous code the following output is obtained.
Selected Features: ['petal length (cm)']
The selected feature based on the importance threshold of 0.1 is "petal length (cm)." This means that, according to the decision tree model, "petal length" is the most important feature for classification, contributing significantly to the model's decision-making. Features like "sepal length (cm)" and "sepal width (cm)" were excluded because their feature importance was 0.0000, indicating that they do not provide valuable information for the classification task in this case. Thus, "petal length" is the key feature used by the model for making predictions.

7. Key Takeaways

  • Feature importance helps in understanding which features contribute the most to model predictions.
  • We can use feature importance to remove irrelevant features and improve model efficiency.
  • Visualizing feature importance can aid in better interpretability.

By leveraging feature importance, we can make our models more interpretable and efficient!

Thursday, February 27, 2025

Avoiding Overfitting in Decision Trees Using max_depth and min_samples

Avoiding Overfitting in Decision Trees

Decision Trees are powerful models, but they tend to overfit when left unrestricted. Overfitting occurs when a model memorizes the training data instead of generalizing to unseen data. In this post, we will explore how to prevent overfitting using the max_depth, min_samples_split, and min_samples_leaf parameters in Scikit-learn.

1. What Causes Overfitting in Decision Trees?

Overfitting happens when a Decision Tree grows too deep, capturing noise instead of meaningful patterns. This results in:

  • High accuracy on training data but poor performance on test data.
  • Complex models with too many nodes and splits.
  • Reduced generalization to new data.

2. Loading and Preparing the Dataset

We will use the Iris dataset for this demonstration.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import numpy as np

# Load the dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
The previous code block consist of the following code lines.
  • from sklearn.datasets import load_iris: This line imports the 'load_iris' function from the 'sklearn.datasets' module, which allows us to load the well-known Iris dataset.
  • from sklearn.model_selection import train_test_split: This imports the 'train_test_split' function from 'sklearn.model_selection'. It is used to split the dataset into training and testing sets.
  • from sklearn.tree import DecisionTreeClassifier: This imports the 'DecisionTreeClassifier' from 'sklearn.tree', which will be used to train a decision tree model for classification.
  • import numpy as np: This imports the 'numpy' library, which is useful for handling numerical operations, although it is not directly used in the code block shown.
  • # Load the dataset: This is a comment that explains the following line of code, where the Iris dataset is loaded using 'load_iris()'.
  • iris = load_iris(): This line loads the Iris dataset and stores it in the variable 'iris'. The dataset contains both the features (X) and the target labels (y).
  • X, y = iris.data, iris.target: Here, the feature data (X) and target labels (y) are extracted from the 'iris' object. 'X' contains the feature data, and 'y' contains the target labels (species of Iris).
  • # Split into training and testing sets: This is a comment indicating that the following code will split the data into training and testing subsets.
  • X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42): This line splits the dataset into training and testing sets. 80% of the data is used for training, and 20% is used for testing. The 'random_state' ensures that the split is reproducible.

3. Overfitting Example: Unrestricted Decision Tree

Let's train a Decision Tree without any restrictions.

# Train an unrestricted Decision Tree
clf_overfit = DecisionTreeClassifier(random_state=42)
clf_overfit.fit(X_train, y_train)

# Evaluate performance
train_accuracy = clf_overfit.score(X_train, y_train)
test_accuracy = clf_overfit.score(X_test, y_test)

print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
    
The previous code block consist of the following code lines.
  • # Train an unrestricted Decision Tree: This is a comment indicating that the following code will train a decision tree model without any restrictions (e.g., max depth).
  • clf_overfit = DecisionTreeClassifier(random_state=42): This line creates an instance of the 'DecisionTreeClassifier' class with the 'random_state' set to 42. This ensures reproducibility of the model's results.
  • clf_overfit.fit(X_train, y_train): This line fits the decision tree model to the training data (X_train and y_train). The model is trained to learn the relationships between the features and target labels in the training dataset.
  • # Evaluate performance: This comment indicates that the following lines will evaluate the model's performance on both the training and testing datasets.
  • train_accuracy = clf_overfit.score(X_train, y_train): This line calculates the accuracy of the trained model on the training dataset. The accuracy is the proportion of correct predictions made by the model.
  • test_accuracy = clf_overfit.score(X_test, y_test): This line calculates the accuracy of the trained model on the testing dataset. It evaluates how well the model generalizes to new, unseen data.
  • print(f"Training Accuracy: {train_accuracy:.4f}"): This prints the accuracy of the model on the training dataset, formatted to four decimal places.
  • print(f"Test Accuracy: {test_accuracy:.4f}"): This prints the accuracy of the model on the testing dataset, also formatted to four decimal places.

Expected Outcome: High training accuracy but lower test accuracy, indicating overfitting.

After executing the code the results verified the expected outcome i.e. the accuracies on both training and testing dataset are equal to 1.00.
Training Accuracy: 1.0000
Test Accuracy: 1.0000

4. Controlling Overfitting with max_depth

max_depth limits how deep the tree can grow. A lower depth prevents the tree from memorizing noise.

# Train a Decision Tree with limited depth
clf_depth = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_depth.fit(X_train, y_train)

# Evaluate performance
train_accuracy = clf_depth.score(X_train, y_train)
test_accuracy = clf_depth.score(X_test, y_test)

print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
    
The previous code block consist of the following code lines.
  • # Train a Decision Tree with limited depth: This comment indicates that the following code will train a decision tree model with a restricted maximum depth to avoid overfitting.
  • clf_depth = DecisionTreeClassifier(max_depth=3, random_state=42): This line creates an instance of the 'DecisionTreeClassifier' class with a specified 'max_depth' of 3. This limits the depth of the decision tree to prevent it from growing too complex and overfitting the training data. The 'random_state' is set to 42 for reproducibility.
  • clf_depth.fit(X_train, y_train): This line fits the decision tree model to the training data (X_train and y_train). The model learns the relationships between the features and target labels, but with a restricted tree depth.
  • # Evaluate performance: This comment indicates that the following lines will evaluate the performance of the model on both the training and testing datasets.
  • train_accuracy = clf_depth.score(X_train, y_train): This line calculates the accuracy of the trained model on the training dataset, representing how well the model fits the data it was trained on.
  • test_accuracy = clf_depth.score(X_test, y_test): This line calculates the accuracy of the trained model on the testing dataset, evaluating the model's ability to generalize to new, unseen data.
  • print(f"Training Accuracy: {train_accuracy:.4f}"): This prints the accuracy of the model on the training dataset, formatted to four decimal places.
  • print(f"Test Accuracy: {test_accuracy:.4f}"): This prints the accuracy of the model on the testing dataset, formatted to four decimal places.

Expected Outcome: Slightly lower training accuracy but improved test accuracy, reducing overfitting.

The expected outcome was verified after the Python code was executed. The accuracy on train dataset is slightly lower than on test dataset.
Training Accuracy: 0.9583
Test Accuracy: 1.0000

5. Controlling Overfitting with min_samples_split

min_samples_split controls the minimum number of samples needed to split a node. Increasing this value forces the tree to consider only significant splits.

# Train a Decision Tree with min_samples_split restriction
clf_split = DecisionTreeClassifier(min_samples_split=10, random_state=42)
clf_split.fit(X_train, y_train)

# Evaluate performance
train_accuracy = clf_split.score(X_train, y_train)
test_accuracy = clf_split.score(X_test, y_test)

print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
    
The previous code block consist of the following code lines.
  • # Train a Decision Tree with min_samples_split restriction: This comment indicates that the following code will train a decision tree model with a restriction on the minimum number of samples required to split an internal node. This restriction helps in controlling the model's complexity.
  • clf_split = DecisionTreeClassifier(min_samples_split=10, random_state=42): This line creates an instance of the 'DecisionTreeClassifier' class with a specified 'min_samples_split' of 10. This means that a node in the decision tree will only be split if it has at least 10 samples, thus preventing the model from making overly fine distinctions on small subsets of data. The 'random_state' is set to 42 for reproducibility.
  • clf_split.fit(X_train, y_train): This line fits the decision tree model to the training data (X_train and y_train). The decision tree will build using the provided training data with the specified restriction on the minimum number of samples required to split a node.
  • # Evaluate performance: This comment indicates that the following lines will evaluate the model's performance on both the training and testing datasets.
  • train_accuracy = clf_split.score(X_train, y_train): This line calculates the accuracy of the trained model on the training dataset, representing how well the model fits the data it was trained on.
  • test_accuracy = clf_split.score(X_test, y_test): This line calculates the accuracy of the trained model on the testing dataset, providing an evaluation of how well the model generalizes to unseen data.
  • print(f"Training Accuracy: {train_accuracy:.4f}"): This prints the accuracy of the model on the training dataset, formatted to four decimal places.
  • print(f"Test Accuracy: {test_accuracy:.4f}"): This prints the accuracy of the model on the testing dataset, formatted to four decimal places.
When the code is executed the Decision tree classifier with mean_samples_split generated agin lower accuracy value on train dataset while perfect on the test dataset.
Training Accuracy: 0.9583
Test Accuracy: 1.0000

6. Controlling Overfitting with min_samples_leaf

min_samples_leaf sets the minimum number of samples required to be at a leaf node. Larger values help in reducing overfitting.

# Train a Decision Tree with min_samples_leaf restriction
clf_leaf = DecisionTreeClassifier(min_samples_leaf=5, random_state=42)
clf_leaf.fit(X_train, y_train)

# Evaluate performance
train_accuracy = clf_leaf.score(X_train, y_train)
test_accuracy = clf_leaf.score(X_test, y_test)

print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
    
The previous block of code consist of the following lines of code.
  • clf_leaf = DecisionTreeClassifier(min_samples_leaf=5, random_state=42): This line initializes a DecisionTreeClassifier with the hyperparameter 'min_samples_leaf' set to 5, which means that each leaf node in the decision tree must have at least 5 samples. This restriction helps control overfitting by preventing the tree from creating nodes with very few samples. The random_state parameter is set to 42 to ensure reproducibility of the model's results.
  • clf_leaf.fit(X_train, y_train): This line trains the decision tree model ('clf_leaf') using the training dataset ('X_train' and 'y_train'). The decision tree is built based on the features (X_train) and target values (y_train) of the training data.
  • train_accuracy = clf_leaf.score(X_train, y_train): After training the model, this line calculates the accuracy of the model on the training data (X_train and y_train). It evaluates how well the model fits the data it was trained on.
  • test_accuracy = clf_leaf.score(X_test, y_test): This line calculates the accuracy of the model on the testing dataset (X_test and y_test). It evaluates how well the model generalizes to new, unseen data.
  • print(f"Training Accuracy: {train_accuracy:.4f}"): This line prints the training accuracy, rounded to four decimal places, to show how well the model performed on the training set.
  • print(f"Test Accuracy: {test_accuracy:.4f}"): Similarly, this line prints the testing accuracy, rounded to four decimal places, to show the model's performance on the testing set.
When executed the Decision Tree Classifier with limited min_samples_leaf = 5 generated lower accuracy on train then on test dataset.
Training Accuracy: 0.9500
Test Accuracy: 1.0000

7. Comparing the Models

Let's summarize how these hyperparameters affect overfitting.

models = {
    "Overfitted": clf_overfit,
    "Max Depth (3)": clf_depth,
    "Min Samples Split (10)": clf_split,
    "Min Samples Leaf (5)": clf_leaf
}

for name, model in models.items():
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    print(f"{name}: Train Accuracy = {train_acc:.4f}, Test Accuracy = {test_acc:.4f}")
    
The previous code block consist of the following code lines.
  • models = { ... }: This line defines a dictionary called 'models', where each key is a description of a model, and the value is the corresponding trained model object. The models included are 'Overfitted', 'Max Depth (3)', 'Min Samples Split (10)', and 'Min Samples Leaf (5)'. Each model has different hyperparameter configurations applied.
  • for name, model in models.items():: This line starts a loop that iterates over each model in the 'models' dictionary. For each iteration, the variable 'name' will hold the description of the model, and 'model' will hold the actual trained model.
  • train_acc = model.score(X_train, y_train): Inside the loop, this line calculates the accuracy of the current model on the training dataset, which shows how well the model fits the training data.
  • test_acc = model.score(X_test, y_test): This line calculates the accuracy of the current model on the testing dataset, indicating how well the model generalizes to unseen data.
  • print(f"{name}: Train Accuracy = {train_acc:.4f}, Test Accuracy = {test_acc:.4f}"): This line prints the name of the model (e.g., 'Overfitted', 'Max Depth (3)', etc.) along with its corresponding training and testing accuracy, formatted to four decimal places.

Expected Outcome: The restricted models will have slightly lower training accuracy but significantly better test accuracy compared to the overfitted model.

8. Key Takeaways

To avoid overfitting in Decision Trees:

  • Use max_depth to limit tree growth and prevent memorization of noise.
  • Increase min_samples_split to ensure meaningful splits.
  • Set min_samples_leaf to avoid creating deep branches with few samples.

By fine-tuning these parameters, we can build a more generalizable model that performs well on unseen data.

Finnaly when the code is executed the following output is obtaine.
Overfitted: Train Accuracy = 1.0000, Test Accuracy = 1.0000
Max Depth (3): Train Accuracy = 0.9583, Test Accuracy = 1.0000
Min Samples Split (10): Train Accuracy = 0.9583, Test Accuracy = 1.0000
Min Samples Leaf (5): Train Accuracy = 0.9500, Test Accuracy = 1.0000
The results of the Decision Tree models show varying levels of performance based on different hyperparameter restrictions.
The Overfitted model, which has no restrictions, achieved perfect accuracy on both the training and test sets, with a training accuracy of 1.0000 and a test accuracy of 1.0000. This suggests that the model has overfitted the training data, as it performs perfectly on both the training and test data, potentially failing to generalize well to new unseen data.
The Max Depth (3) model, which restricts the tree's depth to 3, performed slightly less well on the training set with a training accuracy of 0.9583 but still achieved perfect accuracy on the test set (1.0000). This indicates that limiting the depth of the tree helped prevent overfitting, allowing the model to generalize well to unseen data while still maintaining good performance on the training data.
Similarly, the Min Samples Split (10) model, which restricts the minimum number of samples required to split an internal node to 10, achieved the same performance as the Max Depth (3) model with a training accuracy of 0.9583 and a test accuracy of 1.0000. This suggests that increasing the minimum number of samples required to make a split also helped prevent overfitting, leading to similar generalization performance.
The Min Samples Leaf (5) model, which ensures that each leaf node contains at least 5 samples, showed the lowest training accuracy at 0.9500, but still achieved perfect accuracy on the test set (1.0000). This further confirms that restricting the number of samples in each leaf can slightly reduce the model’s ability to fit the training data perfectly but still does not hinder its ability to generalize well.
In summary, while all models achieved perfect test accuracy, the Overfitted model performed too well on the training data, indicating overfitting. The other models, which include restrictions like depth or minimum sample size, maintained a balance between good training accuracy and perfect test accuracy, reflecting improved generalization.

Hyperparameter Tuning for Decision Trees

Hyperparameter Tuning for Decision Trees

Decision Trees are powerful machine learning models, but their performance heavily depends on the choice of hyperparameters. In this guide, we will explore how to optimize Decision Tree hyperparameters using Scikit-learn's GridSearchCV and RandomizedSearchCV.

1. Understanding Hyperparameters in Decision Trees

Key hyperparameters that affect Decision Tree performance include:

  • max_depth: Limits the depth of the tree to prevent overfitting.
  • min_samples_split: The minimum number of samples required to split an internal node.
  • min_samples_leaf: The minimum number of samples required to be at a leaf node.
  • max_features: The number of features to consider when looking for the best split.
  • criterion: The function to measure the quality of a split (gini or entropy for classification, squared_error for regression).

2. Loading and Preparing the Dataset

We will use the famous Iris dataset for classification.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Load the dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
The previous code block consist of the following lines of code.
  • Import the necessary libraries:
    • from sklearn.datasets import load_iris - Imports the load_iris function from sklearn.datasets to load the Iris dataset.
    • from sklearn.model_selection import train_test_split - Imports the train_test_split function from sklearn.model_selection to split the dataset into training and testing sets.
    • from sklearn.tree import DecisionTreeClassifier - Imports the DecisionTreeClassifier from sklearn.tree to create a decision tree classifier model.
  • Load the Iris dataset:
    • iris = load_iris() - Loads the Iris dataset, which includes features (sepal length, sepal width, petal length, petal width) and target values (species of the iris).
    • X, y = iris.data, iris.target - Separates the dataset into the feature matrix X (input features) and the target vector y (species labels).
  • Split the data into training and testing sets:
    • X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) - Splits the dataset into training and testing sets:
      • X_train - Training feature matrix.
      • X_test - Testing feature matrix.
      • y_train - Training target vector.
      • y_test - Testing target vector.
      • test_size=0.2 - Specifies that 20% of the data will be used for testing, and the remaining 80% will be used for training.
      • random_state=42 - Ensures reproducibility by setting a seed for the random number generator.

3. Baseline Model without Tuning

Let's train a basic Decision Tree without tuning.

# Train a Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Evaluate the model
accuracy = clf.score(X_test, y_test)
print(f"Baseline Accuracy: {accuracy:.4f}")
    
The previous code block consist of the following lines of code.
  • Train a Decision Tree Classifier:
    • clf = DecisionTreeClassifier(random_state=42) - Initializes a Decision Tree Classifier with a random seed set to 42 for reproducibility.
    • clf.fit(X_train, y_train) - Fits the classifier to the training data (X_train for features and y_train for target labels).
  • Evaluate the model:
    • accuracy = clf.score(X_test, y_test) - Evaluates the model by calculating the accuracy on the test data. The score method returns the mean accuracy of the classifier on the given test data.
    • print(f"Baseline Accuracy: {accuracy:.4f}") - Prints the accuracy of the classifier on the test set, rounded to 4 decimal places.
After executing the code written so far the following output is obtained.
Baseline Accuracy: 1.0000

4. Hyperparameter Tuning using GridSearchCV

Grid Search performs an exhaustive search over a specified parameter grid.

from sklearn.model_selection import GridSearchCV

# Define hyperparameter grid
param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

# Grid Search
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best parameters and score
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Accuracy:", grid_search.best_score_)
    
The previous code block consist of the following lines of code.
  • Define hyperparameter grid:
    • param_grid - A dictionary containing the hyperparameters to be tuned and their possible values for the Decision Tree Classifier. This includes:
      • 'max_depth': [3, 5, 10, None] - Specifies the maximum depth of the tree.
      • 'min_samples_split': [2, 5, 10] - Defines the minimum number of samples required to split an internal node.
      • 'min_samples_leaf': [1, 2, 4] - Defines the minimum number of samples required to be at a leaf node.
      • 'criterion': ['gini', 'entropy'] - Specifies the function to measure the quality of a split (Gini impurity or Entropy).
  • Grid Search:
    • grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5, scoring='accuracy', n_jobs=-1) - Performs grid search with 5-fold cross-validation (cv=5) to find the best combination of hyperparameters using accuracy as the scoring metric (scoring='accuracy'). The parameter n_jobs=-1 enables parallel computation.
    • grid_search.fit(X_train, y_train) - Fits the grid search model to the training data (X_train and y_train) to explore the hyperparameter space.
  • Best parameters and score:
    • print("Best Hyperparameters:", grid_search.best_params_) - Prints the hyperparameters that produced the best performance during grid search.
    • print("Best Accuracy:", grid_search.best_score_) - Prints the best accuracy achieved during the grid search.

Explanation:

  • The search runs on multiple combinations of hyperparameters.
  • The cv=5 argument performs 5-fold cross-validation.
  • The best parameters and accuracy are displayed after tuning.
After executing the code the following output is obtained.
Best Hyperparameters: {'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 2}
Best Accuracy: 0.9583333333333334
The results obtained from the GridSearchCV on the DecisionTreeClassifier indicate that the optimal hyperparameters for the model are 'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 4, and 'min_samples_split': 2. The use of the entropy criterion suggests that the model makes splits based on information gain, which is often useful for creating more balanced decision boundaries. A maximum depth of 5 indicates that the tree has a moderate level of complexity, preventing overfitting while still capturing the necessary patterns in the data. The parameter 'min_samples_leaf': 4 means that each leaf node must have at least 4 samples, which helps in reducing model complexity and overfitting by ensuring that leaves contain a minimum number of data points. Similarly, 'min_samples_split': 2 allows the model to split nodes as long as there are at least 2 samples, giving the tree more flexibility in learning from the data. The best accuracy achieved by the model is 0.9583, which is quite impressive, suggesting that the tuned model performs exceptionally well on the test data. This high accuracy indicates that the DecisionTreeClassifier with the selected hyperparameters is highly effective in capturing the underlying patterns of the data without overfitting.

5. Hyperparameter Tuning using RandomizedSearchCV

Random Search is an efficient alternative that samples a fixed number of hyperparameter combinations.

from sklearn.model_selection import RandomizedSearchCV
import numpy as np

# Define hyperparameter distributions
param_dist = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': np.arange(2, 20, 2),
    'min_samples_leaf': np.arange(1, 10, 2),
    'criterion': ['gini', 'entropy']
}

# Randomized Search
random_search = RandomizedSearchCV(DecisionTreeClassifier(random_state=42), param_distributions=param_dist, n_iter=20, cv=5, scoring='accuracy', n_jobs=-1, random_state=42)
random_search.fit(X_train, y_train)

# Best parameters and score
print("Best Hyperparameters:", random_search.best_params_)
print("Best Accuracy:", random_search.best_score_)
    
The previous code block consist of the following lines of code.
  • import numpy as np: Imports the NumPy library, which is used to generate arrays for hyperparameter distributions.
  • Define hyperparameter distributions:
    • 'max_depth': [3, 5, 10, None]: Specifies possible values for the maximum depth of the decision tree. This determines how deep the tree can grow.
    • 'min_samples_split': np.arange(2, 20, 2): Specifies possible values for the minimum number of samples required to split an internal node. The values range from 2 to 20, incremented by 2.
    • 'min_samples_leaf': np.arange(1, 10, 2): Specifies possible values for the minimum number of samples required to be at a leaf node. The values range from 1 to 10, incremented by 2.
    • 'criterion': ['gini', 'entropy']: Specifies the splitting criteria, either "gini" (Gini impurity) or "entropy" (Information gain).
  • Randomized Search Configuration:
    • random_search = RandomizedSearchCV(DecisionTreeClassifier(random_state=42), param_distributions=param_dist, n_iter=20, cv=5, scoring='accuracy', n_jobs=-1, random_state=42): Defines the RandomizedSearchCV, which will randomly sample 20 hyperparameter combinations from param_dist and evaluate each using 5-fold cross-validation.
    • n_iter=20: Specifies the number of random combinations to try during the search.
    • cv=5: Sets the number of cross-validation folds to 5, ensuring the model is trained and validated on different subsets of the data.
    • scoring='accuracy': Uses accuracy as the evaluation metric to guide the hyperparameter search.
    • n_jobs=-1: Utilizes all available cores for parallel processing during the search.
    • random_state=42: Ensures that the random search can be reproduced in future runs by setting a fixed seed.
  • random_search.fit(X_train, y_train): Trains the RandomizedSearchCV model using the training data (X_train, y_train) and searches for the best combination of hyperparameters.
  • Display Results:
    • print("Best Hyperparameters:", random_search.best_params_): Prints the best combination of hyperparameters found during the search.
    • print("Best Accuracy:", random_search.best_score_): Prints the accuracy score corresponding to the best hyperparameters.

Explanation:

  • Random search selects random hyperparameter combinations.
  • The n_iter=20 argument limits the number of sampled combinations.
  • Randomized search is faster than grid search while still providing good results.
After executing the code the following results were obtained for RandomizedSearchCV.
 Best Hyperparameters: {'min_samples_split': 14, 'min_samples_leaf': 3, 'max_depth': 10, 'criterion': 'entropy'}
Best Accuracy: 0.95
The results obtained from applying RandomizedSearchCV on the DecisionTreeClassifier indicate that the model's hyperparameters were optimized for better performance. The best hyperparameters identified were a min_samples_split of 14, a min_samples_leaf of 3, a max_depth of 10, and the use of the entropy criterion. The min_samples_split of 14 ensures that the tree only splits when there is sufficient data, preventing the model from creating overly specific, less generalizable splits. The min_samples_leaf of 3 further reduces complexity by ensuring that each leaf node contains at least 3 samples, promoting better generalization. The max_depth of 10 limits the depth of the tree, striking a balance between capturing important data patterns and avoiding overfitting. Lastly, the entropy criterion was chosen to guide the model in making splits that maximize information gain, leading to more meaningful and useful divisions in the data. With these optimized parameters, the model achieved an accuracy of 95% on the test set. This high accuracy suggests that the model is well-tuned and performs robustly, effectively distinguishing between classes without overfitting. Overall, the use of RandomizedSearchCV has proven effective in selecting hyperparameters that lead to a highly performant DecisionTreeClassifier. The results demonstrate that the model is well-optimized for the task at hand, providing an efficient and accurate classification model.

6. Evaluating the Best Model

We now train a Decision Tree using the best hyperparameters found.

# Train the best model
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)

# Evaluate the model
best_accuracy = best_model.score(X_test, y_test)
print(f"Tuned Model Accuracy: {best_accuracy:.4f}")
    
The previous block of code consist of the following lines of code.
  • Retrieve the best model:
    • best_model = grid_search.best_estimator_: Retrieves the best model from the grid search based on the highest accuracy score. The best_estimator_ is the model with the optimal hyperparameters found during the grid search.
  • Train the best model:
    • best_model.fit(X_train, y_train): Trains the best model on the training data (X_train, y_train) using the optimal hyperparameters.
  • Evaluate the tuned model:
    • best_accuracy = best_model.score(X_test, y_test): Evaluates the tuned model on the test set (X_test, y_test) and calculates the accuracy score.
    • print(f"Tuned Model Accuracy: {best_accuracy:.4f}"): Prints the accuracy of the tuned model, rounded to four decimal places.
After executing the code for retrieving the best model the following accuracy is obtained.
Tuned Model Accuracy: 1.0000

7. Conclusion

Hyperparameter tuning significantly improves Decision Tree performance. In this tutorial, we explored:

  • GridSearchCV: Exhaustive search for optimal hyperparameters.
  • RandomizedSearchCV: Faster alternative by sampling a subset of hyperparameters.
  • How to evaluate the best model after tuning.

Try these methods on your own datasets to achieve better Decision Tree performance!