Thursday, February 27, 2025

Avoiding Overfitting in Decision Trees Using max_depth and min_samples

Avoiding Overfitting in Decision Trees

Decision Trees are powerful models, but they tend to overfit when left unrestricted. Overfitting occurs when a model memorizes the training data instead of generalizing to unseen data. In this post, we will explore how to prevent overfitting using the max_depth, min_samples_split, and min_samples_leaf parameters in Scikit-learn.

1. What Causes Overfitting in Decision Trees?

Overfitting happens when a Decision Tree grows too deep, capturing noise instead of meaningful patterns. This results in:

  • High accuracy on training data but poor performance on test data.
  • Complex models with too many nodes and splits.
  • Reduced generalization to new data.

2. Loading and Preparing the Dataset

We will use the Iris dataset for this demonstration.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import numpy as np

# Load the dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
The previous code block consist of the following code lines.
  • from sklearn.datasets import load_iris: This line imports the 'load_iris' function from the 'sklearn.datasets' module, which allows us to load the well-known Iris dataset.
  • from sklearn.model_selection import train_test_split: This imports the 'train_test_split' function from 'sklearn.model_selection'. It is used to split the dataset into training and testing sets.
  • from sklearn.tree import DecisionTreeClassifier: This imports the 'DecisionTreeClassifier' from 'sklearn.tree', which will be used to train a decision tree model for classification.
  • import numpy as np: This imports the 'numpy' library, which is useful for handling numerical operations, although it is not directly used in the code block shown.
  • # Load the dataset: This is a comment that explains the following line of code, where the Iris dataset is loaded using 'load_iris()'.
  • iris = load_iris(): This line loads the Iris dataset and stores it in the variable 'iris'. The dataset contains both the features (X) and the target labels (y).
  • X, y = iris.data, iris.target: Here, the feature data (X) and target labels (y) are extracted from the 'iris' object. 'X' contains the feature data, and 'y' contains the target labels (species of Iris).
  • # Split into training and testing sets: This is a comment indicating that the following code will split the data into training and testing subsets.
  • X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42): This line splits the dataset into training and testing sets. 80% of the data is used for training, and 20% is used for testing. The 'random_state' ensures that the split is reproducible.

3. Overfitting Example: Unrestricted Decision Tree

Let's train a Decision Tree without any restrictions.

# Train an unrestricted Decision Tree
clf_overfit = DecisionTreeClassifier(random_state=42)
clf_overfit.fit(X_train, y_train)

# Evaluate performance
train_accuracy = clf_overfit.score(X_train, y_train)
test_accuracy = clf_overfit.score(X_test, y_test)

print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
    
The previous code block consist of the following code lines.
  • # Train an unrestricted Decision Tree: This is a comment indicating that the following code will train a decision tree model without any restrictions (e.g., max depth).
  • clf_overfit = DecisionTreeClassifier(random_state=42): This line creates an instance of the 'DecisionTreeClassifier' class with the 'random_state' set to 42. This ensures reproducibility of the model's results.
  • clf_overfit.fit(X_train, y_train): This line fits the decision tree model to the training data (X_train and y_train). The model is trained to learn the relationships between the features and target labels in the training dataset.
  • # Evaluate performance: This comment indicates that the following lines will evaluate the model's performance on both the training and testing datasets.
  • train_accuracy = clf_overfit.score(X_train, y_train): This line calculates the accuracy of the trained model on the training dataset. The accuracy is the proportion of correct predictions made by the model.
  • test_accuracy = clf_overfit.score(X_test, y_test): This line calculates the accuracy of the trained model on the testing dataset. It evaluates how well the model generalizes to new, unseen data.
  • print(f"Training Accuracy: {train_accuracy:.4f}"): This prints the accuracy of the model on the training dataset, formatted to four decimal places.
  • print(f"Test Accuracy: {test_accuracy:.4f}"): This prints the accuracy of the model on the testing dataset, also formatted to four decimal places.

Expected Outcome: High training accuracy but lower test accuracy, indicating overfitting.

After executing the code the results verified the expected outcome i.e. the accuracies on both training and testing dataset are equal to 1.00.
Training Accuracy: 1.0000
Test Accuracy: 1.0000

4. Controlling Overfitting with max_depth

max_depth limits how deep the tree can grow. A lower depth prevents the tree from memorizing noise.

# Train a Decision Tree with limited depth
clf_depth = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_depth.fit(X_train, y_train)

# Evaluate performance
train_accuracy = clf_depth.score(X_train, y_train)
test_accuracy = clf_depth.score(X_test, y_test)

print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
    
The previous code block consist of the following code lines.
  • # Train a Decision Tree with limited depth: This comment indicates that the following code will train a decision tree model with a restricted maximum depth to avoid overfitting.
  • clf_depth = DecisionTreeClassifier(max_depth=3, random_state=42): This line creates an instance of the 'DecisionTreeClassifier' class with a specified 'max_depth' of 3. This limits the depth of the decision tree to prevent it from growing too complex and overfitting the training data. The 'random_state' is set to 42 for reproducibility.
  • clf_depth.fit(X_train, y_train): This line fits the decision tree model to the training data (X_train and y_train). The model learns the relationships between the features and target labels, but with a restricted tree depth.
  • # Evaluate performance: This comment indicates that the following lines will evaluate the performance of the model on both the training and testing datasets.
  • train_accuracy = clf_depth.score(X_train, y_train): This line calculates the accuracy of the trained model on the training dataset, representing how well the model fits the data it was trained on.
  • test_accuracy = clf_depth.score(X_test, y_test): This line calculates the accuracy of the trained model on the testing dataset, evaluating the model's ability to generalize to new, unseen data.
  • print(f"Training Accuracy: {train_accuracy:.4f}"): This prints the accuracy of the model on the training dataset, formatted to four decimal places.
  • print(f"Test Accuracy: {test_accuracy:.4f}"): This prints the accuracy of the model on the testing dataset, formatted to four decimal places.

Expected Outcome: Slightly lower training accuracy but improved test accuracy, reducing overfitting.

The expected outcome was verified after the Python code was executed. The accuracy on train dataset is slightly lower than on test dataset.
Training Accuracy: 0.9583
Test Accuracy: 1.0000

5. Controlling Overfitting with min_samples_split

min_samples_split controls the minimum number of samples needed to split a node. Increasing this value forces the tree to consider only significant splits.

# Train a Decision Tree with min_samples_split restriction
clf_split = DecisionTreeClassifier(min_samples_split=10, random_state=42)
clf_split.fit(X_train, y_train)

# Evaluate performance
train_accuracy = clf_split.score(X_train, y_train)
test_accuracy = clf_split.score(X_test, y_test)

print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
    
The previous code block consist of the following code lines.
  • # Train a Decision Tree with min_samples_split restriction: This comment indicates that the following code will train a decision tree model with a restriction on the minimum number of samples required to split an internal node. This restriction helps in controlling the model's complexity.
  • clf_split = DecisionTreeClassifier(min_samples_split=10, random_state=42): This line creates an instance of the 'DecisionTreeClassifier' class with a specified 'min_samples_split' of 10. This means that a node in the decision tree will only be split if it has at least 10 samples, thus preventing the model from making overly fine distinctions on small subsets of data. The 'random_state' is set to 42 for reproducibility.
  • clf_split.fit(X_train, y_train): This line fits the decision tree model to the training data (X_train and y_train). The decision tree will build using the provided training data with the specified restriction on the minimum number of samples required to split a node.
  • # Evaluate performance: This comment indicates that the following lines will evaluate the model's performance on both the training and testing datasets.
  • train_accuracy = clf_split.score(X_train, y_train): This line calculates the accuracy of the trained model on the training dataset, representing how well the model fits the data it was trained on.
  • test_accuracy = clf_split.score(X_test, y_test): This line calculates the accuracy of the trained model on the testing dataset, providing an evaluation of how well the model generalizes to unseen data.
  • print(f"Training Accuracy: {train_accuracy:.4f}"): This prints the accuracy of the model on the training dataset, formatted to four decimal places.
  • print(f"Test Accuracy: {test_accuracy:.4f}"): This prints the accuracy of the model on the testing dataset, formatted to four decimal places.
When the code is executed the Decision tree classifier with mean_samples_split generated agin lower accuracy value on train dataset while perfect on the test dataset.
Training Accuracy: 0.9583
Test Accuracy: 1.0000

6. Controlling Overfitting with min_samples_leaf

min_samples_leaf sets the minimum number of samples required to be at a leaf node. Larger values help in reducing overfitting.

# Train a Decision Tree with min_samples_leaf restriction
clf_leaf = DecisionTreeClassifier(min_samples_leaf=5, random_state=42)
clf_leaf.fit(X_train, y_train)

# Evaluate performance
train_accuracy = clf_leaf.score(X_train, y_train)
test_accuracy = clf_leaf.score(X_test, y_test)

print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
    
The previous block of code consist of the following lines of code.
  • clf_leaf = DecisionTreeClassifier(min_samples_leaf=5, random_state=42): This line initializes a DecisionTreeClassifier with the hyperparameter 'min_samples_leaf' set to 5, which means that each leaf node in the decision tree must have at least 5 samples. This restriction helps control overfitting by preventing the tree from creating nodes with very few samples. The random_state parameter is set to 42 to ensure reproducibility of the model's results.
  • clf_leaf.fit(X_train, y_train): This line trains the decision tree model ('clf_leaf') using the training dataset ('X_train' and 'y_train'). The decision tree is built based on the features (X_train) and target values (y_train) of the training data.
  • train_accuracy = clf_leaf.score(X_train, y_train): After training the model, this line calculates the accuracy of the model on the training data (X_train and y_train). It evaluates how well the model fits the data it was trained on.
  • test_accuracy = clf_leaf.score(X_test, y_test): This line calculates the accuracy of the model on the testing dataset (X_test and y_test). It evaluates how well the model generalizes to new, unseen data.
  • print(f"Training Accuracy: {train_accuracy:.4f}"): This line prints the training accuracy, rounded to four decimal places, to show how well the model performed on the training set.
  • print(f"Test Accuracy: {test_accuracy:.4f}"): Similarly, this line prints the testing accuracy, rounded to four decimal places, to show the model's performance on the testing set.
When executed the Decision Tree Classifier with limited min_samples_leaf = 5 generated lower accuracy on train then on test dataset.
Training Accuracy: 0.9500
Test Accuracy: 1.0000

7. Comparing the Models

Let's summarize how these hyperparameters affect overfitting.

models = {
    "Overfitted": clf_overfit,
    "Max Depth (3)": clf_depth,
    "Min Samples Split (10)": clf_split,
    "Min Samples Leaf (5)": clf_leaf
}

for name, model in models.items():
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    print(f"{name}: Train Accuracy = {train_acc:.4f}, Test Accuracy = {test_acc:.4f}")
    
The previous code block consist of the following code lines.
  • models = { ... }: This line defines a dictionary called 'models', where each key is a description of a model, and the value is the corresponding trained model object. The models included are 'Overfitted', 'Max Depth (3)', 'Min Samples Split (10)', and 'Min Samples Leaf (5)'. Each model has different hyperparameter configurations applied.
  • for name, model in models.items():: This line starts a loop that iterates over each model in the 'models' dictionary. For each iteration, the variable 'name' will hold the description of the model, and 'model' will hold the actual trained model.
  • train_acc = model.score(X_train, y_train): Inside the loop, this line calculates the accuracy of the current model on the training dataset, which shows how well the model fits the training data.
  • test_acc = model.score(X_test, y_test): This line calculates the accuracy of the current model on the testing dataset, indicating how well the model generalizes to unseen data.
  • print(f"{name}: Train Accuracy = {train_acc:.4f}, Test Accuracy = {test_acc:.4f}"): This line prints the name of the model (e.g., 'Overfitted', 'Max Depth (3)', etc.) along with its corresponding training and testing accuracy, formatted to four decimal places.

Expected Outcome: The restricted models will have slightly lower training accuracy but significantly better test accuracy compared to the overfitted model.

8. Key Takeaways

To avoid overfitting in Decision Trees:

  • Use max_depth to limit tree growth and prevent memorization of noise.
  • Increase min_samples_split to ensure meaningful splits.
  • Set min_samples_leaf to avoid creating deep branches with few samples.

By fine-tuning these parameters, we can build a more generalizable model that performs well on unseen data.

Finnaly when the code is executed the following output is obtaine.
Overfitted: Train Accuracy = 1.0000, Test Accuracy = 1.0000
Max Depth (3): Train Accuracy = 0.9583, Test Accuracy = 1.0000
Min Samples Split (10): Train Accuracy = 0.9583, Test Accuracy = 1.0000
Min Samples Leaf (5): Train Accuracy = 0.9500, Test Accuracy = 1.0000
The results of the Decision Tree models show varying levels of performance based on different hyperparameter restrictions.
The Overfitted model, which has no restrictions, achieved perfect accuracy on both the training and test sets, with a training accuracy of 1.0000 and a test accuracy of 1.0000. This suggests that the model has overfitted the training data, as it performs perfectly on both the training and test data, potentially failing to generalize well to new unseen data.
The Max Depth (3) model, which restricts the tree's depth to 3, performed slightly less well on the training set with a training accuracy of 0.9583 but still achieved perfect accuracy on the test set (1.0000). This indicates that limiting the depth of the tree helped prevent overfitting, allowing the model to generalize well to unseen data while still maintaining good performance on the training data.
Similarly, the Min Samples Split (10) model, which restricts the minimum number of samples required to split an internal node to 10, achieved the same performance as the Max Depth (3) model with a training accuracy of 0.9583 and a test accuracy of 1.0000. This suggests that increasing the minimum number of samples required to make a split also helped prevent overfitting, leading to similar generalization performance.
The Min Samples Leaf (5) model, which ensures that each leaf node contains at least 5 samples, showed the lowest training accuracy at 0.9500, but still achieved perfect accuracy on the test set (1.0000). This further confirms that restricting the number of samples in each leaf can slightly reduce the model’s ability to fit the training data perfectly but still does not hinder its ability to generalize well.
In summary, while all models achieved perfect test accuracy, the Overfitted model performed too well on the training data, indicating overfitting. The other models, which include restrictions like depth or minimum sample size, maintained a balance between good training accuracy and perfect test accuracy, reflecting improved generalization.

Hyperparameter Tuning for Decision Trees

Hyperparameter Tuning for Decision Trees

Decision Trees are powerful machine learning models, but their performance heavily depends on the choice of hyperparameters. In this guide, we will explore how to optimize Decision Tree hyperparameters using Scikit-learn's GridSearchCV and RandomizedSearchCV.

1. Understanding Hyperparameters in Decision Trees

Key hyperparameters that affect Decision Tree performance include:

  • max_depth: Limits the depth of the tree to prevent overfitting.
  • min_samples_split: The minimum number of samples required to split an internal node.
  • min_samples_leaf: The minimum number of samples required to be at a leaf node.
  • max_features: The number of features to consider when looking for the best split.
  • criterion: The function to measure the quality of a split (gini or entropy for classification, squared_error for regression).

2. Loading and Preparing the Dataset

We will use the famous Iris dataset for classification.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Load the dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
The previous code block consist of the following lines of code.
  • Import the necessary libraries:
    • from sklearn.datasets import load_iris - Imports the load_iris function from sklearn.datasets to load the Iris dataset.
    • from sklearn.model_selection import train_test_split - Imports the train_test_split function from sklearn.model_selection to split the dataset into training and testing sets.
    • from sklearn.tree import DecisionTreeClassifier - Imports the DecisionTreeClassifier from sklearn.tree to create a decision tree classifier model.
  • Load the Iris dataset:
    • iris = load_iris() - Loads the Iris dataset, which includes features (sepal length, sepal width, petal length, petal width) and target values (species of the iris).
    • X, y = iris.data, iris.target - Separates the dataset into the feature matrix X (input features) and the target vector y (species labels).
  • Split the data into training and testing sets:
    • X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) - Splits the dataset into training and testing sets:
      • X_train - Training feature matrix.
      • X_test - Testing feature matrix.
      • y_train - Training target vector.
      • y_test - Testing target vector.
      • test_size=0.2 - Specifies that 20% of the data will be used for testing, and the remaining 80% will be used for training.
      • random_state=42 - Ensures reproducibility by setting a seed for the random number generator.

3. Baseline Model without Tuning

Let's train a basic Decision Tree without tuning.

# Train a Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Evaluate the model
accuracy = clf.score(X_test, y_test)
print(f"Baseline Accuracy: {accuracy:.4f}")
    
The previous code block consist of the following lines of code.
  • Train a Decision Tree Classifier:
    • clf = DecisionTreeClassifier(random_state=42) - Initializes a Decision Tree Classifier with a random seed set to 42 for reproducibility.
    • clf.fit(X_train, y_train) - Fits the classifier to the training data (X_train for features and y_train for target labels).
  • Evaluate the model:
    • accuracy = clf.score(X_test, y_test) - Evaluates the model by calculating the accuracy on the test data. The score method returns the mean accuracy of the classifier on the given test data.
    • print(f"Baseline Accuracy: {accuracy:.4f}") - Prints the accuracy of the classifier on the test set, rounded to 4 decimal places.
After executing the code written so far the following output is obtained.
Baseline Accuracy: 1.0000

4. Hyperparameter Tuning using GridSearchCV

Grid Search performs an exhaustive search over a specified parameter grid.

from sklearn.model_selection import GridSearchCV

# Define hyperparameter grid
param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

# Grid Search
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best parameters and score
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Accuracy:", grid_search.best_score_)
    
The previous code block consist of the following lines of code.
  • Define hyperparameter grid:
    • param_grid - A dictionary containing the hyperparameters to be tuned and their possible values for the Decision Tree Classifier. This includes:
      • 'max_depth': [3, 5, 10, None] - Specifies the maximum depth of the tree.
      • 'min_samples_split': [2, 5, 10] - Defines the minimum number of samples required to split an internal node.
      • 'min_samples_leaf': [1, 2, 4] - Defines the minimum number of samples required to be at a leaf node.
      • 'criterion': ['gini', 'entropy'] - Specifies the function to measure the quality of a split (Gini impurity or Entropy).
  • Grid Search:
    • grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5, scoring='accuracy', n_jobs=-1) - Performs grid search with 5-fold cross-validation (cv=5) to find the best combination of hyperparameters using accuracy as the scoring metric (scoring='accuracy'). The parameter n_jobs=-1 enables parallel computation.
    • grid_search.fit(X_train, y_train) - Fits the grid search model to the training data (X_train and y_train) to explore the hyperparameter space.
  • Best parameters and score:
    • print("Best Hyperparameters:", grid_search.best_params_) - Prints the hyperparameters that produced the best performance during grid search.
    • print("Best Accuracy:", grid_search.best_score_) - Prints the best accuracy achieved during the grid search.

Explanation:

  • The search runs on multiple combinations of hyperparameters.
  • The cv=5 argument performs 5-fold cross-validation.
  • The best parameters and accuracy are displayed after tuning.
After executing the code the following output is obtained.
Best Hyperparameters: {'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 2}
Best Accuracy: 0.9583333333333334
The results obtained from the GridSearchCV on the DecisionTreeClassifier indicate that the optimal hyperparameters for the model are 'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 4, and 'min_samples_split': 2. The use of the entropy criterion suggests that the model makes splits based on information gain, which is often useful for creating more balanced decision boundaries. A maximum depth of 5 indicates that the tree has a moderate level of complexity, preventing overfitting while still capturing the necessary patterns in the data. The parameter 'min_samples_leaf': 4 means that each leaf node must have at least 4 samples, which helps in reducing model complexity and overfitting by ensuring that leaves contain a minimum number of data points. Similarly, 'min_samples_split': 2 allows the model to split nodes as long as there are at least 2 samples, giving the tree more flexibility in learning from the data. The best accuracy achieved by the model is 0.9583, which is quite impressive, suggesting that the tuned model performs exceptionally well on the test data. This high accuracy indicates that the DecisionTreeClassifier with the selected hyperparameters is highly effective in capturing the underlying patterns of the data without overfitting.

5. Hyperparameter Tuning using RandomizedSearchCV

Random Search is an efficient alternative that samples a fixed number of hyperparameter combinations.

from sklearn.model_selection import RandomizedSearchCV
import numpy as np

# Define hyperparameter distributions
param_dist = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': np.arange(2, 20, 2),
    'min_samples_leaf': np.arange(1, 10, 2),
    'criterion': ['gini', 'entropy']
}

# Randomized Search
random_search = RandomizedSearchCV(DecisionTreeClassifier(random_state=42), param_distributions=param_dist, n_iter=20, cv=5, scoring='accuracy', n_jobs=-1, random_state=42)
random_search.fit(X_train, y_train)

# Best parameters and score
print("Best Hyperparameters:", random_search.best_params_)
print("Best Accuracy:", random_search.best_score_)
    
The previous code block consist of the following lines of code.
  • import numpy as np: Imports the NumPy library, which is used to generate arrays for hyperparameter distributions.
  • Define hyperparameter distributions:
    • 'max_depth': [3, 5, 10, None]: Specifies possible values for the maximum depth of the decision tree. This determines how deep the tree can grow.
    • 'min_samples_split': np.arange(2, 20, 2): Specifies possible values for the minimum number of samples required to split an internal node. The values range from 2 to 20, incremented by 2.
    • 'min_samples_leaf': np.arange(1, 10, 2): Specifies possible values for the minimum number of samples required to be at a leaf node. The values range from 1 to 10, incremented by 2.
    • 'criterion': ['gini', 'entropy']: Specifies the splitting criteria, either "gini" (Gini impurity) or "entropy" (Information gain).
  • Randomized Search Configuration:
    • random_search = RandomizedSearchCV(DecisionTreeClassifier(random_state=42), param_distributions=param_dist, n_iter=20, cv=5, scoring='accuracy', n_jobs=-1, random_state=42): Defines the RandomizedSearchCV, which will randomly sample 20 hyperparameter combinations from param_dist and evaluate each using 5-fold cross-validation.
    • n_iter=20: Specifies the number of random combinations to try during the search.
    • cv=5: Sets the number of cross-validation folds to 5, ensuring the model is trained and validated on different subsets of the data.
    • scoring='accuracy': Uses accuracy as the evaluation metric to guide the hyperparameter search.
    • n_jobs=-1: Utilizes all available cores for parallel processing during the search.
    • random_state=42: Ensures that the random search can be reproduced in future runs by setting a fixed seed.
  • random_search.fit(X_train, y_train): Trains the RandomizedSearchCV model using the training data (X_train, y_train) and searches for the best combination of hyperparameters.
  • Display Results:
    • print("Best Hyperparameters:", random_search.best_params_): Prints the best combination of hyperparameters found during the search.
    • print("Best Accuracy:", random_search.best_score_): Prints the accuracy score corresponding to the best hyperparameters.

Explanation:

  • Random search selects random hyperparameter combinations.
  • The n_iter=20 argument limits the number of sampled combinations.
  • Randomized search is faster than grid search while still providing good results.
After executing the code the following results were obtained for RandomizedSearchCV.
 Best Hyperparameters: {'min_samples_split': 14, 'min_samples_leaf': 3, 'max_depth': 10, 'criterion': 'entropy'}
Best Accuracy: 0.95
The results obtained from applying RandomizedSearchCV on the DecisionTreeClassifier indicate that the model's hyperparameters were optimized for better performance. The best hyperparameters identified were a min_samples_split of 14, a min_samples_leaf of 3, a max_depth of 10, and the use of the entropy criterion. The min_samples_split of 14 ensures that the tree only splits when there is sufficient data, preventing the model from creating overly specific, less generalizable splits. The min_samples_leaf of 3 further reduces complexity by ensuring that each leaf node contains at least 3 samples, promoting better generalization. The max_depth of 10 limits the depth of the tree, striking a balance between capturing important data patterns and avoiding overfitting. Lastly, the entropy criterion was chosen to guide the model in making splits that maximize information gain, leading to more meaningful and useful divisions in the data. With these optimized parameters, the model achieved an accuracy of 95% on the test set. This high accuracy suggests that the model is well-tuned and performs robustly, effectively distinguishing between classes without overfitting. Overall, the use of RandomizedSearchCV has proven effective in selecting hyperparameters that lead to a highly performant DecisionTreeClassifier. The results demonstrate that the model is well-optimized for the task at hand, providing an efficient and accurate classification model.

6. Evaluating the Best Model

We now train a Decision Tree using the best hyperparameters found.

# Train the best model
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)

# Evaluate the model
best_accuracy = best_model.score(X_test, y_test)
print(f"Tuned Model Accuracy: {best_accuracy:.4f}")
    
The previous block of code consist of the following lines of code.
  • Retrieve the best model:
    • best_model = grid_search.best_estimator_: Retrieves the best model from the grid search based on the highest accuracy score. The best_estimator_ is the model with the optimal hyperparameters found during the grid search.
  • Train the best model:
    • best_model.fit(X_train, y_train): Trains the best model on the training data (X_train, y_train) using the optimal hyperparameters.
  • Evaluate the tuned model:
    • best_accuracy = best_model.score(X_test, y_test): Evaluates the tuned model on the test set (X_test, y_test) and calculates the accuracy score.
    • print(f"Tuned Model Accuracy: {best_accuracy:.4f}"): Prints the accuracy of the tuned model, rounded to four decimal places.
After executing the code for retrieving the best model the following accuracy is obtained.
Tuned Model Accuracy: 1.0000

7. Conclusion

Hyperparameter tuning significantly improves Decision Tree performance. In this tutorial, we explored:

  • GridSearchCV: Exhaustive search for optimal hyperparameters.
  • RandomizedSearchCV: Faster alternative by sampling a subset of hyperparameters.
  • How to evaluate the best model after tuning.

Try these methods on your own datasets to achieve better Decision Tree performance!

Visualizing Decision Trees in Scikit-learn

Visualizing Decision Trees in Scikit-learn

Decision Trees are one of the most intuitive machine learning models, and a great advantage is that they can be visualized to understand how decisions are made at each step. In this post, we will explore different ways to visualize Decision Trees using Python’s Scikit-learn library.

Why Visualize a Decision Tree?

Understanding the structure of a Decision Tree helps with:

  • Interpreting the model's decision-making process.
  • Identifying important features used for classification or regression.
  • Detecting overfitting when the tree is too deep.

1. Training a Decision Tree in Scikit-learn

First, let's train a Decision Tree using the Iris dataset for classification.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Load the dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train a Decision Tree Classifier
clf = DecisionTreeClassifier(max_depth=3, random_state=42)
clf.fit(X, y)
    
The previous code block consist of the following code lines:
  • Load the dataset:
    • iris = load_iris() - Loads the Iris dataset from sklearn.datasets. The dataset contains features of iris flowers and their corresponding species labels.
    • X, y = iris.data, iris.target - X contains the feature data (iris flower measurements), and y contains the target data (iris species labels).
  • Train a Decision Tree Classifier:
    • clf = DecisionTreeClassifier(max_depth=3, random_state=42) - Initializes a Decision Tree Classifier with a maximum depth of 3 to control the complexity of the tree and prevent overfitting. The random_state=42 ensures that the results are reproducible.
    • clf.fit(X, y) - Fits the Decision Tree Classifier (clf) to the Iris dataset. This step trains the classifier using the features (X) and target labels (y).

2. Visualizing with plot_tree

Scikit-learn provides the plot_tree function to directly visualize a trained Decision Tree.

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

# Plot the Decision Tree
plt.figure(figsize=(12,8))
plot_tree(clf, filled=True, feature_names=iris.feature_names, class_names=iris.target_names, rounded=True)
plt.show()
    
The previous code block consist of the following code lines:
  • Import necessary libraries:
    • import matplotlib.pyplot as plt - Imports the matplotlib.pyplot module, which is used for plotting graphs and figures.
    • from sklearn.tree import plot_tree - Imports the plot_tree function from sklearn.tree to visualize decision trees.
  • Plot the Decision Tree:
    • plt.figure(figsize=(12,8)) - Creates a figure with a specified size of 12x8 inches to plot the decision tree.
    • plot_tree(clf, filled=True, feature_names=iris.feature_names, class_names=iris.target_names, rounded=True) - Plots the trained decision tree clf. The filled=True argument colors the nodes according to the predicted class. The feature_names=iris.feature_names and class_names=iris.target_names add feature and class names to the plot, respectively. The rounded=True argument makes the nodes have rounded corners for a cleaner appearance.
    • plt.show() - Displays the decision tree plot on the screen.

Explanation:

  • The filled=True argument colors the nodes based on the predicted class.
  • The feature_names and class_names arguments add labels for better understanding.
  • The rounded=True makes the boxes have rounded corners for better readability.

3. Exporting Tree as a Graph using export_graphviz

Another approach is using Graphviz to create a graphical representation of the tree.

from sklearn.tree import export_graphviz
import graphviz

# Export the Decision Tree
dot_data = export_graphviz(clf, out_file=None, filled=True, feature_names=iris.feature_names, class_names=iris.target_names, rounded=True)

# Visualize using Graphviz
graph = graphviz.Source(dot_data)
graph.render("decision_tree")  # Saves the tree as a .pdf file
graph
    
The previous code block consist of the following code lines:
  • Import necessary libraries:
    • from sklearn.tree import export_graphviz - Imports the export_graphviz function, which is used to export a decision tree in the Graphviz DOT format.
    • import graphviz - Imports the graphviz module, which is used to render and visualize Graphviz DOT format files.
  • Export the Decision Tree:
    • dot_data = export_graphviz(clf, out_file=None, filled=True, feature_names=iris.feature_names, class_names=iris.target_names, rounded=True) - Exports the trained decision tree clf to the DOT format. The filled=True argument colors the nodes according to the predicted class. The feature_names=iris.feature_names and class_names=iris.target_names add feature and class names to the tree. The rounded=True argument ensures the nodes have rounded corners.
  • Visualize using Graphviz:
    • graph = graphviz.Source(dot_data) - Creates a Graphviz source object from the DOT data, which represents the decision tree.
    • graph.render("decision_tree") - Renders the decision tree and saves it as a .pdf file with the name decision_tree.pdf.
    • graph - Displays the decision tree visually.

Explanation:

  • The tree is exported as a DOT file format using export_graphviz.
  • Graphviz is used to render the visualization.
  • To view the output, run the script in a Jupyter Notebook or save it as an image/PDF.

4. Feature Importance in Decision Trees

Decision Trees can help us understand which features are most important in making predictions.

import numpy as np

# Get feature importances
feature_importances = clf.feature_importances_

# Print feature importance
for feature, importance in zip(iris.feature_names, feature_importances):
    print(f"{feature}: {importance:.4f}")
    
The previous code block consist of the following code lines:
  • Import the necessary library:
    • import numpy as np - Imports the NumPy library, which is used for numerical operations.
  • Get feature importances:
    • feature_importances = clf.feature_importances_ - Retrieves the feature importances from the trained decision tree classifier clf. The feature_importances_ attribute provides the relative importance of each feature in making predictions.
  • Print feature importances:
    • for feature, importance in zip(iris.feature_names, feature_importances): - Iterates over the feature names and their corresponding importances using zip to pair each feature with its importance value.
    • print(f"{feature}: {importance:.4f}") - Prints the name of each feature along with its importance value, formatted to four decimal places.

Explanation:

  • Higher values indicate that a feature plays a more significant role in decision-making.
  • This helps in feature selection by identifying which features contribute the most to the prediction.

5. Interactive Decision Tree Visualization

We can create an interactive tree visualization using dtreeviz, an external library.

Installation:

pip install dtreeviz
    

Usage:

from dtreeviz.trees import dtreeviz

# Generate the visualization
viz = dtreeviz(clf, X, y, target_name="species", feature_names=iris.feature_names, class_names=iris.target_names)

# Display the tree
viz.show()
    
The previous code block consist of the following code lines:
  • Import the necessary library:
    • from dtreeviz.trees import dtreeviz - Imports the dtreeviz function from the dtreeviz library, which is used for visualizing decision trees in a more interactive and detailed manner.
  • Generate the visualization:
    • viz = dtreeviz(clf, X, y, target_name="species", feature_names=iris.feature_names, class_names=iris.target_names) - Calls the dtreeviz function to generate a detailed visualization of the decision tree. It takes the following parameters:
      • clf - The trained decision tree classifier.
      • X - The feature matrix containing the input data.
      • y - The target values corresponding to the input data.
      • target_name="species" - The name of the target variable (in this case, "species").
      • feature_names=iris.feature_names - The list of feature names.
      • class_names=iris.target_names - The list of class names for the target variable (the types of species).
  • Display the tree:
    • viz.show() - Displays the generated decision tree visualization in an interactive format.

Conclusion

Visualizing Decision Trees helps in understanding model decisions and identifying overfitting. In this tutorial, we explored:

  • plot_tree: A quick way to visualize the tree.
  • export_graphviz: Exporting the tree as a DOT file for use with Graphviz.
  • Feature Importance: Understanding the significance of features in the decision process.
  • dtreeviz: Creating an interactive visualization for deeper insights.

Try these methods with your own datasets to enhance your machine learning models!

Decision Trees for Regression

Decision Trees for Regression

Decision Trees are not only useful for classification tasks but also for regression problems. In this post, we will focus on Decision Trees for regression, where the goal is to predict a continuous target variable based on input features. Let's dive into how Decision Trees work for regression tasks and implement them in Python using Scikit-learn.

What is a Decision Tree for Regression?

A Decision Tree for regression works similarly to a classification tree, except that instead of classifying data into categories, it predicts a continuous value. The tree splits the data at each internal node based on feature values and predicts the average value of the target variable within the leaf nodes. These splits are chosen to minimize the variance of the target variable in each resulting node.

Key Features of Decision Trees for Regression:

  • Non-linear relationships: Decision Trees can model complex, non-linear relationships between the input features and the target variable.
  • No need for feature scaling: Decision Trees do not require feature normalization or scaling, making them simpler to implement in comparison to models like linear regression.
  • Overfitting risk: Like in classification, Decision Trees for regression can overfit the data if the tree is too deep.

Advantages and Disadvantages of Decision Trees for Regression

Advantages:

  • Simple to understand and interpret.
  • Can handle both numerical and categorical features.
  • Non-linear models that can capture complex patterns.
  • No need for feature scaling or normalization.

Disadvantages:

  • Prone to overfitting if not properly pruned.
  • Instability: small changes in the data can lead to large changes in the tree structure.
  • Less accurate when compared to more advanced algorithms like Random Forests or Gradient Boosting Machines.

Decision Tree Regressor in Python

In this section, we will implement a Decision Tree regressor using Scikit-learn and demonstrate how to use it for predicting continuous values.

1. Load the Dataset

For this example, we will use the California Housing dataset, which contains information about various features of houses in California and the target variable, which is the house value.
from sklearn.datasets import fetch_california_housing
import pandas as pd

# Load the California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Create a DataFrame for better visualization
df = pd.DataFrame(X, columns=data.feature_names)
df['target'] = y
df.head()
        
The previous code consist of the following code lines:
  • Load the California Housing dataset:
    • data = fetch_california_housing() - Loads the California Housing dataset using the fetch_california_housing function from sklearn.datasets. The dataset contains features related to housing prices in California.
    • X = data.data - Assigns the input features (data) of the dataset to X.
    • y = data.target - Assigns the target values (housing prices) of the dataset to y.
  • Create a DataFrame for better visualization:
    • df = pd.DataFrame(X, columns=data.feature_names) - Creates a pandas DataFrame from the input features X, and assigns the appropriate feature names from data.feature_names to the DataFrame's columns for better readability.
    • df['target'] = y - Adds the target column (housing prices) to the DataFrame as the last column, labeled as "target".
  • Display the first few rows of the DataFrame:
    • df.head() - Displays the first 5 rows of the DataFrame for a preview of the data, including both the features and the target column.

2. Train the Decision Tree Regressor

Now, let’s train a Decision Tree model using the Scikit-learn `DecisionTreeRegressor`. We will first split the dataset into training and testing sets and then train the regressor.
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train the Decision Tree regressor
dt_regressor = DecisionTreeRegressor(random_state=42)
dt_regressor.fit(X_train, y_train)
        
The previous code block consist of following code lines:
  • Importing libraries
    • from sklearn.model_selection import train_test_split - train_test_split function imported from sklearn.model_selection module which is used to split the dataset to train and test dataset in user-specified ratio.
    • from sklearn.tree import DecisionTreeRegressor - Decision Tree Regressor algorithm imported from sklearn.tree module
  • Split the data into training and testing sets:
    • X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) - Splits the data into training and testing sets using the train_test_split function. 30% of the data is reserved for testing (test_size=0.3), and the random_state=42 ensures reproducibility of the split.
  • Initialize and train the Decision Tree regressor:
    • dt_regressor = DecisionTreeRegressor(random_state=42) - Initializes a DecisionTreeRegressor model, setting random_state=42 to ensure reproducibility of the results.
    • dt_regressor.fit(X_train, y_train) - Trains the Decision Tree regressor model using the training data (X_train and y_train) to learn the relationship between the features and the target values.

3. Evaluate the Model

After training the model, let’s evaluate its performance by predicting values on the test set and calculating the Mean Squared Error (MSE).
from sklearn.metrics import mean_squared_error

# Make predictions on the test set
y_pred = dt_regressor.predict(X_test)

# Calculate the Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")
        
The previous code consist of the following code lines:
  • Make predictions on the test set:
    • y_pred = dt_regressor.predict(X_test) - Uses the trained dt_regressor to make predictions on the test set (X_test) based on the learned model.
  • Calculate the Mean Squared Error:
    • mse = mean_squared_error(y_test, y_pred) - Computes the Mean Squared Error (MSE) by comparing the true values (y_test) with the predicted values (y_pred). The MSE measures the average squared difference between predicted and actual values, indicating the model's performance.
  • Print the Mean Squared Error:
    • print(f"Mean Squared Error: {mse:.2f}") - Displays the calculated MSE value, rounded to two decimal places, to evaluate the performance of the regression model.
After executing the code the following output is obtained.
Mean Squared Error: 0.53
      

4. Visualize the Decision Tree

One of the advantages of Decision Trees is their interpretability. We can visualize the trained regression tree using Scikit-learn’s `plot_tree` function.
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

# Visualize the decision tree
plt.figure(figsize=(12,8))
plot_tree(dt_regressor, filled=True, feature_names=data.feature_names, rounded=True)
plt.show()
        
The previous code block consist of the following code lines:
  • Visualize the decision tree:
    • plt.figure(figsize=(12,8)) - Creates a new figure for plotting with a specified size (12 inches by 8 inches) to ensure the decision tree is displayed clearly.
  • Plot the decision tree:
    • plot_tree(dt_regressor, filled=True, feature_names=data.feature_names, rounded=True) - Uses the plot_tree function to visualize the trained decision tree model (dt_regressor). It includes the following options:
      • filled=True - Fills the nodes of the tree with colors to represent the predicted values or class probabilities.
      • feature_names=data.feature_names - Labels the features (input variables) used in the decision tree at each node.
      • rounded=True - Rounds the corners of the nodes for a cleaner and more visually appealing tree structure.
  • Show the plot:
    • plt.show() - Displays the generated decision tree plot to the user.

The plot above shows the decision tree and how it splits the data at each node. Unlike classification trees, the leaf nodes will show the predicted values (rather than class labels) based on the average target value for the data points in that leaf.

Pruning the Decision Tree

Just like in classification, Decision Trees for regression can overfit the data if the tree is too deep. One way to reduce overfitting is by pruning the tree, either by setting a maximum depth or by requiring a minimum number of samples in a leaf node.

Prune the Tree Using Maximum Depth

Let’s prune the tree by setting a maximum depth to prevent overfitting.
# Train the decision tree with a maximum depth of 5
dt_regressor_pruned = DecisionTreeRegressor(max_depth=5, random_state=42)
dt_regressor_pruned.fit(X_train, y_train)

# Visualize the pruned decision tree
plt.figure(figsize=(12,8))
plot_tree(dt_regressor_pruned, filled=True, feature_names=data.feature_names, rounded=True)
plt.show()
        
The previous code block consist of the following code lines:
  • Train the decision tree with a maximum depth of 5:
    • dt_regressor_pruned = DecisionTreeRegressor(max_depth=5, random_state=42) - Initializes a decision tree regressor with a maximum depth of 5 to limit the depth of the tree and prevent overfitting. The random_state=42 ensures reproducibility of results.
    • dt_regressor_pruned.fit(X_train, y_train) - Fits the decision tree model (dt_regressor_pruned) to the training data (X_train, y_train). This step trains the model based on the features and targets from the training set.
  • Visualize the pruned decision tree:
    • plt.figure(figsize=(12,8)) - Creates a new figure for plotting with a specified size (12 inches by 8 inches) to ensure the pruned decision tree is displayed clearly.
    • plot_tree(dt_regressor_pruned, filled=True, feature_names=data.feature_names, rounded=True) - Visualizes the pruned decision tree model (dt_regressor_pruned) with the following options:
      • filled=True - Fills the nodes of the tree with colors to represent the predicted values or class probabilities.
      • feature_names=data.feature_names - Labels the features (input variables) used in the decision tree at each node.
      • rounded=True - Rounds the corners of the nodes for a cleaner and more visually appealing tree structure.
  • Show the plot:
    • plt.show() - Displays the pruned decision tree plot to the user.

Conclusion

Decision Trees for regression are a powerful tool for predicting continuous target variables. They are easy to interpret and can model non-linear relationships in the data. However, they are prone to overfitting, which can be mitigated through pruning or using ensemble methods like Random Forests.

In this post, we have covered the basics of Decision Trees for regression, including how to implement and evaluate a Decision Tree regressor using Scikit-learn, as well as how to visualize and prune the tree to avoid overfitting. Try experimenting with different datasets and pruning strategies to see how you can improve the performance of your regression models!

Decision Trees for Classification

Decision Trees for Classification

Decision trees are one of the most popular machine learning algorithms for both classification and regression tasks. In this post, we will focus on Decision Trees for classification tasks, where the goal is to predict the class label of an object based on its features.

What is a Decision Tree?

A Decision Tree is a flowchart-like structure where each internal node represents a "test" or "decision" on an attribute (e.g., whether a feature is greater than a threshold value), each branch represents the outcome of that decision, and each leaf node represents a class label (the outcome of the classification). The tree is constructed by splitting the data at each internal node based on the most important features, with the goal of classifying the data into distinct classes.

Key Features of Decision Trees:

  • Interpretability: One of the biggest advantages of decision trees is that they are easy to interpret and visualize.
  • Non-linear decision boundaries: Unlike linear models, decision trees can handle non-linear decision boundaries.
  • Handles both numerical and categorical data: Decision trees can handle both types of data without needing feature scaling.
  • Overfitting risk: Decision trees are prone to overfitting, especially when the tree is deep, meaning it has too many branches.

Advantages and Disadvantages of Decision Trees

Advantages:

  • Simple to understand and interpret.
  • Can handle both categorical and numerical data.
  • Requires little data preprocessing, such as normalization or scaling.
  • Can handle missing values and can still make a classification based on the remaining features.

Disadvantages:

  • Prone to overfitting, especially with deep trees.
  • Not robust to small changes in the data.
  • Can be biased towards features with more levels (many categories in categorical features).

Decision Tree Classifier in Python

In this section, we will implement a Decision Tree classifier using Scikit-learn and demonstrate how to classify data.

1. Load the Dataset

For this example, we will use the famous Iris dataset, which contains measurements of three types of iris flowers (Setosa, Versicolor, and Virginica) and the goal is to classify the flowers based on these features.
from sklearn.datasets import load_iris
import pandas as pd

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Create a DataFrame for better visualization
df = pd.DataFrame(X, columns=iris.feature_names)
df['target'] = y
df.head()
    
The previous code block consist of the following code lines:
  • Import necessary libraries:
    • from sklearn.datasets import load_iris - Imports the load_iris function from the sklearn.datasets module to load the Iris dataset.
    • import pandas as pd - Imports the pandas library, which is useful for data manipulation and visualization.
  • Load the Iris dataset:
    • iris = load_iris() - Loads the Iris dataset into the variable iris. This dataset contains data about the features and species of Iris flowers.
    • X = iris.data - Extracts the feature matrix (sepal length, sepal width, petal length, and petal width) from the dataset and stores it in X.
    • y = iris.target - Extracts the target labels (the species of the Iris flowers) from the dataset and stores it in y.
  • Create a DataFrame for better visualization:
    • df = pd.DataFrame(X, columns=iris.feature_names) - Creates a pandas DataFrame from the feature matrix X and labels the columns using the feature names from the Iris dataset.
    • df['target'] = y - Adds a new column named 'target' to the DataFrame df, containing the target labels (species) from y.
    • df.head() - Displays the first five rows of the DataFrame df to provide a preview of the data, including the features and target labels.

2. Train the Decision Tree Classifier

Now, let’s train a Decision Tree model using the Scikit-learn DecisionTreeClassifier. We will first split the dataset into training and testing sets, then train the classifier on the training set.
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train the Decision Tree classifier
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)
    
The previous code block consist of the following code lines:
  • Import necessary libraries:
    • from sklearn.model_selection import train_test_split - Imports the train_test_split function from the sklearn.model_selection module to split the dataset into training and testing sets.
    • from sklearn.tree import DecisionTreeClassifier - Imports the DecisionTreeClassifier from the sklearn.tree module to create and train a decision tree model for classification tasks.
  • Split the data into training and testing sets:
    • X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) - Splits the feature matrix X and target labels y into training and testing sets, with 30% of the data allocated for testing. The random_state=42 ensures reproducibility of the data split.
  • Initialize and train the Decision Tree classifier:
    • dt_classifier = DecisionTreeClassifier - Initializes the DecisionTreeClassifier object but has not yet trained the model. To complete the initialization, the model should be instantiated using dt_classifier = DecisionTreeClassifier().

3. Evaluate the Model

After training the model, let’s evaluate its performance by making predictions on the test set and calculating the accuracy.
from sklearn.metrics import accuracy_score

# Make predictions on the test set
y_pred = dt_classifier.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
    
The previous code block consist of the following code lines:
  • Import the necessary metric:
    • from sklearn.metrics import accuracy_score - Imports the accuracy_score function from the sklearn.metrics module to evaluate the performance of the model based on its accuracy.
  • Make predictions on the test set:
    • y_pred = dt_classifier.predict(X_test) - Uses the trained dt_classifier to predict the target values for the test set X_test, storing the predicted labels in y_pred.
  • Calculate and print the accuracy:
    • accuracy = accuracy_score(y_test, y_pred) - Compares the predicted labels y_pred with the true labels y_test to calculate the accuracy of the model. The accuracy is the proportion of correct predictions.
    • print(f"Accuracy: {accuracy:.2f}") - Prints the calculated accuracy, formatting the value to two decimal places using .2f.
When the code is executed the following accuracy value is obtained:
Accuracy: 1.00
The acchieved accuracy using decision tree classifier is equal to 1.00 indicating perfect classification performance. The next step is to visualize the decision tree.

4. Visualize the Decision Tree

One of the advantages of Decision Trees is their interpretability. We can visualize the trained decision tree using Scikit-learn’s `plot_tree` function.
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

# Visualize the decision tree
plt.figure(figsize=(12,8))
plot_tree(dt_classifier, filled=True, feature_names=iris.feature_names, class_names=iris.target_names, rounded=True)
plt.show()
    
The previous code block consist of the following code lines:
  • Import necessary libraries for visualization:
    • from sklearn.tree import plot_tree - Imports the plot_tree function from sklearn.tree to plot the structure of the decision tree.
    • import matplotlib.pyplot as plt - Imports matplotlib.pyplot to handle the visualization and plotting of the decision tree.
  • Create a figure for the plot:
    • plt.figure(figsize=(12,8)) - Initializes a new figure with a specified size of 12 by 8 inches for a clear and readable visualization.
  • Plot the decision tree:
    • plot_tree(dt_classifier, filled=True, feature_names=iris.feature_names, class_names=iris.target_names, rounded=True) - Uses the plot_tree function to create a visual representation of the trained decision tree (dt_classifier). The arguments:
      • filled=True - Fills the nodes with colors to represent the class distribution.
      • feature_names=iris.feature_names - Specifies the feature names from the Iris dataset to label the features in the tree.
      • class_names=iris.target_names - Specifies the class names (species) for the target labels.
      • rounded=True - Rounds the corners of the nodes for aesthetic purposes.
  • Display the plot:
    • plt.show() - Displays the decision tree plot.
After executing the code the following figure is obtained.
2025-02-27T10:20:10.088500 image/svg+xml Matplotlib v3.9.2, https://matplotlib.org/
Figure 1 - Graphical representation of decision tree classifier

The plot above shows the decision tree and how it splits the data at each node. Each node represents a feature and a threshold that is used to split the data. The leaf nodes show the predicted class for each partition.

Pruning the Decision Tree

Decision Trees can easily become too complex and overfit the data if they grow too deep. One way to mitigate overfitting is by pruning the tree. Pruning involves setting a maximum depth for the tree or requiring a minimum number of samples at a node.

Prune the Tree Using Maximum Depth

We can limit the depth of the tree to prevent overfitting by setting the `max_depth` parameter.
# Train the decision tree with a maximum depth of 3
dt_classifier_pruned = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_classifier_pruned.fit(X_train, y_train)

# Visualize the pruned decision tree
plt.figure(figsize=(12,8))
plot_tree(dt_classifier_pruned, filled=True, feature_names=iris.feature_names, class_names=iris.target_names, rounded=True)
plt.show()
    
The previous code block consist of the following code lines:
  • Train a pruned decision tree with a maximum depth of 3:
    • dt_classifier_pruned = DecisionTreeClassifier(max_depth=3, random_state=42) - Initializes a new DecisionTreeClassifier with a maximum depth of 3, which limits the tree's complexity. The random_state=42 ensures reproducibility.
    • dt_classifier_pruned.fit(X_train, y_train) - Trains the decision tree classifier on the training data (X_train) and corresponding labels (y_train).
  • Create a figure for the pruned tree plot:
    • plt.figure(figsize=(12,8)) - Initializes a new figure for plotting the decision tree, setting the figure size to 12 by 8 inches for clear visualization.
  • Plot the pruned decision tree:
    • plot_tree(dt_classifier_pruned, filled=True, feature_names=iris.feature_names, class_names=iris.target_names, rounded=True) - Uses the plot_tree function to create a visual representation of the pruned decision tree (dt_classifier_pruned). The arguments:
      • filled=True - Fills the nodes with colors to represent the class distribution.
      • feature_names=iris.feature_names - Specifies the feature names from the Iris dataset to label the features in the tree.
      • class_names=iris.target_names - Specifies the class names (species) for the target labels.
      • rounded=True - Rounds the corners of the nodes for aesthetic purposes.
  • Display the pruned tree plot:
    • plt.show() - Displays the pruned decision tree plot.
After the code is executed the graphical representaion of a decision tree classifier with limited depth is shown in Figure 2.
2025-02-27T10:20:10.455712 image/svg+xml Matplotlib v3.9.2, https://matplotlib.org/
Figure 2 - Decision Tree Classifier with limited depth to 3

Conclusion

Decision Trees are a powerful tool for classification tasks, offering advantages like interpretability and the ability to handle both numerical and categorical data. However, they can easily overfit, which can be mitigated by pruning the tree or using ensemble methods like Random Forests.

In this post, we have covered the basics of Decision Trees, including how to implement and evaluate a Decision Tree classifier using Scikit-learn, as well as how to visualize and prune the tree to improve performance. Experiment with different datasets to explore the full potential of Decision Trees!