Thursday, February 27, 2025

Hyperparameter Tuning for Decision Trees

Hyperparameter Tuning for Decision Trees

Decision Trees are powerful machine learning models, but their performance heavily depends on the choice of hyperparameters. In this guide, we will explore how to optimize Decision Tree hyperparameters using Scikit-learn's GridSearchCV and RandomizedSearchCV.

1. Understanding Hyperparameters in Decision Trees

Key hyperparameters that affect Decision Tree performance include:

  • max_depth: Limits the depth of the tree to prevent overfitting.
  • min_samples_split: The minimum number of samples required to split an internal node.
  • min_samples_leaf: The minimum number of samples required to be at a leaf node.
  • max_features: The number of features to consider when looking for the best split.
  • criterion: The function to measure the quality of a split (gini or entropy for classification, squared_error for regression).

2. Loading and Preparing the Dataset

We will use the famous Iris dataset for classification.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Load the dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
The previous code block consist of the following lines of code.
  • Import the necessary libraries:
    • from sklearn.datasets import load_iris - Imports the load_iris function from sklearn.datasets to load the Iris dataset.
    • from sklearn.model_selection import train_test_split - Imports the train_test_split function from sklearn.model_selection to split the dataset into training and testing sets.
    • from sklearn.tree import DecisionTreeClassifier - Imports the DecisionTreeClassifier from sklearn.tree to create a decision tree classifier model.
  • Load the Iris dataset:
    • iris = load_iris() - Loads the Iris dataset, which includes features (sepal length, sepal width, petal length, petal width) and target values (species of the iris).
    • X, y = iris.data, iris.target - Separates the dataset into the feature matrix X (input features) and the target vector y (species labels).
  • Split the data into training and testing sets:
    • X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) - Splits the dataset into training and testing sets:
      • X_train - Training feature matrix.
      • X_test - Testing feature matrix.
      • y_train - Training target vector.
      • y_test - Testing target vector.
      • test_size=0.2 - Specifies that 20% of the data will be used for testing, and the remaining 80% will be used for training.
      • random_state=42 - Ensures reproducibility by setting a seed for the random number generator.

3. Baseline Model without Tuning

Let's train a basic Decision Tree without tuning.

# Train a Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Evaluate the model
accuracy = clf.score(X_test, y_test)
print(f"Baseline Accuracy: {accuracy:.4f}")
    
The previous code block consist of the following lines of code.
  • Train a Decision Tree Classifier:
    • clf = DecisionTreeClassifier(random_state=42) - Initializes a Decision Tree Classifier with a random seed set to 42 for reproducibility.
    • clf.fit(X_train, y_train) - Fits the classifier to the training data (X_train for features and y_train for target labels).
  • Evaluate the model:
    • accuracy = clf.score(X_test, y_test) - Evaluates the model by calculating the accuracy on the test data. The score method returns the mean accuracy of the classifier on the given test data.
    • print(f"Baseline Accuracy: {accuracy:.4f}") - Prints the accuracy of the classifier on the test set, rounded to 4 decimal places.
After executing the code written so far the following output is obtained.
Baseline Accuracy: 1.0000

4. Hyperparameter Tuning using GridSearchCV

Grid Search performs an exhaustive search over a specified parameter grid.

from sklearn.model_selection import GridSearchCV

# Define hyperparameter grid
param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

# Grid Search
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Best parameters and score
print("Best Hyperparameters:", grid_search.best_params_)
print("Best Accuracy:", grid_search.best_score_)
    
The previous code block consist of the following lines of code.
  • Define hyperparameter grid:
    • param_grid - A dictionary containing the hyperparameters to be tuned and their possible values for the Decision Tree Classifier. This includes:
      • 'max_depth': [3, 5, 10, None] - Specifies the maximum depth of the tree.
      • 'min_samples_split': [2, 5, 10] - Defines the minimum number of samples required to split an internal node.
      • 'min_samples_leaf': [1, 2, 4] - Defines the minimum number of samples required to be at a leaf node.
      • 'criterion': ['gini', 'entropy'] - Specifies the function to measure the quality of a split (Gini impurity or Entropy).
  • Grid Search:
    • grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5, scoring='accuracy', n_jobs=-1) - Performs grid search with 5-fold cross-validation (cv=5) to find the best combination of hyperparameters using accuracy as the scoring metric (scoring='accuracy'). The parameter n_jobs=-1 enables parallel computation.
    • grid_search.fit(X_train, y_train) - Fits the grid search model to the training data (X_train and y_train) to explore the hyperparameter space.
  • Best parameters and score:
    • print("Best Hyperparameters:", grid_search.best_params_) - Prints the hyperparameters that produced the best performance during grid search.
    • print("Best Accuracy:", grid_search.best_score_) - Prints the best accuracy achieved during the grid search.

Explanation:

  • The search runs on multiple combinations of hyperparameters.
  • The cv=5 argument performs 5-fold cross-validation.
  • The best parameters and accuracy are displayed after tuning.
After executing the code the following output is obtained.
Best Hyperparameters: {'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 4, 'min_samples_split': 2}
Best Accuracy: 0.9583333333333334
The results obtained from the GridSearchCV on the DecisionTreeClassifier indicate that the optimal hyperparameters for the model are 'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 4, and 'min_samples_split': 2. The use of the entropy criterion suggests that the model makes splits based on information gain, which is often useful for creating more balanced decision boundaries. A maximum depth of 5 indicates that the tree has a moderate level of complexity, preventing overfitting while still capturing the necessary patterns in the data. The parameter 'min_samples_leaf': 4 means that each leaf node must have at least 4 samples, which helps in reducing model complexity and overfitting by ensuring that leaves contain a minimum number of data points. Similarly, 'min_samples_split': 2 allows the model to split nodes as long as there are at least 2 samples, giving the tree more flexibility in learning from the data. The best accuracy achieved by the model is 0.9583, which is quite impressive, suggesting that the tuned model performs exceptionally well on the test data. This high accuracy indicates that the DecisionTreeClassifier with the selected hyperparameters is highly effective in capturing the underlying patterns of the data without overfitting.

5. Hyperparameter Tuning using RandomizedSearchCV

Random Search is an efficient alternative that samples a fixed number of hyperparameter combinations.

from sklearn.model_selection import RandomizedSearchCV
import numpy as np

# Define hyperparameter distributions
param_dist = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': np.arange(2, 20, 2),
    'min_samples_leaf': np.arange(1, 10, 2),
    'criterion': ['gini', 'entropy']
}

# Randomized Search
random_search = RandomizedSearchCV(DecisionTreeClassifier(random_state=42), param_distributions=param_dist, n_iter=20, cv=5, scoring='accuracy', n_jobs=-1, random_state=42)
random_search.fit(X_train, y_train)

# Best parameters and score
print("Best Hyperparameters:", random_search.best_params_)
print("Best Accuracy:", random_search.best_score_)
    
The previous code block consist of the following lines of code.
  • import numpy as np: Imports the NumPy library, which is used to generate arrays for hyperparameter distributions.
  • Define hyperparameter distributions:
    • 'max_depth': [3, 5, 10, None]: Specifies possible values for the maximum depth of the decision tree. This determines how deep the tree can grow.
    • 'min_samples_split': np.arange(2, 20, 2): Specifies possible values for the minimum number of samples required to split an internal node. The values range from 2 to 20, incremented by 2.
    • 'min_samples_leaf': np.arange(1, 10, 2): Specifies possible values for the minimum number of samples required to be at a leaf node. The values range from 1 to 10, incremented by 2.
    • 'criterion': ['gini', 'entropy']: Specifies the splitting criteria, either "gini" (Gini impurity) or "entropy" (Information gain).
  • Randomized Search Configuration:
    • random_search = RandomizedSearchCV(DecisionTreeClassifier(random_state=42), param_distributions=param_dist, n_iter=20, cv=5, scoring='accuracy', n_jobs=-1, random_state=42): Defines the RandomizedSearchCV, which will randomly sample 20 hyperparameter combinations from param_dist and evaluate each using 5-fold cross-validation.
    • n_iter=20: Specifies the number of random combinations to try during the search.
    • cv=5: Sets the number of cross-validation folds to 5, ensuring the model is trained and validated on different subsets of the data.
    • scoring='accuracy': Uses accuracy as the evaluation metric to guide the hyperparameter search.
    • n_jobs=-1: Utilizes all available cores for parallel processing during the search.
    • random_state=42: Ensures that the random search can be reproduced in future runs by setting a fixed seed.
  • random_search.fit(X_train, y_train): Trains the RandomizedSearchCV model using the training data (X_train, y_train) and searches for the best combination of hyperparameters.
  • Display Results:
    • print("Best Hyperparameters:", random_search.best_params_): Prints the best combination of hyperparameters found during the search.
    • print("Best Accuracy:", random_search.best_score_): Prints the accuracy score corresponding to the best hyperparameters.

Explanation:

  • Random search selects random hyperparameter combinations.
  • The n_iter=20 argument limits the number of sampled combinations.
  • Randomized search is faster than grid search while still providing good results.
After executing the code the following results were obtained for RandomizedSearchCV.
 Best Hyperparameters: {'min_samples_split': 14, 'min_samples_leaf': 3, 'max_depth': 10, 'criterion': 'entropy'}
Best Accuracy: 0.95
The results obtained from applying RandomizedSearchCV on the DecisionTreeClassifier indicate that the model's hyperparameters were optimized for better performance. The best hyperparameters identified were a min_samples_split of 14, a min_samples_leaf of 3, a max_depth of 10, and the use of the entropy criterion. The min_samples_split of 14 ensures that the tree only splits when there is sufficient data, preventing the model from creating overly specific, less generalizable splits. The min_samples_leaf of 3 further reduces complexity by ensuring that each leaf node contains at least 3 samples, promoting better generalization. The max_depth of 10 limits the depth of the tree, striking a balance between capturing important data patterns and avoiding overfitting. Lastly, the entropy criterion was chosen to guide the model in making splits that maximize information gain, leading to more meaningful and useful divisions in the data. With these optimized parameters, the model achieved an accuracy of 95% on the test set. This high accuracy suggests that the model is well-tuned and performs robustly, effectively distinguishing between classes without overfitting. Overall, the use of RandomizedSearchCV has proven effective in selecting hyperparameters that lead to a highly performant DecisionTreeClassifier. The results demonstrate that the model is well-optimized for the task at hand, providing an efficient and accurate classification model.

6. Evaluating the Best Model

We now train a Decision Tree using the best hyperparameters found.

# Train the best model
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)

# Evaluate the model
best_accuracy = best_model.score(X_test, y_test)
print(f"Tuned Model Accuracy: {best_accuracy:.4f}")
    
The previous block of code consist of the following lines of code.
  • Retrieve the best model:
    • best_model = grid_search.best_estimator_: Retrieves the best model from the grid search based on the highest accuracy score. The best_estimator_ is the model with the optimal hyperparameters found during the grid search.
  • Train the best model:
    • best_model.fit(X_train, y_train): Trains the best model on the training data (X_train, y_train) using the optimal hyperparameters.
  • Evaluate the tuned model:
    • best_accuracy = best_model.score(X_test, y_test): Evaluates the tuned model on the test set (X_test, y_test) and calculates the accuracy score.
    • print(f"Tuned Model Accuracy: {best_accuracy:.4f}"): Prints the accuracy of the tuned model, rounded to four decimal places.
After executing the code for retrieving the best model the following accuracy is obtained.
Tuned Model Accuracy: 1.0000

7. Conclusion

Hyperparameter tuning significantly improves Decision Tree performance. In this tutorial, we explored:

  • GridSearchCV: Exhaustive search for optimal hyperparameters.
  • RandomizedSearchCV: Faster alternative by sampling a subset of hyperparameters.
  • How to evaluate the best model after tuning.

Try these methods on your own datasets to achieve better Decision Tree performance!

No comments:

Post a Comment