Decision Trees are powerful models, but they tend to overfit when left unrestricted. Overfitting occurs when a model memorizes the training data instead of generalizing to unseen data. In this post, we will explore how to prevent overfitting using the max_depth
, min_samples_split
, and min_samples_leaf
parameters in Scikit-learn.
1. What Causes Overfitting in Decision Trees?
Overfitting happens when a Decision Tree grows too deep, capturing noise instead of meaningful patterns. This results in:
- High accuracy on training data but poor performance on test data.
- Complex models with too many nodes and splits.
- Reduced generalization to new data.
2. Loading and Preparing the Dataset
We will use the Iris dataset for this demonstration.
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier import numpy as np # Load the dataset iris = load_iris() X, y = iris.data, iris.target # Split into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)The previous code block consist of the following code lines.
- from sklearn.datasets import load_iris: This line imports the 'load_iris' function from the 'sklearn.datasets' module, which allows us to load the well-known Iris dataset.
- from sklearn.model_selection import train_test_split: This imports the 'train_test_split' function from 'sklearn.model_selection'. It is used to split the dataset into training and testing sets.
- from sklearn.tree import DecisionTreeClassifier: This imports the 'DecisionTreeClassifier' from 'sklearn.tree', which will be used to train a decision tree model for classification.
- import numpy as np: This imports the 'numpy' library, which is useful for handling numerical operations, although it is not directly used in the code block shown.
- # Load the dataset: This is a comment that explains the following line of code, where the Iris dataset is loaded using 'load_iris()'.
- iris = load_iris(): This line loads the Iris dataset and stores it in the variable 'iris'. The dataset contains both the features (X) and the target labels (y).
- X, y = iris.data, iris.target: Here, the feature data (X) and target labels (y) are extracted from the 'iris' object. 'X' contains the feature data, and 'y' contains the target labels (species of Iris).
- # Split into training and testing sets: This is a comment indicating that the following code will split the data into training and testing subsets.
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42): This line splits the dataset into training and testing sets. 80% of the data is used for training, and 20% is used for testing. The 'random_state' ensures that the split is reproducible.
3. Overfitting Example: Unrestricted Decision Tree
Let's train a Decision Tree without any restrictions.
# Train an unrestricted Decision Tree clf_overfit = DecisionTreeClassifier(random_state=42) clf_overfit.fit(X_train, y_train) # Evaluate performance train_accuracy = clf_overfit.score(X_train, y_train) test_accuracy = clf_overfit.score(X_test, y_test) print(f"Training Accuracy: {train_accuracy:.4f}") print(f"Test Accuracy: {test_accuracy:.4f}")The previous code block consist of the following code lines.
- # Train an unrestricted Decision Tree: This is a comment indicating that the following code will train a decision tree model without any restrictions (e.g., max depth).
- clf_overfit = DecisionTreeClassifier(random_state=42): This line creates an instance of the 'DecisionTreeClassifier' class with the 'random_state' set to 42. This ensures reproducibility of the model's results.
- clf_overfit.fit(X_train, y_train): This line fits the decision tree model to the training data (X_train and y_train). The model is trained to learn the relationships between the features and target labels in the training dataset.
- # Evaluate performance: This comment indicates that the following lines will evaluate the model's performance on both the training and testing datasets.
- train_accuracy = clf_overfit.score(X_train, y_train): This line calculates the accuracy of the trained model on the training dataset. The accuracy is the proportion of correct predictions made by the model.
- test_accuracy = clf_overfit.score(X_test, y_test): This line calculates the accuracy of the trained model on the testing dataset. It evaluates how well the model generalizes to new, unseen data.
- print(f"Training Accuracy: {train_accuracy:.4f}"): This prints the accuracy of the model on the training dataset, formatted to four decimal places.
- print(f"Test Accuracy: {test_accuracy:.4f}"): This prints the accuracy of the model on the testing dataset, also formatted to four decimal places.
Expected Outcome: High training accuracy but lower test accuracy, indicating overfitting.
After executing the code the results verified the expected outcome i.e. the accuracies on both training and testing dataset are equal to 1.00.Training Accuracy: 1.0000 Test Accuracy: 1.0000
4. Controlling Overfitting with max_depth
max_depth
limits how deep the tree can grow. A lower depth prevents the tree from memorizing noise.
# Train a Decision Tree with limited depth clf_depth = DecisionTreeClassifier(max_depth=3, random_state=42) clf_depth.fit(X_train, y_train) # Evaluate performance train_accuracy = clf_depth.score(X_train, y_train) test_accuracy = clf_depth.score(X_test, y_test) print(f"Training Accuracy: {train_accuracy:.4f}") print(f"Test Accuracy: {test_accuracy:.4f}")The previous code block consist of the following code lines.
- # Train a Decision Tree with limited depth: This comment indicates that the following code will train a decision tree model with a restricted maximum depth to avoid overfitting.
- clf_depth = DecisionTreeClassifier(max_depth=3, random_state=42): This line creates an instance of the 'DecisionTreeClassifier' class with a specified 'max_depth' of 3. This limits the depth of the decision tree to prevent it from growing too complex and overfitting the training data. The 'random_state' is set to 42 for reproducibility.
- clf_depth.fit(X_train, y_train): This line fits the decision tree model to the training data (X_train and y_train). The model learns the relationships between the features and target labels, but with a restricted tree depth.
- # Evaluate performance: This comment indicates that the following lines will evaluate the performance of the model on both the training and testing datasets.
- train_accuracy = clf_depth.score(X_train, y_train): This line calculates the accuracy of the trained model on the training dataset, representing how well the model fits the data it was trained on.
- test_accuracy = clf_depth.score(X_test, y_test): This line calculates the accuracy of the trained model on the testing dataset, evaluating the model's ability to generalize to new, unseen data.
- print(f"Training Accuracy: {train_accuracy:.4f}"): This prints the accuracy of the model on the training dataset, formatted to four decimal places.
- print(f"Test Accuracy: {test_accuracy:.4f}"): This prints the accuracy of the model on the testing dataset, formatted to four decimal places.
Expected Outcome: Slightly lower training accuracy but improved test accuracy, reducing overfitting.
The expected outcome was verified after the Python code was executed. The accuracy on train dataset is slightly lower than on test dataset.Training Accuracy: 0.9583 Test Accuracy: 1.0000
5. Controlling Overfitting with min_samples_split
min_samples_split
controls the minimum number of samples needed to split a node. Increasing this value forces the tree to consider only significant splits.
# Train a Decision Tree with min_samples_split restriction clf_split = DecisionTreeClassifier(min_samples_split=10, random_state=42) clf_split.fit(X_train, y_train) # Evaluate performance train_accuracy = clf_split.score(X_train, y_train) test_accuracy = clf_split.score(X_test, y_test) print(f"Training Accuracy: {train_accuracy:.4f}") print(f"Test Accuracy: {test_accuracy:.4f}")The previous code block consist of the following code lines.
- # Train a Decision Tree with min_samples_split restriction: This comment indicates that the following code will train a decision tree model with a restriction on the minimum number of samples required to split an internal node. This restriction helps in controlling the model's complexity.
- clf_split = DecisionTreeClassifier(min_samples_split=10, random_state=42): This line creates an instance of the 'DecisionTreeClassifier' class with a specified 'min_samples_split' of 10. This means that a node in the decision tree will only be split if it has at least 10 samples, thus preventing the model from making overly fine distinctions on small subsets of data. The 'random_state' is set to 42 for reproducibility.
- clf_split.fit(X_train, y_train): This line fits the decision tree model to the training data (X_train and y_train). The decision tree will build using the provided training data with the specified restriction on the minimum number of samples required to split a node.
- # Evaluate performance: This comment indicates that the following lines will evaluate the model's performance on both the training and testing datasets.
- train_accuracy = clf_split.score(X_train, y_train): This line calculates the accuracy of the trained model on the training dataset, representing how well the model fits the data it was trained on.
- test_accuracy = clf_split.score(X_test, y_test): This line calculates the accuracy of the trained model on the testing dataset, providing an evaluation of how well the model generalizes to unseen data.
- print(f"Training Accuracy: {train_accuracy:.4f}"): This prints the accuracy of the model on the training dataset, formatted to four decimal places.
- print(f"Test Accuracy: {test_accuracy:.4f}"): This prints the accuracy of the model on the testing dataset, formatted to four decimal places.
Training Accuracy: 0.9583 Test Accuracy: 1.0000
6. Controlling Overfitting with min_samples_leaf
min_samples_leaf
sets the minimum number of samples required to be at a leaf node. Larger values help in reducing overfitting.
# Train a Decision Tree with min_samples_leaf restriction clf_leaf = DecisionTreeClassifier(min_samples_leaf=5, random_state=42) clf_leaf.fit(X_train, y_train) # Evaluate performance train_accuracy = clf_leaf.score(X_train, y_train) test_accuracy = clf_leaf.score(X_test, y_test) print(f"Training Accuracy: {train_accuracy:.4f}") print(f"Test Accuracy: {test_accuracy:.4f}")The previous block of code consist of the following lines of code.
- clf_leaf = DecisionTreeClassifier(min_samples_leaf=5, random_state=42): This line initializes a DecisionTreeClassifier with the hyperparameter 'min_samples_leaf' set to 5, which means that each leaf node in the decision tree must have at least 5 samples. This restriction helps control overfitting by preventing the tree from creating nodes with very few samples. The random_state parameter is set to 42 to ensure reproducibility of the model's results.
- clf_leaf.fit(X_train, y_train): This line trains the decision tree model ('clf_leaf') using the training dataset ('X_train' and 'y_train'). The decision tree is built based on the features (X_train) and target values (y_train) of the training data.
- train_accuracy = clf_leaf.score(X_train, y_train): After training the model, this line calculates the accuracy of the model on the training data (X_train and y_train). It evaluates how well the model fits the data it was trained on.
- test_accuracy = clf_leaf.score(X_test, y_test): This line calculates the accuracy of the model on the testing dataset (X_test and y_test). It evaluates how well the model generalizes to new, unseen data.
- print(f"Training Accuracy: {train_accuracy:.4f}"): This line prints the training accuracy, rounded to four decimal places, to show how well the model performed on the training set.
- print(f"Test Accuracy: {test_accuracy:.4f}"): Similarly, this line prints the testing accuracy, rounded to four decimal places, to show the model's performance on the testing set.
Training Accuracy: 0.9500 Test Accuracy: 1.0000
7. Comparing the Models
Let's summarize how these hyperparameters affect overfitting.
models = { "Overfitted": clf_overfit, "Max Depth (3)": clf_depth, "Min Samples Split (10)": clf_split, "Min Samples Leaf (5)": clf_leaf } for name, model in models.items(): train_acc = model.score(X_train, y_train) test_acc = model.score(X_test, y_test) print(f"{name}: Train Accuracy = {train_acc:.4f}, Test Accuracy = {test_acc:.4f}")The previous code block consist of the following code lines.
- models = { ... }: This line defines a dictionary called 'models', where each key is a description of a model, and the value is the corresponding trained model object. The models included are 'Overfitted', 'Max Depth (3)', 'Min Samples Split (10)', and 'Min Samples Leaf (5)'. Each model has different hyperparameter configurations applied.
- for name, model in models.items():: This line starts a loop that iterates over each model in the 'models' dictionary. For each iteration, the variable 'name' will hold the description of the model, and 'model' will hold the actual trained model.
- train_acc = model.score(X_train, y_train): Inside the loop, this line calculates the accuracy of the current model on the training dataset, which shows how well the model fits the training data.
- test_acc = model.score(X_test, y_test): This line calculates the accuracy of the current model on the testing dataset, indicating how well the model generalizes to unseen data.
- print(f"{name}: Train Accuracy = {train_acc:.4f}, Test Accuracy = {test_acc:.4f}"): This line prints the name of the model (e.g., 'Overfitted', 'Max Depth (3)', etc.) along with its corresponding training and testing accuracy, formatted to four decimal places.
Expected Outcome: The restricted models will have slightly lower training accuracy but significantly better test accuracy compared to the overfitted model.
8. Key Takeaways
To avoid overfitting in Decision Trees:
- Use
max_depth
to limit tree growth and prevent memorization of noise. - Increase
min_samples_split
to ensure meaningful splits. - Set
min_samples_leaf
to avoid creating deep branches with few samples.
By fine-tuning these parameters, we can build a more generalizable model that performs well on unseen data.
Finnaly when the code is executed the following output is obtaine.Overfitted: Train Accuracy = 1.0000, Test Accuracy = 1.0000 Max Depth (3): Train Accuracy = 0.9583, Test Accuracy = 1.0000 Min Samples Split (10): Train Accuracy = 0.9583, Test Accuracy = 1.0000 Min Samples Leaf (5): Train Accuracy = 0.9500, Test Accuracy = 1.0000The results of the Decision Tree models show varying levels of performance based on different hyperparameter restrictions.
The Overfitted model, which has no restrictions, achieved perfect accuracy on both the training and test sets, with a training accuracy of 1.0000 and a test accuracy of 1.0000. This suggests that the model has overfitted the training data, as it performs perfectly on both the training and test data, potentially failing to generalize well to new unseen data.
The Max Depth (3) model, which restricts the tree's depth to 3, performed slightly less well on the training set with a training accuracy of 0.9583 but still achieved perfect accuracy on the test set (1.0000). This indicates that limiting the depth of the tree helped prevent overfitting, allowing the model to generalize well to unseen data while still maintaining good performance on the training data.
Similarly, the Min Samples Split (10) model, which restricts the minimum number of samples required to split an internal node to 10, achieved the same performance as the Max Depth (3) model with a training accuracy of 0.9583 and a test accuracy of 1.0000. This suggests that increasing the minimum number of samples required to make a split also helped prevent overfitting, leading to similar generalization performance.
The Min Samples Leaf (5) model, which ensures that each leaf node contains at least 5 samples, showed the lowest training accuracy at 0.9500, but still achieved perfect accuracy on the test set (1.0000). This further confirms that restricting the number of samples in each leaf can slightly reduce the model’s ability to fit the training data perfectly but still does not hinder its ability to generalize well.
In summary, while all models achieved perfect test accuracy, the Overfitted model performed too well on the training data, indicating overfitting. The other models, which include restrictions like depth or minimum sample size, maintained a balance between good training accuracy and perfect test accuracy, reflecting improved generalization.
No comments:
Post a Comment