PYTHONHOLICS: Feature Importance in Decision Trees

Feature Importance in Decision Trees - Pythonholics

Decision Trees are widely used in machine learning because they provide not only high accuracy but also interpretability. One of the most valuable aspects of Decision Trees is their ability to rank feature importance, which helps in understanding which features contribute the most to predictions.

1. What is Feature Importance?

Feature importance measures how much each feature contributes to reducing impurity in a Decision Tree model. Scikit-learn provides an easy way to extract these values using the feature_importances_ attribute.

2. Loading and Preparing the Dataset

We will use the Iris dataset for this demonstration.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import matplotlib.pyplot as plt

# Load the dataset
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The previous code block consist of the following code lines.

from sklearn.datasets import load_iris: This imports the load_iris function from the sklearn.datasets module, which loads the Iris dataset.
from sklearn.model_selection import train_test_split: This imports the train_test_split function from the sklearn.model_selection module, used to split data into training and testing sets.
from sklearn.tree import DecisionTreeClassifier: This imports the DecisionTreeClassifier from the sklearn.tree module, which is used to create a decision tree model for classification.
import numpy as np: This imports the numpy library with the alias np, which is used for numerical operations in Python.
import matplotlib.pyplot as plt: This imports the matplotlib.pyplot module with the alias plt, which is used for creating plots and visualizations.
# Load the dataset: A comment indicating that the next lines of code will load the dataset.
iris = load_iris(): This loads the Iris dataset into the iris variable. The dataset contains features (data) and target labels (target).
X, y = iris.data, iris.target: This splits the iris dataset into two variables: X (feature data) and y (target labels).
feature_names = iris.feature_names: This stores the feature names of the Iris dataset into the variable feature_names.
# Split into training and testing sets: A comment indicating that the data will be split into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42): This splits the dataset into training and testing sets using the train_test_split function. test_size=0.2 means 20% of the data is used for testing, and random_state=42 ensures the split is reproducible.

The code written so far contained importing libraries, dataset and preparing the dataset for training and testing the decision tree classifier. The next step is define the classification model and train it using X_train, y_train dataset values.

3. Training a Decision Tree Model

Let's train a Decision Tree classifier and extract feature importance values.

# Train a Decision Tree Classifier
clf = DecisionTreeClassifier(max_depth=3, random_state=42)
clf.fit(X_train, y_train)

# Extract feature importances
importances = clf.feature_importances_

# Display feature importance values
for feature, importance in zip(feature_names, importances):
    print(f"{feature}: {importance:.4f}")

The previous code block consist of the following lines of code.

# Train a Decision Tree Classifier: A comment indicating that the next lines of code will train a decision tree classifier.
clf = DecisionTreeClassifier(max_depth=3, random_state=42): This creates an instance of the DecisionTreeClassifier with a maximum depth of 3 (to prevent overfitting) and a random_state=42 for reproducibility of results.
clf.fit(X_train, y_train): This trains the decision tree classifier on the training data X_train (features) and y_train (target labels).
# Extract feature importances: A comment indicating that the following line of code will extract the importance of each feature in the trained model.
importances = clf.feature_importances_: This retrieves the feature importance values from the trained decision tree model and stores them in the importances variable.
# Display feature importance values: A comment indicating that the next lines of code will display the feature importance values.
for feature, importance in zip(feature_names, importances):: This iterates through the feature_names (the names of the features) and importances (the importance values) simultaneously using the zip function.
print(f"{feature}: {importance:.4f}"): This prints each feature's name and its corresponding importance value, formatted to four decimal places.

After executing the code written so far we will obtain the feature importances of dataset values.

sepal length (cm): 0.0000
sepal width (cm): 0.0000
petal length (cm): 0.9346
petal width (cm): 0.0654

The feature importances indicate how much each feature contributes to the decision tree's ability to classify the Iris dataset. The "sepal length (cm)" and "sepal width (cm)" features have a feature importance of 0.0000, meaning that these features have no significant contribution to the classification task in this specific decision tree model. On the other hand, "petal length (cm)" has a high feature importance of 0.9346, suggesting that it plays a dominant role in the model's decision-making process. "Petal width (cm)" also contributes to some extent with a feature importance of 0.0654, though it is less significant compared to "petal length." These values demonstrate that the decision tree heavily relies on the petal-related features for classification.

4. Visualizing Feature Importance

We can plot the feature importance values for better understanding.

# Plot feature importance
plt.figure(figsize=(8, 5))
plt.barh(feature_names, importances, color="skyblue")
plt.xlabel("Feature Importance")
plt.ylabel("Feature")
plt.title("Feature Importance in Decision Tree")
plt.show()

The previous code block consisto of the following lines of code.

# Plot feature importance: A comment indicating that the following lines of code will plot the feature importance.
plt.figure(figsize=(8, 5)): This creates a new figure for the plot with a specified size of 8 inches by 5 inches using matplotlib.pyplot.
plt.barh(feature_names, importances, color="skyblue"): This creates a horizontal bar plot (barh) where the feature_names are plotted on the y-axis and the importances are plotted on the x-axis. The bars are colored "skyblue".
plt.xlabel("Feature Importance"): This sets the label for the x-axis as "Feature Importance".
plt.ylabel("Feature"): This sets the label for the y-axis as "Feature".
plt.title("Feature Importance in Decision Tree"): This sets the title of the plot as "Feature Importance in Decision Tree".
plt.show(): This displays the plot on the screen.

5. Interpreting Feature Importance

Higher feature importance values indicate that the feature has a greater influence on the model’s decisions. Features with very low importance can often be removed to simplify the model without significant loss in performance.

6. Using Feature Importance for Feature Selection

If some features have very low importance, we can remove them and retrain the model:

# Select important features (threshold of 0.1)
important_features = [feature for feature, importance in zip(feature_names, importances) if importance > 0.1]
print("Selected Features:", important_features)

The previous block of code consist of the following lines of code.

# Select important features (threshold of 0.1): A comment indicating that the following code will select features that have an importance greater than 0.1.
important_features = [feature for feature, importance in zip(feature_names, importances) if importance > 0.1]: This is a list comprehension that iterates over the feature_names and importances simultaneously using zip. It selects only those features where the importance is greater than 0.1 and stores them in the important_features list.
print("Selected Features:", important_features): This prints the list of selected important features to the console.

After executing the previous code the following output is obtained.

Selected Features: ['petal length (cm)']

The selected feature based on the importance threshold of 0.1 is "petal length (cm)." This means that, according to the decision tree model, "petal length" is the most important feature for classification, contributing significantly to the model's decision-making. Features like "sepal length (cm)" and "sepal width (cm)" were excluded because their feature importance was 0.0000, indicating that they do not provide valuable information for the classification task in this case. Thus, "petal length" is the key feature used by the model for making predictions.

7. Key Takeaways

Feature importance helps in understanding which features contribute the most to model predictions.
We can use feature importance to remove irrelevant features and improve model efficiency.
Visualizing feature importance can aid in better interpretability.

By leveraging feature importance, we can make our models more interpretable and efficient!

PYTHONHOLICS

Saturday, March 1, 2025

Feature Importance in Decision Trees