Monday, March 3, 2025

Random Forests for Classification

Random Forests for Classification

Random Forest is a powerful ensemble learning algorithm that improves classification performance by combining multiple decision trees. It reduces overfitting and increases accuracy by leveraging the power of randomness in data selection and tree construction.

1. What is a Random Forest?

A Random Forest is a machine learning algorithm that belongs to the ensemble learning family, meaning it combines multiple models to improve predictive accuracy and reduce overfitting. Specifically, it is an extension of decision trees, where a large number of decision trees are trained on different subsets of the data, and their outputs are aggregated to produce the final prediction. Each tree in the Random Forest is built using a random selection of features and a random subset of training data, often sampled with replacement (a technique called bootstrapping). For classification tasks, the final output is determined by majority voting among the trees, while for regression tasks, it is the average of the individual tree predictions. The main advantages of Random Forest include its ability to handle large datasets with high dimensionality, its robustness to noise and overfitting, and its capability to capture complex patterns in the data. It is widely used in various applications such as finance, healthcare, image recognition, and fraud detection due to its strong performance and ease of implementation.

2. Loading and Preparing the Dataset

The Iris dataset is a well-known dataset in machine learning, commonly used for classification tasks. It contains 150 samples of iris flowers, categorized into three species: Setosa, Versicolor, and Virginica. Each sample has four features—sepal length, sepal width, petal length, and petal width—which help distinguish between the species. To demonstrate how to train a Random Forest classifier using this dataset, we first need to load the data and preprocess it, ensuring it is formatted correctly for training. We then split the dataset into training and testing sets to evaluate the model’s performance. Next, we create a Random Forest classifier by specifying parameters such as the number of trees in the forest, the maximum depth of each tree, and the criteria for splitting nodes. The classifier is then trained on the training data using an ensemble of decision trees, each built from a random subset of the dataset and features. Once trained, the model is tested on the unseen test data to assess its accuracy and generalization ability. By aggregating predictions from multiple trees, the Random Forest classifier reduces variance and prevents overfitting, resulting in a robust and reliable model. This approach makes it an excellent choice for real-world classification problems, where data may be noisy or complex.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import numpy as np

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
  • from sklearn.datasets import load_iris
    • Imports the load_iris function from the sklearn.datasets module.
    • This function is used to load the famous Iris dataset, which is commonly used for classification tasks.
  • from sklearn.model_selection import train_test_split
    • Imports the train_test_split function from the sklearn.model_selection module.
    • This function is used to split the dataset into training and testing sets.
  • from sklearn.ensemble import RandomForestClassifier
    • Imports the RandomForestClassifier from the sklearn.ensemble module.
    • This is the machine learning model that will be trained to classify iris species based on their features.
  • from sklearn.metrics import accuracy_score
    • Imports the accuracy_score function from the sklearn.metrics module.
    • This function will be used to evaluate the model's performance by comparing predicted and actual values.
  • import numpy as np
    • Imports the NumPy library, a fundamental package for numerical computing in Python.
    • It provides support for large, multi-dimensional arrays and various mathematical functions.
  • iris = load_iris()
    • Loads the Iris dataset and stores it in the variable iris.
    • The dataset contains flower measurements and their corresponding species labels.
  • X, y = iris.data, iris.target
    • Extracts the feature data (X) and target labels (y) from the iris dataset.
    • X contains numerical measurements (sepal length, sepal width, petal length, and petal width).
    • y contains the class labels (0 for Setosa, 1 for Versicolor, and 2 for Virginica).
  • X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    • Splits the dataset into training and testing sets using the train_test_split function.
    • X_train and y_train contain 80% of the data, used for training.
    • X_test and y_test contain 20% of the data, used for testing.
    • The test_size=0.2 argument specifies that 20% of the data should be reserved for testing.
    • The random_state=42 ensures that the split is reproducible by setting a fixed random seed.
When the previous code is executed nothing will happen due to the fact that we have not used any print function to show the output. Using this code we have imported necessary libraries for this example, load the iris dataset, and split the data for training and testing the Random Forest Classifier. The next step is to define the random forest classifier model and to train it using training data.

3. Training a Random Forest Classifier

Now, let's train a Random Forest classifier with Scikit-learn.

# Train a Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")
    
  • # Train a Random Forest Classifier
    • This is a comment indicating that the following lines of code will train a Random Forest classifier.
  • clf = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)
    • Creates an instance of the RandomForestClassifier from Scikit-Learn.
    • n_estimators=100: Specifies that the Random Forest will consist of 100 decision trees.
    • max_depth=3: Limits the depth of each decision tree to 3 levels to prevent overfitting.
    • random_state=42: Ensures reproducibility by setting a fixed random seed.
  • clf.fit(X_train, y_train)
    • Trains (fits) the Random Forest model using the training data.
    • The model learns patterns in X_train (features) to map them to y_train (labels).
  • # Make predictions
    • This is a comment indicating that the following lines of code will make predictions using the trained model.
  • y_pred = clf.predict(X_test)
    • Uses the trained model to predict the class labels for the test dataset X_test.
    • The predicted labels are stored in the variable y_pred.
  • # Evaluate accuracy
    • This is a comment indicating that the following lines of code will evaluate the model's accuracy.
  • accuracy = accuracy_score(y_test, y_pred)
    • Calculates the accuracy of the model by comparing predicted labels (y_pred) with actual labels (y_test).
    • The accuracy score represents the proportion of correct predictions made by the model.
  • print(f"Model Accuracy: {accuracy:.4f}")
    • Prints the accuracy of the model formatted to four decimal places.
    • The f-string is used for string formatting, making the output more readable.
After the code written so far is executed the only output that is obtained is
Model Accuracy: 1.0000
The result shows that trained RFC has perfect classification performance on the test dataset. The nex step in this investigation would be to determine feature imporance i.e. to determine which features have most contribution to the lable/output variable.

4. Feature Importance in Random Forest

Random Forests provide a built-in way to determine feature importance. This helps in understanding which features are most influential in classification.

import matplotlib.pyplot as plt

# Extract feature importances
importances = clf.feature_importances_
feature_names = iris.feature_names

# Plot feature importance
plt.figure(figsize=(8, 5))
plt.barh(feature_names, importances, color="skyblue")
plt.xlabel("Feature Importance")
plt.ylabel("Feature")
plt.title("Feature Importance in Random Forest")
plt.show()
    
  • import matplotlib.pyplot as plt
    • Imports the pyplot module from the Matplotlib library, which is used for data visualization.
    • This module provides functions to create various types of plots, such as bar charts, line graphs, and histograms.
  • # Extract feature importances
    • This is a comment indicating that the following lines of code will extract the feature importance values from the trained model.
  • importances = clf.feature_importances_
    • Retrieves the feature importance values from the trained Random Forest model.
    • Each value represents how much a particular feature contributes to the model's decision-making process.
  • feature_names = iris.feature_names
    • Extracts the names of the features from the Iris dataset.
    • The feature names include sepal length, sepal width, petal length, and petal width.
  • # Plot feature importance
    • This is a comment indicating that the following lines of code will generate a bar chart to visualize feature importance.
  • plt.figure(figsize=(8, 5))
    • Creates a new figure for the plot with a specified size of 8 inches by 5 inches.
    • This ensures that the plot is clear and well-sized for visualization.
  • plt.barh(feature_names, importances, color="skyblue")
    • Creates a horizontal bar chart where:
    • feature_names are placed on the y-axis.
    • importances (feature importance values) are represented on the x-axis.
    • The bars are colored skyblue for better visualization.
  • plt.xlabel("Feature Importance")
    • Labels the x-axis as "Feature Importance" to indicate what the values represent.
  • plt.ylabel("Feature")
    • Labels the y-axis as "Feature" to indicate that it represents the different features of the dataset.
  • plt.title("Feature Importance in Random Forest")
    • Sets the title of the plot to "Feature Importance in Random Forest" to describe the visualization.
  • plt.show()
    • Displays the plot, making the feature importance visualization visible.
After previous code is executed the following plot is obtained which is shown in Figure 1.
2025-03-03T22:19:56.692250 image/svg+xml Matplotlib v3.9.2, https://matplotlib.org/
Figure 1 - Feature importance on iris dataset variables obtained using trained RFC model.
The feature importance results indicate the relative contribution of each feature to the model's predictions. Among the four features, "petal length (cm)" and "petal width (cm)" hold the most significant importance, with values of 0.4522 and 0.4317, respectively. These two features dominate the decision-making process, suggesting that they provide the most information about the target variable. In contrast, "sepal length (cm)" and "sepal width (cm)" have much lower importance scores, with values of 0.1062 and 0.0099. This implies that these features contribute far less to the model's predictive ability compared to the petal dimensions. Overall, petal length and petal width appear to be the key drivers in distinguishing between the classes in this model.

5. Hyperparameter Tuning for Better Performance

To improve performance, we can tune hyperparameters using GridSearchCV. In case of grid search we will find the optimal combination of some of RFC hyperparameters such as n_estimatros, max_depth, min_samples. In case of GridSearchCV we will try some combinations i.e. the n_estimators parameter will be set to 50, 100, and 200. The max_depth will be set to 3, 5, and 10 while min_samples_split will be set to 2, 5, and 10. The entire code for performing the grid search CV is shown below.

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 10],
    'min_samples_split': [2, 5, 10]
}

# Perform grid search
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)
print("Best Score: " , grid_search.best_score_)
    
  • Importing the GridSearchCV module: The code begins by importing GridSearchCV from the sklearn.model_selection module. This is a method used to search for the best combination of hyperparameters for a model.
  • Defining the parameter grid: The param_grid dictionary is created to define a range of values for each hyperparameter. In this case:
    • 'n_estimators': Number of trees in the forest, with possible values 50, 100, and 200.
    • 'max_depth': Maximum depth of each tree, with possible values 3, 5, and 10.
    • 'min_samples_split': Minimum number of samples required to split an internal node, with possible values 2, 5, and 10.
  • Performing grid search: GridSearchCV is initialized with the RandomForestClassifier, the param_grid, and other parameters:
    • cv=5: The number of cross-validation folds to use (5 in this case).
    • scoring='accuracy': The metric used to evaluate the model performance (accuracy in this case).
  • Fitting the model: The fit method is called on the grid search, using X_train and y_train as input. This will train the model using each combination of parameters defined in the param_grid.
  • Displaying best parameters: The best_params_ attribute of the grid_search object is printed to show the combination of hyperparameters that provided the best performance based on the grid search results.
  • Displaying best score: The best_score_ attribute of the grid_search object is printed to show the best accuracy achieved using RFC in GridSearchCV.
After executed the grid search CV the print functions should display the best parameters and the highest classification accuracy score.
  Best Parameters: {'max_depth': 3, 'min_samples_split': 2, 'n_estimators': 50}
Best Score:  0.95
  

6. Key Takeaways

  • Random Forests improve classification by reducing overfitting compared to single Decision Trees.
  • They provide feature importance values, aiding in feature selection.
  • Hyperparameter tuning helps in optimizing model performance.

By leveraging Random Forests, you can build robust classification models with improved accuracy and generalization!

Saturday, March 1, 2025

Feature Importance in Decision Trees

Feature Importance in Decision Trees - Pythonholics

Decision Trees are widely used in machine learning because they provide not only high accuracy but also interpretability. One of the most valuable aspects of Decision Trees is their ability to rank feature importance, which helps in understanding which features contribute the most to predictions.

1. What is Feature Importance?

Feature importance measures how much each feature contributes to reducing impurity in a Decision Tree model. Scikit-learn provides an easy way to extract these values using the feature_importances_ attribute.

2. Loading and Preparing the Dataset

We will use the Iris dataset for this demonstration.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import matplotlib.pyplot as plt

# Load the dataset
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
The previous code block consist of the following code lines.
  • from sklearn.datasets import load_iris: This imports the load_iris function from the sklearn.datasets module, which loads the Iris dataset.
  • from sklearn.model_selection import train_test_split: This imports the train_test_split function from the sklearn.model_selection module, used to split data into training and testing sets.
  • from sklearn.tree import DecisionTreeClassifier: This imports the DecisionTreeClassifier from the sklearn.tree module, which is used to create a decision tree model for classification.
  • import numpy as np: This imports the numpy library with the alias np, which is used for numerical operations in Python.
  • import matplotlib.pyplot as plt: This imports the matplotlib.pyplot module with the alias plt, which is used for creating plots and visualizations.
  • # Load the dataset: A comment indicating that the next lines of code will load the dataset.
  • iris = load_iris(): This loads the Iris dataset into the iris variable. The dataset contains features (data) and target labels (target).
  • X, y = iris.data, iris.target: This splits the iris dataset into two variables: X (feature data) and y (target labels).
  • feature_names = iris.feature_names: This stores the feature names of the Iris dataset into the variable feature_names.
  • # Split into training and testing sets: A comment indicating that the data will be split into training and testing sets.
  • X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42): This splits the dataset into training and testing sets using the train_test_split function. test_size=0.2 means 20% of the data is used for testing, and random_state=42 ensures the split is reproducible.
The code written so far contained importing libraries, dataset and preparing the dataset for training and testing the decision tree classifier. The next step is define the classification model and train it using X_train, y_train dataset values.

3. Training a Decision Tree Model

Let's train a Decision Tree classifier and extract feature importance values.

# Train a Decision Tree Classifier
clf = DecisionTreeClassifier(max_depth=3, random_state=42)
clf.fit(X_train, y_train)

# Extract feature importances
importances = clf.feature_importances_

# Display feature importance values
for feature, importance in zip(feature_names, importances):
    print(f"{feature}: {importance:.4f}")
    
The previous code block consist of the following lines of code.
  • # Train a Decision Tree Classifier: A comment indicating that the next lines of code will train a decision tree classifier.
  • clf = DecisionTreeClassifier(max_depth=3, random_state=42): This creates an instance of the DecisionTreeClassifier with a maximum depth of 3 (to prevent overfitting) and a random_state=42 for reproducibility of results.
  • clf.fit(X_train, y_train): This trains the decision tree classifier on the training data X_train (features) and y_train (target labels).
  • # Extract feature importances: A comment indicating that the following line of code will extract the importance of each feature in the trained model.
  • importances = clf.feature_importances_: This retrieves the feature importance values from the trained decision tree model and stores them in the importances variable.
  • # Display feature importance values: A comment indicating that the next lines of code will display the feature importance values.
  • for feature, importance in zip(feature_names, importances):: This iterates through the feature_names (the names of the features) and importances (the importance values) simultaneously using the zip function.
  • print(f"{feature}: {importance:.4f}"): This prints each feature's name and its corresponding importance value, formatted to four decimal places.
After executing the code written so far we will obtain the feature importances of dataset values.
sepal length (cm): 0.0000
sepal width (cm): 0.0000
petal length (cm): 0.9346
petal width (cm): 0.0654
The feature importances indicate how much each feature contributes to the decision tree's ability to classify the Iris dataset. The "sepal length (cm)" and "sepal width (cm)" features have a feature importance of 0.0000, meaning that these features have no significant contribution to the classification task in this specific decision tree model. On the other hand, "petal length (cm)" has a high feature importance of 0.9346, suggesting that it plays a dominant role in the model's decision-making process. "Petal width (cm)" also contributes to some extent with a feature importance of 0.0654, though it is less significant compared to "petal length." These values demonstrate that the decision tree heavily relies on the petal-related features for classification.

4. Visualizing Feature Importance

We can plot the feature importance values for better understanding.

# Plot feature importance
plt.figure(figsize=(8, 5))
plt.barh(feature_names, importances, color="skyblue")
plt.xlabel("Feature Importance")
plt.ylabel("Feature")
plt.title("Feature Importance in Decision Tree")
plt.show()
    
The previous code block consisto of the following lines of code.
  • # Plot feature importance: A comment indicating that the following lines of code will plot the feature importance.
  • plt.figure(figsize=(8, 5)): This creates a new figure for the plot with a specified size of 8 inches by 5 inches using matplotlib.pyplot.
  • plt.barh(feature_names, importances, color="skyblue"): This creates a horizontal bar plot (barh) where the feature_names are plotted on the y-axis and the importances are plotted on the x-axis. The bars are colored "skyblue".
  • plt.xlabel("Feature Importance"): This sets the label for the x-axis as "Feature Importance".
  • plt.ylabel("Feature"): This sets the label for the y-axis as "Feature".
  • plt.title("Feature Importance in Decision Tree"): This sets the title of the plot as "Feature Importance in Decision Tree".
  • plt.show(): This displays the plot on the screen.

5. Interpreting Feature Importance

Higher feature importance values indicate that the feature has a greater influence on the model’s decisions. Features with very low importance can often be removed to simplify the model without significant loss in performance.

6. Using Feature Importance for Feature Selection

If some features have very low importance, we can remove them and retrain the model:

# Select important features (threshold of 0.1)
important_features = [feature for feature, importance in zip(feature_names, importances) if importance > 0.1]
print("Selected Features:", important_features)
    
The previous block of code consist of the following lines of code.
  • # Select important features (threshold of 0.1): A comment indicating that the following code will select features that have an importance greater than 0.1.
  • important_features = [feature for feature, importance in zip(feature_names, importances) if importance > 0.1]: This is a list comprehension that iterates over the feature_names and importances simultaneously using zip. It selects only those features where the importance is greater than 0.1 and stores them in the important_features list.
  • print("Selected Features:", important_features): This prints the list of selected important features to the console.
After executing the previous code the following output is obtained.
Selected Features: ['petal length (cm)']
The selected feature based on the importance threshold of 0.1 is "petal length (cm)." This means that, according to the decision tree model, "petal length" is the most important feature for classification, contributing significantly to the model's decision-making. Features like "sepal length (cm)" and "sepal width (cm)" were excluded because their feature importance was 0.0000, indicating that they do not provide valuable information for the classification task in this case. Thus, "petal length" is the key feature used by the model for making predictions.

7. Key Takeaways

  • Feature importance helps in understanding which features contribute the most to model predictions.
  • We can use feature importance to remove irrelevant features and improve model efficiency.
  • Visualizing feature importance can aid in better interpretability.

By leveraging feature importance, we can make our models more interpretable and efficient!