Monday, March 3, 2025

Random Forests for Classification

Random Forests for Classification

Random Forest is a powerful ensemble learning algorithm that improves classification performance by combining multiple decision trees. It reduces overfitting and increases accuracy by leveraging the power of randomness in data selection and tree construction.

1. What is a Random Forest?

A Random Forest is a machine learning algorithm that belongs to the ensemble learning family, meaning it combines multiple models to improve predictive accuracy and reduce overfitting. Specifically, it is an extension of decision trees, where a large number of decision trees are trained on different subsets of the data, and their outputs are aggregated to produce the final prediction. Each tree in the Random Forest is built using a random selection of features and a random subset of training data, often sampled with replacement (a technique called bootstrapping). For classification tasks, the final output is determined by majority voting among the trees, while for regression tasks, it is the average of the individual tree predictions. The main advantages of Random Forest include its ability to handle large datasets with high dimensionality, its robustness to noise and overfitting, and its capability to capture complex patterns in the data. It is widely used in various applications such as finance, healthcare, image recognition, and fraud detection due to its strong performance and ease of implementation.

2. Loading and Preparing the Dataset

The Iris dataset is a well-known dataset in machine learning, commonly used for classification tasks. It contains 150 samples of iris flowers, categorized into three species: Setosa, Versicolor, and Virginica. Each sample has four features—sepal length, sepal width, petal length, and petal width—which help distinguish between the species. To demonstrate how to train a Random Forest classifier using this dataset, we first need to load the data and preprocess it, ensuring it is formatted correctly for training. We then split the dataset into training and testing sets to evaluate the model’s performance. Next, we create a Random Forest classifier by specifying parameters such as the number of trees in the forest, the maximum depth of each tree, and the criteria for splitting nodes. The classifier is then trained on the training data using an ensemble of decision trees, each built from a random subset of the dataset and features. Once trained, the model is tested on the unseen test data to assess its accuracy and generalization ability. By aggregating predictions from multiple trees, the Random Forest classifier reduces variance and prevents overfitting, resulting in a robust and reliable model. This approach makes it an excellent choice for real-world classification problems, where data may be noisy or complex.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import numpy as np

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
  • from sklearn.datasets import load_iris
    • Imports the load_iris function from the sklearn.datasets module.
    • This function is used to load the famous Iris dataset, which is commonly used for classification tasks.
  • from sklearn.model_selection import train_test_split
    • Imports the train_test_split function from the sklearn.model_selection module.
    • This function is used to split the dataset into training and testing sets.
  • from sklearn.ensemble import RandomForestClassifier
    • Imports the RandomForestClassifier from the sklearn.ensemble module.
    • This is the machine learning model that will be trained to classify iris species based on their features.
  • from sklearn.metrics import accuracy_score
    • Imports the accuracy_score function from the sklearn.metrics module.
    • This function will be used to evaluate the model's performance by comparing predicted and actual values.
  • import numpy as np
    • Imports the NumPy library, a fundamental package for numerical computing in Python.
    • It provides support for large, multi-dimensional arrays and various mathematical functions.
  • iris = load_iris()
    • Loads the Iris dataset and stores it in the variable iris.
    • The dataset contains flower measurements and their corresponding species labels.
  • X, y = iris.data, iris.target
    • Extracts the feature data (X) and target labels (y) from the iris dataset.
    • X contains numerical measurements (sepal length, sepal width, petal length, and petal width).
    • y contains the class labels (0 for Setosa, 1 for Versicolor, and 2 for Virginica).
  • X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    • Splits the dataset into training and testing sets using the train_test_split function.
    • X_train and y_train contain 80% of the data, used for training.
    • X_test and y_test contain 20% of the data, used for testing.
    • The test_size=0.2 argument specifies that 20% of the data should be reserved for testing.
    • The random_state=42 ensures that the split is reproducible by setting a fixed random seed.
When the previous code is executed nothing will happen due to the fact that we have not used any print function to show the output. Using this code we have imported necessary libraries for this example, load the iris dataset, and split the data for training and testing the Random Forest Classifier. The next step is to define the random forest classifier model and to train it using training data.

3. Training a Random Forest Classifier

Now, let's train a Random Forest classifier with Scikit-learn.

# Train a Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")
    
  • # Train a Random Forest Classifier
    • This is a comment indicating that the following lines of code will train a Random Forest classifier.
  • clf = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)
    • Creates an instance of the RandomForestClassifier from Scikit-Learn.
    • n_estimators=100: Specifies that the Random Forest will consist of 100 decision trees.
    • max_depth=3: Limits the depth of each decision tree to 3 levels to prevent overfitting.
    • random_state=42: Ensures reproducibility by setting a fixed random seed.
  • clf.fit(X_train, y_train)
    • Trains (fits) the Random Forest model using the training data.
    • The model learns patterns in X_train (features) to map them to y_train (labels).
  • # Make predictions
    • This is a comment indicating that the following lines of code will make predictions using the trained model.
  • y_pred = clf.predict(X_test)
    • Uses the trained model to predict the class labels for the test dataset X_test.
    • The predicted labels are stored in the variable y_pred.
  • # Evaluate accuracy
    • This is a comment indicating that the following lines of code will evaluate the model's accuracy.
  • accuracy = accuracy_score(y_test, y_pred)
    • Calculates the accuracy of the model by comparing predicted labels (y_pred) with actual labels (y_test).
    • The accuracy score represents the proportion of correct predictions made by the model.
  • print(f"Model Accuracy: {accuracy:.4f}")
    • Prints the accuracy of the model formatted to four decimal places.
    • The f-string is used for string formatting, making the output more readable.
After the code written so far is executed the only output that is obtained is
Model Accuracy: 1.0000
The result shows that trained RFC has perfect classification performance on the test dataset. The nex step in this investigation would be to determine feature imporance i.e. to determine which features have most contribution to the lable/output variable.

4. Feature Importance in Random Forest

Random Forests provide a built-in way to determine feature importance. This helps in understanding which features are most influential in classification.

import matplotlib.pyplot as plt

# Extract feature importances
importances = clf.feature_importances_
feature_names = iris.feature_names

# Plot feature importance
plt.figure(figsize=(8, 5))
plt.barh(feature_names, importances, color="skyblue")
plt.xlabel("Feature Importance")
plt.ylabel("Feature")
plt.title("Feature Importance in Random Forest")
plt.show()
    
  • import matplotlib.pyplot as plt
    • Imports the pyplot module from the Matplotlib library, which is used for data visualization.
    • This module provides functions to create various types of plots, such as bar charts, line graphs, and histograms.
  • # Extract feature importances
    • This is a comment indicating that the following lines of code will extract the feature importance values from the trained model.
  • importances = clf.feature_importances_
    • Retrieves the feature importance values from the trained Random Forest model.
    • Each value represents how much a particular feature contributes to the model's decision-making process.
  • feature_names = iris.feature_names
    • Extracts the names of the features from the Iris dataset.
    • The feature names include sepal length, sepal width, petal length, and petal width.
  • # Plot feature importance
    • This is a comment indicating that the following lines of code will generate a bar chart to visualize feature importance.
  • plt.figure(figsize=(8, 5))
    • Creates a new figure for the plot with a specified size of 8 inches by 5 inches.
    • This ensures that the plot is clear and well-sized for visualization.
  • plt.barh(feature_names, importances, color="skyblue")
    • Creates a horizontal bar chart where:
    • feature_names are placed on the y-axis.
    • importances (feature importance values) are represented on the x-axis.
    • The bars are colored skyblue for better visualization.
  • plt.xlabel("Feature Importance")
    • Labels the x-axis as "Feature Importance" to indicate what the values represent.
  • plt.ylabel("Feature")
    • Labels the y-axis as "Feature" to indicate that it represents the different features of the dataset.
  • plt.title("Feature Importance in Random Forest")
    • Sets the title of the plot to "Feature Importance in Random Forest" to describe the visualization.
  • plt.show()
    • Displays the plot, making the feature importance visualization visible.
After previous code is executed the following plot is obtained which is shown in Figure 1.
2025-03-03T22:19:56.692250 image/svg+xml Matplotlib v3.9.2, https://matplotlib.org/
Figure 1 - Feature importance on iris dataset variables obtained using trained RFC model.
The feature importance results indicate the relative contribution of each feature to the model's predictions. Among the four features, "petal length (cm)" and "petal width (cm)" hold the most significant importance, with values of 0.4522 and 0.4317, respectively. These two features dominate the decision-making process, suggesting that they provide the most information about the target variable. In contrast, "sepal length (cm)" and "sepal width (cm)" have much lower importance scores, with values of 0.1062 and 0.0099. This implies that these features contribute far less to the model's predictive ability compared to the petal dimensions. Overall, petal length and petal width appear to be the key drivers in distinguishing between the classes in this model.

5. Hyperparameter Tuning for Better Performance

To improve performance, we can tune hyperparameters using GridSearchCV. In case of grid search we will find the optimal combination of some of RFC hyperparameters such as n_estimatros, max_depth, min_samples. In case of GridSearchCV we will try some combinations i.e. the n_estimators parameter will be set to 50, 100, and 200. The max_depth will be set to 3, 5, and 10 while min_samples_split will be set to 2, 5, and 10. The entire code for performing the grid search CV is shown below.

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 10],
    'min_samples_split': [2, 5, 10]
}

# Perform grid search
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)
print("Best Score: " , grid_search.best_score_)
    
  • Importing the GridSearchCV module: The code begins by importing GridSearchCV from the sklearn.model_selection module. This is a method used to search for the best combination of hyperparameters for a model.
  • Defining the parameter grid: The param_grid dictionary is created to define a range of values for each hyperparameter. In this case:
    • 'n_estimators': Number of trees in the forest, with possible values 50, 100, and 200.
    • 'max_depth': Maximum depth of each tree, with possible values 3, 5, and 10.
    • 'min_samples_split': Minimum number of samples required to split an internal node, with possible values 2, 5, and 10.
  • Performing grid search: GridSearchCV is initialized with the RandomForestClassifier, the param_grid, and other parameters:
    • cv=5: The number of cross-validation folds to use (5 in this case).
    • scoring='accuracy': The metric used to evaluate the model performance (accuracy in this case).
  • Fitting the model: The fit method is called on the grid search, using X_train and y_train as input. This will train the model using each combination of parameters defined in the param_grid.
  • Displaying best parameters: The best_params_ attribute of the grid_search object is printed to show the combination of hyperparameters that provided the best performance based on the grid search results.
  • Displaying best score: The best_score_ attribute of the grid_search object is printed to show the best accuracy achieved using RFC in GridSearchCV.
After executed the grid search CV the print functions should display the best parameters and the highest classification accuracy score.
  Best Parameters: {'max_depth': 3, 'min_samples_split': 2, 'n_estimators': 50}
Best Score:  0.95
  

6. Key Takeaways

  • Random Forests improve classification by reducing overfitting compared to single Decision Trees.
  • They provide feature importance values, aiding in feature selection.
  • Hyperparameter tuning helps in optimizing model performance.

By leveraging Random Forests, you can build robust classification models with improved accuracy and generalization!

No comments:

Post a Comment