PYTHONHOLICS

Pythonholics Learning Hub

Learn Python, Machine Learning, and Scientific Computing Faster

Step-by-step tutorials, complete code examples, practical projects, and AFAP books for students, engineers, and researchers.

Start Learning Python Machine Learning Tutorials AFAP Book Series

Python Basics

Start with clean beginner-friendly tutorials and build confidence through small examples.

Open path →

Machine Learning

Learn scikit-learn, classification, regression, metrics, and practical model workflows.

Open tutorials →

AFAP Book Series

Follow the As Fast As Possible book series for structured Python and ML learning.

View books →

Wednesday, July 29, 2026

Feature Importance in Random Forests: How to Interpret Your Model

Pythonholics Machine Learning Tutorial

A detailed practical guide to impurity-based feature importance, permutation importance, correlated predictors, importance stability, feature selection, and responsible interpretation using Python and scikit-learn.

Two core methods Mean decrease in impurity and permutation importance are explained and compared.

Complete Python workflow The script generates all figures, result tables, metrics, and CSV files.

Random Forests are among the most useful machine-learning algorithms for structured and tabular data. They can model nonlinear relationships, capture complex interactions, handle classification and regression problems, and often provide strong performance without requiring extensive data transformation. Nevertheless, a high-performing model is not automatically an understandable model.

After a Random Forest has been trained, an important practical question remains: which input features influence the model most strongly? Feature importance attempts to answer this question by assigning a numerical score to every predictor. These scores can help us understand the fitted model, identify potentially useful or redundant variables, communicate results, and design additional experiments.

Feature importance must still be interpreted cautiously. A high importance value does not prove that a feature causes the predicted outcome. It only indicates that the fitted model used the information associated with that feature in a particular way. Rankings can also be affected by correlated predictors, data leakage, high-cardinality variables, the selected scoring metric, sampling variation, and model configuration.

Main lesson: Random Forest feature importance is a model-inspection tool, not a causal analysis. The most reliable interpretation combines several importance methods with predictive evaluation, correlation analysis, stability checks, and domain knowledge.

1. What Does Feature Importance Mean?

A machine-learning dataset contains input variables, commonly called features, and an output that the model attempts to predict. For example, a medical classification dataset may contain measurements of a tumor, while the target indicates whether the tumor is benign or malignant. A Random Forest may use some measurements repeatedly and depend only weakly on others.

Feature importance provides a model-based estimate of this dependence. In broad terms, an important feature is one that helps the model reduce uncertainty, separate classes, lower prediction error, or maintain predictive performance. However, different feature-importance methods operationalize this idea differently.

Table 1. Practical questions feature importance can help answer
Question	How feature importance helps	Important limitation
Which variables does the model use most?	Ranks predictors according to a defined importance measure.	The ranking explains the fitted model, not necessarily the real-world mechanism.
Can the feature set be reduced?	Identifies candidates for feature-selection experiments.	Low individual importance does not prove that a feature is useless in interactions.
Is the model relying on suspicious information?	Can expose identifiers, proxies, leakage variables, or unexpected predictors.	A leakage audit still requires understanding how every variable was created.
How can the model be explained to stakeholders?	Provides a compact global summary of model reliance.	Global importance does not explain every individual prediction.
Which measurements deserve further study?	Suggests variables or feature groups for additional analysis.	Predictive association must not be presented as causation.

Global and local interpretation

Standard Random Forest feature importance is a global interpretation method. It summarizes model behavior across many observations. It does not directly explain why one specific observation received one specific prediction. Local explanations require methods such as local permutation, SHAP values, local surrogate models, or careful examination of decision paths.

Importance is method-dependent

There is no single universal definition of feature importance. A feature can be important because it frequently creates strong decision-tree splits, because shuffling it damages test performance, because removing it reduces cross-validation performance, or because it contributes strongly to individual predictions. Consequently, two valid methods can produce different rankings.

2. Why Can a Random Forest Rank Features?

A Random Forest combines many decision trees. For classification, each tree predicts a class and the forest aggregates those predictions, usually through majority voting or averaged class probabilities. For regression, the outputs of individual trees are averaged.

Randomness is introduced in two important places. First, each tree is trained on a bootstrap sample drawn from the training data. Second, only a random subset of features is considered at each candidate split. This creates diversity among trees and reduces the tendency of a single tree to overfit the training observations.

Every internal node of a classification tree selects a feature and threshold that reduce impurity. Since the forest records which features were used and how much impurity they reduced, the reductions can be aggregated into an impurity-based importance score. Alternatively, the fitted forest can be treated as a black box and evaluated by measuring how much its predictions deteriorate when individual features are randomly permuted.

**Figure 1.** A practical workflow for interpreting a Random Forest. Predictive performance is evaluated before impurity-based importance, permutation importance, stability, and correlation are examined.

Figure 1 emphasizes that feature importance should not be calculated in isolation. The model must first be trained using a valid experimental design and evaluated on observations that were not used to fit it. Only then should importance rankings be interpreted and validated.

3. Impurity-Based Feature Importance

The built-in importance returned by RandomForestClassifier.feature_importances_ is commonly described as impurity-based feature importance, mean decrease in impurity, or MDI. It measures the accumulated reduction in node impurity attributed to each feature across all trees.

3.1 Impurity reduction at one node

Consider a node containing a subset of training observations. A candidate split divides those observations into a left child and a right child. For a classification tree using Gini impurity, the impurity of node t can be written as:

Gini(t) = 1 − Σ_k=1^K p(k|t)²

Here, K is the number of classes and p(k|t) is the proportion of observations belonging to class k in node t. A pure node has a Gini value of zero.

The weighted impurity decrease produced by a split can be represented conceptually as:

ΔI(t) = w(t)I(t) − w(t_L)I(t_L) − w(t_R)I(t_R)

The symbols t, t_L, and t_R denote the parent, left-child, and right-child nodes. The function I denotes impurity, and w represents the proportion or weighted proportion of observations reaching a node.

3.2 Aggregation across the forest

Whenever feature j is used for a split, its weighted impurity reduction is added to its total. The totals are averaged across the trees and normalized so that the final importance values sum to one:

MDI(j) = normalized sum of weighted impurity decreases produced by feature j

A feature obtains a high MDI value when it is used frequently, produces substantial impurity reductions, affects many training observations, or combines these properties.

3.3 Advantages of MDI

It is available immediately after the forest has been fitted.
It is inexpensive because no additional model evaluation is required.
It provides a simple ranking whose values sum to one.
It can be calculated separately for every tree to study stability.
It is useful as a fast initial overview of the forest structure.

3.4 Limitations and bias

MDI is not a neutral measure. It is computed from the same training process that built the forest, so it describes internal split behavior rather than an independently measured loss of predictive performance. It can favor continuous variables and variables with many possible split points. Correlated predictors can share importance unevenly, and a variable can look important even when its contribution is redundant.

Do not interpret the largest MDI value as proof of causality. The score only indicates that the trained forest assigned substantial split-based importance to the feature.

The clearest first visualization is a horizontal bar chart in which features are sorted by MDI. Error bars can display the standard deviation of importance across individual trees. This adds information that the single forest-level vector does not provide.

**Figure 2.** Top ten features according to mean decrease in impurity. Longer bars indicate a greater aggregated contribution to weighted impurity reduction. Error bars summarize variation across individual trees.

Figure 2 should be interpreted as a description of the forest's internal structure. A long bar indicates that the feature contributed strongly to splits across the ensemble. The error bar indicates whether that contribution was consistent among trees or concentrated in only part of the forest.

4. Permutation Feature Importance

Permutation importance measures the extent to which predictive performance depends on a feature. It can be applied to Random Forests and many other fitted estimators, which makes it a model-agnostic inspection method.

4.1 Core idea

First, the fitted model is evaluated on a validation or test dataset to obtain a baseline score. Next, the values of one feature are randomly shuffled across rows. This destroys the association between that feature and the target while leaving the feature's marginal distribution unchanged. The model is evaluated again, and the decrease in performance becomes the importance estimate.

PI(j) = baseline score − score after permuting feature j

The procedure is repeated several times because one random permutation can produce an unusually large or small change. The mean score decrease and standard deviation are then reported.

4.2 Step-by-step algorithm

Fit the Random Forest using the training data.
Evaluate the fitted forest on an independent validation or test set.
Select one feature and randomly permute its values across observations.
Evaluate the model using the modified dataset.
Record the decrease in the selected performance score.
Repeat the permutation several times and calculate the mean and standard deviation.
Restore the feature and repeat the process for every remaining predictor.

4.3 Choosing the scoring metric

Permutation importance depends on the chosen score. Accuracy may be suitable for a balanced classification problem, but balanced accuracy, F1-score, ROC-AUC, average precision, or a cost-sensitive score may be more meaningful in other settings. Regression problems may use R², negative mean squared error, or negative mean absolute error.

This tutorial uses balanced accuracy. Consequently, the permutation importance of a feature is the mean decrease in balanced accuracy after that feature is shuffled.

4.4 Interpreting positive, zero, and negative values

Table 2. Interpreting permutation-importance values
Observed value	General interpretation	Recommended response
Large positive value	Shuffling the feature substantially damages model performance.	Treat the feature as important, then verify stability and correlation.
Small positive value	The feature has a limited measurable contribution under the selected score.	Compare uncertainty and test feature removal experimentally.
Value near zero	The model can largely maintain performance when the feature is shuffled.	Check whether correlated features preserve the same information.
Negative value	The model performed slightly better after shuffling the feature.	Consider sampling variation, noise, instability, or harmful reliance.

The mean permutation importance should be shown with an uncertainty estimate. A feature whose mean is positive but whose error bar crosses zero may not have a reliably positive contribution under the current test sample and scoring metric.

**Figure 3.** Permutation importance calculated on the held-out test subset. Each bar represents the mean decrease in balanced accuracy, while each error bar represents the standard deviation across repeated permutations.

Figure 3 is more directly connected to predictive performance than Figure 2. Nevertheless, it is not automatically free from interpretation problems. In particular, correlated predictors can cause permutation importance to underestimate the importance of information that is duplicated across multiple features.

5. MDI vs Permutation Importance

MDI and permutation importance answer related but different questions. MDI asks how strongly the feature contributed to impurity reduction inside the trained forest. Permutation importance asks how much test performance deteriorates when the feature's information is disrupted.

Table 3. Comparison of impurity-based and permutation importance
Property	Impurity-based importance	Permutation importance
Main quantity	Aggregated weighted impurity reduction	Decrease in a selected predictive score
Data normally used	Training process and fitted tree structure	Preferably validation or test data
Computational cost	Low	Higher because predictions are repeated
Model dependence	Specific to tree-based models	Applicable to many fitted models
High-cardinality bias	Can be substantial	Generally less direct, but not immune to dataset problems
Correlated predictors	Importance may be unevenly shared or concentrated	Importance may be masked by substitute predictors
Best use	Fast structural overview	Performance-based validation of model reliance

5.1 Why the numerical scales cannot be compared directly

MDI values are normalized to sum to one. Permutation values represent changes in a performance metric and do not generally sum to one. Therefore, a value of 0.10 in MDI is not numerically equivalent to a 0.10 decrease in balanced accuracy.

The Python script creates a normalized version of positive permutation importance only for visual comparison. This normalization makes broad patterns easier to see, but it does not transform the two measures into the same mathematical quantity.

**Figure 4.** Visual comparison of MDI and positive permutation importance after separate normalization. The normalization is used only to compare broad ranking patterns; it does not make the methods equivalent.

5.2 Understanding disagreement

A feature can have high MDI and low permutation importance when it is used frequently by the trees but is redundant with other predictors. When shuffled, the remaining variables preserve enough information for the model to maintain performance.

A feature can also have moderate MDI and comparatively high permutation importance. This suggests that it may not dominate the split structure, but disrupting it still damages predictive performance. Such disagreement is not necessarily an error. It is a signal that the data structure should be examined more carefully.

6. Practical Experiment with the Breast Cancer Dataset

The complete Python example uses the Breast Cancer Wisconsin diagnostic dataset distributed with scikit-learn. The problem is binary classification. Each observation contains measurements derived from a digitized image of a fine-needle aspirate of a breast mass. The target classes are malignant and benign.

Table 4. Dataset summary
Property	Value
Observations	569
Input features	30 numerical variables
Target classes	2
Malignant observations	212
Benign observations	357
Missing values	0
Train/test split	70% / 30%, stratified

6.1 Random Forest configuration

Table 5. Random Forest configuration used in the example
Hyperparameter	Value	Purpose
`n_estimators`	500	Builds a stable ensemble of 500 trees.
`criterion`	gini	Uses Gini impurity for classification splits.
`max_depth`	None	Allows trees to expand until other stopping conditions apply.
`min_samples_split`	2	Minimum observations needed to split an internal node.
`min_samples_leaf`	1	Minimum observations required in a leaf.
`max_features`	sqrt	Considers a random square-root-sized feature subset at each split.
`bootstrap`	True	Trains each tree using a bootstrap sample.
`oob_score`	True	Calculates an out-of-bag performance estimate.
`random_state`	42	Makes the experiment reproducible.
`n_jobs`	-1	Uses available processor cores where supported.

6.2 Evaluate the model before interpreting it

Feature-importance rankings are meaningful only when the underlying model has learned a useful predictive relationship. A poor model can still produce rankings, but those rankings explain a model whose predictions are unreliable.

Table 6. Representative performance from the reproducible example
Metric	Value
Accuracy	0.9474
Balanced accuracy	0.9391
Precision	0.9455
Recall	0.9720
F1-score	0.9585
Matthews correlation coefficient	0.8872
ROC-AUC	0.9917
Out-of-bag score	0.9673

The confusion matrix provides more detail than accuracy alone. It distinguishes correctly identified malignant cases, malignant cases incorrectly classified as benign, benign cases incorrectly classified as malignant, and correctly identified benign cases.

**Figure 5.** Confusion matrix for the held-out test set. In the representative run, 58 malignant and 104 benign observations were classified correctly, while six malignant and three benign observations were misclassified.

In a medical context, the two error types may have different consequences. For this reason, model evaluation should not rely on one aggregate score. The example reports several complementary metrics and retains the confusion matrix as part of the interpretation workflow.

7. How Correlated Features Affect Importance

Correlation is one of the main reasons why feature-importance rankings can be misleading. Suppose that two predictors contain nearly the same information. A tree may use the first feature in one branch and the second in another branch. Across the forest, the importance may be split between them or concentrated unevenly.

7.1 Effect on MDI

MDI may assign most of the importance to one feature from a correlated group because that feature happened to produce slightly better splits. Another nearly equivalent feature can then appear unimportant even though it contains similar predictive information.

7.2 Effect on permutation importance

Permuting one correlated feature may produce only a small performance decrease because another feature still provides similar information. As a result, the importance of both features can be underestimated when each is permuted separately.

A correlation matrix should therefore be examined alongside the importance rankings. The Breast Cancer dataset contains several radius, perimeter, area, concavity, and concave-point measurements that are strongly related.

**Figure 6.** Correlation among the ten highest-ranked MDI features. Strongly correlated predictors can share, redistribute, or mask importance and should often be interpreted as related feature groups.

7.3 Better approaches for correlated predictors

Grouped permutation: shuffle a related group of predictors together.
Feature clustering: cluster correlated variables and retain representative features.
Conditional permutation: permute a feature conditionally on related variables.
Domain grouping: interpret related measurements as one conceptual factor.
Ablation experiments: remove groups of predictors and retrain the entire pipeline.

8. Stability and Uncertainty of Feature Importance

A single importance vector can create a false impression of certainty. Random Forests are stochastic models, and individual trees use different bootstrap samples and feature subsets. Consequently, importance can vary from one tree to another and from one fitted forest to another.

8.1 Tree-to-tree variability

The importance values from all 500 individual trees can be collected and displayed as boxplots. A narrow distribution suggests that many trees assign a similar level of importance to the feature. A wide distribution indicates that the feature is highly important in some trees but less important in others.

**Figure 7.** Distribution of feature importance across individual trees. Wide distributions indicate that a feature's role varies substantially across bootstrap samples and random feature subsets.

8.2 Stability across repeated model fits

Tree-to-tree variability is informative, but it does not replace repeated model fitting. A stronger analysis repeats the complete training procedure using different cross-validation folds or random seeds. The resulting feature ranks can then be summarized using means, standard deviations, confidence intervals, or rank correlations.

Recommended publication practice: report importance variability across repeated cross-validation runs rather than presenting one importance ranking from one fitted model as definitive.

9. Cumulative Importance and Feature Selection

Feature importance is often used to create a reduced feature subset. A common approach sorts features from highest to lowest MDI and calculates cumulative importance. This shows how quickly the total reaches thresholds such as 80% or 90%.

**Figure 8.** Cumulative impurity-based feature importance. The curve indicates how many top-ranked features account for a selected proportion of total MDI, but it should not be used as an automatic deletion rule.

9.1 Why threshold selection is not enough

Keeping enough features to reach 90% cumulative MDI does not guarantee that the reduced model will retain 90% of predictive performance. Importance values are not percentages of accuracy. A low-ranked feature may provide unique information, contribute only through interactions, or become important after another feature is removed.

9.2 Correct feature-selection procedure

Calculate importance using only the training process or training folds.
Choose one or more candidate subsets, such as the top 5, 10, 15, and 20 features.
Retrain the complete preprocessing and modeling pipeline for each subset.
Evaluate every candidate using cross-validation or a nested validation design.
Select the smallest subset whose performance is acceptably close to the full model.
Evaluate the selected pipeline once on the untouched test set.

Avoid leakage: do not calculate importance on the complete dataset and then report cross-validation performance using the selected variables. Feature selection must be performed inside the training folds.

10. Complete Python Code

The following script generates every figure and CSV table referenced in this article. Save it as random_forest_feature_importance_tutorial.py and run it from Spyder, a terminal, or another Python environment.

10.1 Required packages

pip install numpy pandas matplotlib scikit-learn

10.2 Full reproducible script

"""
Random Forest Feature Importance Tutorial
Generates all tables and figures used in the Pythonholics article:
"Feature Importance in Random Forests: How to Interpret Your Model"

Tested workflow:
- Breast Cancer Wisconsin diagnostic dataset
- RandomForestClassifier
- Impurity-based feature importance
- Permutation feature importance
- Tree-to-tree importance stability
- Correlation analysis
- Cumulative feature importance
"""

from pathlib import Path
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.metrics import (
    accuracy_score,
    balanced_accuracy_score,
    classification_report,
    confusion_matrix,
    f1_score,
    matthews_corrcoef,
    precision_score,
    recall_score,
    roc_auc_score,
)
from sklearn.model_selection import train_test_split


# =============================================================================
# 1. Configuration
# =============================================================================

RANDOM_STATE = 42
TEST_SIZE = 0.30
N_ESTIMATORS = 500
N_PERMUTATION_REPEATS = 30
TOP_N = 10

OUTPUT_DIR = Path("rf_feature_importance_outputs")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Times New Roman is used when it is installed on the computer.
# Matplotlib will use a fallback font when it is unavailable.
plt.rcParams["font.family"] = "Times New Roman"
plt.rcParams["font.size"] = 12
plt.rcParams["axes.titlesize"] = 15
plt.rcParams["axes.labelsize"] = 13
plt.rcParams["xtick.labelsize"] = 11
plt.rcParams["ytick.labelsize"] = 11
plt.rcParams["legend.fontsize"] = 11
plt.rcParams["figure.titlesize"] = 15

warnings.filterwarnings("ignore", category=UserWarning)


def save_figure(filename: str) -> None:
    """Save the active Matplotlib figure and close it."""
    plt.tight_layout()
    plt.savefig(
        OUTPUT_DIR / filename,
        dpi=300,
        bbox_inches="tight",
    )
    plt.close()


# =============================================================================
# 2. Load and inspect the dataset
# =============================================================================

dataset = load_breast_cancer()

X = pd.DataFrame(
    dataset.data,
    columns=dataset.feature_names,
)
y = pd.Series(
    dataset.target,
    name="target",
)

dataset_summary = pd.DataFrame(
    {
        "property": [
            "Number of observations",
            "Number of input features",
            "Number of target classes",
            "Class 0 observations",
            "Class 1 observations",
            "Missing values",
        ],
        "value": [
            X.shape[0],
            X.shape[1],
            y.nunique(),
            int((y == 0).sum()),
            int((y == 1).sum()),
            int(X.isna().sum().sum()),
        ],
    }
)

dataset_summary.to_csv(
    OUTPUT_DIR / "table_01_dataset_summary.csv",
    index=False,
)


# =============================================================================
# 3. Create a stratified training/test split
# =============================================================================

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
    stratify=y,
)


# =============================================================================
# 4. Train the Random Forest classifier
# =============================================================================

model = RandomForestClassifier(
    n_estimators=N_ESTIMATORS,
    criterion="gini",
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    max_features="sqrt",
    bootstrap=True,
    oob_score=True,
    class_weight=None,
    random_state=RANDOM_STATE,
    n_jobs=-1,
)

model.fit(X_train, y_train)


# =============================================================================
# 5. Evaluate predictive performance
# =============================================================================

y_pred = model.predict(X_test)
y_probability = model.predict_proba(X_test)[:, 1]

performance = pd.DataFrame(
    {
        "metric": [
            "Accuracy",
            "Balanced accuracy",
            "Precision",
            "Recall",
            "F1-score",
            "Matthews correlation coefficient",
            "ROC-AUC",
            "Out-of-bag score",
        ],
        "value": [
            accuracy_score(y_test, y_pred),
            balanced_accuracy_score(y_test, y_pred),
            precision_score(y_test, y_pred),
            recall_score(y_test, y_pred),
            f1_score(y_test, y_pred),
            matthews_corrcoef(y_test, y_pred),
            roc_auc_score(y_test, y_probability),
            model.oob_score_,
        ],
    }
)

performance.to_csv(
    OUTPUT_DIR / "table_02_model_performance.csv",
    index=False,
)

classification_report_table = pd.DataFrame(
    classification_report(
        y_test,
        y_pred,
        target_names=dataset.target_names,
        output_dict=True,
        zero_division=0,
    )
).transpose()

classification_report_table.to_csv(
    OUTPUT_DIR / "table_03_classification_report.csv"
)

cm = confusion_matrix(y_test, y_pred)

confusion_matrix_table = pd.DataFrame(
    cm,
    index=["Actual malignant", "Actual benign"],
    columns=["Predicted malignant", "Predicted benign"],
)

confusion_matrix_table.to_csv(
    OUTPUT_DIR / "table_04_confusion_matrix.csv"
)


# =============================================================================
# 6. Calculate impurity-based feature importance
# =============================================================================

# Importance produced by the complete forest.
mdi_table = pd.DataFrame(
    {
        "feature": X.columns,
        "mdi_importance": model.feature_importances_,
    }
)

# Importance values from individual trees allow us to estimate stability.
tree_importances = np.array(
    [tree.feature_importances_ for tree in model.estimators_]
)

mdi_table["tree_importance_std"] = tree_importances.std(axis=0)

mdi_table = (
    mdi_table
    .sort_values("mdi_importance", ascending=False)
    .reset_index(drop=True)
)

mdi_table["mdi_rank"] = np.arange(1, len(mdi_table) + 1)
mdi_table["cumulative_mdi"] = mdi_table["mdi_importance"].cumsum()

mdi_table.to_csv(
    OUTPUT_DIR / "table_05_mdi_feature_importance.csv",
    index=False,
)


# =============================================================================
# 7. Calculate permutation feature importance
# =============================================================================

permutation_result = permutation_importance(
    model,
    X_test,
    y_test,
    scoring="balanced_accuracy",
    n_repeats=N_PERMUTATION_REPEATS,
    random_state=RANDOM_STATE,
    n_jobs=-1,
)

permutation_table = pd.DataFrame(
    {
        "feature": X.columns,
        "permutation_mean": permutation_result.importances_mean,
        "permutation_std": permutation_result.importances_std,
    }
)

permutation_table = (
    permutation_table
    .sort_values("permutation_mean", ascending=False)
    .reset_index(drop=True)
)

permutation_table["permutation_rank"] = np.arange(
    1,
    len(permutation_table) + 1,
)

permutation_table.to_csv(
    OUTPUT_DIR / "table_06_permutation_feature_importance.csv",
    index=False,
)


# =============================================================================
# 8. Create a combined comparison table
# =============================================================================

comparison_table = pd.merge(
    mdi_table,
    permutation_table,
    on="feature",
    how="inner",
)

comparison_table["absolute_rank_difference"] = (
    comparison_table["mdi_rank"]
    - comparison_table["permutation_rank"]
).abs()

# Normalization below is used only to make the two methods easier to compare
# visually. It does not make the methods mathematically equivalent.
positive_permutation = comparison_table["permutation_mean"].clip(lower=0)

if positive_permutation.sum() > 0:
    comparison_table["normalized_positive_permutation"] = (
        positive_permutation / positive_permutation.sum()
    )
else:
    comparison_table["normalized_positive_permutation"] = 0.0

comparison_table.to_csv(
    OUTPUT_DIR / "table_07_importance_comparison.csv",
    index=False,
)


# =============================================================================
# 9. Figure 1: Conceptual Random Forest interpretation workflow
# =============================================================================

fig, ax = plt.subplots(figsize=(12, 4.8))
ax.axis("off")

workflow_nodes = [
    (0.10, "Training data"),
    (0.32, "Random Forest\nclassifier"),
    (0.55, "Model\nperformance"),
    (0.76, "Feature-importance\nmethods"),
    (0.94, "Interpretation and\nvalidation"),
]

for x_position, label in workflow_nodes:
    ax.text(
        x_position,
        0.52,
        label,
        ha="center",
        va="center",
        transform=ax.transAxes,
        bbox={
            "boxstyle": "round,pad=0.7",
            "facecolor": "white",
            "edgecolor": "black",
        },
    )

for index in range(len(workflow_nodes) - 1):
    x_start = workflow_nodes[index][0] + 0.07
    x_end = workflow_nodes[index + 1][0] - 0.08

    ax.annotate(
        "",
        xy=(x_end, 0.52),
        xytext=(x_start, 0.52),
        xycoords=ax.transAxes,
        textcoords=ax.transAxes,
        arrowprops={"arrowstyle": "->", "linewidth": 1.5},
    )

ax.text(
    0.76,
    0.23,
    "MDI  |  permutation  |  stability  |  correlation",
    ha="center",
    va="center",
    transform=ax.transAxes,
)

ax.set_title("A Practical Workflow for Interpreting Random Forest Feature Importance")
save_figure("figure_01_random_forest_interpretation_workflow.png")


# =============================================================================
# 10. Figure 2: Top impurity-based feature importances
# =============================================================================

mdi_top = mdi_table.head(TOP_N).sort_values(
    "mdi_importance",
    ascending=True,
)

plt.figure(figsize=(11, 7))
plt.barh(
    mdi_top["feature"],
    mdi_top["mdi_importance"],
    xerr=mdi_top["tree_importance_std"],
    capsize=3,
)
plt.xlabel("Mean decrease in impurity")
plt.ylabel("Feature")
plt.title("Top 10 Features by Impurity-Based Importance")
save_figure("figure_02_mdi_feature_importance.png")


# =============================================================================
# 11. Figure 3: Top permutation importances
# =============================================================================

permutation_top = permutation_table.head(TOP_N).sort_values(
    "permutation_mean",
    ascending=True,
)

plt.figure(figsize=(11, 7))
plt.barh(
    permutation_top["feature"],
    permutation_top["permutation_mean"],
    xerr=permutation_top["permutation_std"],
    capsize=3,
)
plt.axvline(0, linewidth=1)
plt.xlabel("Mean decrease in balanced accuracy")
plt.ylabel("Feature")
plt.title("Top 10 Features by Permutation Importance")
save_figure("figure_03_permutation_feature_importance.png")


# =============================================================================
# 12. Figure 4: Normalized MDI/permutation comparison
# =============================================================================

comparison_top_features = (
    comparison_table
    .sort_values("mdi_importance", ascending=False)
    .head(TOP_N)
    .copy()
)

plot_positions = np.arange(len(comparison_top_features))
bar_width = 0.38

plt.figure(figsize=(13, 7))
plt.bar(
    plot_positions - bar_width / 2,
    comparison_top_features["mdi_importance"],
    width=bar_width,
    label="Normalized MDI",
)
plt.bar(
    plot_positions + bar_width / 2,
    comparison_top_features["normalized_positive_permutation"],
    width=bar_width,
    label="Normalized positive permutation importance",
)
plt.xticks(
    plot_positions,
    comparison_top_features["feature"],
    rotation=75,
    ha="right",
)
plt.ylabel("Normalized importance used for visual comparison")
plt.xlabel("Feature")
plt.title("Comparison of MDI and Permutation Feature Importance")
plt.legend()
save_figure("figure_04_mdi_vs_permutation_comparison.png")


# =============================================================================
# 13. Figure 5: Confusion matrix
# =============================================================================

plt.figure(figsize=(7, 6))
plt.imshow(cm, interpolation="nearest", aspect="auto")
plt.colorbar()

class_labels = ["Malignant", "Benign"]
plt.xticks([0, 1], class_labels)
plt.yticks([0, 1], class_labels)
plt.xlabel("Predicted class")
plt.ylabel("True class")
plt.title("Random Forest Confusion Matrix")

threshold = cm.max() / 2

for row in range(cm.shape[0]):
    for column in range(cm.shape[1]):
        plt.text(
            column,
            row,
            str(cm[row, column]),
            ha="center",
            va="center",
        )

save_figure("figure_05_confusion_matrix.png")


# =============================================================================
# 14. Figure 6: Correlation heatmap for the top MDI features
# =============================================================================

top_feature_names = mdi_table.head(TOP_N)["feature"].tolist()
correlation_matrix = X[top_feature_names].corr(method="pearson")

plt.figure(figsize=(11, 9))
image = plt.imshow(
    correlation_matrix,
    interpolation="nearest",
    aspect="auto",
    vmin=-1,
    vmax=1,
)
plt.colorbar(image, label="Pearson correlation coefficient")
plt.xticks(
    np.arange(len(top_feature_names)),
    top_feature_names,
    rotation=75,
    ha="right",
)
plt.yticks(
    np.arange(len(top_feature_names)),
    top_feature_names,
)
plt.title("Correlation Among the Top 10 MDI Features")
save_figure("figure_06_top_feature_correlation_heatmap.png")


# =============================================================================
# 15. Figure 7: Tree-to-tree importance stability
# =============================================================================

top_indices = [
    X.columns.get_loc(feature)
    for feature in mdi_table.head(TOP_N)["feature"]
]

tree_importance_top = tree_importances[:, top_indices]
tree_importance_labels = mdi_table.head(TOP_N)["feature"].tolist()

plt.figure(figsize=(13, 7))
plt.boxplot(
    tree_importance_top,
    tick_labels=tree_importance_labels,
    showfliers=False,
)
plt.xticks(rotation=75, ha="right")
plt.xlabel("Feature")
plt.ylabel("Importance across individual trees")
plt.title("Tree-to-Tree Variability of the Top Feature Importances")
save_figure("figure_07_tree_importance_stability.png")


# =============================================================================
# 16. Figure 8: Cumulative MDI importance
# =============================================================================

feature_count = np.arange(1, len(mdi_table) + 1)

plt.figure(figsize=(10, 6))
plt.plot(
    feature_count,
    mdi_table["cumulative_mdi"],
    marker="o",
)
plt.axhline(0.80, linestyle="--", label="80% importance")
plt.axhline(0.90, linestyle=":", label="90% importance")
plt.xlabel("Number of features included")
plt.ylabel("Cumulative MDI importance")
plt.title("Cumulative Impurity-Based Feature Importance")
plt.grid(True, alpha=0.3)
plt.legend()
save_figure("figure_08_cumulative_feature_importance.png")


# =============================================================================
# 17. Determine the numbers of features required for thresholds
# =============================================================================

features_for_80 = int(
    np.argmax(mdi_table["cumulative_mdi"].to_numpy() >= 0.80) + 1
)

features_for_90 = int(
    np.argmax(mdi_table["cumulative_mdi"].to_numpy() >= 0.90) + 1
)

threshold_summary = pd.DataFrame(
    {
        "cumulative_importance_threshold": [0.80, 0.90],
        "number_of_features_required": [features_for_80, features_for_90],
    }
)

threshold_summary.to_csv(
    OUTPUT_DIR / "table_08_cumulative_importance_thresholds.csv",
    index=False,
)


# =============================================================================
# 18. Print the main results
# =============================================================================

print("\nDATASET SUMMARY")
print(dataset_summary.to_string(index=False))

print("\nMODEL PERFORMANCE")
print(performance.to_string(index=False))

print("\nCONFUSION MATRIX")
print(confusion_matrix_table)

print("\nTOP 10 MDI FEATURES")
print(mdi_table.head(10).to_string(index=False))

print("\nTOP 10 PERMUTATION FEATURES")
print(permutation_table.head(10).to_string(index=False))

print("\nCUMULATIVE IMPORTANCE THRESHOLDS")
print(threshold_summary.to_string(index=False))

print(f"\nAll tables and figures were saved to:\n{OUTPUT_DIR.resolve()}")

10.3 Generated output files

Table 7. Files generated by the Python script
File	Content
`table_01_dataset_summary.csv`	Dataset dimensions, class counts, and missing-value count.
`table_02_model_performance.csv`	Accuracy, balanced accuracy, precision, recall, F1, MCC, ROC-AUC, and OOB score.
`table_03_classification_report.csv`	Class-specific precision, recall, F1-score, and support.
`table_04_confusion_matrix.csv`	Numerical confusion matrix.
`table_05_mdi_feature_importance.csv`	MDI values, tree-level standard deviations, ranks, and cumulative MDI.
`table_06_permutation_feature_importance.csv`	Mean and standard deviation of repeated permutation importance.
`table_07_importance_comparison.csv`	Merged rankings and normalized values used for comparison.
`table_08_cumulative_importance_thresholds.csv`	Numbers of features required to reach 80% and 90% cumulative MDI.
`figure_01_random_forest_interpretation_workflow.png`	Interpretation workflow diagram.
`figure_02_mdi_feature_importance.png`	Top MDI importance chart.
`figure_03_permutation_feature_importance.png`	Top permutation importance chart.
`figure_04_mdi_vs_permutation_comparison.png`	Normalized comparison chart.
`figure_05_confusion_matrix.png`	Confusion matrix visualization.
`figure_06_top_feature_correlation_heatmap.png`	Top-feature correlation heatmap.
`figure_07_tree_importance_stability.png`	Tree-to-tree importance distributions.
`figure_08_cumulative_feature_importance.png`	Cumulative MDI curve.

11. How to Interpret the Example Results

11.1 Top impurity-based features

In the representative run, the highest MDI values were associated with measurements such as worst concave points, worst perimeter, worst area, worst radius, and mean concave points. These variables describe characteristics of the cell nuclei and provide strong splitting opportunities for the fitted forest.

Table 8. Top ten MDI features from the representative run
Rank	Feature	MDI importance
1	worst concave points	0.136424
2	worst perimeter	0.132857
3	worst area	0.131652
4	worst radius	0.088794
5	mean concave points	0.080803
6	mean radius	0.063168
7	mean perimeter	0.053367
8	mean area	0.043312
9	mean concavity	0.042598
10	worst concavity	0.036355

11.2 Top permutation features

The permutation ranking differs considerably. Worst texture and mean texture appear at the top, while several highly ranked MDI variables have smaller permutation values. This is an excellent example of why one method should not be interpreted alone. Several geometric measurements are strongly correlated, so shuffling one of them may not damage performance substantially because related features remain available.

Table 9. Top ten permutation features from the representative run
Rank	Feature	Mean decrease in balanced accuracy	Standard deviation
1	worst texture	0.009813	0.003271
2	mean texture	0.005763	0.003119
3	mean concavity	0.004829	0.003301
4	worst radius	0.004461	0.005277
5	mean concave points	0.004206	0.003486
6	worst concavity	0.003271	0.004849
7	area error	0.003115	0.003267
8	radius error	0.002336	0.002893
9	worst smoothness	0.001558	0.002203
10	worst perimeter	0.001341	0.005475

Interpretation: the disagreement does not mean that one method is wrong. MDI emphasizes the features used to build strong splits, while permutation importance measures additional performance loss after disrupting one feature while all correlated substitutes remain available.

11.3 Why permutation values may be small

Small permutation values do not necessarily mean that the model contains no useful features. The forest can distribute predictive information across many related variables. If several predictors are substitutes, shuffling one variable at a time may cause only a modest score reduction. This is precisely why correlation analysis, grouped permutation, and feature-group ablation can be valuable.

12. Common Interpretation Mistakes

12.1 Treating importance as causality

Predictive importance is not a causal effect. A feature may be a proxy for another variable, reflect a selection process, or become important because of leakage.

12.2 Interpreting only the highest bar

The difference between the first and second feature may be unstable. Features should be interpreted in groups, with uncertainty estimates and repeated fitting.

12.3 Ignoring correlated predictors

Correlation can redistribute MDI and mask permutation importance. Always inspect relationships among the highest-ranked predictors.

12.4 Calculating permutation importance on training data

Training-set permutation importance can reflect overfitting. Validation or test data provide a more meaningful estimate of model reliance on unseen observations.

12.5 Selecting features before splitting the data

Using the complete dataset to select features leaks information into model evaluation. Importance-based selection belongs inside cross-validation or the training portion of the experiment.

12.6 Assuming importance reveals direction

A positive importance score does not tell us whether higher feature values increase or decrease the predicted probability. Partial dependence, accumulated local effects, SHAP dependence plots, or targeted conditional analysis are needed to study direction.

12.7 Ignoring class imbalance and the scoring metric

Permutation rankings can change when accuracy is replaced by balanced accuracy, recall, F1-score, ROC-AUC, or a domain-specific cost function. The score should match the actual objective of the model.

12.8 Removing all low-ranked features automatically

A feature can have low marginal importance while contributing through interactions. Removal decisions must be validated by retraining and evaluating the complete model.

13. Advanced Methods for Deeper Interpretation

13.1 Partial dependence plots

Partial dependence plots estimate the average predicted response as one or two features vary. They help show whether the model response is increasing, decreasing, nonlinear, or threshold-like. They can be misleading when the plotted feature is strongly correlated with other predictors.

13.2 Individual conditional expectation

Individual conditional expectation curves show how predictions change for individual observations rather than only displaying an average. They can reveal heterogeneous effects that a partial dependence curve hides.

13.3 Accumulated local effects

Accumulated local effects plots are designed to reduce some problems caused by correlated features. They estimate local changes in predictions within regions supported by the observed data.

13.4 SHAP values

SHAP-based methods assign contributions to features for individual predictions and can be aggregated into global summaries. They provide richer local explanations but require additional assumptions, computation, and careful treatment of dependent predictors.

13.5 Drop-column importance

Drop-column importance removes one feature, retrains the model, and measures the resulting performance change. It can be more faithful to the retrained modeling process than simple permutation, but it is computationally expensive because the model must be refitted for every feature or feature group.

13.6 Grouped feature importance

When several variables describe the same concept, grouped permutation or grouped removal may be more meaningful than ranking each column independently. For example, radius, perimeter, and area measurements might be analyzed as a geometric feature group.

Table 10. Choosing an interpretation method
Goal	Suitable method	Main caution
Fast global overview	MDI	Training-based bias and correlated predictors
Performance-based global reliance	Permutation importance	Depends on scoring metric and correlation
Importance after retraining	Drop-column importance	High computational cost
Average response shape	Partial dependence	Can extrapolate into unrealistic combinations
Observation-specific contribution	SHAP or local explanation	Requires careful background and dependence assumptions
Related predictors	Grouped permutation or grouped ablation	Feature groups require domain justification

14. Random Forest Regression

The same workflow applies to RandomForestRegressor. The primary difference is the impurity criterion and the evaluation metric. Instead of Gini impurity, regression trees typically reduce squared error, absolute error, or another regression criterion.

Permutation importance should then be calculated using a regression score such as R², negative mean squared error, or negative mean absolute error. The interpretation remains the same: a feature is important when disrupting its information causes the selected predictive score to deteriorate.

15. Best-Practice Checklist

Evaluate predictive performance before interpreting feature importance.
Use a held-out validation or test subset for permutation importance.
Select a permutation scoring metric that matches the real modeling objective.
Compare MDI with at least one performance-based method.
Inspect correlations among highly ranked predictors.
Report standard deviations or repeated-fit variability.
Interpret groups of related predictors rather than overemphasizing tiny rank differences.
Perform feature selection inside cross-validation to prevent leakage.
Retrain and reevaluate the model after removing features.
Use domain knowledge to judge whether the ranking is plausible.
Do not claim that predictive importance proves causality.
Use local explanation methods when individual predictions must be explained.

16. Conclusion

Feature importance makes Random Forests easier to inspect, but it does not provide one final and unquestionable explanation. Impurity-based importance is fast and reveals how features contribute to the structure of the forest. Permutation importance measures how strongly predictive performance depends on preserving the information in each feature.

The two methods can disagree because they measure different properties. Correlated predictors, redundant information, model instability, scoring choices, and sampling variation all affect the resulting rankings. A responsible analysis therefore combines importance measures with performance evaluation, correlation analysis, uncertainty estimates, repeated model fitting, and domain knowledge.

Used carefully, Random Forest feature importance can help identify meaningful predictors, detect suspicious dependencies, guide feature-selection experiments, and communicate model behavior. Used carelessly, it can produce confident but misleading conclusions. The objective is not simply to generate a ranked bar chart; it is to understand what the ranking measures, why it may change, and how it should influence the next modeling decision.

17. References

Breiman, L. (2001). Random Forests. Machine Learning, 45, 5–32. https://doi.org/10.1023/A:1010933404324
Louppe, G., Wehenkel, L., Sutera, A., and Geurts, P. (2013). Understanding Variable Importances in Forests of Randomized Trees. Advances in Neural Information Processing Systems, 26.
Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. (2007). Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution. BMC Bioinformatics, 8, 25. https://doi.org/10.1186/1471-2105-8-25
Altmann, A., Toloşi, L., Sander, O., and Lengauer, T. (2010). Permutation Importance: A Corrected Feature Importance Measure. Bioinformatics, 26(10), 1340–1347. https://doi.org/10.1093/bioinformatics/btq134
Fisher, A., Rudin, C., and Dominici, F. (2019). All Models Are Wrong, but Many Are Useful: Learning a Variable's Importance by Studying an Entire Class of Prediction Models Simultaneously. Journal of Machine Learning Research, 20(177), 1–81.
Hooker, G., Mentch, L., and Zhou, S. (2021). Unrestricted Permutation Forces Extrapolation: Variable Importance Requires at Least One More Model, or There Is No Free Variable Importance. Statistics and Computing, 31, 82. https://doi.org/10.1007/s11222-021-10057-z
Pedregosa, F., Varoquaux, G., Gramfort, A., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Lundberg, S. M., and Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems, 30.
Molnar, C. (2022). Interpretable Machine Learning, second edition.

Monday, July 27, 2026

Out-of-Bag Error in Random Forests: How OOB Validation Works

Random Forests contain a useful validation mechanism that is created automatically during bootstrap training. It is called out-of-bag validation, or simply OOB validation. In this AFAP tutorial, we will explain the complete idea, calculate the OOB error, inspect sample-level OOB probabilities, compare OOB performance with an untouched test set, and generate two figures in Python.

Out-of-bag validation in a Random Forest. Each tree is trained on a bootstrap sample, while the observations excluded from that sample are used for an internal validation of the tree.

AFAP summary: each tree is trained on a bootstrap sample of the training data. The observations that were not selected for that tree form its OOB subset. Every training observation is predicted only by trees that did not use it for training. The aggregated correctness of those predictions gives the OOB score, while 1 - OOB score gives the OOB error.

What Is Out-of-Bag Error?

A Random Forest is an ensemble of decision trees. In the standard bootstrap version of the algorithm, every tree is not trained on the original training dataset directly. Instead, a new training sample is constructed by randomly drawing observations from the original training set with replacement. This procedure is called bootstrap sampling.

Because sampling is performed with replacement, the same observation can be selected more than once, while some observations are not selected at all. For a particular tree, the observations that were not selected are called its out-of-bag observations.

These excluded observations can be passed through that tree after it has been trained. Since the tree did not see them during fitting, its predictions for those observations behave like internal validation predictions. The procedure is repeated over all trees, and every training observation is evaluated by the subset of trees for which it was out of bag.

OOB error = 1 − OOB accuracy

For classification, the default OOB score in scikit-learn is classification accuracy. Therefore, an OOB score of 0.9624 corresponds to an OOB error of approximately 0.0376, or 3.76%.

Why Does a Bootstrap Sample Leave About 36.8% of the Data Out?

Suppose that the training dataset contains n observations and that one bootstrap sample also contains n draws. During one draw, the probability that a particular observation is not selected is:

P(not selected in one draw) = 1 − 1/n

The bootstrap process performs n independent draws. The probability that the observation is never selected is consequently:

P(OOB) = (1 − 1/n)ⁿ ≈ e⁻¹ ≈ 0.368

Therefore, approximately 36.8% of the observations are out of bag for an individual tree, while approximately 63.2% appear at least once in its bootstrap sample. The exact percentage varies from tree to tree because the sampling process is random.

A bootstrap sample has the same number of draws as the training dataset, but duplicate selections cause some original observations to remain outside the sample.

How OOB Validation Works Step by Step

Step 1: Construct a bootstrap sample for each tree

Assume that the training dataset contains 1,000 observations. To build the first tree, the algorithm performs 1,000 random draws with replacement. Some observations appear several times, while roughly 368 observations are not selected.

Step 2: Train the tree only on its bootstrap sample

The decision tree learns its splitting rules from the selected bootstrap observations. The corresponding OOB observations do not participate in the construction of that tree.

Step 3: Predict the OOB observations

After training, the excluded observations are passed through the tree. These are valid internal validation predictions for that particular tree because those observations were not used to fit it.

Step 4: Repeat the procedure for all trees

Every tree receives a different bootstrap sample and therefore has a different OOB subset. An observation excluded from one tree may be included in another tree. With a sufficiently large number of trees, every observation normally receives many OOB predictions.

Step 5: Aggregate only the valid OOB votes

For a given training observation, the forest ignores trees that used that observation during training. Classification probabilities and the predicted class are calculated only from trees for which the observation was out of bag.

Step 6: Compare aggregated predictions with true targets

The aggregated OOB predictions are compared with the original target values. Their average correctness produces the OOB score. The corresponding error is calculated by subtracting the score from one.

An OOB prediction for one observation is produced only from trees that excluded that observation from their bootstrap samples.

Important distinction: there is no single universal OOB dataset. Every tree has its own OOB subset, and every observation is validated by a different subset of trees.

OOB Validation in scikit-learn

The RandomForestClassifier performs OOB estimation when the following two conditions are satisfied:

bootstrap=True
oob_score=True

After fitting, the main OOB attributes are:

Attribute	Meaning
`forest.oob_score_`	The overall OOB score. For a classifier, the default score is accuracy.
`forest.oob_decision_function_`	OOB class probabilities for each training observation.
`forest.estimators_samples_`	The training-sample indices drawn for each individual tree.

The matrix oob_decision_function_ has one row for every training observation and one column for every class. In a binary classification problem, a row such as [0.15, 0.85] means that the OOB trees assigned an estimated probability of 0.15 to class 0 and 0.85 to class 1.

Complete Python Example

The example uses the Breast Cancer Wisconsin dataset included with scikit-learn. No external CSV file is required. The script creates an independent 25% test set, calculates the OOB error trajectory for different forest sizes, trains a final forest with 300 trees, calculates additional OOB metrics, compares them with test-set metrics, and saves all results in the same folder as the Python script.

Install the required packages

Open Anaconda Prompt or a terminal and run:

pip install numpy pandas matplotlib scikit-learn

How to run the example in Spyder

Create a new Python file in Spyder.
Save it as oob_random_forest_afap.py.
Copy the complete script below into the file.
Press F5 or select Run → Run.
Read the printed metrics in the IPython console.
Open the generated PNG and CSV files from the same folder as the script.

The script uses Path(__file__).resolve().parent, so the output files are saved beside the Python script even when Spyder uses a different working directory.

"""
Out-of-Bag Error in Random Forests — AFAP example

The script:
1. loads the Breast Cancer Wisconsin dataset;
2. creates an external train/test split;
3. tracks OOB error for different forest sizes;
4. trains a final RandomForestClassifier with OOB estimation;
5. computes OOB and test metrics;
6. saves two publication-ready figures.
"""

from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    ConfusionMatrixDisplay,
    accuracy_score,
    balanced_accuracy_score,
    classification_report,
    f1_score,
    roc_auc_score,
)
from sklearn.model_selection import train_test_split


# -----------------------------------------------------------------------------
# 1. Reproducibility and figure settings
# -----------------------------------------------------------------------------
RANDOM_STATE = 42
OUTPUT_DIRECTORY = (
    Path(__file__).resolve().parent if "__file__" in globals() else Path.cwd()
)

plt.rcParams["font.family"] = "Times New Roman"
plt.rcParams["font.size"] = 12
plt.rcParams["axes.titlesize"] = 15
plt.rcParams["axes.labelsize"] = 13
plt.rcParams["legend.fontsize"] = 10


# -----------------------------------------------------------------------------
# 2. Load the dataset
# -----------------------------------------------------------------------------
dataset = load_breast_cancer(as_frame=True)
X = dataset.data
Y = dataset.target

print("=" * 79)
print("DATASET")
print("=" * 79)
print(f"Samples:  {X.shape[0]}")
print(f"Features: {X.shape[1]}")
print("Classes:")
print(Y.value_counts().sort_index())


# -----------------------------------------------------------------------------
# 3. Create an external test set
# -----------------------------------------------------------------------------
# OOB estimation is computed only from X_train and y_train. The test set remains
# completely untouched and is used later for a final independent comparison.
X_train, X_test, y_train, y_test = train_test_split(
    X,
    Y,
    test_size=0.25,
    stratify=Y,
    random_state=RANDOM_STATE,
)

print("\n" + "=" * 79)
print("DATA SPLIT")
print("=" * 79)
print(f"Training samples: {X_train.shape[0]}")
print(f"Test samples:     {X_test.shape[0]}")


# -----------------------------------------------------------------------------
# 4. Measure how OOB error changes with the number of trees
# -----------------------------------------------------------------------------
number_of_trees = [25, 50, 75, 100, 150, 200, 300]
oob_errors = []

for n_estimators in number_of_trees:
    temporary_forest = RandomForestClassifier(
        n_estimators=n_estimators,
        bootstrap=True,
        oob_score=True,
        max_features="sqrt",
        random_state=RANDOM_STATE,
        n_jobs=-1,
    )

    temporary_forest.fit(X_train, y_train)
    current_oob_error = 1.0 - temporary_forest.oob_score_
    oob_errors.append(current_oob_error)

    print(
        f"Trees: {n_estimators:3d} | "
        f"OOB accuracy: {temporary_forest.oob_score_:.6f} | "
        f"OOB error: {current_oob_error:.6f}"
    )

# Store the trajectory in a CSV file for later reuse.
oob_results = pd.DataFrame(
    {
        "n_estimators": number_of_trees,
        "oob_accuracy": [1.0 - value for value in oob_errors],
        "oob_error": oob_errors,
    }
)
oob_results.to_csv(OUTPUT_DIRECTORY / "oob_error_results.csv", index=False)

# Plot and save the OOB error curve.
plt.figure(figsize=(10, 6))
plt.plot(number_of_trees, oob_errors, marker="o", linewidth=1.5)
plt.xlabel("Number of trees")
plt.ylabel("OOB error")
plt.title("Out-of-Bag Error as the Random Forest Grows")
plt.grid(True, linestyle="--", linewidth=0.5, alpha=0.7)
plt.tight_layout()
plt.savefig(
    OUTPUT_DIRECTORY / "oob_error_curve.png",
    dpi=300,
    bbox_inches="tight",
)
plt.show()


# -----------------------------------------------------------------------------
# 5. Train the final random forest
# -----------------------------------------------------------------------------
forest = RandomForestClassifier(
    n_estimators=300,
    bootstrap=True,
    oob_score=True,
    max_features="sqrt",
    random_state=RANDOM_STATE,
    n_jobs=-1,
)
forest.fit(X_train, y_train)


# -----------------------------------------------------------------------------
# 6. Obtain sample-level OOB probabilities and OOB predictions
# -----------------------------------------------------------------------------
oob_probabilities = forest.oob_decision_function_

# With a sufficiently large forest, every training sample normally receives OOB
# votes. The validity mask also makes the script safe for unusually small forests.
valid_oob_mask = (
    np.isfinite(oob_probabilities).all(axis=1)
    & (oob_probabilities.sum(axis=1) > 0.0)
)

if not valid_oob_mask.any():
    raise RuntimeError(
        "No valid OOB predictions were produced. Increase n_estimators."
    )

y_train_array = y_train.to_numpy()
y_oob_true = y_train_array[valid_oob_mask]
oob_probabilities_valid = oob_probabilities[valid_oob_mask]
y_oob_pred = forest.classes_[
    np.argmax(oob_probabilities_valid, axis=1)
]

# The Breast Cancer dataset uses class 1 as the positive class.
positive_class_positions = np.where(forest.classes_ == 1)[0]
if positive_class_positions.size != 1:
    raise RuntimeError("The positive class 1 was not found in forest.classes_.")
positive_class_index = int(positive_class_positions[0])
y_oob_positive_probability = oob_probabilities_valid[:, positive_class_index]


# -----------------------------------------------------------------------------
# 7. Calculate OOB metrics
# -----------------------------------------------------------------------------
oob_accuracy = accuracy_score(y_oob_true, y_oob_pred)
oob_balanced_accuracy = balanced_accuracy_score(y_oob_true, y_oob_pred)
oob_f1 = f1_score(y_oob_true, y_oob_pred)
oob_roc_auc = roc_auc_score(y_oob_true, y_oob_positive_probability)
oob_error = 1.0 - oob_accuracy

print("\n" + "=" * 79)
print("OUT-OF-BAG RESULTS")
print("=" * 79)
print(f"Valid OOB samples:     {valid_oob_mask.sum()} / {len(valid_oob_mask)}")
print(f"forest.oob_score_:     {forest.oob_score_:.6f}")
print(f"OOB accuracy:          {oob_accuracy:.6f}")
print(f"OOB error:             {oob_error:.6f}")
print(f"OOB balanced accuracy: {oob_balanced_accuracy:.6f}")
print(f"OOB F1-score:          {oob_f1:.6f}")
print(f"OOB ROC-AUC:           {oob_roc_auc:.6f}")
print("\nOOB classification report:")
print(classification_report(y_oob_true, y_oob_pred, digits=4))


# -----------------------------------------------------------------------------
# 8. Plot the OOB confusion matrix
# -----------------------------------------------------------------------------
display = ConfusionMatrixDisplay.from_predictions(
    y_oob_true,
    y_oob_pred,
    display_labels=dataset.target_names,
    cmap="Greys",
    colorbar=False,
    values_format="d",
)
display.ax_.set_title("Confusion Matrix Based on OOB Predictions")
plt.tight_layout()
plt.savefig(
    OUTPUT_DIRECTORY / "oob_confusion_matrix.png",
    dpi=300,
    bbox_inches="tight",
)
plt.show()


# -----------------------------------------------------------------------------
# 9. Evaluate the same final model on the untouched test set
# -----------------------------------------------------------------------------
y_test_pred = forest.predict(X_test)
y_test_probabilities = forest.predict_proba(X_test)[:, positive_class_index]

test_accuracy = accuracy_score(y_test, y_test_pred)
test_balanced_accuracy = balanced_accuracy_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred)
test_roc_auc = roc_auc_score(y_test, y_test_probabilities)

print("\n" + "=" * 79)
print("INDEPENDENT TEST RESULTS")
print("=" * 79)
print(f"Test accuracy:          {test_accuracy:.6f}")
print(f"Test error:             {1.0 - test_accuracy:.6f}")
print(f"Test balanced accuracy: {test_balanced_accuracy:.6f}")
print(f"Test F1-score:          {test_f1:.6f}")
print(f"Test ROC-AUC:           {test_roc_auc:.6f}")


# -----------------------------------------------------------------------------
# 10. Save the main results
# -----------------------------------------------------------------------------
summary = pd.DataFrame(
    [
        {
            "evaluation": "OOB",
            "accuracy": oob_accuracy,
            "error": oob_error,
            "balanced_accuracy": oob_balanced_accuracy,
            "f1_score": oob_f1,
            "roc_auc": oob_roc_auc,
        },
        {
            "evaluation": "Independent test",
            "accuracy": test_accuracy,
            "error": 1.0 - test_accuracy,
            "balanced_accuracy": test_balanced_accuracy,
            "f1_score": test_f1,
            "roc_auc": test_roc_auc,
        },
    ]
)
summary.to_csv(OUTPUT_DIRECTORY / "oob_and_test_metrics.csv", index=False)

print("\n" + "=" * 79)
print("FILES CREATED")
print("=" * 79)
for filename in (
    "oob_error_curve.png",
    "oob_confusion_matrix.png",
    "oob_error_results.csv",
    "oob_and_test_metrics.csv",
):
    print(OUTPUT_DIRECTORY / filename)

Step-by-Step Explanation of the Python Code

1. Define reproducibility and output settings

RANDOM_STATE = 42 ensures that the train/test split and bootstrap samples are reproducible. The output directory is derived from the location of the script so that the figures and CSV files are easy to find.

2. Load the dataset

dataset = load_breast_cancer(as_frame=True)
X = dataset.data
Y = dataset.target

The dataset contains 569 samples and 30 numerical input features. The target has two classes. The as_frame=True argument returns pandas objects, which makes the data easier to inspect.

3. Create an external test set

X_train, X_test, y_train, y_test = train_test_split(
    X,
    Y,
    test_size=0.25,
    stratify=Y,
    random_state=RANDOM_STATE,
)

OOB validation can estimate model performance without creating a validation subset from the training data. Nevertheless, a separate test set remains useful for a final independent evaluation. The stratify=Y argument preserves the class proportions in both subsets.

4. Track OOB error as the forest grows

current_oob_error = 1.0 - temporary_forest.oob_score_

Separate forests are trained with 25, 50, 75, 100, 150, 200, and 300 trees. The resulting trajectory shows whether the OOB error is still changing or has reached a stable region. This is one practical way to determine whether adding more trees is likely to produce a meaningful improvement.

OOB error as a function of the number of trees. A plateau indicates that additional trees are no longer changing the OOB estimate substantially.

5. Train the final forest

forest = RandomForestClassifier(
    n_estimators=300,
    bootstrap=True,
    oob_score=True,
    max_features="sqrt",
    random_state=RANDOM_STATE,
    n_jobs=-1,
)

bootstrap=True activates bootstrap sampling, while oob_score=True requests OOB estimation. The max_features="sqrt" setting allows each split to consider the square root of the total number of features. Finally, n_jobs=-1 allows scikit-learn to use all available processor cores during fitting.

6. Extract OOB probabilities

oob_probabilities = forest.oob_decision_function_

This matrix contains the aggregated class probabilities obtained only from trees for which each observation was out of bag. It is more useful than the single oob_score_ value because it allows us to calculate balanced accuracy, F1-score, ROC-AUC, a confusion matrix, and other metrics.

7. Filter invalid OOB rows

With very few trees, it is possible that an observation is never left out and therefore receives no OOB prediction. The script uses a validity mask to remove non-finite or empty probability rows. With 300 trees in this example, all 426 training observations receive valid OOB probabilities.

8. Convert OOB probabilities into classes

y_oob_pred = forest.classes_[
    np.argmax(oob_probabilities_valid, axis=1)
]

np.argmax selects the class with the largest OOB probability for each observation. The corresponding class labels are recovered from forest.classes_.

9. Calculate additional OOB metrics

The script calculates OOB accuracy, error, balanced accuracy, F1-score, and ROC-AUC. This is important because accuracy alone can be misleading when the target classes are strongly imbalanced.

10. Plot the OOB confusion matrix

Confusion matrix calculated from sample-level OOB predictions, not from predictions made on the training observations by the complete fitted forest.

11. Compare OOB and independent test performance

The same final forest is evaluated on the untouched test set. Similar OOB and test results indicate that the OOB estimate is behaving reasonably for this example. A large gap would suggest that the OOB estimate may not represent the external test conditions sufficiently well.

Expected Results and Their Interpretation

Using scikit-learn 1.8.0 and the specified random state, the example produced the following values. Minor numerical differences can occur with another library version or execution environment.

Metric	OOB estimate	Independent test
Accuracy	0.9624	0.9580
Error	0.0376	0.0420
Balanced accuracy	0.9586	0.9512
F1-score	0.9701	0.9670
ROC-AUC	0.9849	0.9949

The final OOB accuracy is approximately 96.24%, which corresponds to an OOB error of approximately 3.76%. The independent test accuracy is approximately 95.80%. The small difference between these two accuracy values suggests that OOB validation provides a useful internal estimate for this particular dataset and model configuration.

The OOB error decreases from approximately 4.93% with 25 trees to approximately 3.76% with 100 trees. It then remains stable through 300 trees. This indicates that, for this example, increasing the forest beyond approximately 100–150 trees does not substantially change the OOB error estimate.

These values should not be interpreted as universal Random Forest performance. They belong only to this dataset, split, random state, and hyperparameter configuration.

OOB Validation Versus a Test Split and Cross-Validation

Method	Main advantage	Main limitation
OOB validation	Uses training observations efficiently and is obtained during bootstrap forest fitting.	Available only for bootstrap-based estimators and may be noisy with too few trees.
Hold-out validation	Simple and computationally inexpensive.	Performance depends on one particular split, and part of the data is removed from training.
k-fold cross-validation	Provides performance estimates over several data partitions.	Requires fitting the complete forest several times.
Independent test set	Provides the final evaluation on completely untouched data.	Must not be repeatedly used for hyperparameter selection.

OOB validation is generated from bootstrap exclusions, whereas hold-out and k-fold validation explicitly partition the available observations.

OOB validation is particularly attractive when the dataset is not large enough to sacrifice a substantial validation subset. However, it should not automatically replace an independent test set. A strong experimental design can use OOB estimates for efficient model development and preserve a completely untouched test set for the final report.

Limitations and Common Mistakes

OOB estimation requires bootstrap sampling

The OOB mechanism depends on observations being excluded from bootstrap samples. Therefore, oob_score=True must be used together with bootstrap=True.

Too few trees can produce unstable OOB estimates

A small forest may provide only a few OOB votes per observation. In extreme cases, some observations may receive no OOB predictions. Increasing n_estimators usually stabilizes the estimate.

OOB accuracy is not enough for imbalanced data

When one class dominates, a high OOB accuracy may hide poor minority-class detection. Use oob_decision_function_ to calculate balanced accuracy, sensitivity, specificity, F1-score, MCC, PR-AUC, or other task-appropriate metrics.

OOB validation does not automatically prevent every form of leakage

OOB sampling is performed at the observation level. If several rows belong to the same patient, device, customer, machine, or time period, related rows may appear in both a tree's bootstrap sample and its OOB subset. Grouped or temporal evaluation may therefore be more appropriate for structured data.

Do not repeatedly tune against one untouched test set

The test set in this tutorial is used only to demonstrate the relationship between OOB and external performance. In a real project, preserve the test data until model development and hyperparameter selection are complete.

OOB validation is not identical to k-fold cross-validation

OOB subsets overlap and have variable sizes. k-fold cross-validation uses explicit, non-overlapping validation folds within each repetition. The two methods may therefore produce different estimates, particularly on small or heterogeneous datasets.

OOB Estimation for Random Forest Regression

The same principle applies to RandomForestRegressor. Every numerical target is predicted by trees that did not use the corresponding observation during training. In scikit-learn, the default OOB score for the regressor is the coefficient of determination, R², rather than accuracy.

from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(
    n_estimators=300,
    bootstrap=True,
    oob_score=True,
    random_state=42,
    n_jobs=-1,
)

regressor.fit(X_train, y_train)
print(regressor.oob_score_)
print(regressor.oob_prediction_)

The oob_prediction_ attribute contains one OOB regression prediction for each training observation. These values can be used to calculate MAE, MSE, RMSE, R², or another regression metric.

Frequently Asked Questions

Is OOB error the same as training error?

No. Training error is often calculated using trees that were fitted with the same observations being predicted. OOB error excludes those trees and uses only trees that did not train on a particular observation.

Does every tree have the same OOB observations?

No. Every tree receives a different random bootstrap sample and therefore has a different OOB subset.

Can I use OOB validation for hyperparameter tuning?

Yes, it can provide an efficient internal criterion for comparing Random Forest configurations. However, repeatedly optimizing many configurations against the same OOB mechanism can still introduce selection bias. Final performance should be confirmed on untouched data or through a carefully designed external validation procedure.

How many trees are enough?

There is no universal number. Plot the OOB error against n_estimators and identify the region where the curve stabilizes. The required number depends on dataset size, noise, class structure, tree settings, and random variation.

Should I scale the features?

Decision-tree splits are generally not sensitive to monotonic feature scaling. Therefore, scaling is usually not required for Random Forests, although preprocessing may still be necessary for missing values, categorical variables, leakage prevention, and pipeline consistency.

Conclusion

Out-of-bag validation is one of the most useful properties of bootstrap-based Random Forests. Each tree creates its own internal validation subset simply by excluding observations during bootstrap sampling. The forest then aggregates predictions only from trees that did not use a particular observation during training.

In scikit-learn, the procedure is activated by setting bootstrap=True and oob_score=True. The overall result is available through oob_score_, while oob_decision_function_ provides sample-level class probabilities for more detailed analysis. These probabilities make it possible to calculate balanced accuracy, F1-score, ROC-AUC, confusion matrices, and other metrics without creating an additional validation split.

OOB validation is efficient, practical, and easy to calculate, but it is not a universal replacement for a well-designed test set, grouped validation, temporal validation, or cross-validation. It should be treated as an internal performance estimate whose usefulness depends on the structure of the data and the purpose of the experiment.

Files Created by the Example

oob_error_curve.png — OOB error versus the number of trees.
oob_confusion_matrix.png — confusion matrix from OOB predictions.
oob_error_results.csv — OOB trajectory for all tested forest sizes.
oob_and_test_metrics.csv — OOB and independent test metrics.

References

Breiman, L. Random Forests. Machine Learning, 45, 5–32, 2001.
Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning, 2nd edition. Springer, 2009.
scikit-learn developers. RandomForestClassifier API documentation and OOB Errors for Random Forests example.