Wednesday, February 26, 2025

Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is a statistical method used for dimensionality reduction while preserving as much of the class discriminatory information as possible. LDA is commonly used in supervised learning for classification tasks, particularly when dealing with datasets that have more than two classes.

What is Linear Discriminant Analysis (LDA)?

Linear Discriminant Analysis is a technique that looks for the directions (linear combinations of the features) that will maximize the separation between multiple classes. While Principal Component Analysis (PCA) is an unsupervised method that focuses on maximizing the variance in the data, LDA, being supervised, focuses on maximizing the separability between the classes.

Key Concepts of LDA:

  • Between-class variance: The variance that exists between different class means.
  • Within-class variance: The variance within each class, i.e., how much the individual data points of a class deviate from the class mean.
  • Objective: Maximize the ratio of between-class variance to within-class variance, i.e., separate the classes as much as possible.

Applications of LDA

LDA is widely used in various fields, including:

  • Pattern recognition: For classifying data based on features such as images or audio signals.
  • Face recognition: Reducing the dimensionality of facial features while maintaining class separability.
  • Medical diagnostics: Used in the analysis of medical images to identify potential diseases.

Implementing LDA in Python

In this section, we will implement Linear Discriminant Analysis (LDA) using the popular machine learning library Scikit-learn. We'll demonstrate how to apply LDA for classification and visualize the data after dimensionality reduction.

1. Load the Dataset

We will use the Iris dataset, a classic dataset that is often used for classification tasks. It contains 150 samples of iris flowers, each with four features, and three classes (setosa, versicolor, and virginica).
from sklearn.datasets import load_iris
import pandas as pd

# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Create a DataFrame for better visualization
df = pd.DataFrame(X, columns=iris.feature_names)
df['target'] = y
df.head()
    
The previous code block consist of the following code lines:
  • Import necessary libraries:
    • from sklearn.datasets import load_iris - Imports the load_iris function to load the Iris dataset from sklearn.datasets.
    • import pandas as pd - Imports the pandas library as pd for data manipulation and visualization.
  • Load the Iris dataset:
    • iris = load_iris() - Loads the Iris dataset and stores it in the variable iris.
    • X = iris.data - Extracts the feature data (X) from the Iris dataset and stores it in the variable X.
    • y = iris.target - Extracts the target labels (y) from the Iris dataset and stores it in the variable y.
  • Create a DataFrame for better visualization:
    • df = pd.DataFrame(X, columns=iris.feature_names) - Converts the feature data X into a pandas DataFrame with column names taken from the Iris dataset's feature names.
    • df['target'] = y - Adds the target labels y as a new column named target to the DataFrame df.
  • Display the first few rows of the DataFrame:
    • df.head() - Displays the first 5 rows of the DataFrame df for a quick preview of the data.

2. Apply LDA for Dimensionality Reduction

LDA can be used to reduce the dimensionality of the data while keeping the class separability. In this example, we will reduce the dataset to two dimensions, which we can later plot to visualize the class separation.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Initialize LDA model to reduce dimensions to 2
lda = LinearDiscriminantAnalysis(n_components=2)

# Fit and transform the data
X_lda = lda.fit_transform(X, y)

# Convert to a DataFrame for easier plotting
lda_df = pd.DataFrame(X_lda, columns=['LD1', 'LD2'])
lda_df['target'] = y
lda_df.head()
    
The previous code block consist of the following code lines:
  • Import necessary library:
    • from sklearn.discriminant_analysis import LinearDiscriminantAnalysis - Imports the LinearDiscriminantAnalysis class from sklearn.discriminant_analysis to perform dimensionality reduction using LDA.
  • Initialize the LDA model:
    • lda = LinearDiscriminantAnalysis(n_components=2) - Initializes the LDA model and sets the number of components for dimensionality reduction to 2.
  • Fit and transform the data:
    • X_lda = lda.fit_transform(X, y) - Fits the LDA model to the data (X features and y target labels) and transforms the data into 2 dimensions, storing the result in X_lda.
  • Convert to a DataFrame for easier plotting:
    • lda_df = pd.DataFrame(X_lda, columns=['LD1', 'LD2']) - Converts the 2-dimensional transformed data X_lda into a pandas DataFrame with column names 'LD1' and 'LD2' for the two linear discriminants.
    • lda_df['target'] = y - Adds the original target labels y as a new column 'target' to the DataFrame lda_df.
  • Display the first few rows of the DataFrame:
    • lda_df.head() - Displays the first 5 rows of the DataFrame lda_df for a preview of the data after dimensionality reduction.

3. Visualize the Results

After applying LDA, we can plot the reduced data to visualize how well the classes are separated.
import matplotlib.pyplot as plt
import seaborn as sns

# Plot the LDA results
plt.figure(figsize=(8,6))
sns.scatterplot(data=lda_df, x='LD1', y='LD2', hue='target', palette='viridis', s=100)
plt.title('LDA of Iris Dataset')
plt.xlabel('Linear Discriminant 1 (LD1)')
plt.ylabel('Linear Discriminant 2 (LD2)')
plt.grid(True)
plt.show()
    
The previous code block consist of the following code lines:
  • Import necessary libraries:
    • import matplotlib.pyplot as plt - Imports the pyplot module from the matplotlib library for creating plots.
    • import seaborn as sns - Imports the seaborn library for statistical data visualization, which provides a high-level interface for creating attractive and informative plots.
  • Create the plot:
    • plt.figure(figsize=(8,6)) - Initializes a new figure for the plot with a specified size of 8 inches by 6 inches.
  • Create a scatter plot of the LDA results:
    • sns.scatterplot(data=lda_df, x='LD1', y='LD2', hue='target', palette='viridis', s=100) - Creates a scatter plot using the seaborn scatterplot function, where:
      • data=lda_df - Specifies the data to plot (the DataFrame with LDA results).
      • x='LD1' - Specifies the x-axis data (Linear Discriminant 1).
      • y='LD2' - Specifies the y-axis data (Linear Discriminant 2).
      • hue='target' - Color codes the points based on the target labels ('setosa', 'versicolor', 'virginica').
      • palette='viridis' - Specifies the color palette for the plot (a perceptually uniform color scheme).
      • s=100 - Sets the size of the scatter plot markers to 100 for better visibility.
  • Set the plot title and axis labels:
    • plt.title('LDA of Iris Dataset') - Sets the title of the plot to 'LDA of Iris Dataset'.
    • plt.xlabel('Linear Discriminant 1 (LD1)') - Sets the label for the x-axis to 'Linear Discriminant 1 (LD1)'.
    • plt.ylabel('Linear Discriminant 2 (LD2)') - Sets the label for the y-axis to 'Linear Discriminant 2 (LD2)'.
  • Display the grid and show the plot:
    • plt.grid(True) - Enables the grid on the plot for better readability.
    • plt.show() - Displays the plot on the screen.
After executing the code the matplotlib code will generate the following graph.
2025-02-27T08:08:15.875766 image/svg+xml Matplotlib v3.9.2, https://matplotlib.org/
Figure 1 - LDA of Iris Dataset

The plot above shows how LDA has reduced the data from four dimensions to two while maintaining the class separability. We can see that the three classes (Setosa, Versicolor, Virginica) are well separated along the first two linear discriminants.

Using LDA for Classification

LDA can also be used as a classifier by applying it directly to the target labels. In this example, we will train an LDA classifier and evaluate its performance on the Iris dataset.
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train the LDA classifier
lda_classifier = LinearDiscriminantAnalysis()
lda_classifier.fit(X_train, y_train)

# Make predictions and evaluate the model
y_pred = lda_classifier.predict(X_test)
print(classification_report(y_test, y_pred))
    
The previous code block consist of the following code lines:
  • Import necessary libraries:
    • from sklearn.model_selection import train_test_split - Imports the train_test_split function from the sklearn.model_selection module to split the data into training and testing sets.
    • from sklearn.metrics import classification_report - Imports the classification_report function from the sklearn.metrics module to evaluate the performance of the model.
  • Split the data into training and testing sets:
    • X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) - Splits the feature matrix X and target vector y into training and testing sets. The test size is set to 30% (0.3), and the random seed is fixed at 42 for reproducibility.
  • Initialize and train the LDA classifier:
    • lda_classifier = LinearDiscriminantAnalysis() - Initializes the LinearDiscriminantAnalysis (LDA) classifier.
    • lda_classifier.fit(X_train, y_train) - Trains the LDA classifier using the training data X_train and corresponding labels y_train.
  • Make predictions and evaluate the model:
    • y_pred = lda_classifier.predict(X_test) - Uses the trained LDA model to make predictions on the test data X_test.
    • print(classification_report(y_test, y_pred)) - Prints a detailed classification report, comparing the true labels y_test with the predicted labels y_pred, including metrics like precision, recall, and F1-score.
After executing the entire code the following classification report is obtained.
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45 
The classification results from the Linear Discriminant Analysis (LDA) on the Iris dataset show perfect performance across all classes. For class 0, the model achieved a precision, recall, and F1-score of 1.00, indicating it correctly identified all instances of this class. Similarly, for class 1, the model also achieved perfect precision, recall, and F1-score of 1.00, demonstrating flawless identification of class 1 samples. Class 2 also showed perfect classification performance, with the model achieving precision, recall, and F1-score of 1.00. The overall accuracy of the model was 1.00, meaning all 45 samples in the test set were correctly classified. The macro average, which averages the performance across classes, shows a precision, recall, and F1-score all equal to 1.00, reflecting the balanced and perfect performance. Similarly, the weighted average, accounting for the class distribution, also shows a precision, recall, and F1-score of 1.00, further confirming the model's flawless classification on the Iris dataset.

After training the model, we use the `classification_report` to evaluate its performance, which includes metrics such as accuracy, precision, recall, and F1 score.

Conclusion

Linear Discriminant Analysis (LDA) is a powerful tool for both dimensionality reduction and classification, especially when dealing with multiple classes. By maximizing class separability, LDA helps in improving the classification performance and also provides a way to visualize high-dimensional data in lower dimensions. In this post, we’ve demonstrated how to implement and apply LDA in Python using the Scikit-learn library.

Experiment with your own datasets and observe how LDA can improve both your data visualization and classification tasks!

No comments:

Post a Comment