Wednesday, February 26, 2025

Using SGDClassifier for Classification Tasks

Using SGDClassifier for Classification Tasks
In the world of machine learning, classification tasks are a common use case where we need to assign a category label to input data. Scikit-learn's SGDClassifier is an excellent tool for performing classification tasks using stochastic gradient descent (SGD). This model is particularly well-suited for large datasets and real-time learning scenarios, where the data arrives sequentially or the dataset is too large to fit in memory all at once.

What is SGDClassifier?

The SGDClassifier is a linear classifier that uses stochastic gradient descent (SGD) to minimize the loss function. This method is especially effective when dealing with large datasets or when you want to perform online learning, where the model is updated as new data comes in.

SGDClassifier can be used for a variety of classification tasks, such as binary classification, multiclass classification, and multilabel classification. It supports a wide range of loss functions, including logistic regression and hinge loss for linear Support Vector Machines (SVMs), among others.

How to Use SGDClassifier for Classification

Let's go through a step-by-step example of using SGDClassifier for a classification task. We'll use the popular Iris dataset, which is often used for classification examples.

First, let's import the necessary libraries and load the Iris dataset:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
The prevous code block exist of the following code lines:
  • Import necessary libraries:
    • from sklearn.datasets import load_iris - Imports the Iris dataset from scikit-learn.
    • from sklearn.model_selection import train_test_split - Imports the function to split data into training and test sets.
    • from sklearn.linear_model import SGDClassifier - Imports the Stochastic Gradient Descent (SGD) classifier.
    • from sklearn.metrics import classification_report - Imports a function to evaluate the model’s performance.
  • Load the Iris dataset:
    • iris = load_iris() - Loads the Iris dataset into memory.
    • X, y = iris.data, iris.target - Extracts the feature matrix X and the target variable y.
  • Split the dataset into training and test sets:
    • X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) - Splits the data into:
      • X_train, y_train - 70% of the data for training.
      • X_test, y_test - 30% of the data for testing.
      • test_size=0.3 - Specifies that 30% of the data is for testing.
      • random_state=42 - Ensures reproducibility of the split.
After executing the previous code block nothing will happen i.e. nothing will be visible as the output. The next step is to train the model.

Training the Model

Now, let's initialize an SGDClassifier with a logistic loss function and fit it to the training data:
# Initialize the SGDClassifier with logistic loss (logistic regression)
sgd_clf = SGDClassifier(loss='log', max_iter=1000, tol=1e-3, random_state=42)

# Train the model
sgd_clf.fit(X_train, y_train)
    
The previous code block exist of the following code lines:
  • Initialize the SGDClassifier:
    • sgd_clf = SGDClassifier(loss='log', max_iter=1000, tol=1e-3, random_state=42) - Creates an instance of the SGD classifier.
    • loss='log' - Specifies that the classifier should use logistic regression (log loss) for classification.
    • max_iter=1000 - Sets the maximum number of iterations for training.
    • tol=1e-3 - Defines the stopping criteria; training stops if the improvement in loss is less than this threshold.
    • random_state=42 - Ensures reproducibility by setting a fixed random seed.
  • Train the model:
    • sgd_clf.fit(X_train, y_train) - Trains the classifier using the training data.
    • The model learns the relationship between X_train (features) and y_train (target labels).
Again after executing the previos code block nothing will happen i.e. nothing will be visible as the output. We need to make prediction and show some form of the classification report.

Making Predictions

Once the model is trained, we can use it to make predictions on the test set:
# Make predictions on the test set
y_pred = sgd_clf.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))
    
The previous code block consist of the following code lines:
  • Make predictions on the test set:
    • y_pred = sgd_clf.predict(X_test) - Uses the trained model to predict class labels for the test data.
    • The output y_pred contains predicted labels for each sample in X_test.
  • Print the classification report:
    • print(classification_report(y_test, y_pred)) - Generates a summary of the model’s performance.
    • The classification report includes:
      • Precision - The fraction of correctly predicted positive instances out of all predicted positives.
      • Recall - The fraction of correctly predicted positive instances out of all actual positives.
      • F1-score - The harmonic mean of precision and recall.
      • Support - The number of actual occurrences of each class in y_test.
When the entire code written so far is executed we will obtain the following output.
                precision    recall  f1-score   support

           0       0.95      1.00      0.97        19
           1       1.00      0.23      0.38        13
           2       0.59      1.00      0.74        13

    accuracy                           0.78        45
   macro avg       0.85      0.74      0.70        45
weighted avg       0.86      0.78      0.73        45
  
The classification report for the SGDClassifier on the Iris dataset shows varying performance across the three classes. For class 0, the model achieved a high precision of 0.95, perfect recall of 1.00, and an F1-score of 0.97, indicating excellent performance. For class 1, however, the classifier struggled, achieving perfect precision of 1.00 but a low recall of 0.23, resulting in a low F1-score of 0.38. This suggests that the model had difficulty correctly identifying instances of class 1. For class 2, the model achieved a decent precision of 0.59 and perfect recall of 1.00, leading to a relatively high F1-score of 0.74, reflecting good recall but lower precision. Overall, the classifier achieved an accuracy of 0.78 across all classes. The macro average, which gives equal weight to each class, shows a precision of 0.85, recall of 0.74, and an F1-score of 0.70. The weighted average, which takes into account the class distribution, resulted in a precision of 0.86, recall of 0.78, and F1-score of 0.73, suggesting a balanced overall performance with a stronger emphasis on precision.

Hyperparameter Tuning

One of the key advantages of SGDClassifier is its flexibility in tuning hyperparameters. For example, you can experiment with different loss functions to improve model performance. The available loss functions include:
  • 'hinge': Standard SVM loss function.
  • 'log': Logistic regression loss function.
  • 'modified_huber': A smoother version of the hinge loss.
  • 'perceptron': Perceptron loss function.
You can also adjust other hyperparameters such as the learning rate, number of iterations, and regularization strength. Here is how you can experiment with different hyperparameters:
# Experiment with different hyperparameters
sgd_clf = SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-3, alpha=0.0001, random_state=42)

# Retrain the model with updated hyperparameters
sgd_clf.fit(X_train, y_train)
    
The previous code block consist of the following code lines:
  • Experiment with different hyperparameters:
    • sgd_clf = SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-3, alpha=0.0001, random_state=42) - Creates an instance of the SGD classifier with updated hyperparameters.
    • loss='log_loss' - Specifies the use of logarithmic loss for classification, which is appropriate for logistic regression.
    • max_iter=1000 - Sets the maximum number of iterations for the training process to 1000.
    • tol=1e-3 - Defines the tolerance for the stopping criterion; training stops when the improvement is less than this threshold.
    • alpha=0.0001 - Sets the regularization strength; a smaller value indicates less regularization, allowing the model to fit the data more closely.
    • random_state=42 - Ensures reproducibility by setting a fixed random seed, which controls the random shuffling of the data.
  • Retrain the model with updated hyperparameters:
    • sgd_clf.fit(X_train, y_train) - Trains the classifier with the training data using the updated hyperparameters.
    • The model will learn from the training data (X_train and y_train) using the specified hyperparameters.

Online Learning with SGDClassifier

One of the key features of SGDClassifier is its ability to perform online learning using the partial_fit() method. This method allows the model to be updated incrementally as new data arrives, making it an excellent choice for real-time applications or when working with large datasets. Here’s how you can use the partial_fit() method for online learning:
# Initialize the SGDClassifier again
sgd_clf = SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-3, random_state=42)

# Train the model incrementally in mini-batches
batch_size = 50
for i in range(0, len(X_train), batch_size):
    X_batch = X_train[i:i + batch_size]
    y_batch = y_train[i:i + batch_size]
    sgd_clf.partial_fit(X_batch, y_batch, classes=[0, 1, 2])

# Make predictions on the test set
y_pred = sgd_clf.predict(X_test)
print(classification_report(y_test, y_pred))
    
The previous code block consist of the following code lines:
  • Initialize the SGDClassifier again:
    • sgd_clf = SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-3, random_state=42) - Creates a new instance of the SGD classifier with the specified hyperparameters.
    • loss='log_loss' - Specifies that the model should use logistic loss for classification (logistic regression).
    • max_iter=1000 - Sets the maximum number of iterations for training to 1000.
    • tol=1e-3 - Defines the stopping criterion; training will stop when the improvement in the loss is smaller than this threshold.
    • random_state=42 - Ensures that the results are reproducible by fixing the random seed for data shuffling and randomization.
  • Train the model incrementally in mini-batches:
    • batch_size = 50 - Defines the batch size for the mini-batch gradient descent as 50 samples per batch.
    • for i in range(0, len(X_train), batch_size): - Iterates through the training data in steps of 50 (mini-batches).
    • X_batch = X_train[i:i + batch_size] - Selects a mini-batch of input features from the training data.
    • y_batch = y_train[i:i + batch_size] - Selects the corresponding target labels for the mini-batch.
    • sgd_clf.partial_fit(X_batch, y_batch, classes=[0, 1, 2]) - Performs incremental training using the mini-batch, updating the model with each batch. The classes=[0, 1, 2] argument defines the classes to be used in the classification task (3 classes for the Iris dataset).
  • Make predictions on the test set:
    • y_pred = sgd_clf.predict(X_test) - Makes predictions using the trained model on the test data (X_test).
  • Print the classification report:
    • print(classification_report(y_test, y_pred)) - Prints a detailed classification report that includes metrics such as precision, recall, F1-score, and support for each class in the test set.
When the code written so far is executed two classification reports will be shown. The last classification report is for the SGDClassifier with new hyperparameter vlaues.
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       0.50      1.00      0.67        13
           2       0.00      0.00      0.00        13

    accuracy                           0.71        45
   macro avg       0.50      0.67      0.56        45
weighted avg       0.57      0.71      0.61        45
The classification report for the model shows a mixed performance across the three classes. For class 0, the model achieved perfect performance with a precision, recall, and F1-score all equal to 1.00, indicating it correctly identified all instances of this class. However, for class 1, the model's precision dropped to 0.50, while recall remained perfect at 1.00, leading to an F1-score of 0.67, which suggests the model correctly identified all true instances of class 1, but also made a significant number of false positives. For class 2, the model struggled significantly, achieving zero precision, recall, and F1-score, indicating that it failed to correctly classify any instances of this class. Overall, the model's accuracy was 0.71, meaning it correctly predicted 71% of the samples. The macro average, which treats all classes equally, shows a precision of 0.50, recall of 0.67, and F1-score of 0.56, reflecting the imbalance in performance across classes. The weighted average, which takes into account the class distribution, results in a precision of 0.57, recall of 0.71, and F1-score of 0.61, indicating that while the model performed reasonably well overall, it had trouble with certain classes, particularly class 2.

Evaluation Metrics

After training the model, it's important to evaluate its performance. One common evaluation metric for classification tasks is the classification report, which provides metrics such as precision, recall, F1 score, and accuracy. You can calculate these metrics using the classification_report function from Scikit-learn, as shown in the previous code examples.

Conclusion

In this post, we learned how to use Scikit-learn's SGDClassifier for classification tasks. We covered how to:
  • Train the model on the Iris dataset using logistic regression loss.
  • Evaluate the model performance using a classification report.
  • Tune hyperparameters to improve performance.
  • Implement online learning with the partial_fit() method.
The SGDClassifier is a powerful tool for classification tasks, especially when working with large datasets or real-time learning scenarios. Its flexibility in choosing different loss functions and performing online learning makes it a great choice for various classification problems.

No comments:

Post a Comment