PYTHONHOLICS: online learning

Showing posts with label online learning. Show all posts

Wednesday, February 26, 2025

Using SGDClassifier for Classification Tasks

In the world of machine learning, classification tasks are a common use case where we need to assign a category label to input data. Scikit-learn's SGDClassifier is an excellent tool for performing classification tasks using stochastic gradient descent (SGD). This model is particularly well-suited for large datasets and real-time learning scenarios, where the data arrives sequentially or the dataset is too large to fit in memory all at once.

What is SGDClassifier?

The SGDClassifier is a linear classifier that uses stochastic gradient descent (SGD) to minimize the loss function. This method is especially effective when dealing with large datasets or when you want to perform online learning, where the model is updated as new data comes in.

SGDClassifier can be used for a variety of classification tasks, such as binary classification, multiclass classification, and multilabel classification. It supports a wide range of loss functions, including logistic regression and hinge loss for linear Support Vector Machines (SVMs), among others.

How to Use SGDClassifier for Classification

Let's go through a step-by-step example of using SGDClassifier for a classification task. We'll use the popular Iris dataset, which is often used for classification examples.

First, let's import the necessary libraries and load the Iris dataset:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

The prevous code block exist of the following code lines:

Import necessary libraries:
- from sklearn.datasets import load_iris - Imports the Iris dataset from scikit-learn.
- from sklearn.model_selection import train_test_split - Imports the function to split data into training and test sets.
- from sklearn.linear_model import SGDClassifier - Imports the Stochastic Gradient Descent (SGD) classifier.
- from sklearn.metrics import classification_report - Imports a function to evaluate the model’s performance.
Load the Iris dataset:
- iris = load_iris() - Loads the Iris dataset into memory.
- X, y = iris.data, iris.target - Extracts the feature matrix X and the target variable y.
Split the dataset into training and test sets:
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) - Splits the data into:
  - X_train, y_train - 70% of the data for training.
  - X_test, y_test - 30% of the data for testing.
  - test_size=0.3 - Specifies that 30% of the data is for testing.
  - random_state=42 - Ensures reproducibility of the split.

After executing the previous code block nothing will happen i.e. nothing will be visible as the output. The next step is to train the model.

Training the Model

Now, let's initialize an SGDClassifier with a logistic loss function and fit it to the training data:

# Initialize the SGDClassifier with logistic loss (logistic regression)
sgd_clf = SGDClassifier(loss='log', max_iter=1000, tol=1e-3, random_state=42)

# Train the model
sgd_clf.fit(X_train, y_train)

The previous code block exist of the following code lines:

Initialize the SGDClassifier:
- sgd_clf = SGDClassifier(loss='log', max_iter=1000, tol=1e-3, random_state=42) - Creates an instance of the SGD classifier.
- loss='log' - Specifies that the classifier should use logistic regression (log loss) for classification.
- max_iter=1000 - Sets the maximum number of iterations for training.
- tol=1e-3 - Defines the stopping criteria; training stops if the improvement in loss is less than this threshold.
- random_state=42 - Ensures reproducibility by setting a fixed random seed.
Train the model:
- sgd_clf.fit(X_train, y_train) - Trains the classifier using the training data.
- The model learns the relationship between X_train (features) and y_train (target labels).

Again after executing the previos code block nothing will happen i.e. nothing will be visible as the output. We need to make prediction and show some form of the classification report.

Making Predictions

Once the model is trained, we can use it to make predictions on the test set:

# Make predictions on the test set
y_pred = sgd_clf.predict(X_test)

# Print the classification report
print(classification_report(y_test, y_pred))

The previous code block consist of the following code lines:

Make predictions on the test set:
- y_pred = sgd_clf.predict(X_test) - Uses the trained model to predict class labels for the test data.
- The output y_pred contains predicted labels for each sample in X_test.
Print the classification report:
- print(classification_report(y_test, y_pred)) - Generates a summary of the model’s performance.
- The classification report includes:
  - Precision - The fraction of correctly predicted positive instances out of all predicted positives.
  - Recall - The fraction of correctly predicted positive instances out of all actual positives.
  - F1-score - The harmonic mean of precision and recall.
  - Support - The number of actual occurrences of each class in y_test.

When the entire code written so far is executed we will obtain the following output.

                precision    recall  f1-score   support

           0       0.95      1.00      0.97        19
           1       1.00      0.23      0.38        13
           2       0.59      1.00      0.74        13

    accuracy                           0.78        45
   macro avg       0.85      0.74      0.70        45
weighted avg       0.86      0.78      0.73        45

The classification report for the SGDClassifier on the Iris dataset shows varying performance across the three classes. For class 0, the model achieved a high precision of 0.95, perfect recall of 1.00, and an F1-score of 0.97, indicating excellent performance. For class 1, however, the classifier struggled, achieving perfect precision of 1.00 but a low recall of 0.23, resulting in a low F1-score of 0.38. This suggests that the model had difficulty correctly identifying instances of class 1. For class 2, the model achieved a decent precision of 0.59 and perfect recall of 1.00, leading to a relatively high F1-score of 0.74, reflecting good recall but lower precision. Overall, the classifier achieved an accuracy of 0.78 across all classes. The macro average, which gives equal weight to each class, shows a precision of 0.85, recall of 0.74, and an F1-score of 0.70. The weighted average, which takes into account the class distribution, resulted in a precision of 0.86, recall of 0.78, and F1-score of 0.73, suggesting a balanced overall performance with a stronger emphasis on precision.

Hyperparameter Tuning

One of the key advantages of SGDClassifier is its flexibility in tuning hyperparameters. For example, you can experiment with different loss functions to improve model performance. The available loss functions include:

'hinge': Standard SVM loss function.
'log': Logistic regression loss function.
'modified_huber': A smoother version of the hinge loss.
'perceptron': Perceptron loss function.

You can also adjust other hyperparameters such as the learning rate, number of iterations, and regularization strength. Here is how you can experiment with different hyperparameters:

# Experiment with different hyperparameters
sgd_clf = SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-3, alpha=0.0001, random_state=42)

# Retrain the model with updated hyperparameters
sgd_clf.fit(X_train, y_train)

The previous code block consist of the following code lines:

Experiment with different hyperparameters:
- sgd_clf = SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-3, alpha=0.0001, random_state=42) - Creates an instance of the SGD classifier with updated hyperparameters.
- loss='log_loss' - Specifies the use of logarithmic loss for classification, which is appropriate for logistic regression.
- max_iter=1000 - Sets the maximum number of iterations for the training process to 1000.
- tol=1e-3 - Defines the tolerance for the stopping criterion; training stops when the improvement is less than this threshold.
- alpha=0.0001 - Sets the regularization strength; a smaller value indicates less regularization, allowing the model to fit the data more closely.
- random_state=42 - Ensures reproducibility by setting a fixed random seed, which controls the random shuffling of the data.
Retrain the model with updated hyperparameters:
- sgd_clf.fit(X_train, y_train) - Trains the classifier with the training data using the updated hyperparameters.
- The model will learn from the training data (X_train and y_train) using the specified hyperparameters.

Online Learning with SGDClassifier

One of the key features of SGDClassifier is its ability to perform online learning using the partial_fit() method. This method allows the model to be updated incrementally as new data arrives, making it an excellent choice for real-time applications or when working with large datasets. Here’s how you can use the partial_fit() method for online learning:

# Initialize the SGDClassifier again
sgd_clf = SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-3, random_state=42)

# Train the model incrementally in mini-batches
batch_size = 50
for i in range(0, len(X_train), batch_size):
    X_batch = X_train[i:i + batch_size]
    y_batch = y_train[i:i + batch_size]
    sgd_clf.partial_fit(X_batch, y_batch, classes=[0, 1, 2])

# Make predictions on the test set
y_pred = sgd_clf.predict(X_test)
print(classification_report(y_test, y_pred))

The previous code block consist of the following code lines:

Initialize the SGDClassifier again:
- sgd_clf = SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-3, random_state=42) - Creates a new instance of the SGD classifier with the specified hyperparameters.
- loss='log_loss' - Specifies that the model should use logistic loss for classification (logistic regression).
- max_iter=1000 - Sets the maximum number of iterations for training to 1000.
- tol=1e-3 - Defines the stopping criterion; training will stop when the improvement in the loss is smaller than this threshold.
- random_state=42 - Ensures that the results are reproducible by fixing the random seed for data shuffling and randomization.
Train the model incrementally in mini-batches:
- batch_size = 50 - Defines the batch size for the mini-batch gradient descent as 50 samples per batch.
- for i in range(0, len(X_train), batch_size): - Iterates through the training data in steps of 50 (mini-batches).
- X_batch = X_train[i:i + batch_size] - Selects a mini-batch of input features from the training data.
- y_batch = y_train[i:i + batch_size] - Selects the corresponding target labels for the mini-batch.
- sgd_clf.partial_fit(X_batch, y_batch, classes=[0, 1, 2]) - Performs incremental training using the mini-batch, updating the model with each batch. The classes=[0, 1, 2] argument defines the classes to be used in the classification task (3 classes for the Iris dataset).
Make predictions on the test set:
- y_pred = sgd_clf.predict(X_test) - Makes predictions using the trained model on the test data (X_test).
Print the classification report:
- print(classification_report(y_test, y_pred)) - Prints a detailed classification report that includes metrics such as precision, recall, F1-score, and support for each class in the test set.

When the code written so far is executed two classification reports will be shown. The last classification report is for the SGDClassifier with new hyperparameter vlaues.

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       0.50      1.00      0.67        13
           2       0.00      0.00      0.00        13

    accuracy                           0.71        45
   macro avg       0.50      0.67      0.56        45
weighted avg       0.57      0.71      0.61        45

The classification report for the model shows a mixed performance across the three classes. For class 0, the model achieved perfect performance with a precision, recall, and F1-score all equal to 1.00, indicating it correctly identified all instances of this class. However, for class 1, the model's precision dropped to 0.50, while recall remained perfect at 1.00, leading to an F1-score of 0.67, which suggests the model correctly identified all true instances of class 1, but also made a significant number of false positives. For class 2, the model struggled significantly, achieving zero precision, recall, and F1-score, indicating that it failed to correctly classify any instances of this class. Overall, the model's accuracy was 0.71, meaning it correctly predicted 71% of the samples. The macro average, which treats all classes equally, shows a precision of 0.50, recall of 0.67, and F1-score of 0.56, reflecting the imbalance in performance across classes. The weighted average, which takes into account the class distribution, results in a precision of 0.57, recall of 0.71, and F1-score of 0.61, indicating that while the model performed reasonably well overall, it had trouble with certain classes, particularly class 2.

Evaluation Metrics

After training the model, it's important to evaluate its performance. One common evaluation metric for classification tasks is the classification report, which provides metrics such as precision, recall, F1 score, and accuracy. You can calculate these metrics using the classification_report function from Scikit-learn, as shown in the previous code examples.

Conclusion

In this post, we learned how to use Scikit-learn's SGDClassifier for classification tasks. We covered how to:

Train the model on the Iris dataset using logistic regression loss.
Evaluate the model performance using a classification report.
Tune hyperparameters to improve performance.
Implement online learning with the partial_fit() method.

The SGDClassifier is a powerful tool for classification tasks, especially when working with large datasets or real-time learning scenarios. Its flexibility in choosing different loss functions and performing online learning makes it a great choice for various classification problems.

Using Scikit-learn's SGDRegressor for Online Learning

Online learning is an important technique for training models incrementally with new data, particularly when dealing with large datasets or streaming data. One powerful tool for implementing online learning in regression tasks is Scikit-learn's SGDRegressor.

In this post, we will explore how to use the SGDRegressor from Scikit-learn for online learning. SGDRegressor uses stochastic gradient descent (SGD) to fit a regression model and can be updated continuously as new data becomes available. This is especially useful in situations where you can't load the entire dataset into memory at once, or when the data arrives in a sequential manner.

What is SGDRegressor?

The SGDRegressor is a linear regression model that is trained using stochastic gradient descent, making it suitable for online learning. It updates the model with each new batch of data, instead of requiring the whole dataset to be available at once.

One of the key benefits of SGDRegressor is that it can handle very large datasets because it only needs to load a small batch of data at a time. This is different from traditional regression algorithms like Ordinary Least Squares (OLS), which require the entire dataset to be loaded into memory.

How to Use SGDRegressor for Online Learning

To demonstrate how SGDRegressor works, we will use a simple example with the California housing dataset, a commonly used dataset for regression tasks. We will perform online learning by training the model incrementally on mini-batches of data.

Let's first load the dataset and initialize the SGDRegressor:

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the California housing dataset
california_housing = fetch_california_housing()
X, y = california_housing.data, california_housing.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the SGDRegressor
sgd_regressor = SGDRegressor(max_iter=1000, tol=1e-3, random_state=42)

The previous code block consists of the following code lines:

Import necessary libraries:
- from sklearn.datasets import fetch_california_housing - Imports the function to load the California housing dataset.
- from sklearn.linear_model import SGDRegressor - Imports the Stochastic Gradient Descent (SGD) regressor for linear regression.
- from sklearn.model_selection import train_test_split - Imports the function to split data into training and testing sets.
- from sklearn.metrics import mean_squared_error - Imports the function to evaluate model performance using mean squared error (MSE).
Load the California housing dataset:
- california_housing = fetch_california_housing() - Fetches the dataset, which contains housing data for California.
- X, y = california_housing.data, california_housing.target - Extracts features (X) and target values (y) from the dataset.
Split the data into training and test sets:
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Initialize the SGDRegressor model:
- sgd_regressor = SGDRegressor(max_iter=1000, tol=1e-3, random_state=42)

Training the Model with Mini-Batches

Now, we will train the model in an online fashion using mini-batches. We will use the partial_fit() method of SGDRegressor to update the model after each mini-batch of data.

In online learning, we process each mini-batch of data and update the model's parameters. This allows the model to adapt to new data as it arrives:

# Train the model incrementally using mini-batches
batch_size = 100
for i in range(0, len(X_train), batch_size):
    X_batch = X_train[i:i+batch_size]
    y_batch = y_train[i:i+batch_size]
    sgd_regressor.partial_fit(X_batch, y_batch)

# Make predictions on the test set
y_pred = sgd_regressor.predict(X_test)

# Evaluate the model performance
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

The previous code block consists of the following code lines:

Train the model incrementally using mini-batches:
- batch_size = 100 - Defines the size of each mini-batch for training.
- for i in range(0, len(X_train), batch_size): - Iterates through the training data in increments of batch_size.
Make predictions on the test set:
- y_pred = sgd_regressor.predict(X_test) - Uses the trained model to predict target values for the test set.
Evaluate the model performance:
- mse = mean_squared_error(y_test, y_pred) - Calculates the Mean Squared Error (MSE) between the actual and predicted values.
- print(f"Mean Squared Error: {mse}") - Displays the computed MSE value.

After executing the entire code the following result (MSE) is obtained.

Mean Squared Error: 1.624362503950084e+30

The extremely high Mean Squared Error (MSE) value (≈1.62 × 10³⁰) suggests that the model is not learning properly. Here are some possible reasons and solutions: Possible Causes: Feature Scaling Issue: The features in the California housing dataset have different scales, and SGDRegressor is sensitive to feature magnitudes. Without scaling, large values dominate the learning process, causing numerical instability. Diverging Weights (Exploding Gradients): Since SGDRegressor updates weights incrementally, large feature values can lead to large weight updates, making predictions explode to unrealistic values. Inappropriate Learning Rate: The default learning rate might be too high, leading to overshooting optimal weight values instead of converging.

Benefits of Using SGDRegressor for Online Learning

Some of the key advantages of using SGDRegressor for online learning include:

Memory Efficiency: It doesn't require the entire dataset to fit into memory, making it suitable for large datasets or streaming data.
Incremental Updates: The model can be updated with new data as it arrives, making it ideal for real-time applications.
Flexibility: You can fine-tune the learning rate and number of iterations to control how quickly the model adapts to new data.

Conclusion

The SGDRegressor from Scikit-learn is a powerful tool for implementing online learning in regression tasks. It allows you to fit models incrementally, making it well-suited for large datasets or data that arrives sequentially. By using the partial_fit() method, you can update the model with new data and continuously improve its performance. Whether you're working with streaming data, large datasets, or want to perform real-time learning, SGDRegressor is an excellent choice for efficient online learning in Python.