Wednesday, February 26, 2025

Using Scikit-learn's SGDRegressor for Online Learning

Using Scikit-learn's SGDRegressor for Online Learning
Online learning is an important technique for training models incrementally with new data, particularly when dealing with large datasets or streaming data. One powerful tool for implementing online learning in regression tasks is Scikit-learn's SGDRegressor.

In this post, we will explore how to use the SGDRegressor from Scikit-learn for online learning. SGDRegressor uses stochastic gradient descent (SGD) to fit a regression model and can be updated continuously as new data becomes available. This is especially useful in situations where you can't load the entire dataset into memory at once, or when the data arrives in a sequential manner.

What is SGDRegressor?

The SGDRegressor is a linear regression model that is trained using stochastic gradient descent, making it suitable for online learning. It updates the model with each new batch of data, instead of requiring the whole dataset to be available at once.

One of the key benefits of SGDRegressor is that it can handle very large datasets because it only needs to load a small batch of data at a time. This is different from traditional regression algorithms like Ordinary Least Squares (OLS), which require the entire dataset to be loaded into memory.

How to Use SGDRegressor for Online Learning

To demonstrate how SGDRegressor works, we will use a simple example with the California housing dataset, a commonly used dataset for regression tasks. We will perform online learning by training the model incrementally on mini-batches of data.

Let's first load the dataset and initialize the SGDRegressor:

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the California housing dataset
california_housing = fetch_california_housing()
X, y = california_housing.data, california_housing.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the SGDRegressor
sgd_regressor = SGDRegressor(max_iter=1000, tol=1e-3, random_state=42)
    
The previous code block consists of the following code lines:
  • Import necessary libraries:
    • from sklearn.datasets import fetch_california_housing - Imports the function to load the California housing dataset.
    • from sklearn.linear_model import SGDRegressor - Imports the Stochastic Gradient Descent (SGD) regressor for linear regression.
    • from sklearn.model_selection import train_test_split - Imports the function to split data into training and testing sets.
    • from sklearn.metrics import mean_squared_error - Imports the function to evaluate model performance using mean squared error (MSE).
  • Load the California housing dataset:
    • california_housing = fetch_california_housing() - Fetches the dataset, which contains housing data for California.
    • X, y = california_housing.data, california_housing.target - Extracts features (X) and target values (y) from the dataset.
  • Split the data into training and test sets:
    • X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
      • Divides the dataset into 80% training and 20% testing data.
      • random_state=42 ensures reproducibility of the split.
  • Initialize the SGDRegressor model:
    • sgd_regressor = SGDRegressor(max_iter=1000, tol=1e-3, random_state=42)
      • Creates an instance of SGDRegressor, a linear model trained using stochastic gradient descent.
      • max_iter=1000 - Sets the maximum number of iterations for optimization.
      • tol=1e-3 - Defines the stopping criterion for convergence.
      • random_state=42 - Ensures consistent results across runs.

Training the Model with Mini-Batches

Now, we will train the model in an online fashion using mini-batches. We will use the partial_fit() method of SGDRegressor to update the model after each mini-batch of data.

In online learning, we process each mini-batch of data and update the model's parameters. This allows the model to adapt to new data as it arrives:

# Train the model incrementally using mini-batches
batch_size = 100
for i in range(0, len(X_train), batch_size):
    X_batch = X_train[i:i+batch_size]
    y_batch = y_train[i:i+batch_size]
    sgd_regressor.partial_fit(X_batch, y_batch)

# Make predictions on the test set
y_pred = sgd_regressor.predict(X_test)

# Evaluate the model performance
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
    
The previous code block consists of the following code lines:
  • Train the model incrementally using mini-batches:
    • batch_size = 100 - Defines the size of each mini-batch for training.
    • for i in range(0, len(X_train), batch_size): - Iterates through the training data in increments of batch_size.
      • X_batch = X_train[i:i+batch_size] - Selects a batch of input features from the training data.
      • y_batch = y_train[i:i+batch_size] - Selects the corresponding batch of target values.
      • sgd_regressor.partial_fit(X_batch, y_batch) - Updates the model incrementally using the mini-batch.
  • Make predictions on the test set:
    • y_pred = sgd_regressor.predict(X_test) - Uses the trained model to predict target values for the test set.
  • Evaluate the model performance:
    • mse = mean_squared_error(y_test, y_pred) - Calculates the Mean Squared Error (MSE) between the actual and predicted values.
    • print(f"Mean Squared Error: {mse}") - Displays the computed MSE value.
After executing the entire code the following result (MSE) is obtained.
Mean Squared Error: 1.624362503950084e+30
The extremely high Mean Squared Error (MSE) value (≈1.62 × 10³⁰) suggests that the model is not learning properly. Here are some possible reasons and solutions: Possible Causes: Feature Scaling Issue: The features in the California housing dataset have different scales, and SGDRegressor is sensitive to feature magnitudes. Without scaling, large values dominate the learning process, causing numerical instability. Diverging Weights (Exploding Gradients): Since SGDRegressor updates weights incrementally, large feature values can lead to large weight updates, making predictions explode to unrealistic values. Inappropriate Learning Rate: The default learning rate might be too high, leading to overshooting optimal weight values instead of converging.

Benefits of Using SGDRegressor for Online Learning

Some of the key advantages of using SGDRegressor for online learning include:
  • Memory Efficiency: It doesn't require the entire dataset to fit into memory, making it suitable for large datasets or streaming data.
  • Incremental Updates: The model can be updated with new data as it arrives, making it ideal for real-time applications.
  • Flexibility: You can fine-tune the learning rate and number of iterations to control how quickly the model adapts to new data.

Conclusion

The SGDRegressor from Scikit-learn is a powerful tool for implementing online learning in regression tasks. It allows you to fit models incrementally, making it well-suited for large datasets or data that arrives sequentially. By using the partial_fit() method, you can update the model with new data and continuously improve its performance. Whether you're working with streaming data, large datasets, or want to perform real-time learning, SGDRegressor is an excellent choice for efficient online learning in Python.

No comments:

Post a Comment