Saturday, December 7, 2024

Splitting Data into Training and Testing Sets with scikit-learn

In ML one of the most important steps in building a robust ML models is splitting your dataset into training and testing set. This ensures that the ML model is trained one part of the dataset and evaluated on the another dataset, helping you gauge its real-world performance. It is also good practice ot evaluate the model on trianing and testing dataset and compare the evaluation metric values. If the evalaution metric vlaues are similar on testing dataset to those on training it can be considered that the training model is properly trained and the overfitting did not occur. In this post we will cover:
  • Why splitting your data is improtant.
  • How to perform a train-test split with scikit-learn
  • Best practices and tips for effective splitting.

Why split your data?

ML models aim to generalize to unsee data. If you train and test your model on the same data:
  • The model migh simply memorize the dataset instead of learning patterns leading to overfitting. The overfitting occurs when the ML model has excellent estimation/Classification perforamnce on the training dataset but the catastrophic estimation/classification performance on the testing (unseen) dataset.
  • You would not get an accurate measure of how the model performs on new data.
Splitting the dataset ensures:
  • The training set is used to teach the ML model
  • The training set evalautes the model's ability to generalize to unseen data.

Train-test split with scikit-learn

To split the dataset before training your ML model the train_test_split function from the scikit-learn library is commonly used. This function will split the dataset into training and testing sets.
In the following example we will create the dataset i.e. features and labels and then split the data using train_test_split function in 80:20 ratio with random shuffling before spliting the dataset.
The first step is to import libraires i.e. numpy and train_test_split function from sklearn library model_selection module.
import numpy as np
from sklearn.model_selection import train test split
The next step is to create dataset and to create it we will create features that will be stored to X variable and the labels to y vairable.
# Sample dataset
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]) # Features
y = np.array([0, 1, 0, 1, 0]) # Labels
Now when the dataset is created it can be splitted using train_test_split function.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Finally we will print the results using pring function.
# Print the results
print("Training Features:\n", X_train)
print("Testing Features:\n", X_test)
print("Training Labels:\n", y_train)
print("Testing Labels:\n", y_test)
When the code is executed the following output is obtained (displayed in the Spyder's console window or shown as output in Command Prompt or Anaconda Prompt...)

Training Features: [[ 9 10] [ 5 6] [ 1 2] [ 7 8]] Testing Features: [[3 4]] Training Labels: [0 0 0 1] Testing Labels: [1]
The key paramters of train_test_split function are:
  • test_size - defines the proportion of the dataset ot include in the testing set. In this example the test_size is set to 0.2 which means 20\% of the data will be used for testing. It should be noted that the default value of the test_size parameter is eqaul ot 0.25.
  • Train_size - is the optional parameter which is used to specify the size of the training set. If neither train_size nor test_size is specified, the remainder after test_size is used for training.
  • random_state - ensures the reproducibility by controlling the randomness of the split. Use of the same value to getthe same split every time.
  • Shuffle - if the shuffle is equal to True (it is by default), shuffles the data before splitting. If the shuffle value is set to False the original dataset order is maintained. IMPORTANT: A good practice is to set the shuffle value to True or do not change this value since by default the value is True. Generally a good practice is to shuffle the samples order in the dataset before performing spliting the dataset to train and test.
  • stratify - this parameter ensures that the proportion of labels in the training and testing sets matches the origianl dataset. It is usefuly for imbalanced datasets.

Best practices for splitting data

  1. Use a Representative Testing Set:Ensure your testing set covers all types of data in your dataset, especially when working with imbalanced classes.
  2. Stratify for Classification Tasks:When dealing with classification problems, always use the stratify parameter to maintain class balance in the training and testing sets.
  3. Keep Testing Data Separate:Never use your testing set for hyperparameter tuning or model training. Reserve it solely for evaluating the final model.
  4. Set a Random State for Reproducibility:Always set random_state when splitting data to ensure your results are consistent across experiments.
  5. Adjust the Split Ratio Based on Data Size:For small datasets, a 70/30 split is common. For large datasets, a 80/20 or even 90/10 split might be sufficient.

Splitting Data into Training, Validation, and Testing Sets

In some cases, you may also want a validation set to fine-tune your model or perform hyperparameter optimization. Here’s how you can achieve this:
import numyp as np
from sklearn.model_selection import train_test_split
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
y = np.array([0, 1, 0, 1, 0])
X_train_val, X_test, y_train_val, y_test = train_test_split(X,y, test_size=0.2, random_state=42)
#Further split training + validation int trainint and validation set
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_trian_val, test_size=0.25, random_state=42) # 25% of 80% = 20%
# Print results
print("Training Features:\n", X_train)
print("Validation Features:\n", X_val)
print("Testing Features:\n", X_test)
The output of the previous code is given below.
Training Features:
 [[ 7  8]
 [ 9 10]
 [ 1  2]]
Validation Features:
 [[5 6]]
Testing Features:
 [[3 4]]
        

Conclusions

Splitting your data into training and testing sets is a vital step in machine learning to prevent overfitting and evaluate model performance. The train_test_split function in scikit-learn makes this process simple and efficient. By following best practices—such as stratifying your splits, keeping testing data separate, and ensuring reproducibility—you can build models that generalize better to real-world data. Got questions or examples of how you split your data? Share them in the comments below!

No comments:

Post a Comment