PYTHONHOLICS: How to use Pandas for Preprocessing Machine learning datasets?

Machine learning (ML) models require clea and well-structured datasets. One of the most used Python libraries for preprocessing ML dataset is Pandas alongside the Numpy. The pandas library allows you to handle missing data, transform categorical variables, normalize numerical features, and perfrom other essential preprocessing tasks with ease.
In this post we will cover the basics of preprocessing ML datasets using Pandas library.

How to load your dataset using Pandas?

Pnadas library makes it easy to load data from various sources. The most commonly used sources/formats are CSV, Excel or SQL databases. The majority of publicaly available datasets on websites such as UCI machine learning repository or Kaggle are in CSV format.
Before loading the dataset the pandas library must be improted. In most online examples of Python code the pandas library is imported in the following way.

import pandas as pd

Usin previous code block the pandas library is imported as pd. The pd is abbreviation using whihc we can access all pandas methods and functions.
To import the dataset in csv format we will use the pandas function read_csv.

data = pd.read_csv("dataset.csv")

Using the previous code line with pd.read_csv we have accessed the read_csv function from pandas library using pd abbreviation. Inside the brackets of the read_csv() you have to put the name of the dataset. If the python script is located in the same folder as the "dataset.csv" then you only need to type "dataset.csv" i.e. the full name of the dataset including the format. If however the dataset is located in another folder than you have to type the addres with the dataset name where the dataset.csv is located. The example is missing. To look at the frist five rows of the datast we have to type the following code.

print(data.head())

data.head() is used to show first 5 rows of the dataset.csv. Since the dataset is loaded and stored in data variable the padnas function to show first five rows is head() function. The print is used to display these five rows of dataset.csv. The full code is shwon below.

import pandas as pd
data = pd.read_csv("dataset.csv") # Only if the dataset is located inside the same folder as the python script
print(data.head())

How to handle missing data?

Missing data is common problem in ML datasets and this issue must be handeled before training of ML algorithms. Pandas library offeres several ways to handle missing values i.e. the missing values (rows with missing values) can be dropped from the dataset or can be imputed. However, before dropping or imputing the dataset must be checked for missing values.

Checking the dataset for missing values?

After the required libraries are defined in the Python script and the dataset was imported to read_csv function to check if dataset contains any missing values you have to use the isnull() pandas function, followed by the sum() function.

print(data.isnull().sum())

Using only isnull() function will return the table with the same size as the dataset with True and False values. The False values indicate that that the cell contains some value while True value contains a cell with missing value. This approach is not readable since the print() function does not show the entire dataset as the output.
On the other hand if we use sum() function after isnull() function it will list all dataset variables in one column with number of missing values per each dataset variable. However, if the dataset contains a large number of variables it will not show all of them.
To show the total sum (number of cells) with missing values you have to type the following code.

print(data.isnull().sum().sum())

The additional sum() function is added to the data.isnull().sum() code line. The output of this code line will be an integer that represents the total number of missing values (empty cells) in the dataset. The entire code used in this example is given below.

import pandas as pd
data = pd.read_csv("dataset.csv") # Enter the proper name of the dataset
print(data.isnull()) # The outpu will have the same size as the dataset however the values in "cells" will have True and Flase values. True for empty cell and False for not empty cell.
print(data.isnull().sum()) # This line will list all dataset variables in one column with number of missin values (empty celss) in second column. This is not good if you have a large number of input variables
print(data.isnull().sum().sum()) # Will show one number as the output which is the number of empty cells in the dataset

How to drop the missing values from the dataset?

When the dataset is loaded and checked for missing values the dropna function is used to remove the missing rows or columns form the dataset. To remove the dataset rows with missing values type in the following code.

data = data.dropna()

To drop columns with missing vlaues you have to define axis paramter of the dropna function. By default the axis=0 when you want to remove the dataset rows with missing values. To remove columns with missing values the axis = 0 must be changed to axis = 1.

data = data.dropna(axis = 1)

How to impute missing values?

The missing vlaues are empty cells in the pandas dataset and as previously stated they can be excluded from the analysis by droping rows or columns. However, there is a quick and easier way which requires fillling empty ceels by column min or fill missing categoridcal values with a mode.

# Fill missing numerical values with the column mean
data['numerical_column'] = data['numerical_column'].fillna(data['numerical_column'].mean())
# Fill missing categorical values with the mode
data['categorical_column'] = data['categorical_column'].fillna(data['categorical_column'].mode()[0])

Encoding categorical variables?

Often padans dataset variables have categorical values that have to be transformed into numeric values os ML algorithm could be trained on dataset. To transform the categorical to numerical values the categorical variables have to be encoded. There are several ways you can encode the categorical variables and one way is using the pandas libray with application one-hot encodeing and label encoding.

One-hot encoding

One-hot encodign creates binray columns for each category in a categorical varible. This method is used when you want transfrom multiclass target variable (variable that contains labels 1,2,3,4) into multiple binary columns (4 coulums are created where each colum corresponds to the label in the original column, each column contains 0 and 1 values where 0 values are for those samples that do not have specfiic label and values of 1 are samples tahat originally contained specific label).
The One hot encoding is performed with get_dummies() function. To demonstrate the one-hot encoding we will define the simple pandas dataframe that will be stored to variable y.

y = pd.DataFrame([1,2,3,4,2,3,4,1,2,3,4,1,2,3,4], columns = ['Class'])

So this variable contains one colum of pandas dataframe with labels 1,2,3, and 4. To transform this column into four columns where in each column those samples that contain class 4 are labeled as 1 and all those samples that originaly are not labeled as 4 have 0 values we have to use get_dummies() funciton.

y_raw = pd.get_dummies(y['Class'], prefix="Class", dtype =float)

The y['Class'] will be transformed into pandas dataframe with four columns where each column will have "Class_" name followed by the corresponding label (1,2,3, and 4). The dtype is float since we want numeric values 0 and 1 in each column not True and False values. If you want the True and False values just don't define dtype. The output is shown below.

     Class_1  Class_2  Class_3  Class_4
0       1.0      0.0      0.0      0.0
1       0.0      1.0      0.0      0.0
2       0.0      0.0      1.0      0.0
3       0.0      0.0      0.0      1.0
4       0.0      1.0      0.0      0.0
5       0.0      0.0      1.0      0.0
6       0.0      0.0      0.0      1.0
7       1.0      0.0      0.0      0.0
8       0.0      1.0      0.0      0.0
9       0.0      0.0      1.0      0.0
10      0.0      0.0      0.0      1.0
11      1.0      0.0      0.0      0.0
12      0.0      1.0      0.0      0.0
13      0.0      0.0      1.0      0.0
14      0.0      0.0      0.0      1.0

As seen from the result the first column represents the number of samples in the dataset from 1 to 14. The pd.get_dummies() function created from one colum four columns where each column corresponds to specfic class i.e. Class 1 for label 1 in the original Class column. The Class_2 is the label 2 in the original Class columns. The Class_3 is the label 3 in the original Class column. The Class_4 is the label 4 in the original Class column.
In Class_1 column all the dataset samples that in original Class column where equal to 1 are in Class_1 column also equal to 1. All other dataset samples value that did not contain label 1 in the original Class column are equal to 0. So for example in the original Class column the second sample has label 2 so in the Class_1 the second sample will be equal to 0. However, the second sample will have value of 1 since the second sample in the original Class column contained label 2.

Label encoding

For ordinal data, you can replace categorical with numercial values.

data['ordinal_column'] = data['ordinal_column'].map({'Low': 1, 'Medium': 2, 'High': 3})

Feature Scaling

Scaling ensures that numerical features contribute equally to the model. Unfortuantelly the pandas library does not have scaling/normalization techniques however, the scikit-library has the folllowing scaling/normalization techniques MaxAbsScaler, MinMaxScaler, Normalizer, PowerTransformer, RobustScaler, and StandardScaler.

Example: How to use StandardScaler?

This is just simple example how to use the StandardScaler. First ste is to define/call the StandardScaler from scikit-learn library preprocesinng module.

import pandas as pd
from sklearn.preprocessing import StandardScaler

Then the pnadas dataset is loaded and standard scaler is applied to the dataset. However, before the application the StandardScaler is defined and the output target variable must be poped out of the dataset since we do not want to scale the target variable i.e. generally it is not recommended.

data = pd.read_csv("dataset_name.csv")
y = data.pop('Output')
scaler = StandarScaler()
data_scaled = scaler.fit_transfrom(data)

Normalization

The dataset noramlization or normalization of input variables can be performed using Normalizer function available in scikit-learn library, preporcessing module. The first step is to define required libraries and in this case it is pandas and scikit-learn.preprocessing module.

import pandas as pd
from sklearn.preprocessing improt Normalizer

Then the dataset must be loaded using pd.read_csv function and the target variable must be poed out.

data = pd.read_csV("dataset_name.csv")
y = data.pop('output')

The next and final step is to define the normalizer function and save it under arbitrary variable name and later apply the normalizer of the dataset input variables.

mdoel = Normalizer()
data_scaled = model.fit_transform(data)

Remove Outliers

Outliers can skew your ML model so they must be handled properly. Pandas makes it easy to detect and handle outliers.

Using the IQR method

Q1 = data['numerical_column'].quantile(0.25)
Q3 = data['numerical_columns'].quantile(0.75)
IQR = Q3 - Q1
# Filter out rows with outliers
data = data[(data['numerical_column'] >= Q1 - 1.5 * IQR) & (data['numerical_column'] <= Q3 + 1.5 * IQR)]

Feature engineering

You can also create new features or modify existing ones using Pandas. In this example we will create new feature by multiplying two existing features from the dataset.

data['new_feature'] = data['feature1']+data['feature2']

Binning

Binning numerical variables can convert them into categorical variables.

# Bin numerical values into categories
data['binned_feature'] = pd.cut(data['numerical_column'], bins=[0, 50, 100], labels=['Low', 'High'])

Saving and Loading Preprocessed Data

After Preprocessing, it's often a good idea ot save your data for later use.

# Save the preprocessed dataset
data.to_csv('preprocessed_data.csv', index=False)
# Load it again when needed
data = pd.read_csv('preprocessed_data.csv')

Conclusion

Pandas is an incredibly powerful tool for preprocessing machine learning datasets. It allows you to handle missing data, encode categorical variables, scale numerical features, remove outliers, and even engineer new features—all with minimal code. By mastering these preprocessing techniques, you can prepare clean, structured datasets for your machine learning models, ensuring better performance and more accurate predictions.
If you have any questions or need further clarification, feel free to drop a comment below!

PYTHONHOLICS

Sunday, December 8, 2024

How to use Pandas for Preprocessing Machine learning datasets?