PYTHONHOLICS: data preprocessing

Showing posts with label data preprocessing. Show all posts

Tuesday, January 28, 2025

Feature Scaling Techniques in Machine Learning

Feature scaling is essential for many machine learning algorithms to perform well. In this section, we will describe several feature scaling techniques, provide a simple example dataset, and showcase the results of applying each technique.

1. MaxAbsScaler

The MaxAbsScaler scales each feature by its maximum absolute value. It scales the data to a range between -1 and 1 while preserving the sparsity of the dataset (if any). This method is useful when the data is already centered around zero and you want to maintain its sparsity.

Example:

from sklearn.preprocessing import MaxAbsScaler
import numpy as np
    
# Example dataset
data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]])
    
scaler = MaxAbsScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)

The previos code block of the following steps:

Importing necessary libraries:
- from sklearn.preprocessing import MaxAbsScaler: Imports the MaxAbsScaler from the sklearn.preprocessing module, which is used to scale each feature by its maximum absolute value.
- import numpy as np: Imports the numpy library, which is used to create and manipulate arrays.
Defining the example dataset:
- data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]): Creates a 3x3 NumPy array called data with 3 rows and 3 columns. The values include both positive and negative numbers.
Creating the scaler object:
- scaler = MaxAbsScaler(): Creates an instance of the MaxAbsScaler, which scales the data by its maximum absolute value.
Applying the scaler to the dataset:
- scaled_data = scaler.fit_transform(data): Applies the MaxAbsScaler to the dataset by fitting the scaler to the data and then transforming it. The result is stored in scaled_data.
Displaying the scaled data:
- print(scaled_data): Prints the transformed data to the console. The data is scaled so that each feature is divided by its maximum absolute value, resulting in values between -1 and 1.

After the code is exectuted the following output is obtained.


    [[ 0.25  0.4   0.5 ]
     [-0.25 -0.4  -0.5 ]
     [ 1.    1.    1.  ]]

2. MinMaxScaler

The MinMaxScaler transforms the data into a fixed range, usually between 0 and 1. The formula is:

\begin{equation} X_{scaled} = \frac{X - X_{min}}{X_{max} - X_{min}} \end{equation}

This scaler is useful when features have different units or scales and you need to standardize them into the same range for convergence.

Example:

from sklearn.preprocessing import MinMaxScaler
import numpy as np
# Example dataset
data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]) 
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)

The previous code block consist of the following steps:

Importing necessary libraries:
- from sklearn.preprocessing import MinMaxScaler: Imports the MinMaxScaler from the sklearn.preprocessing module, which is used to scale features to a specified range, typically between 0 and 1.
- import numpy as np: Imports the numpy library, which is used to create and manipulate arrays.
Defining the example dataset:
- data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]): Creates a 3x3 NumPy array called data with 3 rows and 3 columns. The values in the array are a mix of positive and negative numbers.
Creating the scaler object:
- scaler = MinMaxScaler(): Creates an instance of the MinMaxScaler. This scaler transforms the data by scaling each feature to a given range (default is between 0 and 1), based on the minimum and maximum values of each feature.
Applying the scaler to the dataset:
- scaled_data = scaler.fit_transform(data): This line of code first fits the MinMaxScaler to the data (i.e., computes the minimum and maximum values for each feature) and then transforms the data by scaling each feature to the range [0, 1]. The resulting transformed data is stored in scaled_data.
Displaying the scaled data:
- print(scaled_data): Prints the scaled data to the console. The values in each column are now transformed to lie between 0 and 1, according to the minimum and maximum values of each feature in the original data.

When the previous code is executed the ofllowing output is obtained.

[[0.25 0.4  0.5 ]
[0.   0.   0.  ]
[1.   1.   1.  ]]

3. Normalizer

The Normalizer scales each sample (row) to have a unit norm (magnitude of 1). This is useful when you want to scale each observation independently of the others.

Example:

from sklearn.preprocessing import Normalizer
import numpy as np 
# Example dataset
data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]) 
scaler = Normalizer()
scaled_data = scaler.fit_transform(data)
print(scaled_data)

The previous code block consist of the following steps:

Importing necessary libraries:
- from sklearn.preprocessing import Normalizer: Imports the Normalizer from the sklearn.preprocessing module, which is used to normalize the dataset. Normalization scales each sample (row) to have a unit norm.
- import numpy as np: Imports the numpy library, which is used to create and manipulate arrays.
Defining the example dataset:
- data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]): Creates a 3x3 NumPy array called data with 3 rows and 3 columns. The values in this array include both positive and negative numbers.
Creating the scaler object:
- scaler = Normalizer(): Creates an instance of the Normalizer class. The Normalizer scales each sample (row) in the dataset to have a unit norm (i.e., the Euclidean norm of the row is 1).
Applying the scaler to the dataset:
- scaled_data = scaler.fit_transform(data): This line of code first fits the Normalizer to the dataset (calculates the necessary values for normalization) and then transforms the data, scaling each row to have a unit norm. The resulting scaled data is stored in scaled_data.
Displaying the scaled data:
- print(scaled_data): Prints the normalized data to the console. Each row in the output will have a Euclidean norm of 1, meaning that the sum of squares of the elements in each row will be equal to 1.

When the code block is excuted the following output is obtained.

[[0.26726124 0.53452248 0.80178373]
[-0.26726124 -0.53452248 -0.80178373]
[0.26726124 0.33407655 0.40089186]]

4. PowerTransformer

The PowerTransformer applies a power transformation to make data more Gaussian-like. It includes two methods: the Box-Cox and Yeo-Johnson transformations, which are useful for correcting skewed data.

Example:

from sklearn.preprocessing import PowerTransformer
import numpy as np 
# Example dataset
data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]])     
scaler = PowerTransformer()
scaled_data = scaler.fit_transform(data)

The previous code block consist of the following steps:

Importing necessary libraries:
- from sklearn.preprocessing import PowerTransformer: Imports the PowerTransformer from the sklearn.preprocessing module, which is used to apply power transformations to make data more Gaussian (normal) by applying a nonlinear transformation to the features.
- import numpy as np: Imports the numpy library, which is used for creating and manipulating arrays.
Defining the example dataset:
- data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]): Creates a 3x3 NumPy array called data with 3 rows and 3 columns. The dataset contains both positive and negative numbers.
Creating the scaler object:
- scaler = PowerTransformer(): Creates an instance of the PowerTransformer. This scaler transforms the data using a power transformation to make the data distribution closer to a normal (Gaussian) distribution. It applies a Box-Cox transformation or a Yeo-Johnson transformation, depending on the data's characteristics (positive vs. both positive and negative values).
Applying the scaler to the dataset:
- scaled_data = scaler.fit_transform(data): This line of code first fits the PowerTransformer to the dataset (calculates the necessary transformation parameters) and then transforms the data. The transformed data is stored in scaled_data. The result is a transformed dataset that aims to have a more Gaussian distribution for each feature.

When the previous code is executed the following output is obtained.

[[-0.75592499 -0.75592499 -0.75592499]
[ 0.75592499  0.75592499  0.75592499]
[-1.60169291 -1.60169291 -1.60169291]]

5. RobustScaler

The RobustScaler uses the median and interquartile range for scaling, making it robust to outliers. It scales the data by subtracting the median and dividing by the interquartile range.

Example:

from sklearn.preprocessing import RobustScaler
import numpy as np 
# Example dataset
data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]])     
scaler = RobustScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)

The previous code block consist of the following steps:

Importing necessary libraries:
- from sklearn.preprocessing import RobustScaler: Imports the RobustScaler from the sklearn.preprocessing module, which is used for scaling the features of the dataset using the median and interquartile range (IQR) instead of mean and standard deviation.
- import numpy as np: Imports the numpy library, which is used for creating and manipulating arrays.
Defining the example dataset:
- data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]): Creates a 3x3 NumPy array called data with 3 rows and 3 columns. The dataset contains both positive and negative numbers.
Creating the scaler object:
- scaler = RobustScaler(): Creates an instance of the RobustScaler. This scaler transforms the data by using the median and the interquartile range (IQR) for scaling, which makes it robust to outliers. It is particularly useful when the dataset has extreme outliers that could affect scaling using standard techniques.
Applying the scaler to the dataset:
- scaled_data = scaler.fit_transform(data): This line of code fits the RobustScaler to the dataset (calculates the necessary values like median and IQR for scaling) and then transforms the data. The resulting scaled data is stored in scaled_data.
Displaying the scaled data:
- print(scaled_data): Prints the scaled data to the console. The values are scaled by subtracting the median of each feature and then dividing by the interquartile range (IQR), making them less sensitive to outliers.

When the previous code block is exectued the following output is obtained.

[[ 0.   0.   0.  ]
[-0.5 -0.5 -0.5 ]
[ 1.   1.   1.  ]]

6. StandardScaler

The StandardScaler standardizes features by removing the mean and scaling to unit variance. The formula is:

\begin{equation} X_{scaled} = \frac{X - mean}{std_{dev}} \end{equation}

This method is useful when the data follows a Gaussian distribution or when features have different variances.

Example:

from sklearn.preprocessing import StandardScaler
import numpy as np 
# Example dataset
data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]])     
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)

The previous code block consist of the following steps:

Importing necessary libraries:
- from sklearn.preprocessing import StandardScaler: Imports the StandardScaler from the sklearn.preprocessing module, which is used to scale the dataset by transforming it into a distribution with a mean of 0 and a standard deviation of 1.
- import numpy as np: Imports the numpy library, which is used to create and manipulate arrays.
Defining the example dataset:
- data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]): Creates a 3x3 NumPy array called data with 3 rows and 3 columns. The values in the array include both positive and negative numbers.
Creating the scaler object:
- scaler = StandardScaler(): Creates an instance of the StandardScaler. This scaler transforms the data to have a mean of 0 and a standard deviation of 1. It is commonly used when the features in the dataset are on different scales.
Applying the scaler to the dataset:
- scaled_data = scaler.fit_transform(data): This line of code first fits the StandardScaler to the dataset (calculates the mean and standard deviation for each feature) and then transforms the data, scaling it so that each feature has a mean of 0 and a standard deviation of 1. The resulting scaled data is stored in scaled_data.
Displaying the scaled data:
- print(scaled_data): Prints the scaled data to the console. Each feature will have been transformed to have a mean of 0 and a standard deviation of 1. This ensures that all features are on a comparable scale, which can improve the performance of certain machine learning algorithms.

When the code is executed the following output is obtained.

[[-0.26726124 -0.26726124 -0.26726124]
[ 0.26726124  0.26726124  0.26726124]
[ 1.06904497  1.06904497  1.06904497]]

This is the end of the tutorial on feature scaling techniuqes. Please try the code described in the post and if you have any question regarding this tutorial please leave the commnet below. Thank you.

Sunday, December 8, 2024

How to use Pandas for Preprocessing Machine learning datasets?

Machine learning (ML) models require clea and well-structured datasets. One of the most used Python libraries for preprocessing ML dataset is Pandas alongside the Numpy. The pandas library allows you to handle missing data, transform categorical variables, normalize numerical features, and perfrom other essential preprocessing tasks with ease.
In this post we will cover the basics of preprocessing ML datasets using Pandas library.

How to load your dataset using Pandas?

Pnadas library makes it easy to load data from various sources. The most commonly used sources/formats are CSV, Excel or SQL databases. The majority of publicaly available datasets on websites such as UCI machine learning repository or Kaggle are in CSV format.
Before loading the dataset the pandas library must be improted. In most online examples of Python code the pandas library is imported in the following way.

import pandas as pd

Usin previous code block the pandas library is imported as pd. The pd is abbreviation using whihc we can access all pandas methods and functions.
To import the dataset in csv format we will use the pandas function read_csv.

data = pd.read_csv("dataset.csv")

Using the previous code line with pd.read_csv we have accessed the read_csv function from pandas library using pd abbreviation. Inside the brackets of the read_csv() you have to put the name of the dataset. If the python script is located in the same folder as the "dataset.csv" then you only need to type "dataset.csv" i.e. the full name of the dataset including the format. If however the dataset is located in another folder than you have to type the addres with the dataset name where the dataset.csv is located. The example is missing. To look at the frist five rows of the datast we have to type the following code.

print(data.head())

data.head() is used to show first 5 rows of the dataset.csv. Since the dataset is loaded and stored in data variable the padnas function to show first five rows is head() function. The print is used to display these five rows of dataset.csv. The full code is shwon below.

import pandas as pd
data = pd.read_csv("dataset.csv") # Only if the dataset is located inside the same folder as the python script
print(data.head())

How to handle missing data?

Missing data is common problem in ML datasets and this issue must be handeled before training of ML algorithms. Pandas library offeres several ways to handle missing values i.e. the missing values (rows with missing values) can be dropped from the dataset or can be imputed. However, before dropping or imputing the dataset must be checked for missing values.

Checking the dataset for missing values?

After the required libraries are defined in the Python script and the dataset was imported to read_csv function to check if dataset contains any missing values you have to use the isnull() pandas function, followed by the sum() function.

print(data.isnull().sum())

Using only isnull() function will return the table with the same size as the dataset with True and False values. The False values indicate that that the cell contains some value while True value contains a cell with missing value. This approach is not readable since the print() function does not show the entire dataset as the output.
On the other hand if we use sum() function after isnull() function it will list all dataset variables in one column with number of missing values per each dataset variable. However, if the dataset contains a large number of variables it will not show all of them.
To show the total sum (number of cells) with missing values you have to type the following code.

print(data.isnull().sum().sum())

The additional sum() function is added to the data.isnull().sum() code line. The output of this code line will be an integer that represents the total number of missing values (empty cells) in the dataset. The entire code used in this example is given below.

import pandas as pd
data = pd.read_csv("dataset.csv") # Enter the proper name of the dataset
print(data.isnull()) # The outpu will have the same size as the dataset however the values in "cells" will have True and Flase values. True for empty cell and False for not empty cell.
print(data.isnull().sum()) # This line will list all dataset variables in one column with number of missin values (empty celss) in second column. This is not good if you have a large number of input variables
print(data.isnull().sum().sum()) # Will show one number as the output which is the number of empty cells in the dataset

How to drop the missing values from the dataset?

When the dataset is loaded and checked for missing values the dropna function is used to remove the missing rows or columns form the dataset. To remove the dataset rows with missing values type in the following code.

data = data.dropna()

To drop columns with missing vlaues you have to define axis paramter of the dropna function. By default the axis=0 when you want to remove the dataset rows with missing values. To remove columns with missing values the axis = 0 must be changed to axis = 1.

data = data.dropna(axis = 1)

How to impute missing values?

The missing vlaues are empty cells in the pandas dataset and as previously stated they can be excluded from the analysis by droping rows or columns. However, there is a quick and easier way which requires fillling empty ceels by column min or fill missing categoridcal values with a mode.

# Fill missing numerical values with the column mean
data['numerical_column'] = data['numerical_column'].fillna(data['numerical_column'].mean())
# Fill missing categorical values with the mode
data['categorical_column'] = data['categorical_column'].fillna(data['categorical_column'].mode()[0])

Encoding categorical variables?

Often padans dataset variables have categorical values that have to be transformed into numeric values os ML algorithm could be trained on dataset. To transform the categorical to numerical values the categorical variables have to be encoded. There are several ways you can encode the categorical variables and one way is using the pandas libray with application one-hot encodeing and label encoding.

One-hot encoding

One-hot encodign creates binray columns for each category in a categorical varible. This method is used when you want transfrom multiclass target variable (variable that contains labels 1,2,3,4) into multiple binary columns (4 coulums are created where each colum corresponds to the label in the original column, each column contains 0 and 1 values where 0 values are for those samples that do not have specfiic label and values of 1 are samples tahat originally contained specific label).
The One hot encoding is performed with get_dummies() function. To demonstrate the one-hot encoding we will define the simple pandas dataframe that will be stored to variable y.

y = pd.DataFrame([1,2,3,4,2,3,4,1,2,3,4,1,2,3,4], columns = ['Class'])

So this variable contains one colum of pandas dataframe with labels 1,2,3, and 4. To transform this column into four columns where in each column those samples that contain class 4 are labeled as 1 and all those samples that originaly are not labeled as 4 have 0 values we have to use get_dummies() funciton.

y_raw = pd.get_dummies(y['Class'], prefix="Class", dtype =float)

The y['Class'] will be transformed into pandas dataframe with four columns where each column will have "Class_" name followed by the corresponding label (1,2,3, and 4). The dtype is float since we want numeric values 0 and 1 in each column not True and False values. If you want the True and False values just don't define dtype. The output is shown below.

     Class_1  Class_2  Class_3  Class_4
0       1.0      0.0      0.0      0.0
1       0.0      1.0      0.0      0.0
2       0.0      0.0      1.0      0.0
3       0.0      0.0      0.0      1.0
4       0.0      1.0      0.0      0.0
5       0.0      0.0      1.0      0.0
6       0.0      0.0      0.0      1.0
7       1.0      0.0      0.0      0.0
8       0.0      1.0      0.0      0.0
9       0.0      0.0      1.0      0.0
10      0.0      0.0      0.0      1.0
11      1.0      0.0      0.0      0.0
12      0.0      1.0      0.0      0.0
13      0.0      0.0      1.0      0.0
14      0.0      0.0      0.0      1.0

As seen from the result the first column represents the number of samples in the dataset from 1 to 14. The pd.get_dummies() function created from one colum four columns where each column corresponds to specfic class i.e. Class 1 for label 1 in the original Class column. The Class_2 is the label 2 in the original Class columns. The Class_3 is the label 3 in the original Class column. The Class_4 is the label 4 in the original Class column.
In Class_1 column all the dataset samples that in original Class column where equal to 1 are in Class_1 column also equal to 1. All other dataset samples value that did not contain label 1 in the original Class column are equal to 0. So for example in the original Class column the second sample has label 2 so in the Class_1 the second sample will be equal to 0. However, the second sample will have value of 1 since the second sample in the original Class column contained label 2.

Label encoding

For ordinal data, you can replace categorical with numercial values.

data['ordinal_column'] = data['ordinal_column'].map({'Low': 1, 'Medium': 2, 'High': 3})

Feature Scaling

Scaling ensures that numerical features contribute equally to the model. Unfortuantelly the pandas library does not have scaling/normalization techniques however, the scikit-library has the folllowing scaling/normalization techniques MaxAbsScaler, MinMaxScaler, Normalizer, PowerTransformer, RobustScaler, and StandardScaler.

Example: How to use StandardScaler?

This is just simple example how to use the StandardScaler. First ste is to define/call the StandardScaler from scikit-learn library preprocesinng module.

import pandas as pd
from sklearn.preprocessing import StandardScaler

Then the pnadas dataset is loaded and standard scaler is applied to the dataset. However, before the application the StandardScaler is defined and the output target variable must be poped out of the dataset since we do not want to scale the target variable i.e. generally it is not recommended.

data = pd.read_csv("dataset_name.csv")
y = data.pop('Output')
scaler = StandarScaler()
data_scaled = scaler.fit_transfrom(data)

Normalization

The dataset noramlization or normalization of input variables can be performed using Normalizer function available in scikit-learn library, preporcessing module. The first step is to define required libraries and in this case it is pandas and scikit-learn.preprocessing module.

import pandas as pd
from sklearn.preprocessing improt Normalizer

Then the dataset must be loaded using pd.read_csv function and the target variable must be poed out.

data = pd.read_csV("dataset_name.csv")
y = data.pop('output')

The next and final step is to define the normalizer function and save it under arbitrary variable name and later apply the normalizer of the dataset input variables.

mdoel = Normalizer()
data_scaled = model.fit_transform(data)

Remove Outliers

Outliers can skew your ML model so they must be handled properly. Pandas makes it easy to detect and handle outliers.

Using the IQR method

Q1 = data['numerical_column'].quantile(0.25)
Q3 = data['numerical_columns'].quantile(0.75)
IQR = Q3 - Q1
# Filter out rows with outliers
data = data[(data['numerical_column'] >= Q1 - 1.5 * IQR) & (data['numerical_column'] <= Q3 + 1.5 * IQR)]

Feature engineering

You can also create new features or modify existing ones using Pandas. In this example we will create new feature by multiplying two existing features from the dataset.

data['new_feature'] = data['feature1']+data['feature2']

Binning

Binning numerical variables can convert them into categorical variables.

# Bin numerical values into categories
data['binned_feature'] = pd.cut(data['numerical_column'], bins=[0, 50, 100], labels=['Low', 'High'])

Saving and Loading Preprocessed Data

After Preprocessing, it's often a good idea ot save your data for later use.

# Save the preprocessed dataset
data.to_csv('preprocessed_data.csv', index=False)
# Load it again when needed
data = pd.read_csv('preprocessed_data.csv')

Conclusion

Pandas is an incredibly powerful tool for preprocessing machine learning datasets. It allows you to handle missing data, encode categorical variables, scale numerical features, remove outliers, and even engineer new features—all with minimal code. By mastering these preprocessing techniques, you can prepare clean, structured datasets for your machine learning models, ensuring better performance and more accurate predictions.
If you have any questions or need further clarification, feel free to drop a comment below!