Tuesday, January 28, 2025

Feature Scaling Techniques in Machine Learning

Feature Scaling Techniques in Machine Learning

Feature scaling is essential for many machine learning algorithms to perform well. In this section, we will describe several feature scaling techniques, provide a simple example dataset, and showcase the results of applying each technique.

1. MaxAbsScaler

The MaxAbsScaler scales each feature by its maximum absolute value. It scales the data to a range between -1 and 1 while preserving the sparsity of the dataset (if any). This method is useful when the data is already centered around zero and you want to maintain its sparsity.

Example:

from sklearn.preprocessing import MaxAbsScaler
import numpy as np
    
# Example dataset
data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]])
    
scaler = MaxAbsScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)
   
The previos code block of the following steps:
  • Importing necessary libraries:
    • from sklearn.preprocessing import MaxAbsScaler: Imports the MaxAbsScaler from the sklearn.preprocessing module, which is used to scale each feature by its maximum absolute value.
    • import numpy as np: Imports the numpy library, which is used to create and manipulate arrays.
  • Defining the example dataset:
    • data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]): Creates a 3x3 NumPy array called data with 3 rows and 3 columns. The values include both positive and negative numbers.
  • Creating the scaler object:
    • scaler = MaxAbsScaler(): Creates an instance of the MaxAbsScaler, which scales the data by its maximum absolute value.
  • Applying the scaler to the dataset:
    • scaled_data = scaler.fit_transform(data): Applies the MaxAbsScaler to the dataset by fitting the scaler to the data and then transforming it. The result is stored in scaled_data.
  • Displaying the scaled data:
    • print(scaled_data): Prints the transformed data to the console. The data is scaled so that each feature is divided by its maximum absolute value, resulting in values between -1 and 1.
After the code is exectuted the following output is obtained.

    [[ 0.25  0.4   0.5 ]
     [-0.25 -0.4  -0.5 ]
     [ 1.    1.    1.  ]]
            

2. MinMaxScaler

The MinMaxScaler transforms the data into a fixed range, usually between 0 and 1. The formula is:

\begin{equation} X_{scaled} = \frac{X - X_{min}}{X_{max} - X_{min}} \end{equation}

This scaler is useful when features have different units or scales and you need to standardize them into the same range for convergence.

Example:

from sklearn.preprocessing import MinMaxScaler
import numpy as np
# Example dataset
data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]) 
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)
The previous code block consist of the following steps:
  • Importing necessary libraries:
    • from sklearn.preprocessing import MinMaxScaler: Imports the MinMaxScaler from the sklearn.preprocessing module, which is used to scale features to a specified range, typically between 0 and 1.
    • import numpy as np: Imports the numpy library, which is used to create and manipulate arrays.
  • Defining the example dataset:
    • data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]): Creates a 3x3 NumPy array called data with 3 rows and 3 columns. The values in the array are a mix of positive and negative numbers.
  • Creating the scaler object:
    • scaler = MinMaxScaler(): Creates an instance of the MinMaxScaler. This scaler transforms the data by scaling each feature to a given range (default is between 0 and 1), based on the minimum and maximum values of each feature.
  • Applying the scaler to the dataset:
    • scaled_data = scaler.fit_transform(data): This line of code first fits the MinMaxScaler to the data (i.e., computes the minimum and maximum values for each feature) and then transforms the data by scaling each feature to the range [0, 1]. The resulting transformed data is stored in scaled_data.
  • Displaying the scaled data:
    • print(scaled_data): Prints the scaled data to the console. The values in each column are now transformed to lie between 0 and 1, according to the minimum and maximum values of each feature in the original data.
When the previous code is executed the ofllowing output is obtained.
[[0.25 0.4  0.5 ]
[0.   0.   0.  ]
[1.   1.   1.  ]]

3. Normalizer

The Normalizer scales each sample (row) to have a unit norm (magnitude of 1). This is useful when you want to scale each observation independently of the others.

Example:

from sklearn.preprocessing import Normalizer
import numpy as np 
# Example dataset
data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]) 
scaler = Normalizer()
scaled_data = scaler.fit_transform(data)
print(scaled_data)
The previous code block consist of the following steps:
  • Importing necessary libraries:
    • from sklearn.preprocessing import Normalizer: Imports the Normalizer from the sklearn.preprocessing module, which is used to normalize the dataset. Normalization scales each sample (row) to have a unit norm.
    • import numpy as np: Imports the numpy library, which is used to create and manipulate arrays.
  • Defining the example dataset:
    • data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]): Creates a 3x3 NumPy array called data with 3 rows and 3 columns. The values in this array include both positive and negative numbers.
  • Creating the scaler object:
    • scaler = Normalizer(): Creates an instance of the Normalizer class. The Normalizer scales each sample (row) in the dataset to have a unit norm (i.e., the Euclidean norm of the row is 1).
  • Applying the scaler to the dataset:
    • scaled_data = scaler.fit_transform(data): This line of code first fits the Normalizer to the dataset (calculates the necessary values for normalization) and then transforms the data, scaling each row to have a unit norm. The resulting scaled data is stored in scaled_data.
  • Displaying the scaled data:
    • print(scaled_data): Prints the normalized data to the console. Each row in the output will have a Euclidean norm of 1, meaning that the sum of squares of the elements in each row will be equal to 1.
When the code block is excuted the following output is obtained.
[[0.26726124 0.53452248 0.80178373]
[-0.26726124 -0.53452248 -0.80178373]
[0.26726124 0.33407655 0.40089186]]

4. PowerTransformer

The PowerTransformer applies a power transformation to make data more Gaussian-like. It includes two methods: the Box-Cox and Yeo-Johnson transformations, which are useful for correcting skewed data.

Example:

from sklearn.preprocessing import PowerTransformer
import numpy as np 
# Example dataset
data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]])     
scaler = PowerTransformer()
scaled_data = scaler.fit_transform(data)
The previous code block consist of the following steps:
  • Importing necessary libraries:
    • from sklearn.preprocessing import PowerTransformer: Imports the PowerTransformer from the sklearn.preprocessing module, which is used to apply power transformations to make data more Gaussian (normal) by applying a nonlinear transformation to the features.
    • import numpy as np: Imports the numpy library, which is used for creating and manipulating arrays.
  • Defining the example dataset:
    • data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]): Creates a 3x3 NumPy array called data with 3 rows and 3 columns. The dataset contains both positive and negative numbers.
  • Creating the scaler object:
    • scaler = PowerTransformer(): Creates an instance of the PowerTransformer. This scaler transforms the data using a power transformation to make the data distribution closer to a normal (Gaussian) distribution. It applies a Box-Cox transformation or a Yeo-Johnson transformation, depending on the data's characteristics (positive vs. both positive and negative values).
  • Applying the scaler to the dataset:
    • scaled_data = scaler.fit_transform(data): This line of code first fits the PowerTransformer to the dataset (calculates the necessary transformation parameters) and then transforms the data. The transformed data is stored in scaled_data. The result is a transformed dataset that aims to have a more Gaussian distribution for each feature.
When the previous code is executed the following output is obtained.
[[-0.75592499 -0.75592499 -0.75592499]
[ 0.75592499  0.75592499  0.75592499]
[-1.60169291 -1.60169291 -1.60169291]]

5. RobustScaler

The RobustScaler uses the median and interquartile range for scaling, making it robust to outliers. It scales the data by subtracting the median and dividing by the interquartile range.

Example:

from sklearn.preprocessing import RobustScaler
import numpy as np 
# Example dataset
data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]])     
scaler = RobustScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)
The previous code block consist of the following steps:
  • Importing necessary libraries:
    • from sklearn.preprocessing import RobustScaler: Imports the RobustScaler from the sklearn.preprocessing module, which is used for scaling the features of the dataset using the median and interquartile range (IQR) instead of mean and standard deviation.
    • import numpy as np: Imports the numpy library, which is used for creating and manipulating arrays.
  • Defining the example dataset:
    • data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]): Creates a 3x3 NumPy array called data with 3 rows and 3 columns. The dataset contains both positive and negative numbers.
  • Creating the scaler object:
    • scaler = RobustScaler(): Creates an instance of the RobustScaler. This scaler transforms the data by using the median and the interquartile range (IQR) for scaling, which makes it robust to outliers. It is particularly useful when the dataset has extreme outliers that could affect scaling using standard techniques.
  • Applying the scaler to the dataset:
    • scaled_data = scaler.fit_transform(data): This line of code fits the RobustScaler to the dataset (calculates the necessary values like median and IQR for scaling) and then transforms the data. The resulting scaled data is stored in scaled_data.
  • Displaying the scaled data:
    • print(scaled_data): Prints the scaled data to the console. The values are scaled by subtracting the median of each feature and then dividing by the interquartile range (IQR), making them less sensitive to outliers.
When the previous code block is exectued the following output is obtained.
[[ 0.   0.   0.  ]
[-0.5 -0.5 -0.5 ]
[ 1.   1.   1.  ]]

6. StandardScaler

The StandardScaler standardizes features by removing the mean and scaling to unit variance. The formula is:

\begin{equation} X_{scaled} = \frac{X - mean}{std_{dev}} \end{equation}

This method is useful when the data follows a Gaussian distribution or when features have different variances.

Example:

from sklearn.preprocessing import StandardScaler
import numpy as np 
# Example dataset
data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]])     
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)
The previous code block consist of the following steps:
  • Importing necessary libraries:
    • from sklearn.preprocessing import StandardScaler: Imports the StandardScaler from the sklearn.preprocessing module, which is used to scale the dataset by transforming it into a distribution with a mean of 0 and a standard deviation of 1.
    • import numpy as np: Imports the numpy library, which is used to create and manipulate arrays.
  • Defining the example dataset:
    • data = np.array([[1, 2, 3], [-1, -2, -3], [4, 5, 6]]): Creates a 3x3 NumPy array called data with 3 rows and 3 columns. The values in the array include both positive and negative numbers.
  • Creating the scaler object:
    • scaler = StandardScaler(): Creates an instance of the StandardScaler. This scaler transforms the data to have a mean of 0 and a standard deviation of 1. It is commonly used when the features in the dataset are on different scales.
  • Applying the scaler to the dataset:
    • scaled_data = scaler.fit_transform(data): This line of code first fits the StandardScaler to the dataset (calculates the mean and standard deviation for each feature) and then transforms the data, scaling it so that each feature has a mean of 0 and a standard deviation of 1. The resulting scaled data is stored in scaled_data.
  • Displaying the scaled data:
    • print(scaled_data): Prints the scaled data to the console. Each feature will have been transformed to have a mean of 0 and a standard deviation of 1. This ensures that all features are on a comparable scale, which can improve the performance of certain machine learning algorithms.
When the code is executed the following output is obtained.
[[-0.26726124 -0.26726124 -0.26726124]
[ 0.26726124  0.26726124  0.26726124]
[ 1.06904497  1.06904497  1.06904497]]

As you can see, each scaling technique transforms the data differently based on the chosen method. Understanding the impact of each method on your data can help improve model performance and convergence.

This is the end of the tutorial on feature scaling techniuqes. Please try the code described in the post and if you have any question regarding this tutorial please leave the commnet below. Thank you.

No comments:

Post a Comment