PYTHONHOLICS: What is supervised learning?

In machine learning there are several approaches using which machine learning algorithms learn and these are supervised, unsupervised and reinforcement learning. The supervised learning is one of the most fundamental and widely used approaches in field of ML. As the name suggests the ML algorithms are learning from a supervisor - in this case, a dataset that contains input features (variables) and their corresponding labeled outputs or target variable or variables. This labeled data or target (output) variable searves as a guide to help machine learning model to make predictions o new data. This new data is the data that is unseen by the trained ML model. In this blog post we will break down the basics of supervised learning method, explain the key components, and discuss some popular algorithms that are commonly sused in this type of ML.

What is the concept of the supervised learning?

The concept of supervised learning will be explained on the following exmaple. Imagine you are teching a child to identify types of fruits. You show a child a collections of pictures of fruits along with their names. For example if you show a child a picture of an apple and tell them "This is an apple", or a picture of a banana and say "This is a banana." Over time, the child will learn to recognize these furits even when they see new images.
This is exactly how supervised learning works. In case of ML the ML model is trained on a dataset where each input is paired with a corresponding output label (target variable value). The ML model learns the relationship between inputs and outputs so it can predict the correct output for new (unseen) inputs.

What are key components of the supervised learning?

The key components of the supervsied learning are input features (input variables), labels (target/output variable), training data, and test data.

Input features (input variables) (\(X\)) - these are data points or independet variables used to make predictions. For example in dataset used to predict the house prices, features (input variables) might include the number of bedrooms, square footage, location, parking space....
Labels (target/output variable) (\(y\)) - these are output values or dependent variables that the ML model is trying to predict. For example in case of house price prediction task, the label would be the price of the house.
Training data - This is the dataset used to train ML model. Usually the procedure is to split the inital dataset to dataset for training and testing in ratios 60:40, 70:30, 80:20 or 90:10. The 60:40 ratio is the ratio where 60 % of the dataset is used for training and the remaining 40 % is used for testing. In other words ML model is trained on 60% of the dataset and then tested on 40% of the dataset. The same is valid for other ratios. It should be noted that bot train and test data contains both input features (input variables) and labels (target/output variable) The ML model learns the patterns and relationships from this data. A good practice is to randomly shuffle the dataset (randomly change the order of samples (rows) in the dataset) before splitting the dataset to train and test data.
Test data - Once the model is trained, its perfromance is evaluated on a test dataset called the test data. It is usually a good practice to evaluate the prformance on train and test parts of the dataset.

What are the types of supervised learning?

The supervised learning tasks are broadly classified into two categories i.e. regression and classification.

Regression - The regression is used when the output label (target varaible) is continuous which imeas it can take any numeric value. The example of regression could be predicting the price of a house based on its features (input variables) like size, location and number of bedrooms. Common regression algorithms are Linear regression, Decision Tree Regressor, Multi-Layer Perceptron Regressor, Supprot Vector Regressor (SVR)...
Classification - is used when the output label is categorical which means it belogns to one of a predefined set of classes. A good example of classification is classifying an email whether is spam or not, or identify the type of fruit in an image (apple, banana, etc.). Commonly used algorithms in classification are Logistic Regression, k-Nearest Neighbors (k-NN), Decision Tree Classifier, Random Forest Classifier, Supprot Vector Classifier (SVC), and Multi-Layer Perceptron Classifier...

How Supervsied Learing Works?

The supervised learning method/process typically involves the following steps:

Collect and Preprocess the Data - This is the first step in supervsied learning which requires to gather a dataset with labeled examples. After the data is collected it is necessary to clean and preprocess the data by handling the missing values, if there are features/variables with string typses they had to be transformed into numeric values, perform the statistical analysis, correlation analysis (to investigate the correlation between features (input variables) and labels (output/target variable)), scaling/normalizing features (input variables), and splitting the data into train and test datasets.
Choose a Model - Select the appropriate supervised learning algorithm based on the problem (classification or regression) and dataset size. A good practice is usually to select several ML models and to measure their classification or regression performance. When classification/regression perfromance is compared then several models with high performance can be selected for further analysis.
Train the Model - each ML model for training requires the training data, which allows it to learn the relationships between inputs and outputs.
Evaluate the Model - test the model's performance on the unseen data (test set) using metrics such as accuracy score, precision score, recall score, area under receiver operating characteristics curve, coefficient of determination, mean absolute error, mean sqaured error, root mean squared error, mean absolute percentage error, depending on the task.
Fine-Tune the Model - the idea in this step is to finely tune the hyperparameters of the ML model to imporve the classification/regression performance. Every ML model has hyperparameters and by tuining its values the classification/regression performance (accuracy or coefficient of determination) could be increased. In other words the idea is to find the optimal combination of its hyperparameter vlaues using whihc the performance of the ML model could be increased. There are several ways which optimal combination of hyperparameters could be found i.e. using grid search, random search, Bayesian optimization or any of the evolutionary algorithms.
Deploy the Model - use the trained model to make predictions on real-world data.

Advantages and disadvantages of supervised learning

The supervised learning has its advantages and disadvantages. The advantages of supervised learning are:

Accuracy - supervised learning models can achieve high accuracy when trained on quality labeled data
Versatility - It is applicable to a wide range of problems, from predicting numerical values to calssifying objects.
Interpretability - Many supervsied learning algorithms (linear regression) are interpretable, allowing insights into the data.

The disadvatnages of the suppervised learning or challenges are:

Data Dependency - Requires a large and diverse labeled dataset, which can be costly and time-consuming to collect.
Overfitting - Models can memorize the training data instead of generalizing whcih could lead to poor performance on useen data.
Computational Resources - Training complex models (neural networks) can be computaionally expensive.

Popular applications of supervsied learing

Over the years the suppervised learning of ML algorithms has been applied in healtcare, finance, e-commerce, natural language processing, and computer vision. In healtcare it is used for diagnosing diseases from medical iamges (X-rays, MRIs). In finance for detecting fraudulent transactions or predicting stock prices. In E-commerce for recommending products based on unseen behavior. In natural language processing for sentiment analysis or spam email detection. In compure vision for recognizing objects in images or videos.

Conclusion

Supervised learning is a powerful machine learning technique that enables models to learn from labeled data. Its applications span across various domains, making it an essential tool for solving both regression and classification problems. However, the success of supervised learning depends heavily on the quality and quantity of labeled data, as well as the model's ability to generalize. Whether you're building a spam filter or predicting housing prices, supervised learning provides the framework to tackle a wide variety of challenges. If you're new to machine learning, it's the perfect place to start!

PYTHONHOLICS

Saturday, December 7, 2024

What is supervised learning?