The data is crucial element of ML, and undestanding how the structure and interpret the data is essential for building the effective ML models. The two components are critical of any dataset using in ML and these are features (input variables) and labels (output/target variable). These two terms are foundational elements for success of ML projects. In this post, we will dive into what features (input variables) and labels( output/target variable) are, how they work, and why they are important for building powerful ML models.
What are features (input variables) in ML?
As stated in the name of the previous title the Features are input variables or independent variables in your dataset that are used to make predicitons. Each feature represents a specific measurable property or characteristc of the data you are analyzing. Features can be numerical, categorical, or even derived from raw data such as text, images, or audio.The exampels of features will be shown in four differen publically available dataset: housing dataset, wheater dataset, e-commerce dataset, and combined cycle power plant.
The housing dataset (Boston housing dataset) contains 14 columns in total. The target variable in this dataset is Median value owner-occuped homes in $1000's so the remaining 13 columns are features or input variables. These features are:
- crim - per captia cirme rate by town
- zn - Proportion of large residential lots (over 25000 sq,fit.)
- indus - Proportion of non-retail business acess per town
- Chas - Binary variable indicating if the property is near Charles River (1 for yes, 0 or not)
- Nox - Concentration of nitrogen oxides in the air
- Age - proportion of old owner-occupied units built before 1940
- dis - weighted distances to Boston employment centers
- rad - index of accessibility to radial highways
- Tax - property tax rate per $10000
The key characteristics of features
The key characteristics of features are type, importance and scaling. The features can be of numerical or categorical type. The numerical type are features such as age, hegiht and etc while categorical can be color, region, and etc.Generally not all features contribute equally to the model's predicitons. Some features have hgiehr importance while others have lower importance. Those with lower importance are usually irrelevant or redundant. Many algorithms are sensitive to the scale of features, requiring normalization or standardization i.e. feautre scaling before using feautres in ML algorithm.
What are labels in ML?
Labels (output/target variables) are dependent variable,s represent the answers or target values your model is trying to predict. They are the outcomes you use to evaluate how well your model is performing.
Example of labels in the dataset are:
- In a housing dataset - label is the price of the house
- In a weather dataset - the label could be whether it will rain tomorrow (yes/no)
- In a e-commerce dataset . the label is whether the customer will purchase an item (Yes/no)
Lables can be continuous values or discrete categories. The lables suvh as continuous values are used for regression tasks e.g. in housing dataset predicting house prices. In classification tasks the label is the discrete categories e.g. predicting whether an email is spam or not.
Features Vs. Labels: Quick Comparison
Aspect | Features | Lables |
---|---|---|
Definition | Inputs to the model | Outputs the model is trained to predict |
Role | Independent variables that explain the data |
Dependent variable being explained |
Example in Housing |
Square footage, number of bedrooms |
House price |
Example in Weather |
Temperature, humidity | Rain (yes/no) |
Examples in E-commerce |
Product rating, the time spent on page |
Purhces detecions (yes/no) |
How Features and Labels Work Together?
The relationship between features and labels is at the ehart of supervised learning. The process of supervised learning consists of the following steps:
- Data collection - Before training ML model the dataset must be gathered which must consist of features and labels. For example in housing dataset the features are house attributes while the labels would be house prices. In case of combined cycle power plant the features are ambinet pressure, ambient temperature and vacuum in condenser while the labels are generated power output of CCPP.
- Training Model - when the data is collected nad prepared the dataset is provided to hte ML model. During training, the ML algorithm learns a function or mapping:
$$f(features) = label$$ In regression model, this might be learning to predict house prices based on the square footage and location. - Prediction - when ML algorithm is trained the model can predict labels for new, unseen features. For example, given the features of a house, the model can predict its price.
Why are features and labels important?
Labels determine what the model is trying to predict. Whitouth well-defined and accurate labels, the model cannot learn effectively.
Feature engineering is very important for cracting and selecting the right features which can make or break ML project. DErived features often provide additional predictive power.