The quality of machine learning model is largely determined by the quality of data. Data preparation is very wide and sometimes can be very complex depending on the nature of data especially if it is unstructured and from different sources. In this post we are going to learn about data preprocessing in machine learning. To improve the quality of the model for some machine learning algorithms we have to carry out data preprocessing to ensure that we have the right data so as to avoid the “garbage in, garbage out” curse. For example most machine learning algorithms don’t work well with text/string data hence we need to convert such data into numerical/integer format. There is no silver bullet when it comes to data preprocessing process and the entire life cycle depends on the nature of the problem, type of data available and the machine learning algorithms to be used.

Data Preprocessing

The fundamental steps in machine learning development include data preparation, model development, and model deployment. Data preparation comprises a series of steps which ensures that data is good to be used for model development. Such steps include cleaning, normalization, transformation, feature engineering etc. Data preprocessing is a key step in data preparation which ensures that data is encoded in the right format. There are many techniques for performing data preprocessing which include one-hot encoding, standardization, binarization and many more. We perform data preprocessing in situation such as when the data we have is inconsistent, has missing values, it is dirty or has outliers among others.In this post we are going to look at the most commonly used data preprocessing techniques. In our previous series on the data analysis with Pandas we covered some data preprocessing approaches such as working with missing values. Let’s begin preprocessing our data.

Rescaling Techniques

1. Data Rescaling With MinMaxScaler

Rescaling is the process of transforming variables with varying scales into a uniform scale. Rescaling improves the performance of machine learning algorithms such as k-NN and SVM among others. There are different techniques for rescaling such as MinMaxScaller and MinAbsScaller. In this post we are going to leverage the rescaling class that comes with scikit-learn and focus on the MinMaxScaller technique. MinMaxScaller transforms the features by scaling them to a given range with default being 0 or 1. MinMaxScaler(feature_range=(0, 1), copy=True). Let’s look at the following code that implements the MinMaxScaller and compare the results from classification report. Download the Pima Indian dataset here  pima-indians-diabetes (140 downloads) and drop it in your working directory.

MinMaxScaler Class

Let’s see how we can apply MinMaxScaler to real data and machine learning prediction. First let’s run our model without rescaling the data and see the result;
Machine leaning model before rescaling

Machine leaning model after rescaling

We can see that our model performs better when we rescale our data. To enhance much higher performance we can continue tweaking the feature_range to get the optimal results for our model. To learn more about the rescaling visit the scikit-learn documentation here.

2. Standardizing Data With Standardizer

Standardization is a preprocessing technique that transforms features with Gaussian distribution, different means and standard deviation to the same Gaussian distribution with mean of 0 and standard deviation of 1. Scikit-learn has a StandardScaler class that performs the standardization of Gaussian distributed features into standard Gaussian distributed. Standardization is useful for regression tasks.

Data  Standardization With StandardScaler

For more details on the StandardScaler visit the scikit-learn documentation here.

3. Data Normalization

Normalization is a data rescaling technique that transforms features into a unit length between 0 and 1. This preprocessing technique is useful for K-Nearest Neighbors, SVM and neural networks tasks. Normalization is often applied to sparse data. The most common use case for normalization is in text classification tasks. Scikit-learn has a Normalizer function that rescales features into a standard unit.

Normalizer

For more details on normalization visit the scikit-learn documentation here.

4. Data Binarization

This is a data preprocessing technique for making binary (0 or 1) of the data feature from a given threshold. The resulting features follows the bernourli distribution. Scikit-learn has a Binarizer class that transforms a given data range into binary form. All values that are greater than the threshold are set to 1 while those that are less than the threshold are set to 0. Data binarization is commonly used in text analytics.

Binarizer

For more details on binarizer class visit the scikit-learn documentation here.

5. LabelEncoder

LabelEncoder is a preprocessing technique that transforms text into integers ranging from 0 to n-1 where text refers to words. It is simply a numeric representation of features in a data set and is useful when we ordinal categorical values e.g hot, cold, worm etc. To perform label encoding on a given sentence such as “Hi, my name is Tom”, we first convert create a set of individual words then we define our universe of discourse to capture both lower case and caps together with other characters that defines each word. Finally we assign numerical value to each word starting from 0 upto n-1. Scikit-learn provides a class called LabelEncoder that transforms words into numerical representation.

LabelEncoder

LabelEncoding on Titanic data set
Download the titanic data set here  titanic train data set (207 downloads) and let’s explore and use it to predict the survival of passengers in the ship using KNN algorithm. To predict who will survive let’s consider only 4 features from this data set which are passenger class(Pclass), gender(Sex), age(Age), and number of siblings(SibSp). The the target is the survival (Survived) which is our label. Below is how we will carryout this task;
– Drop all unnecessary features (columns)
– Remove all rows with missing values that are mandatory to us
– Transofm the gender into 0 for female and 1 for male using LabelEncoder
– Split our data set into train and test sets
– Train our model
– Predict the survival of passengers
– Measure our models performance with Classification report
– Save our prediction results to a csv file
– Display the prediction and Classification report

To improve on this model you can add more features to improve the prediction.For more details on LabelEncoder class visit the scikit-learn documentation here.

6. One Hot Encoding

One Hot Encoding is a preprocessing technique for transforming categorical data into numbers or binary vectors. The intuition behind one hot encoding is to first convert the values into integers then converting the integers into binary form . This is because most if not all machine learning algorithm does not work well with string/text data hence we need to convert it to integer and more efficiently to binary vectors. The downside of one hot encoding is that it increases the feature space resulting to what is referred to as (curse of dimensionality) meaning we need a lot of space to represent features using one hot encoding. Scikit-learn has a OneHotEncoder class that transforms the integers into binary vectors. Let’s look at how to use one hot encoding with scikit-learn OneHotEncoder Class.

OneHotEncoder

Let’s look at how to perform one hot encoding using the titanic data set.

One Hot Encoding with Titanic Data set

For more details on OneHotEncoder class visit the scikit-learn documentation here.

7. Dummy Coding

This is a technique of converting categorical features into continuous variable. Dummy coding is similar to One Hot Encoding with the exception that it returns an m-1 binary features for every m label. Pandas has a function that encodes categorical variables into dummy variable. pandas.get_dummies(data, prefix=None, prefix_sep=’_’, dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None). Let’s see at the dummy coding example below.

Dummy Coding with Titanic Data set

For more details on Dummy Coding visit the scikit-learn documentation here.

8. Handling Missing Data

More often data is never complete, some data might be missing some values, others might be having incorrect values and many other unique situations. Most machine learning algorithms don’t work well with missing values, however, other algorithms have been designed to work with the missing data but you never know when you will need to handle missing data. We covered most part of techniques for handling missing data in our Pandas series, you can check this post here. Such techniques included; dropping records with missing values, filling missing values with arbitrary values, and extrapolating the missing values with existing values. In this post we are going to look at imputing missing values. Data imputation refers to the replacement of missing values with certain values. There are several techniques for data imputation such as hot0deck, cold-deck, mean substitution and regression. Scikit-learn has an imputer function Imputer(missing_values=’NaN’, strategy=’mean’, axis=0, verbose=0, copy=True) that handles the missing data menace. Let’s see how to handle missing data with Imputer function.

Data Imputation

For more details on Data Imputation visit the scikit-learn documentation here.

Conclusion

Data preprocessing is a fundamental step in machine learning and data analysis process since it greatly improves the performance and the accuracy of analysis and machine learning model. There are many data preprocessing techniques that can be applied to different problems. In this post we have seen few techniques that are commonly used for preparing data. Some data preprocesing techniques are covered in different posts such as data analysis with Pandas.

What’s Next

In this post we have looked at data preprocessing techniques and why they are important. In the next post we are going to look at the machine learning algorithms in details staring with the Supervised learning agorithms.

Data Preprocesing

Post navigation