In machine learning we divide the data set into two sets; the training set and testing set and sometimes the validation set. This is the best practice in machine learning process since it leads to a model that generalizes well. We fit our model using the train set and evaluate our final model using the test set. However, we can run into two problems which are; overfitting and underfitting. To overcome the overfitting and underfitting we have to use the validation set to adjust our model. The train, validation and test sets is one of the approaches in model evaluation. In this post we are going to learn how to practically apply the concept of train, validation and test sets in our machine learning model. We split our data into 60:20:20 (train:validation:test) ratio or different values but ensuring that the training set is larger than validation and test sets. There are other model evaluation techniques that we will not cover in this post but will be covered in future posts. This post assumes that you have gone through the previous post on loading machine learning data. However, you can still follow along if you haven’t gone through the previous pest.

Train, Validation And Test Sets

The train , validation and test sets are very important in machine learning process because it leads to development of a machine learning model that performs well on the real data. The training set is used to train the model in a process called learning or fitting. The training set is often the largest of the three sets. The validation set is used for turning/optimizing the hyperparameters to get better performance. Not all models requires validation set. Some example of using the validation set is when modeling a K-Nearest Neighbor where we need to find a point for k where the results is optimum. The testing set is used for measuring the performance of a model. Validation set and testing set differ in such that validation set is used in selecting hypeparameters in order to avoid overfitting and underfitting while the testing set is used in evaluating the model.

Train Test Split in Sci-kit-learn

Sci-kit-learn comes with a function called train_test_split(*arrays, **options) that is responsible for splitting the data set into training set and testing set. In this post we are going to focus on dividing our data set into train and test sets. We will see how to use the validation set in the coming posts. The train_test_split(*arrays, **options) function takes the feature, label and test size as the mandatory arguments along with other optional arguments.

Our Project

We are going to use the Iris data set which consists of 150 records and we will split the data into training and testing sets in the ratio 60:40. Then we will use the training set for fitting our model. In the next post we will learn how to measure the performance of our model using accuracy score. We will train a Support Vector Machine using the training set then make prediction.

Train and Test Sets

Model Training

Conclusion

We have seen that training , validation and testing set is very important in machine learning. You need to split your data set into training, validation and testing sets because you cannot use the training set for measuring the performance of your model since your model is optimized on the training set and the this might not relate to the real world cases.Validation set is very important in hyperparameters selection. However, not all model features the validation set. The test set is used in evaluating the performance of your final model. The typical ratio of splitting your data set is 60:20:20 (train:validation:test). There are other techniques for model evaluation that we have not covered here but we’ll cover in other posts.

What’s Next

In this post we have learned about train, test and validation sets. In the next post we are going to learn about how to measure the performance of our model.

Train, Validation And Test Sets

Post navigation