Measuring the performance of your model is an integral step in evaluating how your model will behave on real data. Different metrics give different measures and each metric is best suited for a given case. It is only by measuring our model is when we can have a confidence on the predictions it gives. In this post we are going to learn about different metrics for measuring the performance of the machine learning models. This post build on our previous post on train, validation and test sets, but you can still move along with this post even without viewing the previous post.

Performance Measure

Poor choice of machine learning model evaluation metric can lead to development of wrong model. There are many metrics for measuring the performance of machine learning models for both supervised and unsupervised learning . In this post we are going to see the commonly used model evaluation metrics for supervised learning. We are going to focus on three classification metrics which are; accuracy and confusion matrix, and classification report and two regression metrics which are; mean absolute error (MAE), mean squared error (MSE). Let’s start.

Date Sets

In this post we are going to use various data sets for different metrics. We will use the Iris data set which can be downloaded here iris data set (127 downloads) . We will also use the Pima Indian Diabetic data set for diabetes patients. Pima data set can be downloaded here pima-indians-diabetes (140 downloads) . Also we will use the Boston housing prices data sets for the regression metrics. The Boston house can be found here boston-house-price-dataset (132 downloads) . For each example download the data set then place it in your working directory.

Classification Metrics

Confusion Matrix

A confusion matrix  also referred to as contingency table is a 2*2 table that is used for multi-class classification problems. It is not an evaluation metric but provides a way of measuring other metrics such as precision, recall, accuracy and F1 score of the model among other metrics. Before we begin at looking into these individual metrics, let’s take a look at the confusion matrix below and define some of the key terminologies from it.

confusion matrix - performance measure

True Positives (TP). This are the positive observations that are correctly predicted as positive.
True Negatives (TN). This are the negative observations that are correctly predicted as negative.
False Positives (FP). This are negative observations that were wrongly predicted as positive.
False Negatives (FN). This are positive observations that were wrongly predicted as negative.

Deciding on which case to minimize between the False Negatives or a False Positives is a trade off that depends on the problem at hand. For example we often prefer to minimize False Negatives in cases such as medical diagnosis problems because we can still carry more diagnosis on people who were wrongly identified to be having certain diseases but don’t actually have rather than failing to predicting them but actually they are sick. Sci-kit-learn has a function that computes the confusion matrix metrics for the model as shown below.

Confusion Matrix

1. Accuracy

Accuracy is the number of correct predictions against the number of all predictions in a given data. It is given by the following formula.  Accuracy=(TP+TN)/(TP+FP+FN+TN). This is a metric that is mostly used for classification problems. Accuracy is used for balanced data set i.e data with equal number of responses in each class and where all the predictions and prediction errors are important. In real world having these conditions is very rare and hence the accuracy model is often one of the highly misused metric. Sci-kit-learn has a function for computing the accuracy of a machine learning model. The metrics class has an accuracy_score function that returns the accuracy of the model. We are going to use the Iris data sets for this example and you can find it here iris data set (127 downloads) . The example below demonstrates how to measure the performance using the accuracy metric in Sci-kit-learn.

2. Precision

This is the total number of True positives against total number of Positives (True Positives and False Positives). It is given by the following formula; Precision=TP/(TP+FP). Precision measures the exactness of the model. It is the number of positive predictions against the total number of positive observations predicted.

3. Recall

This is the number of True Positives against the total number of True positives and False negatives. It is the number of positive predictions against the number of positive observations. It is presented by the following formula; Recall=TP/(TP+FN).

4. F1 Score

Also referred as F Score or  F Measure. It is the harmonic mean between Precision and Recall. F1 Score is represented by the following formula F1 Score = 2*((Precision*Recall)/(Precision+Recall)). It informs us how precise and robust our model is. It can be viewed as a weighted average of the precision and recall and is a value ranging from 0 to 1. The higher the F1 Score i.e close to 1, the better our model performs. Sci-kit-learn has F1 Score from metrics class sklearn.metrics.f1_score(y_true, y_pred, labels=None, pos_label=1, average=’binary’, sample_weight=None)

F1 Score

F1 Score With Iris Prediction

Classification Report

This is a summary of the confusion matrix. It tells us about the precision, recall, F1-Score and support for each classification.

Regression Metrics

  1. Mean Absolute Error

The Mean Absolute Error (MAE) is a regression metric that measures the difference between two continuous variables. In machine learning MAE computes the sum of absolute differences between predictions and actual values. From the MAE we get to know how wrong our model predictions is.

      2. Mean Squared Error (MSE)

Also referred to as Mean Squared Deviation (MSD) is a regression metric that measures the difference between the actual observations and the model prediction results. It is a positive value since the squaring removes the negative in the values. It is a risk function which corresponds to the expected value of the loss of squared error.


We have learned about the most important task in machine learning process, which is performance evaluation. Different metrics works well with different problems such as accuracy model can be used for balanced data while it cannot be used for imbalanced data. We have looked at the mostly used metrics while we have also left out some of the best metrics.

What’s Next

In this post we have learned about model evaluation metrics. In the next post we will learn about visualizing our machine learning data so that we can understand better how it looks like.

Performance Measure

Post navigation