Measuring the performance of your model is an integral step in evaluating how your model will behave on real data. Different metrics give different measures and each metric is best suited for a given case. It is only by measuring our model is when we can have a confidence on the predictions it gives. In this post we are going to learn about different metrics for measuring the performance of the machine learning models. This post build on our previous post on train, validation and test sets, but you can still move along with this post even without viewing the previous post.

**Performance Measure**

Poor choice of machine learning model evaluation metric can lead to development of wrong model. There are many metrics for measuring the performance of machine learning models for both supervised and unsupervised learning . In this post we are going to see the commonly used model evaluation metrics for supervised learning. We are going to focus on three classification metrics which are; accuracy and confusion matrix, and classification report and two regression metrics which are; mean absolute error (MAE), mean squared error (MSE). Let’s start.

**Date Sets**

In this post we are going to use various data sets for different metrics. We will use the Iris data set which can be downloaded here iris data set (127 downloads) . We will also use the Pima Indian Diabetic data set for diabetes patients. Pima data set can be downloaded here pima-indians-diabetes (140 downloads) . Also we will use the Boston housing prices data sets for the regression metrics. The Boston house can be found here boston-house-price-dataset (132 downloads) . For each example download the data set then place it in your working directory.

**Classification Metrics**

**Confusion Matrix**

A confusion matrix also referred to as contingency table is a * 2*2* table that is used for multi-class classification problems. It is not an evaluation metric but provides a way of measuring other metrics such as

*precision, recall, accuracy and F1 score*of the model among other metrics. Before we begin at looking into these individual metrics, let’s take a look at the confusion matrix below and define some of the key terminologies from it.

**True Positives (TP).** This are the positive observations that are correctly predicted as positive.

**True Negatives (TN).** This are the negative observations that are correctly predicted as negative.

**False Positives (FP).** This are negative observations that were wrongly predicted as positive.

**False Negatives (FN).** This are positive observations that were wrongly predicted as negative.

Deciding on which case to minimize between the False Negatives or a False Positives is a trade off that depends on the problem at hand. For example we often prefer to minimize False Negatives in cases such as medical diagnosis problems because we can still carry more diagnosis on people who were wrongly identified to be having certain diseases but don’t actually have rather than failing to predicting them but actually they are sick. Sci-kit-learn has a function that computes the confusion matrix metrics for the model as shown below.

**Confusion Matrix**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import pandas as pd from sklearn.cross_validation import train_test_split from sklearn.metrics import confusion_matrix from sklearn.svm import SVC data=pd.read_csv("pima-indians-diabetes.csv") X=data.values[:,0:8] #feature y=data.values[:,8] #label/target X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.4) #Split the data to train and test set #SVM svm_model=SVC() svm_model.fit(X_train,y_train) svm_prediction=svm_model.predict(X_test) results=confusion_matrix(y_test,svm_prediction) print("PREDICTION : \n",svm_prediction[0:10]) print("\nConfusion Matrix : \n",results) |

1 2 3 4 5 6 7 8 |
Output PREDICTION : [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] Confusion Matrix : [[196 0] [111 0]] |

**1. Accuracy**

Accuracy is the number of correct predictions against the number of all predictions in a given data. It is given by the following formula. * Accuracy=(TP+TN)/(TP+FP+FN+TN)*. This is a metric that is mostly used for classification problems. Accuracy is used for balanced data set i.e data with equal number of responses in each class and where all the predictions and prediction errors are important. In real world having these conditions is very rare and hence the accuracy model is often one of the highly misused metric. Sci-kit-learn has a function for computing the accuracy of a machine learning model. The

*metrics*class has an

*accuracy_score*function that returns the accuracy of the model. We are going to use the Iris data sets for this example and you can find it here iris data set (127 downloads) . The example below demonstrates how to measure the performance using the accuracy metric in Sci-kit-learn.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import pandas as pd from sklearn.cross_validation import train_test_split from sklearn.metrics import accuracy_score from sklearn.svm import SVC data=pd.read_csv("pima-indians-diabetes.csv") X=data.values[:,0:8] #feature y=data.values[:,8] #label/target X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.4) #Split the data to train and test set #SVM svm_model=SVC() svm_model.fit(X_train,y_train) svm_prediction=svm_model.predict(X_test) results=accuracy_score(y_test,svm_prediction) print("PREDICTION : \n",svm_prediction[0:10]) print("\nAccuracy Score : ",results) |

1 2 3 4 5 6 |
Output PREDICTION : [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] Accuracy Score : 0.622149837134 |

**2. Precision**

This is the total number of True positives against total number of Positives (True Positives and False Positives). It is given by the following formula; * Precision=TP/(TP+FP)*. Precision measures the exactness of the model. It is the number of positive predictions against the total number of positive observations predicted.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import pandas as pd from sklearn.cross_validation import train_test_split from sklearn.metrics import precision_score from sklearn.svm import SVC data=pd.read_csv("Iris.csv") X=data.values[:,0:4] #feature y=data['Species'] #label/target X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.4) #Split the data to train and test set #SVM svm_model=SVC() svm_model.fit(X_train,y_train) svm_prediction=svm_model.predict(X_test) results=precision_score(y_test,svm_prediction,average='weighted') # You can use average='weighted' or average='micro' or average='macro' or average=None print("PREDICTION : \n",svm_prediction[0:10]) print("\nPrecision Score : \n",results) |

1 2 3 4 5 6 7 8 9 |
Output PREDICTION : ['Iris-versicolor' 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica'] Precision Score : 0.970175438596 |

**3. Recall**

This is the number of True Positives against the total number of True positives and False negatives. It is the number of positive predictions against the number of positive observations. It is presented by the following formula; * Recall=TP/(TP+FN)*.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import pandas as pd from sklearn.cross_validation import train_test_split from sklearn.metrics import recall_score from sklearn.svm import SVC data=pd.read_csv("Iris.csv") X=data.values[:,0:4] #feature y=data['Species'] #label/target #SVM svm_model=SVC() svm_model.fit(X_train,y_train) svm_prediction=svm_model.predict(X_test) results=recall_score(y_test,svm_prediction,average='weighted') # You can use average='weighted' or average='micro' or average='macro' or average=None print("PREDICTION : \n",svm_prediction[0:10]) print("\nRecall Score : \n",results) |

1 2 3 4 5 6 7 8 9 10 |
Output PREDICTION : ['Iris-versicolor' 'Iris-versicolor' 'Iris-virginica' 'Iris-setosa' 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor' 'Iris-virginica' 'Iris-virginica' 'Iris-setosa'] Recall Score : 1.0 |

**4. F1 Score**

Also referred as F Score or F Measure. It is the harmonic mean between Precision and Recall. F1 Score is represented by the following formula * F1 Score = 2*((Precision*Recall)/(Precision+Recall))*. It informs us how precise and robust our model is. It can be viewed as a weighted average of the precision and recall and is a value ranging from 0 to 1. The higher the F1 Score i.e close to 1, the better our model performs. Sci-kit-learn has F1 Score from metrics class

**sklearn.metrics.f1_score(y_true, y_pred, labels=None, pos_label=1, average=’binary’, sample_weight=None)****F1 Score **

1 2 3 4 5 6 7 8 9 10 11 12 |
from sklearn.metrics import f1_score y_true = [0, 1, 1, 0, 1, 1, 1, 0, 1, 1] y_pred = [0, 0, 1, 0, 0, 2, 0, 1, 0, 2] macro_average_result=f1_score(y_true, y_pred, average='macro') micro_average_result=f1_score(y_true, y_pred, average='micro') weighted_average_result=f1_score(y_true, y_pred, average='weighted') print("Macro Average Score : ",macro_average_result) print("\nMicro Average Score : ",micro_average_result) print("\nWeighted Average Score : ",weighted_average_result) |

1 2 3 4 5 6 7 |
Output Macro Average Score : 0.222222222222 Micro Average Score : 0.3 Weighted Average Score : 0.288888888889 |

**F1 Score With Iris Prediction**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
from sklearn.metrics import f1_score import pandas as pd from sklearn.cross_validation import train_test_split from sklearn.svm import SVC data=pd.read_csv("Iris.csv") X=data.values[:,0:4] #feature y=data['Species'] #label/target #SVM svm_model=SVC() svm_model.fit(X_train,y_train) svm_prediction=svm_model.predict(X_test) print("PREDICTION : \n",svm_prediction[0:10]) macro_average_result=f1_score(svm_prediction,y_test, average='macro') micro_average_result=f1_score(svm_prediction,y_test, average='micro') weighted_average_result=f1_score( svm_prediction,y_test, average='weighted') print("\n\nMacro Average Score : ",macro_average_result) print("\nMicro Average Score : ",micro_average_result) print("\nWeighted Average Score : ",weighted_average_result) |

1 2 3 4 5 6 7 8 9 10 11 12 13 |
Output PREDICTION : ['Iris-virginica' 'Iris-versicolor' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica'] Macro Average Score : 0.982443982444 Micro Average Score : 0.983333333333 Weighted Average Score : 0.983344883345 |

**Classification Report**

This is a summary of the confusion matrix. It tells us about the precision, recall, F1-Score and support for each classification.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
from sklearn.metrics import classification_report import pandas as pd from sklearn.cross_validation import train_test_split from sklearn.svm import SVC data=pd.read_csv("pima-indians-diabetes.csv") X=data.values[:,0:8] #feature y=data.values[:,8] #label/target #SVM svm_model=SVC() svm_model.fit(X_train,y_train) svm_prediction=svm_model.predict(X_test) print("PREDICTION : \n",svm_prediction[0:10]) report = classification_report(y_test, svm_prediction) print("\nClassification Report : \n",report) |

1 2 3 4 5 6 7 8 9 10 11 12 |
Output PREDICTION : [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] Classification Report : precision recall f1-score support 0.0 0.64 1.00 0.78 197 1.0 0.00 0.00 0.00 110 avg / total 0.41 0.64 0.50 307 |

**Regression Metrics**

**Mean Absolute Error**

The Mean Absolute Error (MAE) is a regression metric that measures the difference between two continuous variables. In machine learning MAE computes the sum of absolute differences between predictions and actual values. From the MAE we get to know how wrong our model predictions is.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import pandas as pd from sklearn.cross_validation import train_test_split from sklearn.metrics import mean_squared_error from sklearn.linear_model import LinearRegression data=pd.read_csv("boston-house-price-dataset.csv") X=data.values[:,0:13] #feature y=data.values[:,13] #label/target #Linear Regression Model linear_regression_model=LinearRegression() linear_regression_model.fit(X_train,y_train) prediction=linear_regression_model.predict(X_test) results=mean_absolute_error(y_test,prediction) print(" PREDICTION : \n",prediction[0:10]) print("\nMean Squared Error : ",results) |

1 2 3 4 5 6 7 |
Output PREDICTION : [ 24.23412125 41.7997779 16.64323609 20.92898529 5.41589009 36.46273631 21.41250446 18.65721503 14.42999393 21.34227577] Mean Squared Error : 3.44686166895 |

** ** 2.** Mean Squared Error (MSE)**

Also referred to as Mean Squared Deviation (MSD) is a regression metric that measures the difference between the actual observations and the model prediction results. It is a positive value since the squaring removes the negative in the values. It is a risk function which corresponds to the expected value of the loss of squared error.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import pandas as pd from sklearn.cross_validation import train_test_split from sklearn.metrics import mean_squared_error from sklearn.linear_model import LinearRegression data=pd.read_csv("boston-house-price-dataset.csv") X=data.values[:,0:13] #feature y=data.values[:,13] #label/target #Linear Regression Model linear_regression_model=LinearRegression() linear_regression_model.fit(X_train,y_train) prediction=linear_regression_model.predict(X_test) results=mean_squared_error(y_test,prediction) print(" PREDICTION : \n",prediction[0:10]) print("\nMean Squared Error : ",results) |

1 2 3 4 5 6 7 |
Output PREDICTION : [ 30.57180013 27.40000976 23.4732742 32.82164261 23.70134802 19.97317733 23.58751036 18.47115434 26.37462553 34.94713324] Mean Squared Error : 20.4851915125 |

**Conclusion**

We have learned about the most important task in machine learning process, which is performance evaluation. Different metrics works well with different problems such as accuracy model can be used for balanced data while it cannot be used for imbalanced data. We have looked at the mostly used metrics while we have also left out some of the best metrics.

**What’s Next**

In this post we have learned about model evaluation metrics. In the next post we will learn about visualizing our machine learning data so that we can understand better how it looks like.