Whenever machine learning is used for classification or prediction problems, it’s equally important to focus on it’s evaluation metrics which will let you know, how well your model is performing. In this article, we will discuss how to evaluate performance of classification algorithms using confusion matrix.

Table of content

  • What is classification
  • What is confusion matrix
  • Let’s understand with an example
  • How to select appropriate metrics

What is classification?

At first, it’s important to understand, what is classification?

In layman terms, classification is nothing but labeling or dividing the given items into pre-set classes. It’s like, we have 100 fruits like Apple, Banana, or cherry and we are dividing the whole set into different classes i.e. we are labelling each fruit with particular name. We can do this task of classification by using various algorithms like logistic regression, SVM.

But when it comes to scoring each model, there are fixed evaluation matrices that belong to classification problems. It’s important to understand model evaluation techniques thoroughly and apply accordingly otherwise you feel now the work is done, but in actual your model results will definitely disappoint you.

What is confusion matrix?

Let’s try to understand it with binary classification problems.

Suppose we have two class labels. One is positive and other is negative. When we predict any instance as ‘No’ or ‘Negative’, there are two options for its actual status.

If that instance is actually ‘Negative’, then we say we correctly predicted it and it is truly negative. So we name it ‘True Negative’ or TN. But this predicted negative could have been actually positive also. If that is the case, then we say we falsely predicted that instance as Negative which was actually positive. Hence we name it ‘False Negative’ or FN.

Similarly, when we predict any instance as ‘Yes’ or ‘Positive, there are two options for its actual status. If that instance is actually ‘Positive’, then we say we correctly predicted it and it is truly positive. So we name it ‘True Positive’ or TP.

But this predicted positive could have been actually negative. If that is the case, then we say we falsely predicted that instance as positive which was actually negative. Hence we name it ‘False Positive’ or FP.

Confusion matrix is comprised of these 4 elements: TP, TN, FN, FP

Cnfusion matrix tree


No worries! let’s understand it with an example.

Let’s assume that we have installed a hidden camera at an airport and we are capturing every person passing through that camera frame. Our task is to identify terrorists. We used very sophisticated computer vision techniques and Deep Learning algorithms. Now we want to evaluate the performance in real time. Let’s assume that 100 people have passed through the frame on the evaluation day. We also know that out of these only 2 people are terrorists and others are normal people.

We represent our results in the following format.

confusion matrix

From the left table above, we can see that, out of the normal (not terrorist) 98 people, our algorithm has predicted 90 people as “Not Terrorist” and 8 people as “Terrorist”. Similarly, out of 2 terrorists, algorithm predicted 1 as “Terrorist” and other as “Not Terrorist”.

Here, our task was to predict the terrorists correctly. So for us, ‘detection of terrorist’ was a positive.

So what are correct predictions?

  1. Actual Terrorist predicted as Terrorist (count = 1)
  2. Actual Non-Terrorist predicted as Non-Terrorist (count=90)

Now if you check accuracy which is ratio of number of total correct predictions to total predictions, you will notice, in above case accuracy will be

(90+1)/100 = 0.91 or 91%.

Now let’s check recall, what it says..

Recall is the ratio of number of correctly predicted positives to actual positives.

In the above mentioned case, there were 2 terrorists and our model predicted only 1 to be the terrorist.

Hence recall will be 1/2 = 0.5 or 50%.

Now let’s check precision!

Precision is the ratio of correctly predicted positives to total predicted positives. As the model has predicted total 9 people as terrorist but only 1 out of these was actual terrorist.

So precision will be 1/9 =0.1111 or 11.11%.

As you can observe, we have used three different metrics to evaluate the model. But question is, which metrics we should choose to evaluate the model? Which one will tell us the correct score?

How to select appropriate evaluation metrics

Now, Let’s check the intuition behind the use case.

Let’s consider that we are going to examine the people who will be predicted ‘Terrorist’ by our model. In the above example the model predicted total 9 people as ‘Terrorist’. But out of those, only 1 was actually the terrorist. It means that we have troubled other 8 people for no reason. This happened because the model had very low precision.

But at the same time our model has correctly predicted only 1 terrorist out of 2 terrorists. It means that we could only catch one terrorist. This happened because recall value was low.

But if you would have looked at accuracy, it is 91%.

If you would like to go with accuracy as evaluation matrix, you will be fooled definitely looking at this lucrative number. But in the real scenario we were troubling 8 people for no reason and at the same time were letting go one terrorist.

Let’s consider that we have changed our model parameters to correctly predict both terrorists. we built the model strict so that our model starts predicting more people as terrorists. By this, we would be able to detect both terrorists but due to this we are troubling more people who are not terrorists.

It means when we try to increase recall, precision decreases and vice versa.

So what should we do?

We need to decide based on the problem given. In case of above problem, it’s okay to trouble some people for examination. But it’s very important that no terrorist goes undetected.

It means that we want maximum recall at best possible precision.

Here “best possible precision” means it’s not okay get maximum recall at very low precision. Model is not useful if model is predicting many people as ‘terrorists’ and we are ending up checking 30%-40% people. Then that model is not of any use.

So let’s get back to the original question, what an ideal model should do?

Let’s say during the 1 year timeframe, we are observing 10 million people. During the whole year actually 2 terrorist were there. On average our model is predicting 1 person as ‘Terrorists’ per month. In total, we got 12 as terrorists for full year. We carried through research about them and 10 were not terrorists. We let them go after complete checks. But the both the terrorists were caught.  It means we are troubling very less people only 10 in whole year and we are catching all the terrorists.

Whoa!  That’s what we wanted. In this case, the accuracy, recall and precision are 99.96%, 100%, 16.66% respectively.

The confusion matrix is as below.

If the problem statement is something different where precision and recall are equally important then we take harmonic mean of them we call it ‘F1 Score’.

F1 Score= 2*(Precision*Recall)/(Precision + Recall)

I hope that I was able to explain the classification metrics in very simple and fun way. If you like the article and if it helped you understand confusion matrix better, do like, comment and share.

If you want to check other evaluation metrics too, follow the post mentioned below:

Ujwal Pawar
Ujwal Pawar

Highly motivated and passionate ML Expert who is specialized in
translating real world business challenges into Analytics Frameworks. He has experience in delivering end to end Data Science Project from Ideation to Evaluation with 1.5 years of experience.

Leave a Reply