Whenever machine learning is used for classification or prediction problems, it’s equally important to focus on its evaluation metrics which will let you know, how well your model is performing. In this article, we will discuss how to evaluate the performance of classification algorithms using a confusion matrix.

Table of content

  • What is classification
  • What is confusion matrix
  • Let’s understand with an example
  • How to select appropriate metrics

What is classification?

At first, it’s important to understand, what is classification?

In layman’s terms, classification is nothing but labeling or dividing the given items into pre-set classes. It’s like, we have 100 fruits like Apple, Banana, or cherry and we are dividing the whole set into different classes i.e. we are labeling each fruit with a particular name. We can do this task of classification by using various algorithms like logistic regression, SVM.

But when it comes to scoring each model, there are fixed evaluation matrices that belong to classification problems. It’s important to understand model evaluation techniques thoroughly and apply accordingly otherwise you feel now the work is done, but in actual your model results will definitely disappoint you.

What is confusion matrix?

Let’s try to understand it with binary classification problems.

Suppose we have two class labels. One is positive and the other is negative. When we predict any instance as ‘No’ or ‘Negative’, there are two options for its actual status.

If that instance is actually ‘Negative’, then we say we correctly predicted it and it is truly negative. So we name it ‘True Negative’ or TN. But this predicted negative could have been actually positive also. If that is the case, then we say we falsely predicted that instance as Negative which was actually positive. Hence we name it ‘False Negative’ or FN.

Similarly, when we predict any instance as ‘Yes’ or ‘Positive, there are two options for its actual status. If that instance is actually ‘Positive’, then we say we correctly predicted it and it is truly positive. So we name it ‘True Positive’ or TP.

But this predicted positive could have been actually negative. If that is the case, then we say we falsely predicted that instance as positive which was actually negative. Hence we name it ‘False Positive’ or FP.

A confusion matrix is comprised of these 4 elements: TP, TN, FN, FP

[Image Source: Towards Data Science]

No worries! let’s understand it with an example.

Let’s assume that we have installed a hidden camera at an airport and we are capturing every person passing through that camera frame. Our task is to identify terrorists. We used very sophisticated computer vision techniques and Deep Learning algorithms. Now we want to evaluate the performance in real-time. Let’s assume that 100 people have passed through the frame on the evaluation day. We also know that out of these only 2 people are terrorists and others are normal people.

We represent our results in the following format.

confusion matrix

From the left table above, we can see that, out of the normal (not terrorist) 98 people, our algorithm has predicted 90 people as “Not Terrorist” and 8 people as “Terrorist”. Similarly, out of 2 terrorists, the algorithm predicted 1 as “Terrorist” and the other as “Not Terrorist”.

Here, our task was to predict the terrorists correctly. So for us, ‘detection of terrorists’ was positive.

So what are correct predictions?

  1. Actual Terrorist predicted as Terrorist (count = 1)
  2. Actual Non-Terrorist predicted as Non-Terrorist (count=90)

Now if you check accuracy which is the ratio of the number of total correct predictions to total predictions, you will notice, in the above case accuracy will be

(90+1)/100 = 0.91 or 91%.

Now let’s check recall, what it says..

The recall is the ratio of the number of correctly predicted positives to actual positives.

In the above-mentioned case, there were 2 terrorists and our model predicted only 1 to be the terrorist.

Hence recall will be 1/2 = 0.5 or 50%.

Now let’s check precision!

Precision is the ratio of correctly predicted positives to total predicted positives. As the model has predicted a total of 9 people as terrorists but only 1 out of these was an actual terrorist.

So precision will be 1/9 =0.1111 or 11.11%.

As you can observe, we have used three different metrics to evaluate the model. But question is, which metrics we should choose to evaluate the model? Which one will tell us the correct score?

How to select appropriate evaluation metrics

Now, Let’s check the intuition behind the use case.

Let’s consider that we are going to examine the people who will be predicted ‘Terrorists’ by our model. In the above example, the model predicted a total of 9 people as ‘Terrorists’. But out of those, only 1 was actually the terrorist. It means that we have troubled other 8 people for no reason. This happened because the model had very low precision.

But at the same time, our model has correctly predicted only 1 terrorist out of 2 terrorists. It means that we could only catch one terrorist. This happened because the recall value was low.

But if you would have looked at accuracy, it is 91%.

If you would like to go with accuracy as an evaluation matrix, you will be fooled definitely looking at this lucrative number. But in the real scenario, we were troubling 8 people for no reason, and at the same time, we were letting go of one terrorist.

Let’s consider that we have changed our model parameters to correctly predict both terrorists. we built the model strictly so that our model starts predicting more people as terrorists. By this, we would be able to detect both terrorists but due to this, we are troubling more people who are not terrorists.

It means when we try to increase recall, precision decreases, and vice versa.

We need to decide based on the problem given. In case of the above problem, it’s okay to troublesome people for examination. But it’s very important that no terrorist goes undetected.

It means that we want maximum recall at the best possible precision.

Here “best possible precision” means it’s not okay to get a maximum recall at very low precision. Model is not useful if the model is predicting many people as ‘terrorists’ and we are ending up checking 30%-40% of people. Then that model is not of any use.

So let’s get back to the original question, what an ideal model should do?

Let’s say during the 1-year timeframe, we are observing 10 million people. During the whole year, actually, 2 terrorists were there. On average our model is predicting 1 person as ‘Terrorists’ per month. In total, we got 12 as terrorists for a full year. We carried through research about them and 10 were not terrorists. We let them go after completing checks. But both the terrorists were caught.  It means we are troubling very few people only 10 in the whole year and we are catching all the terrorists.

Whoa!  That’s what we wanted. In this case, the accuracy, recall, and precision are 99.96%, 100%, 16.66% respectively.

If the problem statement is something different where precision and recall are equally important then we take harmonic mean of them we call it ‘F1 Score’.

F1 Score= 2*(Precision*Recall)/(Precision + Recall)

We hope that we were able to explain the classification metrics in a very simple and fun way. If you like the article and if it helped you understand the confusion matrix better, do like, comment and share.

If you want to check other evaluation metrics too, follow the post mentioned below:


Ujwal Pawar
Ujwal Pawar

Highly motivated and passionate ML Expert who is specialized in
translating real-world business challenges into Analytics Frameworks. He has experience in delivering end-to-end Data Science projects from Ideation to Evaluation with 1.5 years of experience.