In this article, we are going to understand in-depth and detail about outlier-Statistical and Programming approaches, how to detect and treat outliers so that they won’t screw up the model performance. So read this article till the end, you will get to know how important outliers are while data-preprocessing.
Table of content
- Why study outlier?
- What is an outlier?
- Should we remove the outlier?
- Techniques to detect outliers
- Using Scatter plot graph.
- Using Box plot graph.
- Using Z_score method (Normally distributed Data).
- Using the IQR interquartile range.
One may ask why to study outlier?
Suppose in one school, a new coach has been working with the Long Jump team this month, and the athletes’ performance has changed. Sam can now jump 0.15m further, June and Carol can jump 0.06m further.
The following are the results:-
Vishu:- –0.50m Oh no, Vishu got worse
Mean value is :- (0.20+0.12+0.06-0.50)/4= -0.03m. So on an average performance.
The coach is obviously useless …right?
Here, Vishu’s result is an outlier…What if we remove Vishu’s result? (0.20+0.12+0.06)/3=0.12m
We need to search why is that value over there?
Maybe during practice Vishu was feeling sick. Not the coach’s fault at all. So it is a good idea in this case to remove Vishu’s result.
What is meant by the outlier?
In statistics, an outlier is a data point which is far away from the normal range of observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set.
Should we remove outliers or not?
It is not always required to remove outliers, depending upon the business problem, business requirement. We as machine learning engineers decide whether we should remove outliers or not.
Suppose we have a dataset of weight and the normal range of people’s weight is 30-70, but there are few 15-20 people who have weight more than 100. So in this case we can’t treat these values as an outlier, since they are actual/real values. We can drop outliers in a dataset of peoples favourite tv shows, but we can’t remove outliers when we have a dataset about credit card fraud. It is upto your common sense and observation whether you should remove it or not .
Suppose at least 30%( or a large amount) of data points are outliers means there is some interesting and meaningful insight in outliers and you should not remove it.
Why should we remove outliers?
Outliers increase variability in datasets which reduces statistical significance, making our model less accurate. As we have seen above in school examples, outlier strongly affects mean.
How to detect and treat outliers
There are various methods to detect it. Based on different distributions and tools, these can be categorised into two parts:
- Using the interquartile range(IQR)
- Using standard deviation
If data is skewed then IQR works better to find out outliers. If data is normally distributed then both the methods performs similar.
IQR (for skewed distribution)
First we need to understand, what is meant by quartiles?
Quartiles mean four. These are the values that divide the dataset into four equal parts.
Steps to calculate Quartiles
- First sort data in ascending or descending order
- Q1 (First Quartile/Lower Quartile)=(n+1)/4th item,n=Number of observations
- Q2 (Second Quartile)=(n+1)/2th item =Median
- Q3 (Third Quartile/Upper Quartile)=3(n+1)/4th item
- Q4=Highest observation
Let’s see example
Ex) Find Q1,Q3 of following dataset [3,8,5,2,6,9,4,10,7]
First sort the dataset 2,3,4,5,6,7,8,9,10
Q1=(n+1)/4 th element=10/4=(2.5)th element
Now whenever there is fraction comes,there is little bit different calculation we need to perform
Q1=2nd item+0.5(4th element – 3rd element) OR Q1=avg(2nd element,3rd element)
What does the value of Q1 make sense in order to data?
Q1=3.5 indicates that 25% values are less than or equal to 3.5
Similarly, let’s calculate Q2
Q2=(n+2)/2 th element=(9+1)/2=10/2=5th element=6
Q2=6 As mentioned above, the 2nd quartile is the median value.
So, Q2 indicates that 50% values are less than or equal to 6
Q3=3(n+1)/4 th element=3(10)/4=7.5 th element. Here also fraction value comes.
Q3=7th element+0.5(9th element-8th element) OR Q3=avg(7th element,8th element)
Q3=8+0.5(10-9)=8.5 implies 75% values are less than 8.5
There may be a possibility that with two approaches we get two different values, but they will have very small differences and it will not matter for data analysis.
So in above example IQR is 5
Now, we will see how IQR helps us to detect outliers.
The values outside the following interval, will be treated as outlier.
So in the above example, the values outside the interval
[3.5-1.5(5),7.5+1.5(5)] i.e. [-4,15] are outliers.
So here we can conclude that in the dataset [2,3,4,5,6,7,8,9,10] there are no outliers.
Using Standard Deviation Method(in case of normal distribution)
At first, we need to understand what is standard deviation?
Statistically or numerically, it is a square root of the average of the squared distances of observations from the mean.
Not Understood? No worry!
The Standard Deviation is a measure of how spread out numbers are.
It is denoted by σ
Following is a plot of normal distribution (or bell-shaped curve) where each band has a width of 1 standard deviation
Suppose σ=1.5, This tells you that the bulk of data/observations are 1.5 standard deviations on either side of the mean. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.
Units of SD are in the same units as that of data
SD enables to determine with a high accuracy, the values of the frequency distribution in relation to the mean.
Formula of SD for Population:-
Where μ: mean of population, N: Population size
Formula of SD for sample:-
Where S: Standard deviation of Sample, X: mean of sample, N: Sample size
The important change is “N-1” instead of “N” (which is called “Bessel’s correction”).
Now, let’s see how standard deviation can help us to determine outliers.
In case of normal distribution if data points lie away from the range
|[(μ + 3σ) , (μ — 3σ)]|
is considered as outliers.Where μ is the mean value.
Now, One can ask if I applied two methods(IQR & SD) on a normally distributed particular dataset, which will give me more outliers?
We will get more outliers by IQR methods. But it doesn’t matter by which method you count outliers, since the count difference is very small and doesn’t matter when the dataset is large enough. In detail we are going to see in the following programming section, there you will understand fully what I want to say.
So far, we have seen concepts, let’s see some practical with coding and example!
Detect and treat outliers using python
- Using a Scatter plot graph
- Using Box plot graph
- Using Z_score method (Normally distributed Data)
- Using the IQR interquartile range
Using Scatter Plot
So, here in the neighboring graph, we can see that three dots are far away from the normal data range. These three dots are outliers
The middle line in the boxplot indicates the median value. We can also identify the skewness of our data by observing the shape of the box plot. If the box plot is symmetric it means that our data follows a normal distribution. If our box plot is not symmetric it shows that our data is skewed. You can get a better understanding by looking at the diagrams below:
So, from the above boxplot we can see that there are three outliers
Using Z score
Z-score = (Observation — Mean)/Standard Deviation
Data point that falls outside of 3 standard deviations. we can use a z score and if the z score falls outside of 3 standard deviations.
So, output [102, 107, 108] will be display as an outliers
Using Interquartile Range
So, outliers are [102, 107, 108]
Generally, Q1 and Q3 will match with 2.5+?. Hence in general we get more outliers in the IQR method.
You can find out other data analysis techniques in the following post:
So this is about outliers. I hope you have liked it. I would like to hear your valuable feedback.