Today, we’ll be discussing 7 must-know statistical concepts for data science beginners.
Data Science is one of the fastest-growing fields currently. Do you know why? It brings together the domain expertise from programming, mathematics, and statistics to create insights and make sense of the data. You must be wondering why statistics is also essential for Data Science. To put it in simple words, statistical knowledge helps you use the proper methods to collect the data, employ correct analysis, and effectively present the results.
Well, now you must be thinking that statistics is too vast so how will you learn it? And if you couldn’t learn it, probably data science is not for your cup of tea?
Do not worry!!
We are going to discuss all the important concepts of statistics you need to start with.
“Statistics is the science about how, not being able to think and understand, make the figures do it for yourself”Vasily Klyuchevsky
Table of Content
- Descriptive Statistics
- Mean, Median, and Mode
- Standard Deviation
- Correlattion and Covariance
- Normal Distribution
- Central Limit Theorem
- Outlier Detection
To start learning statistical concepts for data science beginners first you need to understand descriptive statistics. Descriptive Statistics is a way of analyzing and identifying the basic features of a data set.
What will you do with just the raw data available? Nothing.
Presenting the data meaningfully is one of the key aspects of any business process. You should be able to understand and convey how your data is distributed and other key characteristics.
So what can we use to describe the data well?
Mean, Median and Mode
Mean: The average value of the data
Median: The middle value of an ordered dataset
Mode: The most frequent value in the data
How Mean, Median and Mode are useful in Data Science?
Let us consider a scenario where you have a dataset but it contains quite a lot of missing values. How would you deal with missing values?
Drop them off??
Well, simply dropping the missing value is not a good idea. Dropping them off will lead to information loss which may have repercussions later on.
It is better if you replace the missing values with either the mean, median, or mode. Replace the missing value with the mean or median in the case of a numerical variable.
The rule of the thumb is:
- Replace missing value with mean if the dataset is normally distributed.
- Replace missing value with median if the variable is skewed.
In the case of a categorical variable, imputing missing values with the mode is a better choice. Suppose in the gender variable, there are 500 males and 200 females; Replace all missing values with the male.
Standard Deviation is a measure of the spread of the distribution. It is a measure of uncertainty.
- A low standard deviation means that most of the values are close to the mean.
- A high standard deviation means that the values are spread farther away from the mean.
Why Standard Deviation is important in Data Science?
If the standard deviation is high, the model is unable to predict the output precisely. That’s why in the case of widely spread values, we apply techniques like standardization and normalization which scales down the values in a particular range. Using this, we can then apply an ML model to the dataset.
Variance is the expected value of the squared standard deviation of a random variable from its mean.
Why is variance important in data science?
High variance can cause an algorithm to model the random noise in the training data. This leads to overfitting. A model having low bias and low variance is an ideal model.
Also, Variance can also be used to detect multicollinearity. If you don’t know what is multicollinearity, no worries. Multicollinearity is a scenario when the independent variables have a correlation amongst themselves. This leads to overfitting sometimes. The input variables should be independent.
Variance Inflation Factor (VIF) is one such technique that can detect multicollinearity. It is equal to the ratio of the overall model variance to the variance of a model that includes only that single independent variable. If the VIF is > 5, that means that multicollinearity exists.
Skewness is a measure of how much the probability distribution of a random variable deviates from the normal distribution. If that sounds way too complex, do not worry. Believe we will clear each and every technical jargon mentioned.
Let us understand skewness with a real-life example. Suppose 10,00,000 students gave JEE Main exam, and most of them scored less than 150 due to increased difficulty levels. Only a few students scored more than that. Therefore, most of the data points would lie towards the left, and hence the distribution will be positively skewed.
Here we can see a relation.
- In case of Positive skewness, Mode < Median < Mean
- In case of Negative skewness, Mean < Median < Mode
In data science, skewness is not good to have in the dataset. We try to convert it to a normal distribution. We can check the skewness simply by plotting a graph.
Kurtosis determines the heaviness of the distribution tails. It refers to the degree of presence of outliers in the distribution. A large kurtosis indicates that there may be extremely large and small values that behave as outliers. On the other hand, low kurtosis indicates fewer outliers.
- If kurtosis value > 3, it is Leptokurtic. Most values are located in the tail of distribution rather than mean.
- If kurtosis value < 3, it is called Platykurtic. Most of the data points lie in proximity to the mean.
- If kurtosis value = 3, it is Mesokurtic. It represents that the data is normally distributed. Chances of presence of Outliers is less.
You can appreciate the effects of having skewness and kurtosis in the dataset and also various techniques to convert your data to a normal distribution. This is a very important process in the Data Processing pipeline. If you don’t clean your data and make it suitable to be fed to the model, you will only get garbage results out of it.
Correlation and Covariance
This is probably one of the most important statistical concepts you should know. They describe the relationship between two random variables to each other.
Covariance is a statistical technique used for determining the relationship between the movement of two random variables. It signifies the direction of the linear relationship between the two variables. By direction we mean if the variables are directly proportional or inversely proportional to each other. It only tells the direction of the relationship but not magnitude.
Correlation determines how a change in one variable affects the other variable.
- If on increasing magnitude of one variable, other also increases. It is called positive correlation.
- If on increasing magnitude of one variable, value of other decreases. It is called negative correlation.
- If the correlation value is zero, there exists no relationship between the two variables.
Why is it important for Data Science?
Correlation is essential in the feature selection process. Generally, there should be a correlation between your input variable and output variable. In case your input variable does not have a correlation, either you can combine it with another feature or simply drop off the variable.
Another scenario where correlation is used is Multicollinearity. We have already discussed what multicollinearity is above. Multicollinearity has to be avoided in any case in data Science. For example, age and years of experience might have a correlation in predicting the salary of a person.
Normal Distribution is also known as Gaussian Distribution. It is perfectly symmetrical and has a bell-shaped curve. In a normal distribution, most of the values fall around the mean.
According to the Empirical Rule for Normal Distribution:
- 68.27% of data lies within 1 standard deviation of the mean
- 95.45% of data lies within 2 standard deviations of the mean
- 99.73% of data lies within 3 standard deviations of the mean
Thus, almost all the data lies within 3 standard deviations. This rule enables us to check for Outliers and is very helpful when determining the normality of any distribution.
Standard Normal Distribution is a special case of Normal Distribution. In this case, the mean is 0 and the standard deviation is 1.
Converting a normal distribution into standard normal distribution is called Standardization. It is important in the scenario when the variables lie in different ranges. Suppose one variable follows a normal distribution with a mean of 60 and a standard deviation of 4. The other variable follows a normal distribution with a mean of 80 and a standard deviation of 10. This variability affects the performance of the model and hence we bring them to a range of 0 to 1.
Central Limit Theorem
In almost every use case, when the distribution of data is unknown, the normal distribution is used.
Consider a scenario where you have to perform an analysis on all 20-year-old males in a country. Since it is almost impossible to collect this data. So, we take samples of 20-year-old males across the country and calculate the average height of males through the sample data. When you repeat this process a number of times, i.e. take a lot of samples and calculate the mean for each individual sample, then it follows the normal distribution. This happens because of the Central Limit Theorem, also known as CLT.
Why CLT is needed?
When the distribution of the population is not known to us, it becomes difficult for us to perform further mathematical/statistical treatments on it. Also, complete enumeration is not possible to know the population parameters. Hence, with the help of CLT, we approximate distribution to be normal and can use it for further analysis.
Assumptions behind CLT
- The data must be random and unbiased sample from the populatin.
- Samples should be independent of each other. One sample should not influence the other samples.
- The sample size should be sufficiently large, since the more the sample size, the lower the standard error.
- Samples should be representative of the entire population.
Oftentimes in practical machine learning problems, there will be significant differences in the number of different classes of data being predicted.
Just imagine you have a dataset of credit card default prediction in which 8000 rows have output labels as ‘yes’ and 2000 as ‘no’.
Don’t you think that the model will dominantly understand the ‘yes’ category?
Yes, it will because it has been exposed to the ‘yes’ category more. This model will perform poorly in predicting ‘no’ since it has not seen enough training examples. This scenario is called a class imbalance. This may happen because of the unavailability of enough data.
So what will you do in such a case? You cannot simply satisfy yourself with a model that predicts ‘yes’ almost every time. This situation can be taken care of using resampling techniques.
Resampling involves creating a new transformed version of the training dataset in which the selected examples have a different class distribution.
This technique randomly duplicates examples in the minority class.
This technique randomly duplicates examples in the majority class.
You can read more about sampling from this research paper.
Do you know what are outliers? Outliers are extreme values that deviate from other observations on data. For example, suppose in a middle-class town of 30,000 people, the income of 95% population is below $30,000 but few people have income even greater than $1,15,000. These people can be considered as outliers.
Can you detect an issue with this?
Well, if you look closely, the model will misinterpret the information and accuracy will go down.
This is because due to a few unexpectedly high and low observations, the mean is affected. This gives the wrong information to the model. That is why it is very important to detect and treat the outliers appropriately.
Statistical techniques can help you detect outliers.
There is a visual diagram called “Boxplot” that displays a five-number summary of a set of data. Boxplots are a standardized way of displaying the distribution of data based on a five-number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”).
Minimum: Minimum value in the dataset
Q1: the middle number between the smallest number and the median of the dataset.
Q2: The middle value of the dataset or also called the median.
Q3: the middle value between the median and the highest value of the dataset.
Any point that lies farther away can be considered as an outlier.
After detecting the outliers, you can treat them appropriately by either dropping them if there are not many (Dropping too many values will lead to information loss) or could also replace them with values like mean.
Now, I hope now you can appreciate the importance of statistics in real life and how important it is in Data Science. These were the few Must-Know Statistical Concepts for Data Science Beginners, remaining we are going to discuss in the part 2.
Having a good understanding of statistics will make you fall in love with data. Under statistics and let the data confess. A successful data scientist should be able to explain the results statistically.
If you are looking for understanding these topics in detail, you can refer to books as well.
But there is a lot of books available and it’s sometimes difficult to know which one to prefer.
That’s why we have curated the 10 best statistics books for data science enthusiasts.
You can check them out here.
Let us know your feedback as it helps us to improve and provide you with better guidance in your data science journey!