Hey! Did you read the 7 must-know Statistical Concepts for Data Science Beginners (Part 1)? If you haven’t read it yet, visit here.

7 must-know Statistical Concepts for Data Science Beginners (Part 1)

Well, if you have read it, welcome to Part 2. In this second part, we’ll be discussing the remaining must-know statistical concepts for Data Science Beginners. These concepts will boost your confidence in understanding Data Science.

If at first you don’t succeed, try two more times so that your failure is statistically significant

So, without wasting any time, let us hop onto it.

Table of Content

  1. Inferential Statistics
    1. Hypothesis Testing
      1. Null and Alternate Hypothesis
    2. F-test
    3. T-test
    4. Chi-Square Test
    5. ANOVA
  2. Bayes’ Theorem
  3. Conclusion

Inferential Statistics

Inferential statistics is a way of making inferences about populations based on samples. It is often used to compare the differences between the treatment groups.

Hypothesis Testing

There are two possible outcomes: if the result confirms the hypothesis, then you’ve made a measurement. If the result is contrary to the hypothesis, then you’ve made a discovery

Enrico Fermi

Hypothesis Test evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data.

Null Hypothesis: Null hypothesis is a statistical theory that suggests there is no statistical significance exists between the populations.

Alternative Hypothesis: An Alternative hypothesis suggests there is a significant difference between the population parameters.

Level of significance: Denoted by alpha or α. It is a fixed probability of wrongly rejecting a True Null Hypothesis.

Example

Do not feel exaggerated! Let us break it piece by piece through a real-life example. Suppose we take sugar with a label of 500 grams. We assume the statement to be true. Thus, it is assumed to be true (based on the given label) that the canned sauce weighs 500 gm. However, we want to do hypothesis testing to ascertain that the label mentioned as 500 gm is true because there is a claim that sugar packets consisted of 480 gm.

Thus, the Null Hypothesis would get formulated as the statement that the weight of canned sugar is equal to 500 gm. The alternate hypothesis will thus get formulated as the statement that the weight of the sugar packet is less than 500 gm.

We then perform a test like a Two-Tailed Z-test with a certain confidence and find the critical value.

Now we need to ascertain which of the statements is indeed true. We use P-value for that. It is the proportion of samples (assuming the Null Hypothesis is true) that would be as extreme as the test statistic. It is denoted by the letter p.

If the p-value is greater than the level of significance, we fail to reject the Null Hypothesis.

If the p-value is greater than the level of significance, we reject Null Hypothesis.

[Source: Data Science Central]

You can refer to this research article on Hypothesis Testing.

F-Test

F-test is used to compare two variances. It is used to determine whether or not the two independent estimates of the population variances are homogeneous in nature. A result is always a positive number (Since variance is always positive).

For example, let’s suppose that there are two sets of watermelon under experimental conditions. The researcher takes out a random sample of sizes 9 and 11. The standard deviations of their weights are 0.6 and 0.8 respectively. After making an assumption that the distribution of their weights is normal, the researcher conducts an F-test to test the hypothesis on whether or not the true variances are equal.

[Source: WallStreetMojo]

T-Test

A t-test is used to determine if there is a significant difference between the means of the two groups. It helps us determine if the two sets came from the same population. It can be used when the dataset follows the normal distribution with unknown variances.

Let us take an example. Suppose a drug company wants to test a new cancer drug to find out if it improves life expectancy. The control group may show an average life expectancy of +5 years, while the group taking the new drug might have a life expectancy of +6 years. It would seem that the drug might work. But it could be due to a fluke. To test this, researchers would use a Student’s t-test to find out if the results are repeatable for an entire population.

  • A large t-score tells you that the groups are different.
  • A small t-score tells you that the groups are similar.
[Source: Datatab]

Chi-Square Test

A chi-square statistic is a test that measures how a model compares to the actual observed data. It is often used in Hypothesis Testing.

Chi-square can be used to test whether or not two variables are independent of each other. It is also called the ‘Goodness of fit’ test. ‘Goodness of fit’ is a way to test how well a sample of data matches the characteristics of the larger population that the sample is intended to represent.

It is mostly applied to categorical variables. For example, a scientist wants to know if education level and marital status are related to all people in a country.

[Source: Support | Minitab]

ANOVA

Before its formal definition, let us take an example. A group of psychiatric are trying three different therapies: counseling, medication, and biofeedback. You want to see if one therapy is better than the others, then you will use the ANOVA test. It checks the impact of one or more factors by comparing the means of different samples and gives you the result.

It is a statistical technique that is used to check if the means of two or more groups are significantly different from each other.

One-way has one independent variable/factor. For example brand of cereal. A one-way ANOVA will tell you that at least two groups were different from each other. But it won’t tell you which groups were different. If your test returns a significant f-statistic, you may need to run an ad hoc test to tell you exactly which groups had a difference in means.

Two-way ANOVA is appropriate when the experiment has a quantitative outcome and you have two categorical explanatory variables.

[Source: Juran]

Bayes’ Theorem

Bayes’ Theorem is at the heart of probability and one of the most fundamental concepts for Data Science. Can you tell the risk of lending money to any potential borrower? Bayes’ Theorem can help you with that.

According to Bayes’ theorem, the probability of event A given that event B has already occurred can be calculated using the probabilities of event A and event B and probability of event B given that A has already occurred.

Naive Bayes’ Algorithm is one of the most widely used algorithms in Machine Learning. It uses Bayes’ Theorem only t predict the output.

[Source: Towards Data Science]

Conclusion

So this blog summarized all the important statistics topics for data science with the reasons.

If you are looking for understanding these topics in detail, you can refer to some books. But there are a lot of books available. That’s why we have curated the 10 best statistics books for data science enthusiasts. You can check them out here.

10 statistics books for data science beginners