The Chi-Square test provides an important statistical measure for the relationship between categorical variables.

Table of Content

  1. Important terms to understand chi square test
  2. Need of chi square test
  3. Assumptions taken before applying chi square test
  4. Statistical functions used to implement the test in python
  5. Analysis of the result obtained from chi square test using case study
  6. Advantages and Limitations

Let’s start with the basic example:

You might be familiar with Kaggle’s competition ”  Titanic: Machine Learning from Disaster ” where we have to predict the survival of a passenger using features like “Sex”, “Age”, “Pclass” etc.

In this tutorial, we will deal with categorical variables only for e.g. ‘Pclass’ and ‘Sex’.

In this case, there is a possibility that the survival of female passengers is more as compared to male passengers. So we need to check the relationship between them. Also, as we can observe through data analysis, three types of passenger classes are present in the dataset. So there are chances that high-class people may have a higher survival rate. So for this, we need to find out the relationship between ‘Pclass’ and ‘Survived’ columns.

Important terms we need to understand for chi square test:

Contingency table: It is a cross-tabulated form for categorical variables. For the chi-square test, we need to keep those features in cross-tabulated form for which relationship needs to be figured out.

Here for the ‘Sex’ and ‘Survived’ column, the cross-tabulated form will be like this:

GenderNot SurvivedSurvivedTotal
Female81233314
Male468109577
Total549342891

Observed Frequency: Actual count of values per category for each group.

The above table is created after data analysis of given data.

Expected Frequency: Expected count of values per category for each group, if the null hypothesis holds true.

To calculate expected frequency for a row i and column j

Eij = (ith row total * jth column total)/total

Here total count of female passengers is 314, total ‘not survived’ people are 549

So expected value for 1st row and 1st column will be (314*549)/891 = 193.47

So the table will be as follows:

GenderNot survivedSurvived
Female193.47120.52
Male355.52221.47

Residuals: It shows how much our data deviates from our null hypothesis i.e. difference between observed count and expected count. As residual will increase, the mean value of the chi-square statistic will increase so there will be more chances to reject the null hypothesis.

The residual for each group in each category will be :

Observed Frequency - Expected Frequency

Degree of freedom: Degree of freedom informs about the distribution of data. It is calculated as:

Degree_of_freedom = (Number of rows-1)*(Number of columns-1)

In our use case, the degree of freedom will be: (2-1)*(2-1) = 1

If we have data of n rows and m columns, we can calculate all information by using (n-1) rows and (m-1) columns themselves. That’s why, to calculate the degree of freedom, (n-1) rows and (m-1) columns are sufficient.

About Hypothesis

Null hypothesis: It’s an assumption that there are no variations among variables. In layman’s terms, we consider whatever is usual, to be true until there is any statistical evidence that rejects it by showing that an alternate hypothesis exists.

The null hypothesis is a general statement that there is nothing significantly different happening.

For the above use case we have considered the null hypothesis as follows:

Null hypothesis(H0): There is no relationship among variables.

Alternative hypothesis(H1): There is some relationship between variables.

Significance Level: If the probability of the null hypothesis to be true, is less than some specific point, then we can reject the null hypothesis. That specific point is known as significance level (usually it is taken as alpha = 0.05 means at least our null hypothesis should be 5% significant)

Critical Value: Critical value is a point that is compared with test statistic (chi-square statistic in our case). If the absolute value of the chi-square statistic is greater than the critical value, it means we can reject the null hypothesis.

Here is the distribution table for the chi-square test:

[Image Source: Research Gate]

For the above use case, degree of freedom =1 and significance value = 0.05, we can see from the table, the critical value will be 3.84.

Why do we need chi square test?

The Chi-Square test is basically used for finding relationships between categorical variables. It helps to determine whether a categorical variable has any significance in prediction/classification. Apart from that, if two independent categorical variables have any relationship or correlation, it can be figured out using the chi-square test.

Assumptions taken before applying Chi Square Test:

  • Data in the cells should be in count or in terms of frequencies, not in percentage or any other transformation. In our use case, we can see, data is in the form of count only.
  • Category of variables should be mutually exclusive means a particular variable should fall in one and only one category. For example in this case, Female are either survived or not survived. It’s not possible to be present in both the categories. But if we have data of fruits (orange, apple) with features like size (‘large’,’small’) then it’s possible that apple and orange both come under ‘large’ category. In such a case, chi square test may lead to inaccurate result.
  • The value of expected cells should be greater than 5 for at least 20% of the cells. Here we have 4 cells and all cells have values greater than five.
  • The groups being tested must be independent.

How to apply the chi-square test in python

It can be applied using the chi2_contingency function in the stats library.

Chi-square Test of Independence using scipy.stats.chi2_contingency function

Code:

The result of this function gives four values as follows:

chi2_ppf function is used to find out the critical value.

critical_value: 3.841458820694124
we can reject null hypothesis. Relationship exists between these 2 categories

Reference for functions used:

Advantages:

  • Chi-square is robust with respect to the distribution of the data.
  • Ease of computation, the detailed information can be derived from this test such as p-value, degree of freedom and chi-static.

Limitations:

  • It is highly sensitive to sample size.
  • Chi square test is sensitive to small frequencies/counts in the each group of category. If expected frequency is less than 5, it may lead to inaccurate result.

There is one important limitation of the chi-square test, i.e. at one time we can check the relationship between two variables only. So in case if we have a large dataset, with multiple categorical features(like 10,12..) then it’s not a practical approach to check chi-square statistics. So for that, it’s better to go for feature engineering process first and try to find out the best combination of features with a set of less number of variables.

Here is the link for GitHub where I have explained the relationship between ‘Pclass’ and ‘Survived’ categories as well. https://github.com/letthedataconfess/explanation-of-chi-square-test

Let us know if you have any doubts, would like to discuss more.