When developing a machine learning model, you may encounter numerous problems. One common problem related to feature selection determines how relevant the input features are to the predictive output. You can use statistical tests to understand how the output variable depends on the input variable. These tests are helpful when the input variables are definite. If the result indicates that the output is independent, you should remove the input variable as it is irrelevant to the problem. The Pearson’s chi-squared test will identify if the categorical variables are independent or not.
What is a Chi-Square Test?
A Chi-Square test is a statistical technique to determine the relationship between two variables in a similar dataset. We can understand the concept from the following example:
Let’s assume that a researcher wants to figure out a relationship to place students in a department based on their CGPAs. He will extract random records of the department for the last five years. He will record the number of students and their CGPA, who were available for this category, i.e., below 6, 6-7, 7-8, 8-9, 9-10.
If he couldn’t find any relationship between the placement of students and their CGPA, he should equally split the students into different categories. However, if all the students in the category have a CGPA of more than 8, then the students below this score will not come under any category.
Assumptions of the Test
As the Chi test is a statistical test, it includes a few assumptions:
- You will obtain the data using a random selection from the data set.
- Each subject will only fit in a single category. For instance, if you consider the number of employees who were not available on Monday only, you cannot include them on Tuesday.
- You need to collect the data in counts or frequency. Do not consider the data in percentage.
- The data should not contain groups as it will affect observations.
- You cannot use Chi-Square if the value of 20% expected frequencies is below 5.
How to Perform the Chi-Square Test?
Follow these steps to perform the test and find the dependable variables:
- Identifying the hypothesis
- Creating a contingency table
- Determining the expected values
- Computing the Chi-Square statistic
- Accepting and rejecting the Null Hypothesis
1.Identifying the Hypothesis
The Null Hypothesis or H1 would indicate that both the variables are independent. However, you will also include an alternate hypothesis or H1. This indicates that both variables are not independent.
Creating a Contingency Table
In this step, you will create a contingency table indicating the distribution of both variables. Place the first variable in a row and the other variables in the column. This table will help you understand the relationship between both variables.
The contingency table will also include the degrees of freedom. You will indicate the degrees of freedom as (r-1)x(c-1). In this equation, r will be the rows, and c will be the column. Here:
Df = (2-1) x (2-1) = 1
From the table above, we figured all the observed values. Next, we will find the expected values. For that, we need to find the Chi-Square value and identify the relationship.
Determining the Expected Values
According to the null hypothesis, the two variables are not dependent. Therefore, we can consider the following equation by assuming that A and B are two different, independent events:
Now we can calculate the expected value from the first cell. The first cell includes Males who exited from the bank.
Similarly, using the same equation, we can determine the results for other cells as well. Here is the result:
Computing the Chi-Suqare Statistic
We can now determine the Chi-Square value by putting the calculated expected values and observed values in the table below:
The above table indicates O as the observed values and E as the expected values. Considering the Chi-Square statistic formula for the above value, we found the Chi-Square as 2.22.
Accepting And Rejecting the Null Hypothesis
Now, we can check if you should accept or reject the calculated Chi-Square with 95% confidence. The confidence is alpha, which equals 0.05. By putting the values that we figured out from the above formulas, we can find if the Chi-Square should be accepted or rejected.
- Degree of freedom = 1 (according to contingency table)
- Alpha = 0.05
- Chi-Square value = 3.84
You can find the value of Chi-Square using this table.
Because there is a huge difference between Observed values and Expected values, the distribution will fall to the right side.
From the above figure, we can understand that the value of Chi-Square ranges between 0 and inf. However, the alpha lies in the opposite direction ranging between 0 to 1. If the Chi-Square value drops to the error region, you have to reject the Null hypothesis. The error region will be the alpha and range between 0 to 0.05. However, in the above example, the Chi-Square value is lower than the critical Chi-Square value, so you will accept the null hypothesis.
Understanding the above context about the Chi-Square test will give you a clear picture of the concept. Keep in mind that the test will help you identify the relationship between observed and estimated values. Also, it indicates if the variables are dependent or independent. However, you cannot determine why these variables are dependent and the relationship between them.