When developing a machine learning model, you may encounter numerous problems. One common problem related to feature selection determines how relevant the input features are to the predictive output. You can use statistical tests to understand how the output variable depends on the input variable. These tests are helpful when the input variables are definite. If the result indicates that the output is independent, you should remove the input variable as it is irrelevant to the problem. The Pearson’s chisquared test will identify if the categorical variables are independent or not.
What is a ChiSquare Test?
A ChiSquare test is a statistical technique to determine the relationship between two variables in a similar dataset. We can understand the concept from the following example:
Let’s assume that a researcher wants to figure out a relationship to place students in a department based on their CGPAs. He will extract random records of the department for the last five years. He will record the number of students and their CGPA, who were available for this category, i.e., below 6, 67, 78, 89, 910.
If he couldn’t find any relationship between the placement of students and their CGPA, he should equally split the students into different categories. However, if all the students in the category have a CGPA of more than 8, then the students below this score will not come under any category.
Assumptions of the Test
As the Chi test is a statistical test, it includes a few assumptions:
 You will obtain the data using a random selection from the data set.
 Each subject will only fit in a single category. For instance, if you consider the number of employees who were not available on Monday only, you cannot include them on Tuesday.
 You need to collect the data in counts or frequency. Do not consider the data in percentage.
 The data should not contain groups as it will affect observations.
 You cannot use ChiSquare if the value of 20% expected frequencies is below 5.
How to Perform the ChiSquare Test?
Follow these steps to perform the test and find the dependable variables:
 Identifying the hypothesis
 Creating a contingency table
 Determining the expected values
 Computing the ChiSquare statistic
 Accepting and rejecting the Null Hypothesis
1.Identifying the Hypothesis
The Null Hypothesis or H1 would indicate that both the variables are independent. However, you will also include an alternate hypothesis or H1. This indicates that both variables are not independent.

Creating a Contingency Table
In this step, you will create a contingency table indicating the distribution of both variables. Place the first variable in a row and the other variables in the column. This table will help you understand the relationship between both variables.
The contingency table will also include the degrees of freedom. You will indicate the degrees of freedom as (r1)x(c1). In this equation, r will be the rows, and c will be the column. Here:
Df = (21) x (21) = 1
From the table above, we figured all the observed values. Next, we will find the expected values. For that, we need to find the ChiSquare value and identify the relationship.

Determining the Expected Values
According to the null hypothesis, the two variables are not dependent. Therefore, we can consider the following equation by assuming that A and B are two different, independent events:
Now we can calculate the expected value from the first cell. The first cell includes Males who exited from the bank.
Similarly, using the same equation, we can determine the results for other cells as well. Here is the result:

Computing the ChiSuqare Statistic
We can now determine the ChiSquare value by putting the calculated expected values and observed values in the table below:
The above table indicates O as the observed values and E as the expected values. Considering the ChiSquare statistic formula for the above value, we found the ChiSquare as 2.22.

Accepting And Rejecting the Null Hypothesis
Now, we can check if you should accept or reject the calculated ChiSquare with 95% confidence. The confidence is alpha, which equals 0.05. By putting the values that we figured out from the above formulas, we can find if the ChiSquare should be accepted or rejected.
 Degree of freedom = 1 (according to contingency table)
 Alpha = 0.05
 ChiSquare value = 3.84
You can find the value of ChiSquare using this table.
Because there is a huge difference between Observed values and Expected values, the distribution will fall to the right side.
From the above figure, we can understand that the value of ChiSquare ranges between 0 and inf. However, the alpha lies in the opposite direction ranging between 0 to 1. If the ChiSquare value drops to the error region, you have to reject the Null hypothesis. The error region will be the alpha and range between 0 to 0.05. However, in the above example, the ChiSquare value is lower than the critical ChiSquare value, so you will accept the null hypothesis.
Conclusion
Understanding the above context about the ChiSquare test will give you a clear picture of the concept. Keep in mind that the test will help you identify the relationship between observed and estimated values. Also, it indicates if the variables are dependent or independent. However, you cannot determine why these variables are dependent and the relationship between them.