With increasingly advanced machine learning and deep learning algorithms, you can solve almost any problem with proper datasets. However, as the complexity of the model increases, they are becoming hard to interpret. When you talk about the interpretability of the machine learning models, the first thing that comes to mind is Linear Regression. Linear Regression is a model that is quite simple and easy to interpret. Besides being straightforward, it includes various interpretability problems, especially in terms of linear regression assumption violations. You can also refer to assumptions of Linear Regression as Multicollinearity.
What is Multicollinearity?
In a regression model, when two or more than two independent variables correlate with each other, you can refer to the model as multicollinearity. It indicates that you can predict one independent variable with another independent variable in this type of model. The variables can include weight and height, water consumption and household income, price of the car and mileage, leisure time and study time, and others.
You can take any variable from your everyday life. For instance, you love munching on chips while watching television—the rate of satisfaction increases when you watch more television and eat more chips. Now, when you consider the things that will keep you busy and make you happy, which activity will you find interesting and have a greater impact on your happiness level? Will you feel happier when you eat chips or when you watch television?
This would be hard to measure because when you watch more television, you will eat more chips, and when you eat more chips, you have to watch more television. Both these activities correlate with each other. You will find it challenging to determine the impact of one activity on another for happiness. This is what the multicollinearity problem really is. Now let’s find out how you will measure multicollinearity with reference to machine learning.
Types of Multicollinearity
Multicollinearity consists of two different types. The first type of multicollinearity is structural. It is a byproduct. As you will create this assumption with the help of the existing independent variable, you can easily track it. The independent variable x of the first type is squared. For example, if the structural multicollinearity considers that you have a data set and you use a log for normalizing or scaling the features. The second type is data multicollinearity, which is more dangerous than structural. You will find this multicollinearity harder to identify and interpret. If you are using pandas data frame, it will already embed this type of multicollinearity.
Detecting and Removing Multicollinearity
The optimal solution to categorize the multicollinearity is by calculating the Variance Inflation Factor equivalent to each individual variable in the data. By understanding the Variance Inflation Factor, you can determine whether the variable is predictable or not. For this, you will use other independent variables. We can understand the concept with the following example:
By assuming these nine different variables, you can calculate the Variance Inflation Factor for the first variable. i.e., V1. To calculate the variable V1, you should consider it as a target variable and isolate it from all other variables. Treat all the other variables as predictor variables.
By using other variables, we can find the corresponding value of R2. To find the value, train your regression model using the predictor variables. You can compute the VIF value with the help of the R2 value. The output will be:
From the above formula, we can clearly see that both, the R2 value and the Variance Inflation Factor, will increase simultaneously. When the R2 value is higher, it indicates that the other independent variables are properly explaining the target-independent variable. Now to decide whether to keep or remove the variable, we will consider the Variance Inflation Factor threshold value.
The Variance Inflation Factor value should be desirably small. However, this value may remove the independent variables from the dataset. Therefore, experts usually take Variance Inflation Factor threshold equals to five. This indicates that if the value of any independent variable is more than five, it will be removed. But the ideal threshold value will depend on the encountered problem.
Conclusion
The influence of Linear Regression depends on the simplicity of interpreting the model. You will not be able to find the purpose of the model if you miss out on multicollinearity. From the above context, you understand the meaning of multicollinearity and how it affects Linear Regression. Furthermore, you can now detect and remove the multicollinearity for any problem you encounter in the future.