The goal of this post is to allow you to better use ridge regression than simply use what libraries provide. Then, “What is Ridge Regression?”. The only thanks to answer the question are “Variation of Linear Regression”. The worst way is to start out with the subsequent mathematical equations not many can understand initially glance.

https://miro.medium.com/max/360/1*gd9Tzg8lmKLY0ZXWaerU8w.png

Bad news is that we still need to affect it and excellent news is that we’ll not start with equations like that, albeit just not now. What I might wish to begin with is ‘Ordinary method of least squares (OLS)’. If you happened to possess little or no background about rectilinear regression, this video will assist you get the sense of how it works using ‘Least Square Method’. Now, you recognize that OLS is simply like what we generally call ‘Linear Regression’, and that I will use the term intrinsically.

Before Moving On

In the next sections, I will be able to take different approaches with various terms and figures. There are two belongings you would want to recollect. One is that we don’t like overfitting. In other words, we always prefer a model that catches general patterns. The opposite is that our goal is predicting it from new data, not specific data. Therefore, model evaluation should be supported new data (testing set), not given data (training set). Also, i will be able to use the subsequent terms interchangeably.

Independent Variable = Feature = Attribute = Predictor = X

Coefficient = Beta = β

Residual Sum of Squares = RSS

https://miro.medium.com/max/688/1*3cEysrHZokqla0tXnZ-5GQ.png

Least Square Method finds the simplest and Unbiased Coefficients

You may know that least square method finds the coefficients that best fit the info. Another condition to be added is that it also finds the unbiased coefficients. Here unbiased means OLS doesn’t consider which experimental variable is more important than others. It simply finds the coefficients for a given data set. In short, there’s just one set of betas to be found, leading to rock bottom ‘Residual Sum of Squares (RSS)’. The question then becomes “Is a model with rock bottom RSS truly the simplest model?”

Bias vs. Variance

The answer for the question above is “Not really”. As hinted within the word ‘Unbiased’, we’d like to think about ‘Bias’ too. Bias means how equally a model cares about its predictors. Let’s say there are two models to predict an apple price with two predictor ‘sweetness’ and ‘shine’; one model is unbiased and therefore the other is biased.

https://miro.medium.com/max/593/1*OkRTcykIzOlmfe4OCJN1hA.png

First, the unbiased model tries to seek out the connection between the 2 features and therefore the prices, even as the OLS method does. This model will fit the observations as perfectly as possible to attenuate the RSS. However, this might easily cause overfitting issues. In other words, the model won’t perform also with new data because it’s built for the given data so specifically that it’s going to not fit new data.

https://miro.medium.com/max/443/1*wqDhhG2BjkBCl5WuHojddw.png

The biased model accepts its variables unequally to treat each predictor differently. Going back to the instance, we might want to only care about ‘sweetness’ to create a model and this could perform better with new data. The rationale is going to be explained after understanding Bias vs. Variance. If you’re not conversant in the bias vs. variance topic, I strongly recommend you to observe this video which will offer you insight.It are often said that bias is said with a model failing to suit the training set and variance is said with a model failing to suit the testing set. Bias and variance are during a trade-off relationship over model complexity, which suggests that an easy model would have high-bias and low-variance, and the other way around. In our apple example, a model only considering ‘sweetness’ wouldn’t fit the training data the maximum amount because the other model considering both ‘sweetness’ and ‘shine’, but the simpler model are going to be better at predicting new data.

This is because ‘sweetness’ may be a determinant of a price while ‘shine’ shouldn’t by sense. We all know this as a person’s but mathematical models don’t think like us and just calculate what’s given until it finds some relationship between all the predictors and therefore the experimental variable to suit training data.

*Note: We assume that ‘sweetness’ and ‘shine’ aren’t correlated

Where Ridge Regression Comes Into Play

https://miro.medium.com/max/433/1*cB0ESE9z3rB3-rpXPhwgWw.png

Looking at Bias vs. Variance figure, the Y-axis is ‘Error’ which is that the ‘Sum of Bias and Variance’. Since both of them are basically related with failing, we might wish to minimize those. Now taking a re-evaluation at the figure closely, you’ll find that the spot the entire error is lowest is somewhere within the middle. This is often times called ‘Sweet Spot’.

Let’s recall that OLS treats all the variables equally (unbiased). Therefore, an OLS model becomes more complex as new variables are added. It are often said that an OLS model is usually on the rightest of the figure, having rock bottom bias and therefore the highest variance. It’s fixed there, never moves, but we would like to maneuver it to the sweet spot. This is often when ridge regression would shine, also mentioned as Regularization. In ridge regression, you’ll tune the lambda parameter in order that model coefficients change. This will be best understood with a programming demo which will be introduced at the top .

Geometric Understanding of Ridge Regression

Many times, a graphic helps to urge the sensation of how a model works, and ridge regression isn’t an exception. The subsequent figure is that the geometric interpretation to match OLS and ridge regression.

https://miro.medium.com/max/655/1*1pHwPfuhgTDFH8elIh_B2g.png

Contours and OLS Estimate

Each contour may be a connection of spots where the RSS is that the same centered with the OLS estimate where the RSS is that the lowest. Also, the OLS estimate is that the point where it most closely fits the training set (low-bias).

Circle and Ridge Estimate

https://miro.medium.com/max/695/1*YGn5C4Qe2OIKkODiE6Cprw.png

Unlike the OLS estimate, the ridge estimate changes because the size of the blue circle changes. It’s simply where the circle meets the foremost outer contour. How ridge regression works is how we tune the dimensions of the circle. The key point is that β’s change at a special level.

Let’s say β1 is ‘shine’ and β2 is ‘sweetness’. As you’ll see, ridge β1 relatively drops more quickly to zero than ridge β2 does because the circle size changes (compare the 2 figures). The rationale why this happens is because the β’s change differently by the RSS. More intuitively, the contours aren’t circles but ellipses positioned tilted.

Ridge β’s can never be zero but only converge thereto , and this may be explained within the next with the mathematical formula. Although a geometrical expression like this explains a main idea pretty much, there’s a limitation too that we can’t express it over 3-dimension. So, it all comes right down to mathematical expressions.

Mathematical Formula

https://miro.medium.com/max/666/1*pMssBrKdIDKGdZBOvNJRvQ.png

We’ve seen the equation of multiple rectilinear regression both generally terms and matrix version. It are often written in another version as follows.

Here argmin means ‘Argument of Minimum’ that creates the function attain the minimum. Within the context, it finds the β’s that minimize the RSS. And that we skills to urge the β’s from the matrix formula. Now, the question becomes “What does this need to do with ridge regression?”.

https://miro.medium.com/max/247/1*8R8-IckBY6Rw239ruufShg.png

Again, ridge regression may be a variant of rectilinear regression . The term above is that the ridge constraint to the OLS equation. We are trying to find the β’s but they now must meet the above constraint too. Going back to the geometric figure, the C is like the radius of the circle, thus, the β’s should fall within the circle area, probably somewhere on the sting .

Vector Norm

https://miro.medium.com/max/300/1*FSvb8xU_eqvjXyXiXg7jrA.png

We still want to know the very first equation. To do so, we’d like to brush abreast of vector norm, which is nothing but the subsequent definition.

The subscription 2 is as in ‘L2 norm’, and you’ll learn more about vector norms here. We only care about L2 norm at this moment, so we will construct the equation we’ve already seen. The subsequent is that the simplest but still telling an equivalent as what we’ve been discussing. Notice the primary term within the following equation is essentially OLS, and then the second term with lambda is what makes ridge regression.

https://miro.medium.com/max/360/1*LsI3XqHSjNCiteUoFo2zKA.png

What we actually Want to seek out

The term with lambda is usually called ‘Penalty’ since it increases RSS. We iterate certain values onto the lambda and evaluate the model with a measurement like ‘Mean Square Error (MSE)’. So, the lambda value that minimizes MSE should be selected because the final model. This ridge regression model is usually better than the OLS model in prediction. As seen within the formula below, ridge β’s change with lambda and becomes an equivalent as OLS β’s if lambda is adequate to zero (no penalty).

https://miro.medium.com/max/286/1*Rnl4jgKCG8oKuH7MgQ_Vxw.png

Why It Converges to Zero But Not Becomes Zero

Deploying the matrix formula we saw previously, the lambda finishes up in denominator. It means if we increase the lambda value, ridge β’s should decrease. But ridge β’s can’t be zeros regardless of how big the lambda value is about. That is, ridge regression gives different importance weights to the features but doesn’t drop unimportant features.

https://miro.medium.com/max/207/1*524ctaHK1BIN9tqhHIOY8Q.png