Coursera Learner working on a presentation with Coursera logo and
Coursera Learner working on a presentation with Coursera logo and


In this post I’ll explain what the utmost likelihood method for parameter estimation is and undergo an easy example to demonstrate the tactic. A number of the content requires knowledge of fundamental probability concepts like the definition of probability and independence of events. I’ve written a blog post with these prerequisites so be happy to read this if you think that you would like a refresher.

What are parameters?

Often in machine learning we use a model to explain the method that leads to the info that is observed. for instance , we may use a random forest model to classify whether customers may cancel a subscription from a service (known as churn modeling) or we may use a linear model to predict the revenue which will be generated for a corporation counting on what proportion they’ll spend on advertising (this would be an example of linear regression). Each model contains its own set of parameters that ultimately defines what the model seems like.

For a linear model we will write this as y = mx + c. during this example x could represent the advertising spend and y could be the revenue generated. m and c are parameters for this model. Different values for these parameters will give different lines (see figure below).

So parameters define a blueprint for the model. It’s only specific values are chosen for the parameters that we get an instantiation for the model that describes a given phenomenon.*YE0OsCA9xug9fndqk7YGkg.png

Intuitive explanation of maximum likelihood estimation

Maximum likelihood estimation may be a method that determines values for the parameters of a model. The parameter values are found such they maximize the likelihood that the method described by the model produced the info that was actually observed.

The above definition should sound a touch cryptic so let’s undergo an example to assist understand this.

Let’s suppose we’ve observed 10 data points from some process. For instance, each datum could represent the length of your time in seconds that it takes a student to answer a selected exam question. These 10 data points are shown within the figure below*Z3JJGvEtOjmpLFvmWiUR3Q.png

We first need to decide which model we expect best describes the method of generating the info. This part is extremely important. At the very least, we should always have an honest idea about which model to use. This usually comes from having some domain expertise but we won’t discuss this here.

For these data we’ll assume that the info generation process is often adequately described by a Gaussian (normal) distribution. Visual inspection of the figure above suggests that a normal distribution is plausible because most of the ten points are clustered within the middle with few points scattered to the left and therefore the right. (Making this type of decision on the fly with only 10 data points is ill-advised but as long as I generated these data points we’ll accompany it).

Recall that the normal distribution has 2 parameters. The mean, μ, and therefore the variance, σ. Different values of those parameters end in different curves (just like with the straight lines above). we would like to understand which curve was presumably liable for creating the info points that we observed? (See figure below). Maximum likelihood estimation may be a method which will find the values of μ and σ that end in the curve that most closely fits the info.*uLKl0Nz1vFg6bmfiqpCKZQ.png

The true distribution from which the info were generated was f1 ~ N(10, 2.25), which is that the blue curve within the figure above.

Calculating the utmost Likelihood Estimates

Now that we’ve an intuitive understanding of what maximum likelihood estimation is we will advance to learning the way to calculate the parameter values. The values that we discover are called the utmost likelihood estimates (MLE).

Again we’ll demonstrate this with an example. Suppose we’ve three data points this point and that we assume that they need been generated from a process that’s adequately described by a normal distribution. These points are 9, 9.5 and 11. How can we calculate the utmost likelihood estimates of the parameter values of the normal distribution μ and σ?

What we would like to calculate is that the total probability of observing all of the info, i.e. the probability distribution of all observed data points. To try this we might got to calculate some conditional probabilities, which may get very difficult. So it’s here that we’ll make our first assumption. The idea is that every datum is generated independently of the others. This assumption makes the maths much easier. If the events (i.e. the method that generates the data) are independent, then the entire probability of observing all of knowledge is that the product of observing each datum individually (i.e. the merchandise of the marginal probabilities).*t4zrihvhtlZJZsvcX3jRjg.png

The probability density of observing one datum x, that’s generated from a normal distribution, is given by:

The semi colon utilized in the notation P(x; μ, σ) is there to emphasize that the symbols that appear after it are parameters of the probability distribution. So it shouldn’t be confused with a contingent probability (which is usually represented with a vertical line e.g. P(A| B)).*rFzbQ614IR4zEwBM3k1V0Q.png

In our example the entire (joint) probability density of observing the three data points is given by:

We just need to find out the values of μ and σ that leads to giving the utmost value of the above expression.

If you’ve covered calculus in your maths classes then you’ll probably remember that there’s a way which will help us find maxima (and minima) of functions. It’s called differentiation. All we do is try to locate the derivative of the function, set the derivative function to zero then rearrange the equation to form the parameter of interest the topic of the equation. And voilà, we’ll have our MLE values for our parameters. I’ll undergo these steps now but I’ll assume that the reader knows the way to perform differentiation on common functions. If you’d sort of a more detailed explanation then just let me know within the comments.

The log likelihood*hgz4ePKHyMh72hVrEguoyw.png*EN94xeYTJgnhDFnMsHf2WA.png

The above expression for the entire probability is really quite pain to differentiate, so it’s nearly always simplified by taking the Napierian logarithm of the expression. This is often absolutely fine because the Napierian logarithm may be a monotonically increasing function. This suggests that if the worth on the x-axis increases, the worth on the y-axis also increases (see figure below). This is often important because it ensures that the utmost value of the log of the probability occurs at an equivalent point because the original probability function. Therefore we will work with the simpler log-likelihood rather than the first likelihood. Taking logs of the first expression gives us:

This expression is often simplified again using the laws of logarithms to obtain:*iEdEaqWWiruaw_Fr2ophxw.png*xjDrGJ_JHLMa7619jFkjLA.png

This expression is often differentiated to seek out the utmost. during this example we’ll find the MLE of the mean, μ. to try to this we take the partial of the function with reference to μ, giving*kdjQQo5jUX9a2Z0kblJ4Hg.png

Finally, setting the left side of the equation to zero then rearranging for μ gives:*rHtqdjFXRw4sdnLU9n_WsQ.png

And there we’ve our maximum likelihood estimate for μ. we will do an equivalent thing with σ too but I’ll leave that as an exercise for the keen reader.

Concluding remarks

Can maximum likelihood estimation always be solved in a particular manner?

No is that the short answer. It’s more likely that during a world scenario the derivative of the log-likelihood function remains analytically intractable (i.e. it’s way too hard/impossible to differentiate the function by hand). Therefore, iterative methods like Expectation-Maximization algorithms are wont to find numerical solutions for the parameter estimates. The general idea remains an equivalent though.

So why maximum likelihood and not maximum probability?

Well this is often just statisticians being pedantic (but permanently reason). Most of the people tend to use probability and likelihood interchangeably but statisticians and probability theorists distinguish between the 2. The rationale for the confusion is best highlighted by watching the equation.

These expressions are equal! So what does this mean? Let’s first define P(data; μ, σ)? It means “the probability density of observing the info with model parameters μ and σ”. It’s worth noting that we will generalize this to any number of parameters and any distribution.

On the opposite hand L(μ, σ; data) means “the likelihood of the parameters μ and σ taking certain values as long as we’ve observed a bunch of knowledge.”

The equation above says that the probability density of the info given the parameters is adequate to the likelihood of the parameters given the info. But despite these two things being equal, the likelihood and therefore the refore the probability density are fundamentally asking different questions — one is asking about the info and the other is asking about the parameter values. This is often why the tactic is named maximum likelihood and not maximum probability.

When is method of least squares minimization an equivalent as maximum likelihood estimation?

Least squares minimization is another common method for estimating parameter values for a model in machine learning. It seems that when the model is assumed to be Gaussian as within the examples above, the MLE estimates are like the smallest amount squares method. For a more in-depth mathematical derivation inspect these slides.

Intuitively we will interpret the connection between the 2 methods by understanding their objectives. For method of least squares parameter estimation we would like to seek out the road that minimizes the entire squared distance between the info points and therefore the regression curve (see the figure below). In maximum likelihood estimation we would like to maximize the entire probability of the info. When a normal distribution is assumed, the utmost probability is found when the info points meet up with to the mean. Since the normal distribution is symmetric, this is often like minimizing the space between the info points and therefore the mean.