# Maximum Likelihood Estimation (MLE) in Julia: The OLS Example

* The script to reproduce the results of this tutorial in Julia is located here.

We continue working with OLS, using the model and data generating process presented in the previous post. Recall that,

$\epsilon|X \sim \mathcal{N}\left(0,\sigma^2\right) \implies Y|X \sim \mathcal{N}\left( X\beta, \sigma^2 \right)$,

or, equivalently,

$\left(Y-X\beta\right)|X \sim \mathcal{N}\left( 0, \sigma^2 \right)$,

which is a more convenient expression because the distribution does not depend on $X$, conditional on $X$. Denote the parameter vector by $\rho \equiv [\beta, \sigma^2]$. We will now see how to obtain the MLE estimate $\hat\rho$ of $\rho$. By Bayes’ Rule and independence across individuals (i), the likelihood of $\rho$ satisfies,

$\mathcal{L}\left(\rho|Y,X\right) \propto \prod_{i=1}^N \phi\left( Y_i - X_i\beta, \sigma^2 \right|\rho)$;

where $\phi$ is the normal probability distribution function (PDF). $\hat\rho$ is the $\arg\max$ of this expression, and we will show how to find it using a numerical search algorithm in Julia.

Computing the Log-Likelihood of OLS

First, we define the log-likelihood in Julia as follows (we are using the data X and Y generated in the previous post):

using Distributions
function loglike(rho)
beta = rho[1:4]
sigma2 = exp(rho[5])
residual = Y-X*beta
dist = Normal(0, sigma2)
contributions = logpdf(dist,residual)
loglikelihood = sum(contributions)
return -loglikelihood
end


This code first collects $beta$ (beta) and $\sigma^2$ (sigma2) from $\rho$ (rho), uses $\sigma^2$ to initialize the appropriate normal distribution (dist), then evaluates the normal distribution at each of the residuals, $Y_i-X_i\beta$ (residuals), returning the negative of the sum of the individual contributions to the log-likelihood (contributions).

Tip: Always remember to use the negative of the likelihood (or log-likelihood) that you wish to maximize, because the optimize command is a minimizer by default. Since the $\arg\max$ is the same when maximizing some function and minimizing the negative of the function, we need the negative sign in order to maximize the likelihood with a minimizer.

The only confusing part of this function is that sigma2 is read from rho as an exponential. This is strictly a unit conversion — it means that $\sigma^2$ was stored in $\rho$ in log-units, so we must exponentiate to return it to levels, that is, $\rho$ is defined as $\rho \equiv [\beta, \log\left(\sigma^2\right)]$. This is a common approach in numerical optimization to solve a practical problem: the numerical optimizer tries out many possible values for $\rho$ in its effort to search for the MLE, and it may naively try out a negative value of $\sigma^2$, which would be a logical contradiction that would crash the code. By storing $\sigma^2$ in $\log$ units, we make it perfectly acceptable for the optimizer to try out a negative value of the parameter, because a parameter that is negative in log-units is non-negative in level-units.

Tip: Always restrict variances so that the optimizer cannot try negative values. We have used log-units to achieve this goal.

Maximizing the Likelihood of OLS

Now, we are ready to find the MLE, $\hat\rho$. To do this, we will use the Optim package, which I previously showed how to install.

using Optim
params0 = [.1,.2,.3,.4,.5]
optimum = optimize(loglike,params0,method=:cg)
MLE = optimum.minimum
MLE[5] = exp(MLE[5])
println(MLE)


This says to optimize the function loglike, starting from the point params0, which is chosen somewhat arbitrarily. Numerical search algorithms have to start somewhere, and params0 serves as an initial guess of the optimum. Of course, the best possible guess is trueParams, because the optimizer would have to do much less work, but in practice, we do not know the true parameters so the optimizer will have to do the work. Notice that, at the end, we have to exponentiate the sigma2 parameter because the optimizer will return it in log-units due to our exponentiation above.

Results

Using the same random seed as before, the algorithm returns the estimates,

julia> MLE
5-element Array{Float64,1}:
0.112163
0.476432
-0.290571
0.010831
1.01085


which are very close to trueParams and the true variance of $\epsilon$, 1. The Optim package has various optimizers from which to choose; we were using the Newton Conjugate-Gradient (cg) algorithm above. If we replace this with the Nelder-Mead algorithm (nelder_mead), we obtain the almost-identical estimates,

julia> MLE
5-element Array{Float64,1}:
0.112161
0.476435
-0.290572
0.0108343
1.01085


In future posts, we will see cases where the choice of optimizer makes a major difference in the quality of the numerical MLE as well as the speed of computation, but for this simple example, the choice of optimizer does not much matter.