Tag Archives: OLS

Selection Bias Corrections in Julia, Part 1

By: Bradley Setzler

Re-posted from: https://juliaeconomics.com/2014/09/02/selection-bias-corrections-in-julia-part-1/

Selection bias arises when a data sample is not a random draw from the population that it is intended to represent. This is especially problematic when the probability that a particular individual appears in the sample depends on variables that also affect the relationships we wish to study. Selection bias corrections based on models of economic behavior were pioneered by the economist James J. Heckman in his seminal 1979 paper.

For an example of selection bias, suppose we wish to study the effectiveness of a treatment (a new medicine for patients with a particular disease, a preschool curriculum for children facing particular disadvantages, etc.). A random sample is drawn from the population of interest, and the treatment is randomly assigned to a subset of this sample, with the remaining subset serving as the untreated (“control”) group. If the subsets followed instructions, then the resulting data would serve as a random draw from the data generating process that we wish to study.

However, suppose the treatment and control groups do not comply with their assignments. In particular, if only some of the treated benefit from treatment while the others are in fact harmed by treatment, then we might expect the harmed individuals to leave the study. If we accepted the resulting data as a random draw from the data generating process, it would appear that the treatment was much more successful than it actually was; an individual who benefits is more likely to be present in the data than one who does not benefit.

Conversely, if treatment were very beneficial, then some individuals in the control group may find a way to obtain treatment, possibly without our knowledge. The benefits received by the control group would make it appear that the treatment was less beneficial than it actually was; the receipt of treatment is no longer random.

In this tutorial, I present some parameterized examples of selection bias. Then, I present examples of parametric selection bias corrections, evaluating their effectiveness in recovering the data generating processes. Along the way, I demonstrate the use of the GLM package in Julia. A future tutorial demonstrates non-parametric selection bias corrections.


Example 1: Selection on a Normally-Distributed Unobservable

Suppose that we wish to study of the effect of the observable variable X_{i} on Y_i. The data generating process is given by,

Y_i=\beta_0+X_i\beta_1+\epsilon_i,

where X_i and \epsilon_i are independent in the population. Because of this independence condition, the ordinary least squares estimator would be unbiased if the data (Y,X) were drawn randomly from the population. However, suppose that the probability that individual i were in the data set \mathcal{S} were a function of X and \epsilon. For example, suppose that,

\Pr\left( i \in \mathcal{S}\Big| X_i,\epsilon_i\right)=1 if X_i > \epsilon_i,

and the probability is zero otherwise.  This selection rule ensures that, among the individuals in the data (the i in \mathcal{S}), the covariance between X and \epsilon will be positive, even though the covariance is zero in the population. When X covaries positively with \epsilon, then the OLS estimate of \beta_1 is biased upward, i.e., the OLS estimator converges to a value that is greater than \beta_1.

To see the problem, consider the following simulation of the above process in which X and \epsilon are drawn as independent standard normal random variables:

srand(2)
N = 1000
X = randn(N)
epsilon = randn(N)
beta_0 = 0.
beta_1 = 1.
Y = beta_0 + beta_1.*X + epsilon
populationData = DataFrame(Y=Y,X=X,epsilon=epsilon)
selected = X.>epsilon
sampleData = DataFrame(Y=Y[selected],X=X[selected],epsilon=epsilon[selected])

There are 1,000 individuals in the population data, but 492 of them are selected to be included in the sample data. The covariance between X and epsilon in the population data is,

cov(populationData[:X],populationData[:epsilon])
0.00861456704877879

which is approximately zero, but in the sample data, it is,

cov(sampleData[:X],sampleData[:epsilon])
0.32121357192108513

which is approximately 0.32.

Now, we regress Y on X with the two data sets to obtain,

linreg(array(populationData[:X]),array(populationData[:Y]))
2-element Array{Float64,1}:
 -0.027204
  1.00882

linreg(array(sampleData[:X]),array(sampleData[:Y]))
2-element Array{Float64,1}:
 -0.858874
  1.4517

where, in Julia 0.3.0, the command array() replaces the old command matrix() in converting a DataFrame into a numerical Array. This simulation demonstrates severe selection bias associated with using the sample data to estimate the data generating process instead of the population data, as the true parameters, \beta_0=0,\beta_1=1, are not recovered by the sample estimator.


Correction 1: Heckman (1979)

The key insight of Heckman (1979) is that the correlation between X and \epsilon can be represented as an omitted variable in the OLS moment condition,

\mathbb{E}\left[Y_i\Big|X_i,i\in\mathcal{S}\right]=\mathbb{E}\left[\beta_0 + \beta_1X_i+\epsilon_i\Big|X_i,i\in\mathcal{S}\right]=\beta_0 + \beta_1X_i+\mathbb{E}\left[\epsilon_i\Big|X_i,i\in\mathcal{S}\right],

Furthermore, using the conditional density of the standard Normally distributed \epsilon,

\mathbb{E}\left[\epsilon_i\Big|X_i,i\in\mathcal{S}\right]=\mathbb{E}\left[\epsilon_i\Big|X_i,X_i>\epsilon_i\right]=\frac{-\phi\left(X_i\right)}{\Phi\left(X_i\right)}.

which is called the inverse Mills ratio, where \phi and \Phi are the probability and cumulative density functions of the standard normal distribution. As a result, the moment condition that holds in the sample is,

\mathbb{E}\left[Y_i\Big|X_i,i\in\mathcal{S}\right]=\beta_0 + \beta_1X_i+\frac{-\phi\left(X_i\right)}{\Phi\left(X_i\right)}

Returning to our simulation, the inverse Mills ratio is added to the sample data as,

sampleData[:invMills] = -pdf(Normal(),sampleData[:X])./cdf(Normal(),sampleData[:X])

Then, we run the regression corresponding to the sample moment condition,

linreg(array(sampleData[[:X,:invMills]]),array(sampleData[:Y]))
3-element Array{Float64,1}:
 -0.166541
  1.05648
  0.827454

We see that the estimate for \beta_1 is now 1.056, which is close to the true value of 1, compared to the non-corrected estimate of 1.452 above. Similarly, the estimate for \beta_0 has improved from -0.859 to -0.167, when the true value is 0. To see that the Heckman (1979) correction is consistent, we can increase the sample size to N=100,000, which yields the estimates,

linreg(array(sampleData[[:X,:invMills]]),array(sampleData[:Y]))
3-element Array{Float64,1}:
 -0.00417033
  1.00697
  0.991503

which are very close to the true parameter values.

Note that this analysis generalizes to the case in which X contains K variables and the selection rule is,

\delta_0 + \delta_1 X_1 + \delta_2 X_2 + \ldots + \delta_K X_k > \epsilon,

which is the case considered by Heckman (1979). The only difference is that the coefficients \delta_k must first be estimated by regressing an indicator for i \in \mathcal{S} on X, then using the fitted equation within the inverse Mills ratio. This requires that we observe X for i \notin \mathcal{S}. Probit regression is covered in a slightly different context below.

As a matter of terminology, the process of estimating \delta is called the “first stage”, and estimating \beta conditional on the estimates of \delta is called the “second stage”. When the coefficient on the inverse Mills ratio is positive, it is said that “positive selection” has occurred, with “negative selection” otherwise. Positive selection means that, without the correction, the estimate of \beta_1 would have been upward-biased, while negative selection results in a downward-biased estimate. Finally, because the selection rule is driven by an unobservable variables \epsilon, this is a case of “selection on unobservables”. In the next section we consider a case of “selection on observables”.


Example 2: Probit Selection on Observables

Suppose that we wish to know the mean and variance of Y in the population. However, our sample of Y suffers from selection bias. In particular, there is some X such that the probability of observing Y depends on X according to,

\Pr\left(i\in\mathcal{S}\Big| X_i\right)=F\left(X_i\right),

where F is some function with range [0,1]. Notice that, if Y and X were independent, then the resulting sample distribution of Y would be a random draw from the population (marginal distribution) of Y. Instead, we suppose \mathrm{Cov}\left(X,Y\right)\neq 0. For example,

srand(2)
N = 10000
populationData = DataFrame(rand(MvNormal([0,0.],[1 .5;.5 1]),N)')
names!(populationData,[:X,:Y])

mean(array(populationData),1)
1x2 Array{Float64,2}:
 -0.0281916  -0.022319

cov(array(populationData))
2x2 Array{Float64,2}:
 0.98665   0.500912
 0.500912  1.00195 

In this simulated population, the estimated mean and variance of Y are -0.022 and 1.002, and the covariance between X and Y is 0.501. Now, suppose the probability that Y_i is observed is a probit regression of X_i,

\Pr\left( i\in\mathcal{S}\Big| X_i \right) = \Phi\left( \beta_0 + \beta_1 X_i \right),

where \Phi is the CDF of the standard normal distribution. Letting D_i=1 indicate that i \in \mathcal{S}, we can generate the sample selection rule D as,

beta_0 = 0
beta_1 = 1
index = (beta_0 + beta_1*data[:X])
probability = cdf(Normal(0,1),index)
D = zeros(N)
for i=1:N
    D[i] = rand(Bernoulli(probability[i]))
end
populationData[:D] = D
sampleData = populationData
sampleData[D.==0,:Y] = NA

The sample data has missing values in place of Y_i if D_i=0. The estimated mean and variance of Y in the sample data are 0.275 (which is too large) and 0.862 (which is too small).


Correction 2: Inverse Probability Weighting

The reason for the biased estimates of the mean and variance of Y in Example 2 is sample selection on the observable X. In particular, certain values of Y are over-represented due to their relationship with X. Inverse probability weighting is a way to correct for the over-representation of certain types of individuals, where the “type” is captured by the probability of being included in the sample.

In the above simulation, conditional on appearing in the population, the probability that an individual of type X_i=1 is included in the sample is 0.841. By contrast, the probability that an individual of type X_i=0 is included in the sample is 0.5, so type X_i is over-represented by a factor of 0.841/0.5 = 1.682. If we could reduce the impact that type X_i=1 has in the computation of the mean and variance of Y by a factor of 1.682, we would alter the balance of types in the sample to match the balance of types in the population. Inverse probability weighting generalizes this logic by weighting each individual’s impact by the inverse of the probability that this individual appears in the sample.

Before we can make the correct, we must first estimate the probability of sample inclusion. This can be done by fitting the probit regression above by least-squares. For this, we use the GLM package in Julia, which can be installed the usual way with the command Pkg.add(“GLM”).

using GLM
Probit = glm(D ~ X, sampleData, Binomial(), ProbitLink())
DataFrameRegressionModel{GeneralizedLinearModel,Float64}:

Coefficients:
             Estimate Std.Error  z value Pr(>|z|)
(Intercept)  0.114665  0.148809 0.770554   0.4410
X             1.14826   0.21813  5.26414    1e-6

estProb = predict(Probit)
weights = 1./estProb[D.==1]/sum(1./estProb[D.==1])

which are the inverse probability weights needed to match the sample distribution to the population distribution.

Now, we use the inverse probability weights to correct the mean and variance estimates of Y,

correctedMean = sum(sampleData[D.==1,:Y].*weight)
-0.024566923132025013

correctedVariance = (N/(N-1))*sum((sampleData[D.==1,:Y]-correctedMean).^2.*weight)
1.0094029613131092

which are very close to the population values of -0.022319 and 1.00195. The logic here extends to the case of multivariate X, as more coefficients are added to the Probit regression. The logic also extends to other functional forms of F, for example, switching from Probit to Logit is achieved by replacing the ProbitLink() with LogitLink() in the glm() estimation above.


Example 3: Generalized Roy Model

For the final example of this tutorial, we consider a model which allows for rich, realistic economic behavior. In words, the Generalized Roy Model is the economic representation of a world in which each individual must choose between two options, where each option has its own benefits, and one of the options costs more than the other. In math notation, the first alternative, denoted D_i=1, relates the outcomes Y_i to the individual’s observable characteristics, X_i, by,

Y_{1,i} = \mu_1\left(X_i\right)+U_{1,i}.

Similarly, the second alternative, denoted D_i=0, relates Y_i to X_i, by,

Y_{0,i} = \mu_0\left(X_i\right)+U_{0,i}.

The value of Y_i that appears in our sample is thus given by,

Y_i = D_iY_{1,i} + (1-D_i)Y_{0,i}.

Finally, the value of D_i is chosen by individual i according to,

D_i=1 if Y_{1,i}-Y_{0,i}-C_i>0,

where C_i is the cost of choosing the alternative D_i=1 and is given by,

C_i = \mu_C\left(X_i,Z_i\right)+U_{C,i},

where Z contains additional characteristics of i that are not included in X.

We assume that the data only contains Y_i,D_i,X_i,Z_i; it does not contain the variables Y_{i,1},Y_{i,0},U_{i,1},U_{i,0},U_{i,C} or the functions \mu_1,\mu_0,\mu_C. Assuming that the three \mu functions follow the linear form and that the unobservables U are independent and Normally distributed, we can simulate the data generating process as,

srand(2)
N = 1000
sampleData = DataFrame(rand(MvNormal([0,0.],[1 .5; .5 1]),N)')
names!(sampleData,[:X,:Z])
U1 = rand(Normal(0,.5),N)
U0 = rand(Normal(0,.7),N)
UC = rand(Normal(0,.9),N)
betas1 = [0,1]
betas0 = [.3,.2]
betasC = [.1,.1,.1]
Y1 = betas1[1] + betas1[2].*sampleData[:X] + U1
Y0 = betas0[1] + betas0[2].*sampleData[:X] + U0
C = betasC[1] + betasC[2].*sampleData[:X] + betasC[3].*sampleData[:Z] + UC
D = Y1-Y0-C.>0
Y = D.*Y1 + (1-D).*Y0
sampleData[:D] = D
sampleData[:Y] = Y

In this simulation, about 38% of individuals choose the alternative D_i=1. About 10% of individuals choose D_i=0 even though they receive greater benefits under D_i=1 due to the high cost C_i associated with D_i=1.


Solution 3: Heckman Correction for Generalized Roy Model

The identification of this model is attributable to Heckman and Honore (1990). Estimation proceeds in steps. The first step is to notice that the left- and right-hand terms in the following moment equation motivate a Probit regression:

\mathbb{E}\left[D_i\Big|X_i,Z_i\right]=\Pr\left(\mu_D\left(X_i,Z_i\right)>U_{D,i}\Big|X_i,Z_i\right)=\Phi\left(\mu_D\left(X_i,Z_i\right)\right),

where U_{D,i}\equiv -\left(U_{1,i}-U_{0,i}-U_{C,i}\right)/\sigma_D\sim\mathcal{N}\left(0,1\right) is the negative of the total error term arising in the equation that determines D_i above, \sigma_D \equiv \sqrt{\mathrm{Var}\left(U_{1,i}-U_{0,i}-U_{C,i}\right)}, and,

\mu_D\left(X,Z\right) \equiv \left([1,X]\beta_1 -[1,X]\beta_0 -[1,X,Z]\beta_C\right)/\sigma_D \equiv [1,X,Z]\beta_D,

In the simulation above, \beta_D = [-.4,.7,-.1]/\sqrt{.5+.7+.9}\approx[-0.276,0.483,-0.069]. We can estimate \beta_D from the Probit regression of D on X and Z.

betasD = coef(glm(D~X+Z,sampleData,Binomial(),ProbitLink()))
3-element Array{Float64,1}:
 -0.299096
  0.59392 
 -0.103155

Next, notice that,

\mathbb{E}\left[Y_i\Big|D_i=1,X_i,Z_i\right] =[1,X_i]\beta_1+\mathbb{E}\left[U_{1,i}\Big|D_i=1,X_i,Z_i\right],

where,

\mathbb{E}\left[U_{1,i}\Big|D_i=1,X_i,Z_i\right] =\mathbb{E}\left[U_{1,i}\Big|\mu_D\left(X_i,Z_i\right)>U_{i,D},X_i,Z_i\right]=\rho_1 \frac{-\phi\left(\mu_D\left(X_i,Z_i\right)\right)}{\Phi\left(\mu_D\left(X_i,Z_i\right)\right)},

which is the inverse Mills ratio again, where \rho_1 \equiv \frac{\mathrm{Cov}\left(U_1,U_D\right)}{\mathrm{Var}\left(U_D\right)}. Substituting in the estimate for \beta_D, we consistently estimate \beta_1:

fittedVals = hcat(ones(N),array(sampleData[[:X,:Z]]))*betasD

sampleData[:invMills1] = -pdf(Normal(0,1),fittedVals)./cdf(Normal(0,1),fittedVals)

correctedBetas1 = linreg(array(sampleData[D,[:X,:invMills1]]),vec(array(sampleData[D,[:Y]])))
3-element Array{Float64,1}:
  0.0653299
  0.973568 
 -0.135445 

To see how well the correction has performed, compare these estimates to the uncorrected estimates of \beta_1,

biasedBetas1 = linreg(array(sampleData[D,[:X]]),vec(array(sampleData[D,[:Y]])))
2-element Array{Float64,1}:
 0.202169
 0.927317

Similar logic allows us to estimate \beta_0:

sampleData[:invMills0] = pdf(Normal(0,1),fittedVals)./(1-cdf(Normal(0,1),fittedVals))

correctedBetas0 = linreg(array(sampleData[D.==0,[:X,:invMills0]]),vec(array(sampleData[D.==0,[:Y]])))
3-element Array{Float64,1}:
 0.340621
 0.207068
 0.323793

biasedBetas0 = linreg(array(sampleData[D.==0,[:X]]),vec(array(sampleData[D.==0,[:Y]])))
2-element Array{Float64,1}:
 0.548451
 0.295698

In summary, we can consistently estimate the benefits associated with each of two alternative choices, even though we only observe each individual in one of the alternatives, subject to heavy selection bias, by extending the logic introduced by Heckman (1979).


Bradley J. Setzler

Parallel Processing in Julia: Bootstrapping the MLE

By: Bradley Setzler

Re-posted from: https://juliaeconomics.com/2014/06/18/parallel-processing-in-julia-bootstrapping-the-mle/

* Scripts that reproduce the results of this tutorial in Julia are located here and here.

To review our extended example, we first simulated the OLS DGP, solved for the MLE of OLS, bootstrapped the MLE, and computed various non-parametric p-values.

The only part of our previous work that may have slowed down your computer was forming the 10,000 bootstrap samples (samples). In this post, we will see how to speed up the bootstrapping by using more than one processor to collect the bootstrap samples.


The Estimator that We Wish to Bootstrap

We are interested in the following MLE estimator, adapted from our previous work. First, we import the two needed packages and generate the data:

# Prepare
srand(2)
using Distributions
using Optim
using DataFrames
N=1000
K=3

# Generate variables
genX = MvNormal(eye(K))
X = rand(genX,N)
X = X'
X_noconstant = X
constant = ones(N)
X = [constant X]
genEpsilon = Normal(0, 1)
epsilon = rand(genEpsilon,N)
trueParams = [0.01,0.05,0.05,0.07]
Y = X*trueParams + epsilon

# Export data
data = DataFrame(hcat(Y,X))
names!(data,[:Y,:one,:X1,:X2,:X3])
writetable("data.csv",data)

Next, we define the likelihood of OLS:

function loglike(rho,y,x)
    beta = rho[1:4]
    sigma2 = exp(rho[5])+eps(Float64)
    residual = y-x*beta
    dist = Normal(0, sqrt(sigma2))
    contributions = logpdf(dist,residual)
    loglikelihood = sum(contributions)
    return -loglikelihood
end

Finally, we define the bootstrap function, which includes within it a wrapper for the likelihood:

function bootstrapSamples(B)
    println("hi")
    samples = zeros(B,5)
    for b=1:B
		theIndex = sample(1:N,N)
		x = X[theIndex,:]
		y = Y[theIndex,:]
		function wrapLoglike(rho)
			return loglike(rho,y,x)
		end
		samples[b,:] = optimize(wrapLoglike,params0,method=:cg).minimum
	end
	samples[:,5] = exp(samples[:,5])
    println("bye")
    return samples
end

The reason for the cute println statements will become apparent later. Save all of this in a file called “bootstrapFunction.jl”. Finally, we import the data created above:

data = readtable("data.csv")
N = size(data,1)
Y = array(data[:Y])
X = array(data[[:one,:X1,:X2,:X3]])

Parallel Bootstrapping

We now have the function, bootstrapSamples, saved in the file “bootstrapFunction.jl”, that we wish to run in parallel. The idea is that, instead of running the MLE B times on one processor, we run it B/4 times on each of four processors, then collect the results back to the first processor. Unfortunately, Julia makes it very difficult to parallelize functions within the same file that the functions are defined; the user must explicitly tell Julia every parameter and function that is needed on each processor, which is a frustrating process of copy-pasting the command @everywhere all around your code.

Instead, we separate our work into two files, the first one you have already created. Create a second file in the same location on your computer as “bootstrapFunction.jl”. In the second file, write the following commands:

addprocs(4)
require("tutorial_5_bootstrapFunction.jl")
B=10000
b=2500
samples_pmap = pmap(bootstrapSamples,[b,b,b,b])
samples = vcat(samples_pmap[1],samples_pmap[2],samples_pmap[3],samples_pmap[4])

The command addprocs(4) means to add four processors into your Julia session. The command require is the key to getting this right: it ensures that all of the functions and parameters contained in “bootstrapFunction.jl” are available to each processor. pmap is the function that actually applies bootstrapSamples to each value in the list [b,b,b,b]. I set the list next to it to be [b,b,b,b], where b=2500, so that bootstrapEstimates would collect 2,500 bootstrap samples on each processor. The resulting matrices from all four processors are stored in the length-4 vector called samples_pmap. Finally, we vertically stack (vcat) the four matrices into one matrix, samples, which is exactly the matrix of bootstrap samples we have seen previously.

Tip: For parallel processing in Julia, separate your code into two files, one file containing the functions and parameters that need to be run in parallel, and the other file managing the processing and collecting results. Use the require command in the managing file to import the functions and parameters to all processors.


Results

Once you have run the outer file, you can do speed tests to compare the two ways of getting 10,000 bootstrap samples. We use the @elapsed command to time each code. Here are the results:

@elapsed pmap(bootstrapSamples,[b,b,b,b])
	From worker 2:	hi
	From worker 3:	hi
	From worker 4:	hi
	From worker 5:	hi
	From worker 2:	bye
	From worker 5:	bye
	From worker 4:	bye
	From worker 3:	bye
66.528345161

Each of the processors tells us “hi” and “bye” when it begins and finishes its job, respectively. The total time was a bit more than a minute (66.5 seconds). By comparison, the following code will compute 10,000 bootstrap samples on only one processor, as in our previous work:

@elapsed bootstrapSamples(10000)
hi
bye
145.265103729

So, in the absence of parallel processing, the same result would require just over 145 seconds, or almost two and a half minutes. Thus, parallel processing on a four-core personal computer allowed us to reduce the time required to bootstrap our estimator by a factor of around 2.2. As a general rule, parallel processing of the bootstrap estimator saves you more time when the function being bootstrapped is more complicated.


Bradley J. Setzler

Maximum Likelihood Estimation (MLE) in Julia: The OLS Example

By: Bradley Setzler

Re-posted from: http://juliaeconomics.com/2014/06/16/numerical-maximum-likelihood-the-ols-example/

* The script to reproduce the results of this tutorial in Julia is located here.

We continue working with OLS, using the model and data generating process presented in the previous post. Recall that,

\epsilon|X \sim \mathcal{N}\left(0,\sigma^2\right) \implies Y|X \sim \mathcal{N}\left( X\beta, \sigma^2 \right),

or, equivalently,

\left(Y-X\beta\right)|X \sim \mathcal{N}\left( 0, \sigma^2 \right),

which is a more convenient expression because the distribution does not depend on X, conditional on X. Denote the parameter vector by \rho \equiv [\beta, \sigma^2]. We will now see how to obtain the MLE estimate \hat\rho of \rho. By Bayes’ Rule and independence across individuals (i), the likelihood of \rho satisfies,

\mathcal{L}\left(\rho|Y,X\right) \propto \prod_{i=1}^N \phi\left( Y_i - X_i\beta, \sigma^2 \right|\rho);

where \phi is the normal probability distribution function (PDF). \hat\rho is the \arg\max of this expression, and we will show how to find it using a numerical search algorithm in Julia.


Computing the Log-Likelihood of OLS

First, we define the log-likelihood in Julia as follows (we are using the data X and Y generated in the previous post):

using Distributions
function loglike(rho)
   beta = rho[1:4]
   sigma2 = exp(rho[5])
   residual = Y-X*beta
   dist = Normal(0, sigma2)
   contributions = logpdf(dist,residual)
   loglikelihood = sum(contributions)
   return -loglikelihood
end

This code first collects beta (beta) and \sigma^2 (sigma2) from \rho (rho), uses \sigma^2 to initialize the appropriate normal distribution (dist), then evaluates the normal distribution at each of the residuals, Y_i-X_i\beta (residuals), returning the negative of the sum of the individual contributions to the log-likelihood (contributions).

Tip: Always remember to use the negative of the likelihood (or log-likelihood) that you wish to maximize, because the optimize command is a minimizer by default. Since the \arg\max is the same when maximizing some function and minimizing the negative of the function, we need the negative sign in order to maximize the likelihood with a minimizer.

The only confusing part of this function is that sigma2 is read from rho as an exponential. This is strictly a unit conversion — it means that \sigma^2 was stored in \rho in log-units, so we must exponentiate to return it to levels, that is, \rho is defined as \rho \equiv [\beta, \log\left(\sigma^2\right)]. This is a common approach in numerical optimization to solve a practical problem: the numerical optimizer tries out many possible values for \rho in its effort to search for the MLE, and it may naively try out a negative value of \sigma^2, which would be a logical contradiction that would crash the code. By storing \sigma^2 in \log units, we make it perfectly acceptable for the optimizer to try out a negative value of the parameter, because a parameter that is negative in log-units is non-negative in level-units.

Tip: Always restrict variances so that the optimizer cannot try negative values. We have used log-units to achieve this goal.


Maximizing the Likelihood of OLS

Now, we are ready to find the MLE, \hat\rho. To do this, we will use the Optim package, which I previously showed how to install.

using Optim
params0 = [.1,.2,.3,.4,.5]
optimum = optimize(loglike,params0,method=:cg)
MLE = optimum.minimum
MLE[5] = exp(MLE[5])
println(MLE)

This says to optimize the function loglike, starting from the point params0, which is chosen somewhat arbitrarily. Numerical search algorithms have to start somewhere, and params0 serves as an initial guess of the optimum. Of course, the best possible guess is trueParams, because the optimizer would have to do much less work, but in practice, we do not know the true parameters so the optimizer will have to do the work. Notice that, at the end, we have to exponentiate the sigma2 parameter because the optimizer will return it in log-units due to our exponentiation above.


Results

Using the same random seed as before, the algorithm returns the estimates,

julia> MLE
5-element Array{Float64,1}:
0.112163
0.476432
-0.290571
0.010831
1.01085

which are very close to trueParams and the true variance of \epsilon, 1. The Optim package has various optimizers from which to choose; we were using the Newton Conjugate-Gradient (cg) algorithm above. If we replace this with the Nelder-Mead algorithm (nelder_mead), we obtain the almost-identical estimates,

julia> MLE
5-element Array{Float64,1}:
0.112161
0.476435
-0.290572
0.0108343
1.01085

In future posts, we will see cases where the choice of optimizer makes a major difference in the quality of the numerical MLE as well as the speed of computation, but for this simple example, the choice of optimizer does not much matter.


Bradley J. Setzler