Tag Archives: tutorials

Upgrade to Julia Version 0.3.0

By: Bradley Setzler

Re-posted from: https://juliaeconomics.com/2014/09/01/upgrade-to-julia-version-3/

UPDATE: Julia is now on Version 0.4 and the Julia Developers have now released an IDE called Juno that is bundled with Julia. See the new installation guide here.


 

The Julia developers have officially released version 0.3.0, and it is a good idea to upgrade. The key packages in Julia that we use for economics — especially DataFrames — are being rapidly developed for 0.3.0, so you will need to upgrade to take advantage of new and improved features.

It’s relatively easy to upgrade Julia. However, it appears that Julia Studio does not and will not support 0.3.0, so the bigger problem users will face is abandoning Julia Studio in favor of another environment for writing and testing codes.

There are a few candidates from which you may choose to replace Julia Studio. My recommendation is IJulia. IJulia is somewhat easy to install, easy to use, and excellent for working with graphics. A drawback of using IJulia is that it loads in your browser and, at least in my experience, it often crashes the browser (I am using Safari on a Mac), so you will need to save your work frequently.

The greater drawback is that IJulia is much more difficult to install than Julia Studio, as it requires the installation of IPython. It used to be true that a complete novice could install and use both Julia and a coding environment with a few clicks. Now, it is more involved.


How to Install Julia 0.3.0 and IJulia

Here is a (somewhat) quick guide to upgrading Julia and replacing Julia Studio with IJulia:

  1. Delete your existing Julia and Julia Studio. On a Mac, these are located within the Applications folder. These can only cause problems by leaving them on your computer. You may need to check for hidden files relating to Julia and delete those as well (this is definitely true on a Mac, as the hidden ~.julia must be found and deleted). You can find instructions specific to your operating system here under Platform Specific Instructions.
  2. Go here and download Julia version 0.3.0. Once it finishes downloading, open the file and allow it to install. On a Mac, you can find Julia-0.3.0 in Applications once it has finished installing. Open Julia by double-clicking the application. It will open a command prompt/terminal that indicates you are using Julia version 0.3.0. You have successfully upgraded Julia.
  3. To install IJulia, first you will need to install IPython here. If you are lucky, pip install works for you and you can install IPython with one command. If not, you should consider installing a Python distribution that includes IPython; the easiest option may be to install Anaconda, although this gives you much more than you need.
  4. If you have made it this far, the rest is easy. Simply open Julia, use the Julia commands Pkg.add(“IJulia”), then Pkg.build(“IJulia”), then Pkg.update().
  5. You are finished. To open IJulia, you can either use the commands using IJulia then notebook() within Julia, or within the command prompt/terminal use the command ipython notebook –profile julia.

Going Forward

Compare these moderately complicated instructions to my previous promise that anyone could have Julia and a Julia coding environment ready to use in under 5 minutes. 5 minute installation was one of my favorite aspects of Julia, and this is an unfortunate casualty of the upgrade to 0.3.0.

At the moment, it is easier to install R and R Studio than to install Julia and IJulia, unless you already have IPython on your computer. Hopefully, the Julia developer community, which is actively engaged and dedicated to the performance of this language, create a new one-click approach to installing both Julia and a Julia coding environment, which used to be available from Julia Studio. In the mean time, IJulia appears to be the best option for new users.

I hope this guide makes the transition a bit easier for you.


Bradley J. Setzler

Parallel Processing in Julia: Bootstrapping the MLE

By: Bradley Setzler

Re-posted from: https://juliaeconomics.com/2014/06/18/parallel-processing-in-julia-bootstrapping-the-mle/

* Scripts that reproduce the results of this tutorial in Julia are located here and here.

To review our extended example, we first simulated the OLS DGP, solved for the MLE of OLS, bootstrapped the MLE, and computed various non-parametric p-values.

The only part of our previous work that may have slowed down your computer was forming the 10,000 bootstrap samples (samples). In this post, we will see how to speed up the bootstrapping by using more than one processor to collect the bootstrap samples.


The Estimator that We Wish to Bootstrap

We are interested in the following MLE estimator, adapted from our previous work. First, we import the two needed packages and generate the data:

# Prepare
srand(2)
using Distributions
using Optim
using DataFrames
N=1000
K=3

# Generate variables
genX = MvNormal(eye(K))
X = rand(genX,N)
X = X'
X_noconstant = X
constant = ones(N)
X = [constant X]
genEpsilon = Normal(0, 1)
epsilon = rand(genEpsilon,N)
trueParams = [0.01,0.05,0.05,0.07]
Y = X*trueParams + epsilon

# Export data
data = DataFrame(hcat(Y,X))
names!(data,[:Y,:one,:X1,:X2,:X3])
writetable("data.csv",data)

Next, we define the likelihood of OLS:

function loglike(rho,y,x)
    beta = rho[1:4]
    sigma2 = exp(rho[5])+eps(Float64)
    residual = y-x*beta
    dist = Normal(0, sqrt(sigma2))
    contributions = logpdf(dist,residual)
    loglikelihood = sum(contributions)
    return -loglikelihood
end

Finally, we define the bootstrap function, which includes within it a wrapper for the likelihood:

function bootstrapSamples(B)
    println("hi")
    samples = zeros(B,5)
    for b=1:B
		theIndex = sample(1:N,N)
		x = X[theIndex,:]
		y = Y[theIndex,:]
		function wrapLoglike(rho)
			return loglike(rho,y,x)
		end
		samples[b,:] = optimize(wrapLoglike,params0,method=:cg).minimum
	end
	samples[:,5] = exp(samples[:,5])
    println("bye")
    return samples
end

The reason for the cute println statements will become apparent later. Save all of this in a file called “bootstrapFunction.jl”. Finally, we import the data created above:

data = readtable("data.csv")
N = size(data,1)
Y = array(data[:Y])
X = array(data[[:one,:X1,:X2,:X3]])

Parallel Bootstrapping

We now have the function, bootstrapSamples, saved in the file “bootstrapFunction.jl”, that we wish to run in parallel. The idea is that, instead of running the MLE B times on one processor, we run it B/4 times on each of four processors, then collect the results back to the first processor. Unfortunately, Julia makes it very difficult to parallelize functions within the same file that the functions are defined; the user must explicitly tell Julia every parameter and function that is needed on each processor, which is a frustrating process of copy-pasting the command @everywhere all around your code.

Instead, we separate our work into two files, the first one you have already created. Create a second file in the same location on your computer as “bootstrapFunction.jl”. In the second file, write the following commands:

addprocs(4)
require("tutorial_5_bootstrapFunction.jl")
B=10000
b=2500
samples_pmap = pmap(bootstrapSamples,[b,b,b,b])
samples = vcat(samples_pmap[1],samples_pmap[2],samples_pmap[3],samples_pmap[4])

The command addprocs(4) means to add four processors into your Julia session. The command require is the key to getting this right: it ensures that all of the functions and parameters contained in “bootstrapFunction.jl” are available to each processor. pmap is the function that actually applies bootstrapSamples to each value in the list [b,b,b,b]. I set the list next to it to be [b,b,b,b], where b=2500, so that bootstrapEstimates would collect 2,500 bootstrap samples on each processor. The resulting matrices from all four processors are stored in the length-4 vector called samples_pmap. Finally, we vertically stack (vcat) the four matrices into one matrix, samples, which is exactly the matrix of bootstrap samples we have seen previously.

Tip: For parallel processing in Julia, separate your code into two files, one file containing the functions and parameters that need to be run in parallel, and the other file managing the processing and collecting results. Use the require command in the managing file to import the functions and parameters to all processors.


Results

Once you have run the outer file, you can do speed tests to compare the two ways of getting 10,000 bootstrap samples. We use the @elapsed command to time each code. Here are the results:

@elapsed pmap(bootstrapSamples,[b,b,b,b])
	From worker 2:	hi
	From worker 3:	hi
	From worker 4:	hi
	From worker 5:	hi
	From worker 2:	bye
	From worker 5:	bye
	From worker 4:	bye
	From worker 3:	bye
66.528345161

Each of the processors tells us “hi” and “bye” when it begins and finishes its job, respectively. The total time was a bit more than a minute (66.5 seconds). By comparison, the following code will compute 10,000 bootstrap samples on only one processor, as in our previous work:

@elapsed bootstrapSamples(10000)
hi
bye
145.265103729

So, in the absence of parallel processing, the same result would require just over 145 seconds, or almost two and a half minutes. Thus, parallel processing on a four-core personal computer allowed us to reduce the time required to bootstrap our estimator by a factor of around 2.2. As a general rule, parallel processing of the bootstrap estimator saves you more time when the function being bootstrapped is more complicated.


Bradley J. Setzler

Stepdown p-values for Multiple Hypothesis Testing in Julia

By: Bradley Setzler

Re-posted from: https://juliaeconomics.com/2014/06/17/stepdown-p-values-for-multiple-hypothesis-testing-in-julia/

* The script to reproduce the results of this tutorial in Julia is located here.

To finish yesterday’s tutorial on hypothesis testing with non-parametric p-values in Julia, I show how to perform the bootstrap stepdown correction to p-values for multiple testing, which many economists find intimidating (including me) but is actually not so difficult.

First, I will show the simple case of Holm’s (1979) stepdown procedure. Then, I will build on the bootstrap example to apply one of the most exciting advances in econometric theory from the past decade, the bootstrap step-down procedure, developed by Romano and Wolf (2004) and extended by Chicago professor Azeem Shaikh. These methods allow one to bound the family-wise error rate, which is the probability of making at least one false discovery (i.e., rejecting a null hypothesis when the null hypothesis was true).


Motivation

The standard example to motivate this discussion is to suppose you compute p-values for, say, 10 parameters with independent estimators. If all 10 have p-value equal to 0.05 corresponding to the null hypothesis of each, then we reject the null hypothesis for any one of them with 95% confidence. However, the probability that all of them reject the null hypothesis is 1-\left(1-0.05\right)^{10} \approx 0.4 (by independence, the joint probability is the product of marginal probabilities), so we only have 60% confidence in rejecting all of them jointly. If we wish to test for all of the hypotheses jointly, we need some way to adjust the p-values to give us greater confidence in jointly rejecting them. Holm’s solution to this problem is to iteratively inflate the individual p-values, so that the smallest p-value is inflated the most, and the largest p-value is not inflated at all (unless one of the smaller p-values is inflated above the largest p-value, then the largest p-value will be inflated to preserve the order of p-values). The specifics of the procedure are below.

The Romano-Wolf-Shaikh approach is to account for the dependence among the estimators, and penalize estimators more as they become more independent. Here’s how I motivate their approach: First, consider the extreme case of two estimators that are identical. Then, if we reject the null hypothesis for the first, there is no reason to penalize the second; it would be very strange to reject the null hypothesis for the first but not the second when they are identical. Second, suppose they are almost perfectly dependent. Then, rejecting the null for one strongly suggests that we should also reject the null for the second, so we do not want to penalize one very strongly for the rejection of the other. But as they become increasingly independent, the rejection of one provides less and less affirmative information about the other, and we approach the case of perfect independence shown above, which receives the greatest penalty. The specifics of the procedure are below.


Holm’s Correction to the p-values

Suppose you have a vector of p-values, p, of length K, and a desired maximum family-wise error rate, \alpha (the most common choice is \alpha=0.05. These are all of the needed ingredients. Holm’s stepdown procedure is as follows:

  1. If the smallest element of p is greater than \frac{\alpha}{(K+1)}, reject no null hypotheses and exit (do not continue to step 2). Else, reject the null hypothesis corresponding to the smallest element of p (continue to step 2).
  2. If the second-smallest element of p is greater than \frac{\alpha}{(K+1)-1}, reject no remaining null hypotheses and exit (do not continue to step 3). Else, reject the null hypothesis corresponding to the second-smallest element of p (continue to step 3).
  3. If the third-smallest element of p is greater than \frac{\alpha}{(K+1)-2}, reject no remaining null hypotheses and exit (do not continue to step 4). Else, reject the null hypothesis corresponding to the third-smallest element of p (continue to step 4).
  4. And so on.

We could program this directly as a loop-based function that takes p,\alpha as parameters, but a simpler approach is to compute the p-values equivalent to this procedure for any \alpha, which are,

\tilde{p}_{(k)} =\min\left\{\max_{j:p_{(j)}\leq p_{(k)}} \left\{ (K+1-j)p_{(j)}\right\},1\right\},

where (k) means the k^{\mathit{th}} smallest element. This expression is equivalent to the algorithm above in the sense that, for any family-wise error rate \alpha\in (0,1), \tilde{p}_{(k)}\leq\alpha if and only if the corresponding hypothesis is rejected by the algorithm above.

The following code computes these p-values. Julia (apparently) lacks a command that would tell me the index of the rank of the p-values, so my loop below does this, including the handling of ties (when some of the p-values are the same):

function Holm(p)
    K = length(p)
    sort_index = -ones(K)
    sorted_p = sort(p)
    sorted_p_adj = sorted_p.*[K+1-i for i=1:K]
    for j=1:K
        num_ties = length(sort_index[(p.==sorted_p[j]) & (sort_index.<0)])
        sort_index[(p.==sorted_p[j]) & (sort_index.<0)] = j:(j-1+num_ties)
    end
    sorted_holm_p = [minimum([maximum(sorted_p_adj[1:k]),1]) for k=1:K]
    holm_p = [sorted_holm_p[sort_index[k]] for k=1:K]
    return holm_p
end

This is straight-forward except for sort_index, which I constructed such that, e.g., if the first element of sort_index is 3, then the first element of pvalues is the third smallest. Unfortunately, it arbitrarily breaks ties in favor of the parameter that appears first in the MLE array, so the second entry of sort_index is 1 and the third entry is 2, even though the two corresponding p-values are equal.

Please email me or leave a comment if you know of a better way to handle ties.


Bootstrap Stepdown p-values

Given our work yesterday, it is relatively easy to replace bootstrap marginal p-values with bootstrap stepdown p-values. We begin with samples, as created yesterday. The following code creates tNullDistribution, which is the same as nullDistribution from yesterday except as t-statistics (i.e., divided by standard error).

function stepdown(MLE,bootSamples)
    K = length(MLE)
    tMLE = MLE
    bootstrapSE = std(bootSamples,1)
    tNullDistribution = bootSamples
    p = -ones(K)
    for i=1:K
        tMLE[i] = abs(tMLE[i]/bootstrapSE[i])
        tNullDistribution[:,i] = abs((tNullDistribution[:,i]-mean(tNullDistribution[:,i]))/bootstrapSE[i])
        p[i] = mean(tMLE[i].<tNullDistribution[:,i])
    end
    sort_index = -ones(K)
    stepdown_p = -ones(K)
    sorted_p = sort(p)
    for j=1:K
        num_ties = length(sort_index[(p.==sorted_p[j]) & (sort_index.<0)])
        sort_index[(p.==sorted_p[j]) & (sort_index.<0)] = j:(j-1+num_ties)
    end
    for k=1:K
    	current_index = [sort_index.>=k]
        stepdown_p[sort_index[k]] = mean(maximum(tNullDistribution[:,sort_index.>=k],2).>tMLE[sort_index[k]])
    end
    return ["single_pvalues"=>p,"stepdown_pvalues"=>stepdown_p,"Holm_pvalues"=>Holm(p)]
end

The only difference between the single p-values and the stepdown p-values is the use of the maximum t-statistic in the comparison to the null distribution, and the maximum is taken over only the parameter estimates whose p-values have not yet been computed. Notice that I used a dictionary in the return so that I could output single, stepdown, and Holm p-values from the stepdown function.


Results

To test out the above corrected p-values, we return to MLE and samples from yesterday. Suppose we want to perform the two-sided tests simultaneously for the null hypotheses \beta_0=0,\beta_1=0,\beta_2=0,\beta_3=0. The results from the above code are,

julia> stepdown(MLE[1:4],samples[:,1:4])
Dict{ASCIIString,Any} with 3 entries:
  "Holm_pvalues"     => {0.766,0.766,0.18,0.036}
  "stepdown_pvalues" => [0.486,0.649,0.169,0.034]
  "single_pvalues"   => [0.486,0.383,0.06,0.009]

We see that the p-value corresponding to the null hypothesis \beta_2=0 is inflated by both the bootstrap stepdown and Holm procedures, and is no longer significant at the 0.10 level.


Bradley J. Setzler