Author Archives: Gray Calhoun

Belated course announcement: Heike Hofmann’s Julia seminar

Something I probably should have mentioned two weeks ago: Heike Hofmann is teaching a 1 credit Julia Seminar in the Iowa State Statistics Department this semester. It meets Wednesdays from 12-1, and so far has gone through something close to the content’s of Leah Hanson’sLearn Julia in Y minutes.” You can see the schedule on the course’s GitHub page, https://github.com/heike/stat590f, and it should be interesting and fun.

Filed under: Blog Tagged: julia, programming, statistics

Cobbling together parallel random number generation in Julia

I’m starting to work on some computationally demanding projects, (Monte Carlo simulations of bootstraps of out-of-sample forecast comparisons) so I thought I should look at Julia some more. Unfortunately, since Julia’s so young (it’s almost at version 0.3.0 as I write this) a lot of code still needs to be written. Like Random Number Generators (RNGs) that work in parallel. So this post describes an approach that parallelizes computation using a standard RNG; for convenience I’ve put the code (a single function) is in a grotesquely ambitiously named package on GitHub: ParallelRNGs.jl. (Also see this thread on the Julia Users mailing list.)

A few quick points about RNGs and simulations. Most econometrics papers have a section that examines the performance of a few estimators in a known environment (usually the estimators proposed by the paper and a few of the best preexisting estimators). We do this by simulating data on a computer, using that data to produce estimates, and then comparing those estimate to the parameters they’re estimating. Since we’ve generated the data ourselves, we actually know the true values of those parameters, so we can make a real comparison. Do that for 5000 simulated data sets and you can get a reasonably accurate view of how the statistics might perform in real life.

For many reasons, it’s useful to be able to reproduce the exact same simulations again in the future. (Two obvious reasons: it allows other researchers to be able to reproduce your results, and it can make debugging much faster when you discover errors.) So we almost always use pseudo Random Number Generators that use a deterministic algorithm to produce a stream of numbers that behaves in important ways like a stream of independent random values. You initialize these RNGs by setting a starting value (the “pseudo” aspect of the RNGs is implicit from now on) and anyone who has that starting value can reproduce the identical sequence of numbers that you generated. A popular RNG is the “Mersenne Twister,” and “popular” is probably an understatement: it’s the default RNG in R, Matlab, and Julia. And (from what I’ve read; this isn’t my field at all) it’s well regarded for producing a sequence of random numbers for statistical simulations.

But it’s not necessarily appropriate for producing several independent sequences of random numbers. Which is vitally important because I have an 8 core workstation that needs to run lots of simulations, and I’d like to execute 1/8th of the total simulations on each of its cores.

There’s a common misconception that you can get independent random sequences just by choosing different initial values for each sequence, but that’s not guaranteed to be true. There are algorithms for choosing different starting values that are guaranteed to produce independent streams for the Mersenne Twister (see this research by one of the MT’s inventors), but they aren’t implemented in Julia yet. (Or in R, as far as I can tell; they use a different RNG for parallel applications.) And it turns out that Mersenne Twister is the only RNG that’s included in Julia so far.

So, this would be a perfect opportunity for me to step up and implement some of these advanced algorithms for the Mersenne Twister. Or to implement some of the algorithms developed by L’Ecuyer and his coauthors, which are what R uses. And there’s already C code for both options.

But I haven’t done that yet. I’m lazy busy.

Instead, I’ve written an extremely small function that wraps Julia’s default RNG, calls it from the main process alone to generate random numbers, and then sends those random numbers to each of the other processes/cores where the rest of the simulation code runs. The function’s really simple:

function replicate(sim::Function, dgp::Function, n::Integer)
    function rvproducer()
        for i=1:n
            produce(dgp())
        end
    end
    return(pmap(sim, Task(rvproducer)))
end

That’s all. If you’re not used to Julia, you can ignore the “::Function” and the “::Integer” parts of the arguments. Those just identify the datatype of the argument and you can read it as “dgp_function” if you want (and explicitly providing the types like this is optional anyway). So, you give “replicate” two functions: “dgp” generates the random numbers and “sim” does the remaining calculations; “n” is the number of simulations to do. All of the work is done in “pmap” which parcels out the random numbers and sends them to different processors. (There’s a simplified version of the source code for pmap at that link.)

And that’s it. Each time a processor finishes one iteration, pmap calls dgp() again to generate more random numbers and passes them along. It automatically waits for dgp() to finish, so there are no race conditions and it produces the exact same sequence of random numbers every time. The code is shockingly concise. (It shocked me! I wrote it up assuming it would fail so I could understand pmap better and I was pretty surprised when it worked.)

A quick example might help clear up it’s usage. We’ll write a DGP for the bootstrap:

const n = 200     #% Number of observations for each simulation
const nboot = 299 #% Number of bootstrap replications
addprocs(7)       #% Start the other 7 cores
dgp() = (randn(n), rand(1:n, (n, nboot)))

The data are iid Normal, (the “randn(n)” component) and it’s an iid nonparametric bootstrap (the “rand(1:n, (n, nboot))”, which draws independent values from 1 to n and fills them into an n by nboot matrix). Oh, and there’s a good reason for those weird “#%” comments; “#” is Julia’s comment character, but WordPress doesn’t support syntax highlighting for Julia, so we’re pretending this is Matlab code. And “%” is Matlab’s comment character, which turns the comment green.

We’ll use a proxy for some complicated processing step:

@everywhere function sim(x)
    nboot = size(x[2], 2)
    bootvals = Array(Float64, nboot)
    for i=1:nboot
        bootvals[i] = mean(x[1][x[2][:,i]])
    end
    confint = quantile(bootvals, [0.05, 0.95])
    sleep(3) #% not usually recommended!
    return(confint[1] < 0 < confint[2])
end

So “sim” calculates the mean of each bootstrap sample and calculates the 5th and 95th percentile of those simulated means, giving a two-sided 90% confidence interval for the true mean. Then it checks whether the interval contains the true mean (0). And it also wastes 3 seconds sleeping, which is a proxy for more complicated calculations but usually shouldn’t be in your code. The initial “@everywhere” is a Julia macro that loads this function into each of the separate processes so that it’s available for parallelization. (This is probably as good a place as any to link to Julia’s “Parallel Computing” documentation.)

Running a short Monte Carlo is simple:

julia> srand(84537423); #% Initialize the default RNG!!!
julia> @time mc1 = mean(replicate(sim, dgp, 500))

elapsed time: 217.705639 seconds (508892580 bytes allocated, 0.13% gc time)
0.896 #% = 448/500

So, about 3.6 minutes and the confidence intervals have coverage almost exactly 90%.

It’s also useful to compare the execution time to a purely sequential approach. We can do that by using a simple for loop:

function dosequential(nsims)
    boots = Array(Float64, nsims)
    for i=1:nsims
        boots[i] = sim(dgp())
    end
    return boots
end

And, to time it:

julia> dosequential(1); #% Force compilation before timing
julia> srand(84537423); #% Reinitialize the default RNG!!!
julia> @time mc2 = mean(dosequential(500))

elapsed time: 1502.038961 seconds (877739616 bytes allocated, 0.03% gc time)
0.896 #% = 448/500

This takes a lot longer: over 25 minutes, 7 times longer than the parallel approach (exactly what we’d hope for, since the parallel approach runs the simulations on 7 cores). And it gives exactly the same results since we started the RNG at the same initial value.

So this approach to parallelization is great… sometimes.

This approach should work pretty well when there aren’t that many random numbers being passed to each processor, and when there aren’t that many simulations being run; i.e. when “sim” is an inherently complex calculation. Otherwise, the overhead of passing the random numbers to each process can start to matter a lot. In extreme cases, “dosequential” can be faster than “replicate” because the overhead of managing the simulations and passing around random variables dominates the other calculations. In those applications, a real parallel RNG becomes a lot more important.

If you want to play with this code yourself, I made a small package for the replicate function: ParallelRNGs.jl on GitHub. The name is misleadingly ambitious (ambitiously misleading?), but if I do add real parallel RNGs to Julia, I’ll put them there too. The code is still buggy, so use it at your own risk and let me know if you run into problems. (Filing an issue on GitHub is the best way to report bugs.)

P.S. I should mention again that Julia is an absolute joy of a language. Package development isn’t quite as nice as in Clojure, where it’s straightforward to load and unload variables from the package namespace (again, there’s lots of code that still needs to be written). But the actual language is just spectacular and I’d probably want to use it for simulations even if it were slow. Seriously: seven lines of new code to get an acceptable parallel RNG.

Filed under: Blog Tagged: econometrics, julia, programming

Slowly moving to the Julia language from R

Julia is a new computing language that’s gotten a lot of attention lately (e.g., this Wired piece) and that I’ve ignored until recently. But I checked it out a few days ago and, holy crap, it’s a nice language. I’m rewriting the code in my book to use Julia instead of R and I’m almost certainly going to use it instead of R in my PhD class next fall.

So, why Julia and why not R? (And, I suppose why not Python/Matlab/other languages?)

  • Multiple dispatch. So you can define a function
    function TrickyAlgorithm(aThing, bThing)

    differently to depend on whether aThing is a matrix of real numbers and bThing is is a vector, or aThing is a vector and bThing is a matrix, or any other combinations of data types. And you can do this without lots of tedious, potentially slow, and confusing (to people reading and maintaining the code) argument checks and conversion within the function.

    Note that this is kind of similar to Object Oriented Programming, but in OOP TrickyAlgorithm would need to be a method of aThing or bThing. Also note that this is present in R as well.

  • Homoiconicity — the code can be operated on by other parts of the code. Again, R kind of has this too! Kind of, because I’m unaware of a good explanation for how to use it productively, and R’s syntax and scoping rules make it tricky to pull off. But I’m still excited to see it in Julia, because I’ve heard good things about macros and I’d like to appreciate them (I’ve started playing around with Clojure and like it a lot too…). And because stuff like this is amazing:
    @devec r = exp(-abs(x-y))

    which devectorizes x and y (both vectors) and evaluates as

    for i = 1:length(x)
        r[i] = exp(-abs(x[i]-y[i]))
    end
    

    (this example and code is from Dahua Lin’s blog post, Fast Numeric Computation in Julia). Note that “evaluates as” does not mean “gives the same answer as,” it means that the code r = exp(-abs(x-y)) is replaced with the loop by @devec and then the loop is what’s run.

  • Decent speed. Definitely faster than well written R; I don’t have a great feel for how well it compares to highly optimized R (using inline C++, for example), but I write one-off simulation programs and don’t write highly optimized R.And the language encourages loops, which is a relief. R discourages loops and encourages “vectorized” operations that operate on entire objects at once (which are then converted to fast loops in C…). But I use loops all the time anyway, because avoiding loops in time series applications is impossible. R’s poor support for recursion doesn’t help either.And, more to the point, I teach econometrics to graduate students. Many of them haven’t programmed before. Most of them are not going to write parts of their analysis in C++.
  • The syntax is fine and unthreatening, which will help for teaching. It basically looks like Matlab done right. Matlab’s not a bad language because its programs look like they’re built out of Legos, it’s a bad language because of its horrendous implementation of functions, anonymous functions, objects, etc. Compared to R, Matlab and Julia look downright friendly. Compared to Clojure… I can’t even imagine asking first year PhD students (some with no programming experience at all) to work with a Lisp.
  • The last point that’s always mentioned in these language comparisons. What about all of the R packages? There are thousands and thousands of statistical packages coded up for R, and you’re giving that up by moving to a different language.This is apparently a big concern for a lot of people, but… have you looked at the source code for these packages? Most of them are terrible! But some are good, and it might take some time to port them to Julia. Not that much time, I think, because most high-performance popular R packages are a thin layer of interoperability over a fast implementation in C or C++, so the port is just a matter of wrapping it up for Julia. And most of the well designed packages are tools for other package developers.That’s not quite true of R’s statistical graphics, though. They’re really great and could be hard to port. And that’s more or less the only thing that I’m sure that I’ll miss in Julia. (But hopefully not for too long.)
  • Lastly, and this is important: the same massive quantity of packages for R is a big constraint on its future development. Breaking backwards compatibility is a big deal but avoiding it too much imposes costs.

Anyway, since I converted some R code to Julia I thought it would be fun to compare speeds. The first example is used to show the sampling distribution of an average of uniform(0,1) random variables. In R, we have

rstats <- function(rerror, nobs, nsims = 500) {
  replicate(nsims, mean(rerror(nobs)))}

which is (I think) pretty idiomatic R (and is vectorized, so it’s supposed to be fast). Calling it gives

R> system.time(rstats(runif, 500))

[out]:  user  system elapsed 
       0.341   0.002   0.377

For comparison to the Julia results, we’re going to care about the “elapsed” result of 0.377 seconds; the “system” column isn’t relevant here.  Calling it for more observations and more simulations (50,000 of each) gives

R> system.time(rstats(runif, 50000, 50000))

[out]:   user  system elapsed 
      204.184   0.217 215.526

so 216 seconds overall. And, just to preempt criticism, I ran these simulations a few times each and these results are representative; and I ran a byte-compiled version that got (unexpectedly) slightly worse performance.

Equivalent Julia code is

function rmeans(dist, nobs; nsims = 500)
    means = Array(Float64,nsims)
    for i in 1:nsims
        means[i] = mean(rand(dist, nobs))
    end
    return means
end

which is pretty easy to read, but I have no idea if it’s idiomatic. This is my first code in Julia. If you like to minimize lines of code and preallocation of arrays, Julia has list comprehensions and you can write the stylish one line definition (that gave similar times)

rmeans_pretty(dist, nobs; nsims = 500) =
    [ mean(rand(dist, nobs)) for i = 1:nsims ]

We can time  (after loading the Distributions packages):

julia> @elapsed rmeans(Uniform(), 500)

[out]: 0.093662961

so 0.09 seconds, or about a quarter the time as R. But (I forgot to mention earlier), Julia uses a Just In Time compiler, so the 0.09 seconds includes compilation and execution. Running it a second time gives

julia>  @elapsed rmeans(Uniform(), 500)

[out]: 0.004334132

which is half the time again. (Update on 3/17: as Jules pointed out in the comments, 0.004 is 1/20th of 0.09, so this is substantially faster than I’d initially thought. So we are getting into the ~100 times faster range. That’s actually a pretty exciting speed increase, but I’ll need to look into it some more. Well, that was embarrassing.)

Running the larger simulation, we have

julia> @elapsed rmeans_loop(Uniform(), 50000, nsims = 50000)

[out]: 77.318591953

so the R code is a little less than three times slower here. (The compilation step doesn’t make a meaningful difference.) So, Julia isn’t hundreds of times faster, but it is noticeably faster than R, which is nice.

But speed in this sort of test isn’t the main factor. I’m really excited about multiple dispatch — it’s one of the few things in R that I really, really liked from a language standpoint. I really like what I’ve read about Julia’s support for parallelism (but need to learn more). And I like metaprogramming, even if I can’t do it myself yet. So Julia’s trying to be a fast, easy to learn, and elegantly designed language. That’s awesome. I want it to work.

ps: and it’s open source! Can’t forget that.

Filed under: Blog Tagged: julia, programming, R