Tag Archives: parallel

Interfacing with a Xeon Phi via Julia

Re-posted from: http://www.stochasticlifestyle.com/interfacing-xeon-phi-via-julia/

(Disclaimer: This is not a full-Julia solution for using the Phi, and instead is a tutorial on how to link OpenMP/C code for the Xeon Phi to Julia. There may be a future update where some of these functions are specified in Julia, and Intel’s compilertools.jl looks like a viable solution, but for now it’s not possible.)

Intel’s Xeon Phi has a lot of appeal. It’s an instant cluster in your computer, right? It turns out it’s not quite that easy. For one, the installation process itself is quite tricky, and the device has stringent requirements for motherboard choices. Also, making out at over a taraflop is good, but not quite as high as NVIDIA’s GPU acceleration cards.

However, there are a few big reasons why I think our interest in the Xeon Phi should be renewed. For one, Intel will be releasing its next version Knights Landing in Q3 which promises up to 8 teraflops and 16 GB of RAM. Intel has also been saying that this next platform will be much more user friendly and have improved bandwidth to allow for quicker offloading of data. Lastly, since the Xeon Phi uses X86 cores which one interfaces with via standard tools such as OpenMP and MPI, high performance parallel codes naturally transfer over to the Xeon Phi with little work (if you’ve already parallelized your code). For this reason many major HPCs such as Stampede and SuperMIC have been incorporating a Xeon Phi into every compute node. These details tell us that for high-performance computing using Xeon Phi’s to their full potential is the way forward. I am going to detail some of my advances in interfacing with the Xeon Phi via Julia.

First, let’s talk about automatic offloading

Automatic offloading allows you to offload all of your MKL-calls to the Xeon Phi automatically. This means that if you are doing lots of linear algebra on large matrices, standard operations from BLAS and Linpack like matrix multiplication * will automatically be done on the acceleration card. Details for setting up automatic offload are given by MATLAB. However, automatic offloading is a mixed blessing. First of all, there is no data persistence. If you are repeatedly using the same matrices, like in solving an evolution equation (i.e. parabolic PDE), this adds a large overhead since you’ll be sending that data back and forth every multiplication. Also, one major downside is that it does not apply to vectorized arithmetic such as .*. Sure you could hack it to be matrix multiplication by a sparse diagonal matrix, but these types of hacks really only tend to give you speedups when your vectors are large since you still incur the costs of transferring the arrays every time.

Still, it’s stupid easy to setup. You compile Julia with MKL and and setup a few environment variables and it will do it automatically. Thus you should give this a try first.

Native Execution

You can also compile code to natively execute on the Xeon Phi. However, you need to copy the files (and libraries) over the Phi via ssh and run the job from there. Thus while this is really good for C code, it’s not as easy to use when you wish to control the Phi from the computer itself as a “side job”.

Pragma-assisted Offloading

This is the route we are going to take. Pragmas are a type of syntax from OpenMP where one specifies segments of the code to be parallelized. If you’re familiar with using parallel constructs from MATLAB/Julia like parallel loops, OpenMP’s pragmas are pretty much the C version of that. For the Xeon Phi, there exists extra pragmas telling the Phi to offload. This also allows for data persistence. Lastly, for many C codes parallelized with OpenMP they are just one pragma away from working on the Phi.

Our workflow will be as follows. We will use a driver script from Julia which will set the environment and use Julia’s ccall to call the C-code with the OpenMP pragmas which will perform parallelized function calls. Notice that in this case Julia is just performing the role of glue code. The advantage is that we can prepare the data and plot the results from within Julia. The disadvantage is that we will have to write a lot of C-code. However, I am currently talking with Intel’s Developer Lab on using their CompilerTools.jl to compile Julia functions to be used on the Xeon Phi. When that’s available, I will write a tutorial on how to then replace the core functions from this script with the Julia functions. Then, the only C-code would be the code which starts the parallel loop. Let’s get started.

The Problem

We wish to solve some simple stochastic differential equations via the Euler-Maruyama method. We will specify the stochastic differential equation of the form

$dU_{t} = f(U,t)dt + g(U,t)dW_{t}$

via functions $f$ and $g$ . In our code we will also allow the ability to have a function for the true solution in order to perform error calculations.

The Julia Code

Again, at this point the Julia code is quite simple because it is simply performing the glue. Let $M$ be the number of simulations of we perform. Set up empty vectors for the values of $U$ and the true solution $Utrue$ at the endpoints. Note that we only will keep the endpoints from each simulation due to memory issues. If we were to keep the full array for thousands (or millions) of runs this would easily be more memory than the Phi could handle (or even a workstation!). We then need to specify our environmental variables. Set OMP_NUM_THREADS to be the number of compute cores on your system. We setup MIC_PREFIX, LD_IBRARY_PATH, and MIC_LD_LIBRARY_PATH so that we can dynamically link to our library. Note that this assumes that you have already sourced the compiler variables via compilervars.sh with the argument intel64. If not, you can use Julia’s run function to source the script. Lastly, we set a constant MIC_OMP_NUM_THREADS to be the number of threads the Xeon Phi will use. Since when offloading you can use all but 1 core (one manages the jobs) and each core has 4 threads, we set 240 threads (for the 5110p). Like in the case of GPUs, using more threads than cores is beneficial since the cores can utilize large vectors to do multiple calculations at once (SIMD) and other magic. We set the environment variable OFFLOAD_REPORT to 3 which will make the Phi give us details about everything it’s offloading (good for debugging). Lastly, we end by calling the library. The total code is as follows:

M = 240000
ENV["OMP_NUM_THREADS"]=12
Us = Vector{Float64}(M)
ts = Vector{Float64}(M)
Ws = Vector{Float64}(M)
Utrues = Vector{Float64}(M)
MIC_OMP_NUM_THREADS = 240
ENV["MIC_PREFIX"]="MIC"
ENV["OFFLOAD_REPORT"]=3
ENV["LD_LIBRARY_PATH"]=string(ENV["LD_LIBRARY_PATH"],":.")
ENV["MIC_LD_LIBRARY_PATH"]=string(ENV["MIC_LD_LIBRARY_PATH"],":.")
alg = 2
#ccall((:monte_carlo,"/home/crackauc/XeonPhiTests/EMtest/sde_solvers_noffload.so"),Void,(Cint,Ptr{Cdouble},Ptr{Cdouble},Ptr{Cdouble}),M,Us,Utrues,ts)
@time ccall((:monte_carlo,"/home/crackauc/XeonPhiTests/EMtest/sde_solvers.so"),Void,(Cint,Ptr{Cdouble},Ptr{Cdouble},Ptr{Cdouble},Ptr{Cdouble},Cint,Cint),M,Us,Utrues,ts,Ws,MIC_OMP_NUM_THREADS,alg)

For more of an explanation on using the ccall function to interface with C-code, see my previous blog post. Note that the arrays Us, Utrues, ts, and Ws will be updated in place as the value of U, Utrue, t, and W at the end of the path. Thus after the job is done one can use Julia to plot the results.

Xeon Phi Driver Function

The ccall function looks for a function of the following type in a shared library named sde_solvers.so:

void monte_carlo(int M,double* Us,double* Utrues,double* ts,double* Ws,const int MIC_OMP_NUM_THREADS,int alg)

In this function we will just do a parallel for loop where each iteration calls the Euler-Maruyama solver on a different random seed. However, instead of doing a straight parallel for loop, we will put a little separation between “the parallel” and “the for” so that we can keep some persistent data to be a little more efficient.

We start by defining some constants:

double Uzero = .5;
  double dt = 0.00001;
  double T = 2.0;
  int N = ceil(T/dt)+1;

Now we send the job over to the Xeon Phi via the following pragma:

#pragma offload target(mic:MIC_DEV) default(none) in(Uzero,dt,T,N,MIC_OMP_NUM_THREADS,alg) out(Us:length(M)) 
  out(Utrues:length(M)) out(ts:length(M)) out(Ws:length(M))

Note that at the top of the script we have

#ifndef MIC_DEV
#define MIC_DEV 0
#endif
 
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <mathimf.h>
#include "mkl.h"
#include "mkl_vsl.h"

and so MIC_DEV singles out the Xeon Phi labeled 0. Using in we send over the variables, and with out we specify the variables we want the Phi to send back. By adding default(none) we get informed if there are any variables which weren’t specified.

After that pragma, we are on the MIC. The first thing we will do is set the number of threads. I don’t know why but setting the environment variable MIC_OMP_NUM_THREADS in Julia does not set the number of MIC threads, so instead we do it manually on via the command

omp_set_num_threads(MIC_OMP_NUM_THREADS);

Next we start our parallel environment by

#pragma omp parallel default(none) shared(Uzero,alg,dt,T,N,M,ts,Us,Utrues,Ws)

Once again, default(none) will make sure no variables are accidentally set to shared, and we specify all of the inputs as shared. With this, we are now coding with that list of variables on the individual threads of the Phi. Thus we will now will setup an individual run of the SDE solver. We make arrays for time $t$ , the Brownian path $W$ , the solution $U$ , and the true solution $Utrue$ . We also grab the id of the thread to setup random seeds later. This gives:

      int i;
      int tid = omp_get_thread_num();
      double* t; double* W; double* U; double* Utrue;
      int steps;
      t = (double*) malloc(N*sizeof(double));
      W = (double*) malloc(N*sizeof(double));
      U = (double*) malloc(N*sizeof(double));
      Utrue = (double*) malloc(N*sizeof(double));

Now we start our parallel for loop. Notice that by allocating these variables before the loop we have increased our efficiency since each run we will simply write over these values, saving us the time of re-allocating. In our for loop we set the initial values (since we are re-using the same arrays), call the solver algorithm, save the results at the end, and re-run. After we are done with the whole loop, then we free that arrays we made. The code is then as follows:

      #pragma omp for
      for(i=0;i<M;i++){
        t[0]=0;
        U[0]=Uzero;
        Utrue[0]=Uzero;
        W[0] = 0;
        euler_maruyama(&f,&g,&trueSol,Uzero,dt,T,t,&W,U,Utrue,tid*i*M+i); /*unique identifier tid*i*M+i since tid spacing */
        Us[i] = U[N-1];
        Utrues[i] = Utrue[N-1];
        ts[i] = t[N-1];
        Ws[i] = W[N-1];
      }
    free(t); free(Utrue); free(U); free(W); free(Z);

Notice that tid*i*M+i has spacings larger than $M$ and $tid$ and so each value will be unique. This is then the value we can use as a random seed. The full code for the driver function is then:

void monte_carlo(int M,double* Us,double* Utrues,double* ts,double* Ws,const int MIC_OMP_NUM_THREADS,int alg){
  double Uzero = .5;
  double dt = 0.00001;
  double T = 2.0;
  int N = ceil(T/dt)+1;
  #pragma offload target(mic:MIC_DEV) default(none) in(Uzero,dt,T,N,MIC_OMP_NUM_THREADS,alg) out(Us:length(M)) 
  out(Utrues:length(M)) out(ts:length(M)) out(Ws:length(M))
  {
    omp_set_num_threads(MIC_OMP_NUM_THREADS);
    #pragma omp parallel default(none) shared(Uzero,alg,dt,T,N,M,ts,Us,Utrues,Ws)
     {
      int i;
      int tid = omp_get_thread_num();
      double* t; double* W; double* U; double* Utrue;
      int steps;
      t = (double*) malloc(N*sizeof(double));
      W = (double*) malloc(N*sizeof(double));
      U = (double*) malloc(N*sizeof(double));
      Utrue = (double*) malloc(N*sizeof(double));
      #pragma omp for
      for(i=0;i<M;i++){
        t[0]=0;
        U[0]=Uzero;
        Utrue[0]=Uzero;
        W[0] = 0;
        euler_maruyama(&f,&g,&trueSol,Uzero,dt,T,t,&W,U,Utrue,tid*i*M+i); /*unique identifier tid*i*M+i since tid spacing */
        Us[i] = U[N-1];
        Utrues[i] = Utrue[N-1];
        ts[i] = t[N-1];
        Ws[i] = W[N-1];
      }
      free(t); free(Utrue); free(U); free(W); free(Z);
    }
  }
}

Notice I left out the extra algorithms. When I put this in a package (and in my soon to be submitted code for a publication) I have different choices for the solver, but here we will just have Euler-Maruyama.

The Inner Functions

Before we get to the solver, notice that euler_maruyama takes in three functions by handle. However, since these will be executed on the Xeon Phi we decorate them with __attribute__((target(mic))). However, I will leave off these declarations since we can instead have them be put on automatically by a compiler command (and this makes it easier to re-compile to be a Xeon Phi free code). Thus the SDE functions are simply

double f(double t,double x){
  return (1.0/20.0)*x;
}
 
double g(double t,double x){
  return (1.0/10.0)*x;
}
 
double trueSol(double t, double Uzero,double W){
  return Uzero*exp(((1.0/20.0)-((1.0/10.0)*(1.0/10.0))/2.0)*t + (1.0/10.0)*W);
}

Thus the SDE is

$dU_{t} = frac{1}{20} U_{t}dt + frac{1}{10} W_{t}$

which a mathematician would call Geometric Brownian Motion or what someone in finance would know of as the Black-Scholes equation. Our inner function euler_maruyama is then the standard loop for solving via Euler-Maruyama where we replace any instance of $dt$ with a small real number and we replace $dW_{t}$ with normal random variables with zero mean and variance $dt$ . The only tricky part is getting normal random variables, but I used Intel’s VSL library for generating these. The code for solving the Euler-Maruyama equations are then

void euler_maruyama(double (*f)(double,double),double (*g)(double,double),double (*trueSol)(double,double,double),double Uzero,double dt, double T,double* t,double** W,double* U,double* Utrue,int id){
  int N = ceil(T/dt)+1;
  *W = (double*) malloc(N*sizeof(double));
  VSLStreamStatePtr stream;
  vslNewStream(&stream,VSL_BRNG_MT19937,20+id);
  vdRngGaussian(VSL_RNG_METHOD_GAUSSIAN_BOXMULLER,stream,N,*W,0.0f,1.0f);
  (*W)[0] = 0.0f;
 
  int i;
  double dW;
  double sqdt = sqrt(dt);
  for(i=1;i<N;i++){
    /* dW = 0; */
    dW = sqdt* (*W)[i];
    t[i] = t[i-1] + dt;
    (*W)[i] = (*W)[i-1] + dW;
    U[i] = U[i-1] + dt*f(t[i-1],U[i-1]) + g(t[i-1],U[i-1])*dW;
    Utrue[i] = trueSol(t[i],Uzero,(*W)[i]);
  }
  vslDeleteStream(&stream);
}

Notice that this part is nothing special and quite close to what you’d write in $C$ . However, we do note that since we want the value of $W$ at the end of the run outside of this function, and we allocate $W$ within the function, we have to pass $W$ by reference via &W and thus every time it is used we have to deference it via *W. Other than that there’s nothing fancy here.

Compilation

This is always the hardest part. However, notice that if we just take away the offload pragma this is perfectly good OpenMP code! You can do this from the compiler to first check your code. The compilation command is as follows:

icc -mkl -O3 -openmp -fpic -diag-disable 10397 -no-offload -Wno-unknown-pragmas -std=c99 -qopt-report -qopt-report-phase=vec -shared sde_solvers.c -o sde_solvers.so

Most of it is setting up offload reports and libraries, but the important part to notice is that -no-offload is the part that turns off the offload pragma. Give this a try and it should parallelize on the CPU. Now, to compile for the Phi, we use the command

icc -mkl -O3 -openmp -fpic -diag-disable 10397 -qoffload -Wno-unknown-pragmas -std=c99 -qopt-report -qopt-report-phase=vec -shared sde_solvers.c -offload-attribute-target=mic -o sde_solvers.so

Notice that the command -offload-attribute-target=mic is required if you do not put __attribute__((target(mic))) in front of each function that is called when offloaded. I prefer to not put the extra tags because icc required that I delete them to re-compile for the CPU. In this case, we simply get rid of that compiler directive and change to -no-offload and we have working CPU code. Thus you can see how to transfer back and forth between the two via compilation.

After doing this you should be able to call the code from Julia, have it solve the code on the Phi, and then return the result to Julia.

Future Steps

Notice that the functions $f$ , $g$ , and $trueSol$ are simple functions which we pass by pointer into the solver. Julia already has ways to pass function pointers which I go over in my previous tutorial, though since they are not compiled with the __attribute__((target(mic))) flag they will not work on the Phi. Hopefully Intel’s compilertools.jl will support this in the near future. When that’s the case, these functions could be specified from within Julia to allow us to create libraries where we can use Julia-specified functions as the input.

However, this gives a nice template for performing any kind of Monte Carlo simulation or anything else that uses a parallel for loop. This wrapper will form the basis of a library I am creating for stochastic (partial) differential equations. More on that later. In the meantime, have fun experimenting with the Phi!

The post Interfacing with a Xeon Phi via Julia appeared first on Stochastic Lifestyle.

An Introduction to Structural Econometrics in Julia

By: Bradley Setzler

Re-posted from: http://juliaeconomics.com/2016/02/09/an-introduction-to-structural-econometrics-in-julia/

This tutorial is adapted from my Julia introductory lecture taught in the graduate course Practical Computing for Economists, Department of Economics, University of Chicago.

The tutorial is in 5 parts:

Installing Julia + Juno IDE, as well as useful packages
Defining a structural econometric challenge
Data generation, management, and regression visualization
Numerical simulation of optimal agent behavior under constraints
Parallelized estimation by the Method of Simulated Moments

1. Installing Julia + Juno IDE, as well as useful packages

Perhaps the greatest obstacle to using Julia in the past has been the absence of an easy-to-install IDE. There used to be an IDE called Julia Studio which was as easy to use as the popular RStudio for R. Back then, you could install and run Julia + Julia Studio in 5mins, compared to the hours it could take to install Python and its basic packages and IDE. When Julia version 0.3.X was released, Julia Studio no longer worked, and I recommended the IJulia Notebook, which requires the installation of Python and IPython just to use Julia, so any argument that Julia is more convenient to install than Python was lost.

Now, with Julia version 0.4.X, Juno has provided an excellent IDE that comes pre-bundled with Julia for convenience, and you can install Julia + Juno IDE in 5mins. Here are some instructions to help you through the installation process:

Go to http://julialang.org/downloads/ and look for “Julia + Juno IDE bundles“. Click to download the bundle for your system (Windows, Mac, or Linux).
After the brief download (the entire Julia language + Juno IDE is less than 1GB), open the file and click through the installation instructions.
Open Juno, and try to run a very simple line of code. For example, type 2+2, highlight this text, right-click, and choose the option Evaluate. A bubble should display 4 next to the line of code.
- Trouble-shooting: On my Mac running OS X Mavericks, 2+2 failed and an unhelpful error was produced. After some searching, I found that the solution was to install the Jewel package. To install Jewel from within Juno, just type Pkg.add(“Jewel”), highlight this text, and Evaluate. After this, 2+2 was successful.
You have successfully installed Julia + Juno IDE, which includes a number of important packages already, such as DataFrames, Gadfly, and Optim. Now, you will want to run the following codes to install some other packages used in the econometric exercises below:

Pkg.update()
Pkg.add("Ipopt")
Pkg.build("Ipopt")
Pkg.add("JuMP")
Pkg.build("JuMP")
Pkg.add("GLM")
Pkg.add("KernelEstimator")

2. Defining a structural econometric challenge

To motivate our application, we consider a very simple economic model, which I have taught previously in the mathematical economics course for undergraduates at the University of Chicago. Although the model is analytically simple, the econometrics become sufficiently complicated to warrant the Method of Simulated Moments, so this serves us well as a teachable case.

Let $c_i \geq 0$ denote consumption and $0 \leq l_i \leq 1$ denote leisure. Consider an agent who wishes to maximize Cobb-Douglas utility over consumption and leisure, that is,

$U(c_i,l_i) =$ $c^\gamma_i l^{1-\gamma}_i$ .

where $0 \leq \gamma \leq 1$ is the relative preference for consumption. The budget constraint is given by,

$c_i \leq (1-\tau)w_i(1-l_i)+\epsilon_i$ ,

where $w_i$ is the wage observed in the data, $\epsilon_i$ is other income that is not observed in the data, and $\tau$ is the tax rate.

The agent’s problem is to maximize $U(c_i,l_i)$ subject to the budget constraint. We assume that non-labor income is uncorrelated with the wage offer, so that $\mathbb{E}[\epsilon_i | w_i]=0$ . Although this assumption is a bit unrealistic, as we expect high-wage agents to also tend to have higher non-labor income, it helps keep the example simple. The model is also a bit contrived in that we treat the tax rate as unobservable, but this only makes our job more difficult.

The goal of the econometrician is to identify the model parameters $\gamma$ and $\tau$ from the data $(c_i,l_i,w_i)^N_{i=1}$ and the assumed structure. In particular, the econometrician is interested in the policy-relevant parameter $\bar\psi \equiv \frac{1}{N}\sum^N_{i=1}\psi(w_i)$ , where,

$\psi(w_i) \equiv \mathbb{E}_{\epsilon} \frac{\partial}{\partial \tau} C(w_i,\epsilon; \gamma, \tau)$ ,

and $C(\cdot)$ denotes the demand for consumption. $\psi(w_i)$ is the marginal propensity for an agent with wage $w_i$ to consume in response to the tax rate. $\bar{\psi}$ is the population average marginal propensity to consume in response to the tax rate. Of course, we can solve the model analytically to find that $\psi(w_i) = -\gamma w_i$ and $\bar\psi = -\gamma \bar{w}$ , where $\bar{w}$ is the average wage, but we will show that the numerical methods achieve the correct answer even when we cannot solve the model.

3. Data generation, management, and regression visualization

The replication code for this section is available here.

To generate data that follows the above model, we first solve analytically for the demand functions for consumption and leisure. In particular, they are,

$C(w_i,\epsilon_i; \gamma, \tau) = \gamma (1-\tau) w_i + \gamma \epsilon_i$

$L(w_i,\epsilon_i; \gamma, \tau) = (1-\gamma) + \frac{(1-\gamma) \epsilon_i}{ (1-\tau) w_i}$

Thus, we need only draw values of $w_i$ and $\epsilon_i$ , as well as choose parameter values for $\gamma$ and $\tau$ , in order to generate the values of $c_i$ and $l_i$ that agents in this model would choose. We implement this in Julia as follows:

####### Set Simulation Parameters #########
srand(123)           # set the seed to ensure reproducibility
N = 1000             # set number of agents in economy
gamma = .5           # set Cobb-Douglas relative preference for consumption
tau = .2             # set tax rate

####### Draw Income Data and Optimal Consumption and Leisure #########
epsilon = randn(N)                                               # draw unobserved non-labor income
wage = 10+randn(N)                                               # draw observed wage
consump = gamma*(1-tau)*wage + gamma*epsilon                     # Cobb-Douglas demand for c
leisure = (1.0-gamma) + ((1.0-gamma)*epsilon)./((1.0-tau)*wage)  # Cobb-Douglas demand for l

This code is relatively self-explanatory. Our parameter choices are $N=1000$ , $\epsilon_i \sim \mathcal{N}(0,1)$ , $\gamma=1/2$ , and $\tau=1/5$ . We draw the wage to have distribution $w_i \sim \mathcal{N}(10,1)$ , but this is arbitrary.

We combine the variables into a DataFrame, and export the data as a CSV file. In order to better understand the data, we also non-parametrically regress $c_i$ on $w_i$ , and plot the result with Gadfly. The Julia code is as follows:

####### Organize, Describe, and Export Data #########
using DataFrames
using Gadfly
df = DataFrame(consump=consump,leisure=leisure,wage=wage,epsilon=epsilon)  # create data frame
plot_c = plot(df,x=:wage,y=:consump,Geom.smooth(method=:loess))            # plot E[consump|wage] using Gadfly
draw(SVG("plot_c.svg", 4inch, 4inch), plot_c)                              # export plot as SVG
writetable("consump_leisure.csv",df)                                       # export data as CSV

Again, the code is self-explanatory. The regression graph produced by the plot function is:

plot_c

4. Numerical simulation of optimal agent behavior under constraints

The replication code for this section is available here.

We now use constrained numerical optimization to generate optimal consumption and leisure data without analytically solving for the demand function. We begin by importing the data and the necessary packages:

####### Prepare for Numerical Optimization #########

using DataFrames
using JuMP
using Ipopt
df = readtable("consump_leisure.csv")
N = size(df)[1]

Using the JuMP syntax for non-linear modeling, first we define an empty model associated with the Ipopt solver, and then add $N$ values of $c_i$ and $N$ values of $l_i$ to the model:

m = Model(solver=IpoptSolver())    # define empty model solved by Ipopt algorithm
@defVar(m, c[i=1:N] >= 0)       # define positive consumption for each agent
@defVar(m, 0 <= l[i=1:N] <= 1)  # define leisure in [0,1] for each agent

This syntax is especially convenient, as it allows us to define vectors of parameters, each satisfying the natural inequality constraints. Next, we define the budget constraint, which also follows this convenient syntax:

 @addConstraint(m, c[i=1:N] .== (1.0-t)*(1.0-l[i]).*w[i] + e[i] )        # each agent must satisfy the budget constraint

Finally, we define a scalar-valued objective function, which is the sum of each individual’s utility:

 @setNLObjective(m, Max, sum{ g*log(c[i]) + (1-g)*log(l[i]) , i=1:N } )  # maximize the sum of utility across all agents

Notice that we can optimize one objective function instead of optimizing $N$ objective functions because the individual constrained maximization problems are independent across individuals, so the maximum of the sum is the sum of the maxima. Finally, we can apply the solver to this model and extract optimal consumption and leisure as follows:

 status = solve(m)                                                       # run numerical optimization c_opt = getValue(c)                                                     # extract demand for c l_opt = getValue(l)                                                     # extract demand for l

To make sure it worked, we compare the consumption extracted from this numerical approach to the consumption we generated previously using the true demand functions:

cor(c_opt,array(df[:consump]))
0.9999999998435865

Thus, we consumption values produced by the numerically optimizer’s approximation to the demand for consumption are almost identical to those produced by the true demand for consumption. Putting it all together, we create a function that can solve for optimal consumption and leisure given any particular values of $\gamma$ , $\tau$ , and $\epsilon$ :

 function hh_constrained_opt(g,t,w,e)      m = Model(solver=IpoptSolver())                                         # define empty model solved by Ipopt algorithm
  @defVar(m, c[i=1:N] >= 0)                                               # define positive consumption for each agent
  @defVar(m, 0 <= l[i=1:N] <= 1)                                          # define leisure in [0,1] for each agent
  @addConstraint(m, c[i=1:N] .== (1.0-t)*(1.0-l[i]).*w[i] + e[i] )        # each agent must satisfy the budget constraint
  @setNLObjective(m, Max, sum{ g*log(c[i]) + (1-g)*log(l[i]) , i=1:N } )  # maximize the sum of utility across all agents
  status = solve(m)                                                       # run numerical optimization
  c_opt = getValue(c)                                                     # extract demand for c
  l_opt = getValue(l)                                                     # extract demand for l
  demand = DataFrame(c_opt=c_opt,l_opt=l_opt)                             # return demand as DataFrame
end

hh_constrained_opt(gamma,tau,array(df[:wage]),array(df[:epsilon]))          # verify that it works at the true values of gamma, tau, and epsilon

5. Parallelized estimation by the Method of Simulated Moments

The replication codes for this section are available here.

We saw in the previous section that, for a given set of model parameters $\gamma$ and $\tau$ and a given draw of $\epsilon_{i}$ for each $i$ , we have enough information to simulation $c_{i}$ and $l_{i}$ , for each $i$ . Denote these simulated values by $\hat{c}_{i}\left(\epsilon;\gamma,\tau\right)$ and $\hat{l}_{i}\left(\epsilon;\gamma,\tau\right)$ . With these, we can define the moments,

$\hat{m}\left(\gamma,\tau\right)=\mathbb{E}_{\epsilon}\left[\begin{array}{c} \frac{1}{N}\sum_{i}\left[\hat{c}_{i}\left(\epsilon\right)-c_{i}\right]\\ \frac{1}{N}\sum_{i}\left[\hat{l}_{i}\left(\epsilon\right)-l_{i}\right] \end{array}\right]$

which is equal to zero under the model assumptions. A method of simulated moments (MSM) approach to estimate $\gamma$ and $\tau$ is then,

$\left(\hat{\gamma},\hat{\tau}\right)=\arg\min_{\gamma\in\left[0,1\right],\tau\in\left[0,1\right]}\hat{m}\left(\gamma,\tau\right)'W\hat{m}\left(\gamma,\tau\right)$

where $W$ is a $2\times2$ weighting matrix, which is only relevant when the number of moments is greater than the number of parameters, which is not true in our case, so $W$ can be ignored and the method of simulated moments simplifies to,

$\left(\hat{\gamma},\hat{\tau}\right)=\arg\min_{\gamma\in\left[0,1\right],\tau\in\left[0,1\right]}\left\{ \mathbb{E}_{\epsilon}\left[\frac{1}{N}\sum_{i}\left[\hat{c}_{i}\left(\epsilon\right)-c_{i}\right]\right]\right\} ^{2}+\left\{ \mathbb{E}_{\epsilon}\left[\frac{1}{N}\sum_{i}\left[\hat{l}_{i}\left(\epsilon\right)-l_{i}\right]\right]\right\} ^{2}$

Assuming we know the distribution of $\epsilon_i$ , we can simply draw many values of $\epsilon_i$ for each $i$ , and average the moments together across all of the draws of $\epsilon_i$ . This is Monte Carlo numerical integration. In Julia, we can create this objective function with a random draw of $\epsilon$ as follows:

function sim_moments(params)
  this_epsilon = randn(N)                                                     # draw random epsilon
  ggamma,ttau = params                                                        # extract gamma and tau from vector
  this_demand = hh_constrained_opt(ggamma,ttau,array(df[:wage]),this_epsilon) # obtain demand for c and l
  c_moment = mean( this_demand[:c_opt] ) - mean( df[:consump] )               # compute empirical moment for c
  l_moment = mean( this_demand[:l_opt] ) - mean( df[:leisure] )               # compute empirical moment for l
  [c_moment,l_moment]                                                         # return vector of moments
end

In order to estimate $\hat{m}\left(\gamma,\tau\right)$ , we need to run sim_moments(params) many times and take the unweighted average across them to achieve the expectation across $\epsilon_i$ . Because each calculation is computer-intensive, it makes sense to compute the contribution of $\hat{m}\left(\gamma,\tau\right)$ for each draw of $\epsilon_i$ on a different processor and then average across them.

Previously, I presented a convenient approach for parallelization in Julia. The idea is to initialize processors with the addprocs() function in an “outer” script, then import all of the needed data and functions to all of the different processors with the require() function applied to an “inner” script, where the needed data and functions are already managed by the inner script. This is incredibly easy and much simpler than the manual spawn-and-fetch approaches suggested by Julia’s official documentation.

In order to implement the parallelized method of simulated moments, the function hh_constrained_opt() and sim_moments() are stored in a file called est_msm_inner.jl. The following code defines the parallelized MSM and then minimizes the MSM objective using the optimize command set to use the Nelder-Mead algorithm from the Optim package:

####### Prepare for Parallelization #########

addprocs(3)                   # Adds 3 processors in parallel (the first is added by default)
print(nprocs())               # Now there are 4 active processors
require("est_msm_inner.jl")   # This distributes functions and data to all active processors

####### Define Sum of Squared Residuals in Parallel #########

function parallel_moments(params)
  params = exp(params)./(1.0+exp(params))   # rescale parameters to be in [0,1]
  results = @parallel (hcat) for i=1:numReps
    sim_moments(params)
  end
  avg_c_moment = mean(results[1,:])
  avg_l_moment = mean(results[2,:])
  SSR = avg_c_moment^2 + avg_l_moment^2
end

####### Minimize Sum of Squared Residuals in Parallel #########

using Optim
function MSM()
  out = optimize(parallel_moments,[0.,0.],method=:nelder_mead,ftol=1e-8)
  println(out)                                       # verify convergence
  exp(out.minimum)./(1.0+exp(out.minimum))           # return results in rescaled units
end

Parallelization is performed by the @parallel macro, and the results are horizontally concatenated from the various processors by the hcat command. The key tuning parameter here is numReps, which is the number of draws of $\epsilon$ to use in the Monte Carlo numerical integration. Because this example is so simple, a small number of repetitions is sufficient, while a larger number would be needed if $\epsilon$ entered the model in a more complicated manner. The process is run as follows and requires 268 seconds to run on my Macbook Air:

numReps = 12                                         # set number of times to simulate epsilon
gamma_MSM, tau_MSM = MSM()                           # Perform MSM
gamma_MSM
0.49994494921381816
tau_MSM
0.19992279518894465

Finally, given the MSM estimates of $\gamma$ and $\tau$ , we define the numerical derivative, $\frac{df(x)}{dx} \approx \frac{f(x+h)-f(x-h)}{2h}$ , for some small $h$ , as follows:

function Dconsump_Dtau(g,t,h)
  opt_plus_h = hh_constrained_opt(g,t+h,array(df[:wage]),array(df[:epsilon]))
  opt_minus_h = hh_constrained_opt(g,t-h,array(df[:wage]),array(df[:epsilon]))
  (mean(opt_plus_h[:c_opt]) - mean(opt_minus_h[:c_opt]))/(2*h)
end

barpsi_MSM = Dconsump_Dtau(gamma_MSM,tau_MSM,.1)
-5.016610457903023

Thus, we estimate the policy parameter $\bar\psi$ to be approximately $-5.017$ on average, while the true value is $\bar\psi = -\gamma \bar{w} = -(1/2)\times 10=-5$ , so the econometrician’s problem is successfully solved.

Bradley Setzler

#MonthOfJulia Day 12: Parallel Processing

As opposed to many other languages, where parallel computing is bolted on as an afterthought, Julia was designed from the start with parallel computing in mind. It has a number of native features which lend themselves to efficient implementation of parallel algorithms. It also has packages which facilitate cluster computing (using MPI, for example). We won’t be looking at those, but focusing instead on coroutines, generic parallel processing and parallel loops.

Coroutines

Coroutines are not strictly parallel processing (in the sense of “many tasks running at the same time”) but they provide a lightweight mechanism for having multiple tasks defined (if not active) at once. According to Donald Knuth, coroutines are generalised subroutines (with which we are probably all familiar).

Under these conditions each module may be made into a coroutine; that is, it may be coded as an autonomous program which communicates with adjacent modules as if they were input or output subroutines. Thus, coroutines are subroutines all at the same level, each acting as if it were the master program when in fact there is no master program. There is no bound placed by this definition on the number of inputs and outputs a coroutine may have.
Conway, Design of a Separable Transition-Diagram Compiler, 1963.

Coroutines are implemented using produce() and consume(). In a moment you’ll see why those names are appropriate. To illustrate we’ll define a function which generates elements from the Lucas sequence. For reference, the first few terms in the sequence are 2, 1, 3, 4, 7, … If you know about Python’s generators then you’ll find the code below rather familiar.

julia> function lucas_producer(n)
       	a, b = (2, 1)
       	for i = 1:n
       		produce(a)
       		a, b = (b, a + b)
       	end
       end
lucas_producer (generic function with 1 method)

This function is then wrapped in a Task, which has state :runnable.

julia> lucas_task = Task(() -> lucas_producer(10))
Task (runnable) @0x0000000005b5ee60
julia> lucas_task.state
:runnable

Now we’re ready to start consuming data from the Task. Data elements can be retrieved individually or via a loop (in which case the Task acts like an iterable object and no consume() is required).

julia> consume(lucas_task)
2
julia> consume(lucas_task)
1
julia> consume(lucas_task)
3
julia> for n in lucas_task
       	println(n)
       end
4
7
11
18
29
47
76

Between invocations the Task is effectively asleep. The task temporarily springs to life every time data is requested, before becoming dormant once more.

It’s possible to simultaneously set up an arbitrary number of coroutine tasks.

Parallel Processing

Coroutines don’t really feel like “parallel” processing because they are not working simultaneously. However it’s rather straightforward to get Julia to metaphorically juggle many balls at once. The first thing that you’ll need to do is launch the interpreter with multiple worker processes.

$ julia -p 4

There’s always one more process than specified on the command line (we specified the number of worker processes; add one for the master process).

julia> nprocs()
5
julia> workers()                           # Identifiers for the worker processes.
4-element Array{Int64,1}:
 2
 3
 4
 5

We can launch a job on one of the workers using remotecall().

julia> W1 = workers()[1];
julia> P1 = remotecall(W1, x -> factorial(x), 20)
RemoteRef(2,1,6)
julia> fetch(P1)
2432902008176640000

@spawn and @spawnat are macros which launch jobs on individual workers. The @everywhere macro executes code across all processes (including the master).

julia> @everywhere p = 5
julia> @everywhere println(@sprintf("ID %d: %f %d", myid(), rand(), p))
ID 1: 0.686332 5
	From worker 4:	ID 4: 0.107924 5
	From worker 5:	ID 5: 0.136019 5
	From worker 2:	ID 2: 0.145561 5
	From worker 3:	ID 3: 0.670885 5

Parallel Loop and Map

To illustrate how easy it is to set up parallel loops, let’s first consider a simple serial implementation of a Monte Carlo technique to estimate π.

julia> function findpi(n)
       	inside = 0
       	for i = 1:n
       		x, y = rand(2)
       		if (x^2 + y^2 <= 1)
       			inside +=1
       		end
       	end
       	4 * inside / n
       end
findpi (generic function with 1 method)

The quality of the result as well as the execution time (and memory consumption!) depend directly on the number of samples.

julia> @time findpi(10000)
elapsed time: 0.051982841 seconds (1690648 bytes allocated, 81.54% gc time)
3.14
julia> @time findpi(100000000)
elapsed time: 9.533291187 seconds (8800000096 bytes allocated, 42.97% gc time)
3.1416662
julia> @time findpi(1000000000)
elapsed time: 95.436185105 seconds (88000002112 bytes allocated, 43.14% gc time)
3.141605352

The parallel version is implemented using the @parallel macro, which takes a reduction operator (in this case +) as its first argument.

julia> function parallel_findpi(n)
       	inside =  @parallel (+) for i = 1:n
       		x, y = rand(2)
       		x^2 + y^2 <= 1 ? 1 : 0
       	end
       	4 * inside / n
       end
parallel_findpi (generic function with 1 method)

There is some significant overhead associated with setting up the parallel jobs, so that the parallel version actually performs worse for a small number of samples. But when you run sufficient samples the speedup becomes readily apparent.

julia> @time parallel_findpi(10000)
elapsed time: 0.45212316 seconds (9731736 bytes allocated)
3.1724
julia> @time parallel_findpi(100000000)
elapsed time: 3.870065625 seconds (154696 bytes allocated)
3.14154744
julia> @time parallel_findpi(1000000000)
elapsed time: 39.029650365 seconds (151080 bytes allocated)
3.141653704

For reference, these results were achieved with 4 worker processes on a DELL laptop with the following CPU:

root@propane: #lshw | grep product | head -n 1
          product: Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz

More information on parallel computing facilities in Julia can be found in the documentation. As usual the code for today’s Julia journey can be found on github.

The post #MonthOfJulia Day 12: Parallel Processing appeared first on Exegetic Analytics.

juliabloggers.com

A Julia Language Blog Aggregator