Tag Archives: Programming

Julia with MKL on OSX

Introduction

One of the great things about Julia for those in scientific computing is the ease of accessing highly optimized libraries. For matrix operations, Julia comes inbuilt with OpenBLAS, an open source implementation of BLAS, the Basic Linear Algebra Subprograms.

For the majority of people, that’s wonderful. OpenBLAS is quite fast and optimized.

BUT, when you want to diagonalize the large matrices that I do, there’s something better, Intel’s Math Kernel Library, MKL.

As Intel designed the chips and the hardware drivers for just about everyone, they can design their implementation of BLAS to take advantage of the specifics of the hardware and get a speed boost. More to my purposes, it also doesn’t start aborting on larger matrices, even though I had plenty of RAM left. The downside: they get this boost from trade secrets, and thus the software is propriety and behind closed doors. Moral objections for some, monetary objections for others.

If you want to get MKL for yourself, you have two possible routes:

A free community license through Intel Aviary . I got this for my workstation.
Convince your company/ university/ institute to get the fully supported and expensive version. For example, my institute’s cluster has all of Intel’s tools.

Do you need this?

Before you start trying to implement this on your system, take a second and decide whether or not it is worth your while. What kind of systems are you trying to diagonalize? Are you going to be diagonalizing systems at all? Or multiplying large matrices… that would count too…

I generated matrices through A=randn(n,n); and then diagonalized them through @time eigfact(A);.

All of these specs are for my Mac Pro, Late 2013 model, running OSX El Capitan. Processor: 3.7 GHz Quad-Core Intel Xeon E5. Memory: 64 GB 1866 MHz DDR3 ECC. I would be interested in seeing data for other processors.

Time Scaling

Time scaling for MKL and OpenBLAS. Performed for a matrix with randomly generated values according to a normal distribution of unit standard deviation.

Factor Scaling

The ratio between OpenBLAS and MKL. While comporable at small system sizes, at larger matrices, MKL shows a significant improvement.

Memory Scaling

Both MKL and OpenBLAS showed the same memory usage for a given calculation. The scaling appears quadratic, except for a deviation at small system sizes.

What you need to do

For Intel

So in my .zshrc, or .bashrc for those who haven’t discovered the wonders of zsh, I now have

export TBBROOT=fdsljkfds
source /opt/intel/mkl/bin/mklvars.sh intel64 ilp64

The value for TBBROOT is non-zero gobblety-gook.
Once you have added that, either restart your terminal, or type

	source ~/.bashrc

to refresh your terminal.

Now, check that these environment variables are set up correctly:

MKLROOT
- /opt/intel//compilers_and_libraries_2016.2.146/mac/mkl
DYLD_LIBRARY_PATH
- /opt/intel//compilers_and_libraries_2016.2.146/mac/compiler/lib: /opt/intel//compilers_and_libraries_2016.2.146/mac/mkl/lib
LIBRARY_PATH
- /opt/intel//compilers_and_libraries_2016.2.146/mac/compiler/lib: /opt/intel//compilers_and_libraries_2016.2.146/mac/mkl/lib
NLSPATH
- /opt/intel//compilers_and_libraries_2016.2.146/mac/mkl/lib/locale/%l_%t/%N
MANPATH
- /opt/intel//compilers_and_libraries_2016.2.146/mac/man/en_US
CPATH
- /opt/intel//compilers_and_libraries_2016.2.146/mac/mkl/include

by typing

echo $NAME_OF_VARIABLE

Don’t just copy your variables against mine! Find your installation on the system, and checkout where the folders are.

For the Julia Installation

In the Julia file, edit Make.inc in this specific place

## Settings for various Intel tools
# Set to 1 to use MKL
USE_INTEL_MKL =1
# Set to 1 to use MKL FFT
USE_INTEL_MKL_FFT = 1
# Set to 1 to use Intel LIBM
USE_INTEL_LIBM ?= 0
# Set to 1 to enable profiling with Intel VTune Amplifier
USE_INTEL_JITEVENTS ?= 0
# Set to 1 to use Intel C, C++, and FORTRAN compilers
USEICC  ?= 0
USEIFC  ?= 0

Now in the Julia folder try

make
make install

How I eventually figured this out

I was getting complaints when makeing Julia, that

-L/opt/intel//compilers_and_libraries_2016.2.146/mac/tbb/lib

wasn’t found. There was good reason it wasn’t found. It didn’t exist. TBB stands for Threading Building Blocks, another one of Intel’s programs, but this one is meant for multicore C++ programs. Sounds fairly useful, but off topic to what I need right now.

So I wanted to figure out why it was trying to link to that directory. Looking in /opt/intel/mkl/bin/mklvars.sh, the program that sets environment variables for MKL, I discovered:

                if [ -z "${TBBROOT}" ]; then
                    mkl_ld_arch="${CPRO_PATH}/tbb/lib:${mkl_ld_arch}"
                fi

When the variable TBBROOT is zero, it adds this tbb folder to the path. Since I can’t change that file, proprietary stuff, my work around is making TBBROOT non-zero. Then DYLD_LIBRARY_PATH, which gets linked in the Julia make processes, only contains good locations.

Also, you can’t just run source ... to on the command line once and have the variables set for all eternity. When I restarted my terminal the next day, the variables had cleared. So I figured out the lines need to be put in the ~/.bashrc (or ~/.zshrc) instead of just run once.

Conculsions

I still got warnings when making Julia, but obviously none that broke the installation. Hopefully this work-around holds me over till this project is done, and hopefully it helps someone else too 🙂

Benchmarks of Multidimensional Stack Implementations in Julia

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/233-2/

Datastructures.jl claims it’s fast. How does it do? I wrote some quick codes to check it out. What I wanted to do is find out which algorithm does best for implementing a stack where each element is three integers. I tried filling a pre-allocated array, pushing into three separate vectors, and different implementations of the stack from the DataStructures.jl package.

function baseline()
  stack = Array{Int64,2}(1000000,3)
  for i=1:1000000,j=1:3
    stack[i,j]=i
  end
end 
function baseline2()
  stack = Array{Int64,2}(1000000,3)
  for j=1:3,i=1:1000000
    stack[i,j]=i
  end
end
function f0()
  stack = Array{Int64}(1000000,3)
  for i = 1:1000000
    stack[i,:] = [i,i,i]
  end
end
function f02()
  stack = Array{Int64}(3,1000000)
  for i = 1:1000000
    stack[:,i] = [i;i;i]
  end
end
function f1()
  stack1 = Vector{Int64}(1)
  stack2 = Vector{Int64}(1)
  stack3 = Vector{Int64}(1)
  for i = 1:1000000
    push!(stack1,i)
    push!(stack2,i)
    push!(stack3,i)
  end
end
function f2()
  stack1 = Stack(Int)
  stack2 = Stack(Int)
  stack3 = Stack(Int)
  for i = 1:1000000
    push!(stack1,i)
    push!(stack2,i)
    push!(stack3,i)
  end
end
function f3()
  stack = Stack{}(Tuple{Int64,Int64,Int64})
  for i = 1:1000000
    push!(stack,(i,i,i))
  end
end
function f4()
  stack = Stack{}(Vector{Int64})
  for i = 1:1000000
    push!(stack,[i,i,i])
  end
end
using Benchmark
using DataStructures
base = benchmark(baseline,"baseline",1000)
println(base)
base2 = benchmark(baseline2,"baseline2",1000)
println(base2)
df0 = benchmark(f0,"array",1000)
println(df0)
df02 = benchmark(f02,"arrayTranspose",1000)
println(df02)
df1 = benchmark(f1,"vectorStack",1000)
println(df1)
df2 = benchmark(f2,"dsStacks",1000)
println(df2)
df3 = benchmark(f3,"dsStackTuple",1000)
println(df3)
df4 = benchmark(f4,"dsStackVector",1000)
println(df4)

The results were as follows:

| Row | Category   | Benchmark  | Iterations | TotalWall | AverageWall |
|-----|------------|------------|------------|-----------|-------------|
| 1   | "baseline" | "baseline" | 1000       | 11.7169   | 0.0117169   |
 
| Row | MaxWall   | MinWall    | Timestamp             |
|-----|-----------|------------|-----------------------|
| 1   | 0.0158455 | 0.00978837 | "2016-02-29 23:23:51" |
 
| Row | JuliaHash                                  | CodeHash | OS        |
|-----|--------------------------------------------|----------|-----------|
| 1   | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" | NA       | "Windows" |
 
| Row | CPUCores |
|-----|----------|
| 1   | 8        |
1x12 DataFrames.DataFrame
| Row | Category    | Benchmark   | Iterations | TotalWall | AverageWall |
|-----|-------------|-------------|------------|-----------|-------------|
| 1   | "baseline2" | "baseline2" | 1000       | 9.84362   | 0.00984362  |
 
| Row | MaxWall   | MinWall    | Timestamp             |
|-----|-----------|------------|-----------------------|
| 1   | 0.0126953 | 0.00694176 | "2016-02-29 23:24:01" |
 
| Row | JuliaHash                                  | CodeHash | OS        |
|-----|--------------------------------------------|----------|-----------|
| 1   | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" | NA       | "Windows" |
 
| Row | CPUCores |
|-----|----------|
| 1   | 8        |
1x12 DataFrames.DataFrame
| Row | Category | Benchmark | Iterations | TotalWall | AverageWall | MaxWall  |
|-----|----------|-----------|------------|-----------|-------------|----------|
| 1   | "array"  | "array"   | 1000       | 114.288   | 0.114288    | 0.172499 |
 
| Row | MinWall   | Timestamp             |
|-----|-----------|-----------------------|
| 1   | 0.0775942 | "2016-02-29 22:45:42" |
 
| Row | JuliaHash                                  | CodeHash | OS        |
|-----|--------------------------------------------|----------|-----------|
| 1   | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" | NA       | "Windows" |
 
| Row | CPUCores |
|-----|----------|
| 1   | 8        |
1x12 DataFrames.DataFrame
| Row | Category         | Benchmark        | Iterations | TotalWall |
|-----|------------------|------------------|------------|-----------|
| 1   | "arrayTranspose" | "arrayTranspose" | 1000       | 110.981   |
 
| Row | AverageWall | MaxWall  | MinWall   | Timestamp             |
|-----|-------------|----------|-----------|-----------------------|
| 1   | 0.110981    | 0.183495 | 0.0741138 | "2016-02-29 22:47:34" |
 
| Row | JuliaHash                                  | CodeHash | OS        |
|-----|--------------------------------------------|----------|-----------|
| 1   | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" | NA       | "Windows" |
 
| Row | CPUCores |
|-----|----------|
| 1   | 8        |
1x12 DataFrames.DataFrame
| Row | Category      | Benchmark     | Iterations | TotalWall | AverageWall |
|-----|---------------|---------------|------------|-----------|-------------|
| 1   | "vectorStack" | "vectorStack" | 1000       | 34.4623   | 0.0344623   |
 
| Row | MaxWall   | MinWall   | Timestamp             |
|-----|-----------|-----------|-----------------------|
| 1   | 0.0455326 | 0.0285367 | "2016-02-29 22:48:09" |
 
| Row | JuliaHash                                  | CodeHash | OS        |
|-----|--------------------------------------------|----------|-----------|
| 1   | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" | NA       | "Windows" |
 
| Row | CPUCores |
|-----|----------|
| 1   | 8        |
1x12 DataFrames.DataFrame
| Row | Category   | Benchmark  | Iterations | TotalWall | AverageWall |
|-----|------------|------------|------------|-----------|-------------|
| 1   | "dsStacks" | "dsStacks" | 1000       | 38.0762   | 0.0380762   |
 
| Row | MaxWall   | MinWall   | Timestamp             |
|-----|-----------|-----------|-----------------------|
| 1   | 0.0508213 | 0.0303853 | "2016-02-29 22:48:47" |
 
| Row | JuliaHash                                  | CodeHash | OS        |
|-----|--------------------------------------------|----------|-----------|
| 1   | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" | NA       | "Windows" |
 
| Row | CPUCores |
|-----|----------|
| 1   | 8        |
1x12 DataFrames.DataFrame
| Row | Category       | Benchmark      | Iterations | TotalWall | AverageWall |
|-----|----------------|----------------|------------|-----------|-------------|
| 1   | "dsStackTuple" | "dsStackTuple" | 1000       | 19.3516   | 0.0193516   |
 
| Row | MaxWall   | MinWall   | Timestamp             |
|-----|-----------|-----------|-----------------------|
| 1   | 0.0296347 | 0.0140451 | "2016-02-29 22:49:06" |
 
| Row | JuliaHash                                  | CodeHash | OS        |
|-----|--------------------------------------------|----------|-----------|
| 1   | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" | NA       | "Windows" |
 
| Row | CPUCores |
|-----|----------|
| 1   | 8        |
1x12 DataFrames.DataFrame
| Row | Category        | Benchmark       | Iterations | TotalWall |
|-----|-----------------|-----------------|------------|-----------|
| 1   | "dsStackVector" | "dsStackVector" | 1000       | 184.126   |
 
| Row | AverageWall | MaxWall  | MinWall | Timestamp             |
|-----|-------------|----------|---------|-----------------------|
| 1   | 0.184126    | 0.227575 | 0.16454 | "2016-02-29 22:52:11" |
 
| Row | JuliaHash                                  | CodeHash | OS        |
|-----|--------------------------------------------|----------|-----------|
| 1   | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" | NA       | "Windows" |
 
| Row | CPUCores |
|-----|----------|
| 1   | 8        |
1x12 DataFrames.DataFrame
| Row | Category      | Benchmark     | Iterations | TotalWall | AverageWall |
|-----|---------------|---------------|------------|-----------|-------------|
| 1   | "vectorTuple" | "vectorTuple" | 1000       | 23.65     | 0.02365     |
 
| Row | MaxWall   | MinWall   | Timestamp             |
|-----|-----------|-----------|-----------------------|
| 1   | 0.0375346 | 0.0200302 | "2016-02-29 23:29:45" |
 
| Row | JuliaHash                                  | CodeHash | OS        |
|-----|--------------------------------------------|----------|-----------|
| 1   | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" | NA       | "Windows" |
 
| Row | CPUCores |
|-----|----------|
| 1   | 8        |

Things to learn from this are:

Using a tuple is by far the fastest.
Datastructures.jl does beat out all except the pre-allocated array
The standard vector is pretty close to the DataStructures.jl result

The end result is: use arrays when you can pre-allocate and need mutability, but if you want to throw and retrieve things from a dynamic data structure, using tuples is key. Datastructures.jl has some nice features and (obviously) implementations of data structures, and although they are slightly faster than the native implementation, don’t expect a massive speedup. Still, it’s a well-made package you should try out.

The post Benchmarks of Multidimensional Stack Implementations in Julia appeared first on Stochastic Lifestyle.

Interfacing with a Xeon Phi via Julia

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/interfacing-xeon-phi-via-julia/

(Disclaimer: This is not a full-Julia solution for using the Phi, and instead is a tutorial on how to link OpenMP/C code for the Xeon Phi to Julia. There may be a future update where some of these functions are specified in Julia, and Intel’s compilertools.jl looks like a viable solution, but for now it’s not possible.)

Intel’s Xeon Phi has a lot of appeal. It’s an instant cluster in your computer, right? It turns out it’s not quite that easy. For one, the installation process itself is quite tricky, and the device has stringent requirements for motherboard choices. Also, making out at over a taraflop is good, but not quite as high as NVIDIA’s GPU acceleration cards.

However, there are a few big reasons why I think our interest in the Xeon Phi should be renewed. For one, Intel will be releasing its next version Knights Landing in Q3 which promises up to 8 teraflops and 16 GB of RAM. Intel has also been saying that this next platform will be much more user friendly and have improved bandwidth to allow for quicker offloading of data. Lastly, since the Xeon Phi uses X86 cores which one interfaces with via standard tools such as OpenMP and MPI, high performance parallel codes naturally transfer over to the Xeon Phi with little work (if you’ve already parallelized your code). For this reason many major HPCs such as Stampede and SuperMIC have been incorporating a Xeon Phi into every compute node. These details tell us that for high-performance computing using Xeon Phi’s to their full potential is the way forward. I am going to detail some of my advances in interfacing with the Xeon Phi via Julia.

First, let’s talk about automatic offloading

Automatic offloading allows you to offload all of your MKL-calls to the Xeon Phi automatically. This means that if you are doing lots of linear algebra on large matrices, standard operations from BLAS and Linpack like matrix multiplication * will automatically be done on the acceleration card. Details for setting up automatic offload are given by MATLAB. However, automatic offloading is a mixed blessing. First of all, there is no data persistence. If you are repeatedly using the same matrices, like in solving an evolution equation (i.e. parabolic PDE), this adds a large overhead since you’ll be sending that data back and forth every multiplication. Also, one major downside is that it does not apply to vectorized arithmetic such as .*. Sure you could hack it to be matrix multiplication by a sparse diagonal matrix, but these types of hacks really only tend to give you speedups when your vectors are large since you still incur the costs of transferring the arrays every time.

Still, it’s stupid easy to setup. You compile Julia with MKL and and setup a few environment variables and it will do it automatically. Thus you should give this a try first.

Native Execution

You can also compile code to natively execute on the Xeon Phi. However, you need to copy the files (and libraries) over the Phi via ssh and run the job from there. Thus while this is really good for C code, it’s not as easy to use when you wish to control the Phi from the computer itself as a “side job”.

Pragma-assisted Offloading

This is the route we are going to take. Pragmas are a type of syntax from OpenMP where one specifies segments of the code to be parallelized. If you’re familiar with using parallel constructs from MATLAB/Julia like parallel loops, OpenMP’s pragmas are pretty much the C version of that. For the Xeon Phi, there exists extra pragmas telling the Phi to offload. This also allows for data persistence. Lastly, for many C codes parallelized with OpenMP they are just one pragma away from working on the Phi.

Our workflow will be as follows. We will use a driver script from Julia which will set the environment and use Julia’s ccall to call the C-code with the OpenMP pragmas which will perform parallelized function calls. Notice that in this case Julia is just performing the role of glue code. The advantage is that we can prepare the data and plot the results from within Julia. The disadvantage is that we will have to write a lot of C-code. However, I am currently talking with Intel’s Developer Lab on using their CompilerTools.jl to compile Julia functions to be used on the Xeon Phi. When that’s available, I will write a tutorial on how to then replace the core functions from this script with the Julia functions. Then, the only C-code would be the code which starts the parallel loop. Let’s get started.

The Problem

We wish to solve some simple stochastic differential equations via the Euler-Maruyama method. We will specify the stochastic differential equation of the form

$dU_{t} = f(U,t)dt + g(U,t)dW_{t}$

via functions $f$ and $g$ . In our code we will also allow the ability to have a function for the true solution in order to perform error calculations.

The Julia Code

Again, at this point the Julia code is quite simple because it is simply performing the glue. Let $M$ be the number of simulations of we perform. Set up empty vectors for the values of $U$ and the true solution $Utrue$ at the endpoints. Note that we only will keep the endpoints from each simulation due to memory issues. If we were to keep the full array for thousands (or millions) of runs this would easily be more memory than the Phi could handle (or even a workstation!). We then need to specify our environmental variables. Set OMP_NUM_THREADS to be the number of compute cores on your system. We setup MIC_PREFIX, LD_IBRARY_PATH, and MIC_LD_LIBRARY_PATH so that we can dynamically link to our library. Note that this assumes that you have already sourced the compiler variables via compilervars.sh with the argument intel64. If not, you can use Julia’s run function to source the script. Lastly, we set a constant MIC_OMP_NUM_THREADS to be the number of threads the Xeon Phi will use. Since when offloading you can use all but 1 core (one manages the jobs) and each core has 4 threads, we set 240 threads (for the 5110p). Like in the case of GPUs, using more threads than cores is beneficial since the cores can utilize large vectors to do multiple calculations at once (SIMD) and other magic. We set the environment variable OFFLOAD_REPORT to 3 which will make the Phi give us details about everything it’s offloading (good for debugging). Lastly, we end by calling the library. The total code is as follows:

M = 240000
ENV["OMP_NUM_THREADS"]=12
Us = Vector{Float64}(M)
ts = Vector{Float64}(M)
Ws = Vector{Float64}(M)
Utrues = Vector{Float64}(M)
MIC_OMP_NUM_THREADS = 240
ENV["MIC_PREFIX"]="MIC"
ENV["OFFLOAD_REPORT"]=3
ENV["LD_LIBRARY_PATH"]=string(ENV["LD_LIBRARY_PATH"],":.")
ENV["MIC_LD_LIBRARY_PATH"]=string(ENV["MIC_LD_LIBRARY_PATH"],":.")
alg = 2
#ccall((:monte_carlo,"/home/crackauc/XeonPhiTests/EMtest/sde_solvers_noffload.so"),Void,(Cint,Ptr{Cdouble},Ptr{Cdouble},Ptr{Cdouble}),M,Us,Utrues,ts)
@time ccall((:monte_carlo,"/home/crackauc/XeonPhiTests/EMtest/sde_solvers.so"),Void,(Cint,Ptr{Cdouble},Ptr{Cdouble},Ptr{Cdouble},Ptr{Cdouble},Cint,Cint),M,Us,Utrues,ts,Ws,MIC_OMP_NUM_THREADS,alg)

For more of an explanation on using the ccall function to interface with C-code, see my previous blog post. Note that the arrays Us, Utrues, ts, and Ws will be updated in place as the value of U, Utrue, t, and W at the end of the path. Thus after the job is done one can use Julia to plot the results.

Xeon Phi Driver Function

The ccall function looks for a function of the following type in a shared library named sde_solvers.so:

void monte_carlo(int M,double* Us,double* Utrues,double* ts,double* Ws,const int MIC_OMP_NUM_THREADS,int alg)

In this function we will just do a parallel for loop where each iteration calls the Euler-Maruyama solver on a different random seed. However, instead of doing a straight parallel for loop, we will put a little separation between “the parallel” and “the for” so that we can keep some persistent data to be a little more efficient.

We start by defining some constants:

double Uzero = .5;
  double dt = 0.00001;
  double T = 2.0;
  int N = ceil(T/dt)+1;

Now we send the job over to the Xeon Phi via the following pragma:

#pragma offload target(mic:MIC_DEV) default(none) in(Uzero,dt,T,N,MIC_OMP_NUM_THREADS,alg) out(Us:length(M)) 
  out(Utrues:length(M)) out(ts:length(M)) out(Ws:length(M))

Note that at the top of the script we have

#ifndef MIC_DEV
#define MIC_DEV 0
#endif
 
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <mathimf.h>
#include "mkl.h"
#include "mkl_vsl.h"

and so MIC_DEV singles out the Xeon Phi labeled 0. Using in we send over the variables, and with out we specify the variables we want the Phi to send back. By adding default(none) we get informed if there are any variables which weren’t specified.

After that pragma, we are on the MIC. The first thing we will do is set the number of threads. I don’t know why but setting the environment variable MIC_OMP_NUM_THREADS in Julia does not set the number of MIC threads, so instead we do it manually on via the command

omp_set_num_threads(MIC_OMP_NUM_THREADS);

Next we start our parallel environment by

#pragma omp parallel default(none) shared(Uzero,alg,dt,T,N,M,ts,Us,Utrues,Ws)

Once again, default(none) will make sure no variables are accidentally set to shared, and we specify all of the inputs as shared. With this, we are now coding with that list of variables on the individual threads of the Phi. Thus we will now will setup an individual run of the SDE solver. We make arrays for time $t$ , the Brownian path $W$ , the solution $U$ , and the true solution $Utrue$ . We also grab the id of the thread to setup random seeds later. This gives:

      int i;
      int tid = omp_get_thread_num();
      double* t; double* W; double* U; double* Utrue;
      int steps;
      t = (double*) malloc(N*sizeof(double));
      W = (double*) malloc(N*sizeof(double));
      U = (double*) malloc(N*sizeof(double));
      Utrue = (double*) malloc(N*sizeof(double));

Now we start our parallel for loop. Notice that by allocating these variables before the loop we have increased our efficiency since each run we will simply write over these values, saving us the time of re-allocating. In our for loop we set the initial values (since we are re-using the same arrays), call the solver algorithm, save the results at the end, and re-run. After we are done with the whole loop, then we free that arrays we made. The code is then as follows:

      #pragma omp for
      for(i=0;i<M;i++){
        t[0]=0;
        U[0]=Uzero;
        Utrue[0]=Uzero;
        W[0] = 0;
        euler_maruyama(&f,&g,&trueSol,Uzero,dt,T,t,&W,U,Utrue,tid*i*M+i); /*unique identifier tid*i*M+i since tid spacing */
        Us[i] = U[N-1];
        Utrues[i] = Utrue[N-1];
        ts[i] = t[N-1];
        Ws[i] = W[N-1];
      }
    free(t); free(Utrue); free(U); free(W); free(Z);

Notice that tid*i*M+i has spacings larger than $M$ and $tid$ and so each value will be unique. This is then the value we can use as a random seed. The full code for the driver function is then:

void monte_carlo(int M,double* Us,double* Utrues,double* ts,double* Ws,const int MIC_OMP_NUM_THREADS,int alg){
  double Uzero = .5;
  double dt = 0.00001;
  double T = 2.0;
  int N = ceil(T/dt)+1;
  #pragma offload target(mic:MIC_DEV) default(none) in(Uzero,dt,T,N,MIC_OMP_NUM_THREADS,alg) out(Us:length(M)) 
  out(Utrues:length(M)) out(ts:length(M)) out(Ws:length(M))
  {
    omp_set_num_threads(MIC_OMP_NUM_THREADS);
    #pragma omp parallel default(none) shared(Uzero,alg,dt,T,N,M,ts,Us,Utrues,Ws)
     {
      int i;
      int tid = omp_get_thread_num();
      double* t; double* W; double* U; double* Utrue;
      int steps;
      t = (double*) malloc(N*sizeof(double));
      W = (double*) malloc(N*sizeof(double));
      U = (double*) malloc(N*sizeof(double));
      Utrue = (double*) malloc(N*sizeof(double));
      #pragma omp for
      for(i=0;i<M;i++){
        t[0]=0;
        U[0]=Uzero;
        Utrue[0]=Uzero;
        W[0] = 0;
        euler_maruyama(&f,&g,&trueSol,Uzero,dt,T,t,&W,U,Utrue,tid*i*M+i); /*unique identifier tid*i*M+i since tid spacing */
        Us[i] = U[N-1];
        Utrues[i] = Utrue[N-1];
        ts[i] = t[N-1];
        Ws[i] = W[N-1];
      }
      free(t); free(Utrue); free(U); free(W); free(Z);
    }
  }
}

Notice I left out the extra algorithms. When I put this in a package (and in my soon to be submitted code for a publication) I have different choices for the solver, but here we will just have Euler-Maruyama.

The Inner Functions

Before we get to the solver, notice that euler_maruyama takes in three functions by handle. However, since these will be executed on the Xeon Phi we decorate them with __attribute__((target(mic))). However, I will leave off these declarations since we can instead have them be put on automatically by a compiler command (and this makes it easier to re-compile to be a Xeon Phi free code). Thus the SDE functions are simply

double f(double t,double x){
  return (1.0/20.0)*x;
}
 
double g(double t,double x){
  return (1.0/10.0)*x;
}
 
double trueSol(double t, double Uzero,double W){
  return Uzero*exp(((1.0/20.0)-((1.0/10.0)*(1.0/10.0))/2.0)*t + (1.0/10.0)*W);
}

Thus the SDE is

$dU_{t} = frac{1}{20} U_{t}dt + frac{1}{10} W_{t}$

which a mathematician would call Geometric Brownian Motion or what someone in finance would know of as the Black-Scholes equation. Our inner function euler_maruyama is then the standard loop for solving via Euler-Maruyama where we replace any instance of $dt$ with a small real number and we replace $dW_{t}$ with normal random variables with zero mean and variance $dt$ . The only tricky part is getting normal random variables, but I used Intel’s VSL library for generating these. The code for solving the Euler-Maruyama equations are then

void euler_maruyama(double (*f)(double,double),double (*g)(double,double),double (*trueSol)(double,double,double),double Uzero,double dt, double T,double* t,double** W,double* U,double* Utrue,int id){
  int N = ceil(T/dt)+1;
  *W = (double*) malloc(N*sizeof(double));
  VSLStreamStatePtr stream;
  vslNewStream(&stream,VSL_BRNG_MT19937,20+id);
  vdRngGaussian(VSL_RNG_METHOD_GAUSSIAN_BOXMULLER,stream,N,*W,0.0f,1.0f);
  (*W)[0] = 0.0f;
 
  int i;
  double dW;
  double sqdt = sqrt(dt);
  for(i=1;i<N;i++){
    /* dW = 0; */
    dW = sqdt* (*W)[i];
    t[i] = t[i-1] + dt;
    (*W)[i] = (*W)[i-1] + dW;
    U[i] = U[i-1] + dt*f(t[i-1],U[i-1]) + g(t[i-1],U[i-1])*dW;
    Utrue[i] = trueSol(t[i],Uzero,(*W)[i]);
  }
  vslDeleteStream(&stream);
}

Notice that this part is nothing special and quite close to what you’d write in $C$ . However, we do note that since we want the value of $W$ at the end of the run outside of this function, and we allocate $W$ within the function, we have to pass $W$ by reference via &W and thus every time it is used we have to deference it via *W. Other than that there’s nothing fancy here.

Compilation

This is always the hardest part. However, notice that if we just take away the offload pragma this is perfectly good OpenMP code! You can do this from the compiler to first check your code. The compilation command is as follows:

icc -mkl -O3 -openmp -fpic -diag-disable 10397 -no-offload -Wno-unknown-pragmas -std=c99 -qopt-report -qopt-report-phase=vec -shared sde_solvers.c -o sde_solvers.so

Most of it is setting up offload reports and libraries, but the important part to notice is that -no-offload is the part that turns off the offload pragma. Give this a try and it should parallelize on the CPU. Now, to compile for the Phi, we use the command

icc -mkl -O3 -openmp -fpic -diag-disable 10397 -qoffload -Wno-unknown-pragmas -std=c99 -qopt-report -qopt-report-phase=vec -shared sde_solvers.c -offload-attribute-target=mic -o sde_solvers.so

Notice that the command -offload-attribute-target=mic is required if you do not put __attribute__((target(mic))) in front of each function that is called when offloaded. I prefer to not put the extra tags because icc required that I delete them to re-compile for the CPU. In this case, we simply get rid of that compiler directive and change to -no-offload and we have working CPU code. Thus you can see how to transfer back and forth between the two via compilation.

After doing this you should be able to call the code from Julia, have it solve the code on the Phi, and then return the result to Julia.

Future Steps

Notice that the functions $f$ , $g$ , and $trueSol$ are simple functions which we pass by pointer into the solver. Julia already has ways to pass function pointers which I go over in my previous tutorial, though since they are not compiled with the __attribute__((target(mic))) flag they will not work on the Phi. Hopefully Intel’s compilertools.jl will support this in the near future. When that’s the case, these functions could be specified from within Julia to allow us to create libraries where we can use Julia-specified functions as the input.

However, this gives a nice template for performing any kind of Monte Carlo simulation or anything else that uses a parallel for loop. This wrapper will form the basis of a library I am creating for stochastic (partial) differential equations. More on that later. In the meantime, have fun experimenting with the Phi!

The post Interfacing with a Xeon Phi via Julia appeared first on Stochastic Lifestyle.

juliabloggers.com

A Julia Language Blog Aggregator