Author Archives: Christopher Rackauckas

Tutorial for Julia on the HPC with GPUs

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/julia-on-the-hpc-with-gpus/

This is a continuous of my previous post on using Julia on the XSEDE Comet HPC. Check that out first for an explanation of the problem. In that problem, we wished to solve for the area of a region where a polynomial was less than 1, which was calculated by code like:

@time res = @sync @parallel (+) for i = imin:dx:imax
  tmp = 0
  isq2 = i*i; isq3 = i*isq2; isq4 = isq2*isq2; isq5 = i*isq4
  isq6 = isq4*isq2; isq7 = i*isq6; isq8 = isq4*isq4
  @simd for j=jmin:dx:jmax
    jsq2 = j*j; jsq3= j*jsq2; jsq4 = jsq2*jsq2;
    jsq5 = j*jsq4; jsq6 = jsq2*jsq4; jsq7 = j*jsq6; jsq8 = jsq4*jsq4
    @inbounds tmp += abs(coefs[1]*(jsq2) + coefs[2]*(jsq3) + coefs[3]*(jsq4) + coefs[4]*(jsq5) + coefs[5]*jsq6 + coefs[6]*jsq7 + coefs[7]*jsq8 + coefs[8]*(i) + coefs[9]*(i)*(jsq2) + coefs[10]*i*jsq3
    + coefs[11]*(i)*(jsq4) + coefs[12]*i*jsq5 + coefs[13]*(i)*(jsq6) + coefs[14]*i*jsq7 + coefs[15]*(isq2) + coefs[16]*(isq2)*(jsq2) + coefs[17]*isq2*jsq3 + coefs[18]*(isq2)*(jsq4) +
    coefs[19]*isq2*jsq5 + coefs[20]*(isq2)*(jsq6) + coefs[21]*(isq3) + coefs[22]*(isq3)*(jsq2) + coefs[23]*isq3*jsq3 + coefs[24]*(isq3)*(jsq4) + coefs[25]*isq3*jsq5 +
    coefs[26]*(isq4) + coefs[27]*(isq4)*(jsq2) + coefs[28]*isq4*jsq3 + coefs[29]*(isq4)*(jsq4) + coefs[30]*(isq5) + coefs[31]*(isq5)*(jsq2) + coefs[32]*isq5*jsq3+ coefs[33]*(isq6) +
    coefs[34]*(isq6)*(jsq2) + coefs[35]*(isq7) + coefs[36]*(isq8))<1
  end
  tmp
end
res = res*((imax-imin)*(jmax-jmin)/(length(imin:dx:imax)*length(jmin:dx:jmax)))
println(res)

Notice that this code is massively parallel: at every point we do something and we just sum the values. This means it’s a perfect problem for the GPU. I am going to show you how to do this.

Getting CUDArt Setup

We will be using the CUDArt.jl package. I first had this working with the CUDA.jl package but was informed that going forward we should be using the CUDArt.jl package. There’s really not much of a difference in my implementation.

To get started, you have to go here and install an appropriate version of CUDA. Note that you have to have an NVIDIA GPU in order to do this. I will be using a GTX970 on one computer (Windows), and a GTX980Ti on another. Comet has a queue with 2x Tesla K80s, but that’s probably overkill and you should try to develop locally first. One you have all of that setup, CUDArt.jl should compile some .ptx files upon first use. If it does, you’re good to go.

CUDArt.jl’s page shows you how to get started. However, I am going to add a little caveat. In reality, this is the objective function in a machine learning problem, and so we wish to be able to recompute this as quickly as possible just by swapping out the coefficient array coefs. Therefore all of the other arrays must persist on the GPU.

Here’s what we’re going to do. To maximize performance on the GPU we will use Float32’s. On most GPUs (all except the Teslas and the Titan Black), the double precision floating point capabilities are SEVERELY gimped in comparison (1/32 the performance…). To do this, we setup our calculations as follows:

## Setup CPU-side parameters
imin = -8
imax = 1
jmin = -3
jmax = 3
@time coefs,powz,poww = getCoefficients(A0,A1,B0,B1,α,β1234)
iarr = convert(Vector{Float32},collect(imin:dx:imax))
jarr = convert(Vector{Float32},collect(jmin:dx:jmax))
sizei = length(imin:dx:imax)
sizej = length(jmin:dx:jmax)
cudaCores = 1664;
equalDiv = sizei*sizej÷cudaCores + 1
 
##GPU Code
CUDArt.init(devices(dev->true)) #Initiate the devices
dev = device(0) #Set the GPU to the first GPU
#Move the arrays over
g_iarr = CudaArray(iarr)
g_jarr = CudaArray(jarr)
g_coefs = CudaArray(coefs)
g_tmp = CudaArray(Int32,cudaCores)

Note that if you have multiple GPUs and don’t know which one is which device, you can use the command CUDArt.name(CUDArt.device_properties(number)). Here I have CUDArt.name(CUDArt.device_properties(0)) return “GeForce GTX 970” and CUDArt.name(CUDArt.device_properties(1)) returns “GeForce GTX 750 Ti”, so I use device(0) to use the 970. Note that at any time you can change the device with this command, but be careful since the CudaArrays are stored on the device that you call them from. If you want to use multiple GPUs, you will also need to split up your arrays.

The CUDA Kernal

Now the last thing I need to do is make my function. To do this, we have to go to C. Here is the Cuda kernal for the calculation:

// filename: integration.cu
// Performs the inner integration loop
extern "C"
{
    __global__ void integration(const float *coefs, const float *iArr, const float *jArr, const int sizei, const int sizej, const int equalDiv, int *tmp)
    {
        int index = threadIdx.x + blockIdx.x * blockDim.x;
        int loopInd;
        float i, j, isq2, isq3, isq4, isq5, isq6, isq7, isq8, jsq2, jsq3, jsq4, jsq5, jsq6, jsq7, jsq8,
        int ans = 0;
        for(loopInd=0;loopInd<equalDiv;loopInd=loopInd+1){
          i = iArr[(index*equalDiv+loopInd)/sizej];
          j = jArr[(index*equalDiv+loopInd)%sizej];
          if(index*equalDiv+loopInd >= sizei*sizej){
            break;
          }
          if((index*equalDiv+loopInd)%sizej==0 || loopInd==0){
            isq2 = i*i;
            isq3 = i*isq2;
            isq4 = isq2*isq2;
            isq5 = i*isq4;
            isq6 = isq4*isq2;
            isq7 = i*isq6;
            isq8 = isq4*isq4;
          }
          jsq2 = j*j;
          jsq3 = j*jsq2;
          jsq4 = jsq2*jsq2;
          jsq5 = j*jsq4;
          jsq6 = jsq2*jsq4;
          jsq7 = j*jsq6;
          jsq8 = jsq4*jsq4;
          /*tmp[index*1878+loopInd]= index*1878+loopInd;*/
		      /* changed to zero indexing */
          ans = ans + (abs(coefs[0]*(jsq2) + coefs[1]*(jsq3) + coefs[2]*(jsq4) + coefs[3]*(jsq5) + coefs[4]*jsq6 + coefs[5]*jsq7 + coefs[6]*jsq8 + coefs[7]*(i) + coefs[8]*(i)*(jsq2) + coefs[9]*i*jsq3 + coefs[10]*(i)*(jsq4) + coefs[11]*i*jsq5 + coefs[12]*(i)*(jsq6) + coefs[13]*i*jsq7 + coefs[14]*(isq2) + coefs[15]*(isq2)*(jsq2) + coefs[16]*isq2*jsq3 + coefs[17]*(isq2)*(jsq4) + coefs[18]*isq2*jsq5 + coefs[19]*(isq2)*(jsq6) + coefs[20]*(isq3) + coefs[21]*(isq3)*(jsq2) + coefs[22]*isq3*jsq3 + coefs[23]*(isq3)*(jsq4) + coefs[24]*isq3*jsq5 + coefs[25]*(isq4) + coefs[26]*(isq4)*(jsq2) + coefs[27]*isq4*jsq3 + coefs[28]*(isq4)*(jsq4) + coefs[29]*(isq5) + coefs[30]*(isq5)*(jsq2) + coefs[31]*isq5*jsq3+ coefs[32]*(isq6) + coefs[33]*(isq6)*(jsq2) + coefs[34]*(isq7) + coefs[35]*(isq8))<1);
        }
        tmp[index] = ans;
    }
}

Okay, that’s not nice looking at all, but it’s the same idea as the Julia script. Let me explain this code a little bit. Our function is integration. It takes in all of the variables and an array tmp which we will save the output to. The first call calculates a unique index for each CUDA core. We then setup our variables. Notice that before that we setup equalDiv to calculate how many calculations each core should do in order to evenly divide the work. Thus we loop over equalDiv times (and check if we go over since this will not necessarily come out perfectly). This means that each CUDA core will calculate the same number of (i,j)‘s. We then use the calculated index and the loop index to get our $i$ and $j$. We index down $j$ first, then go by $i$’s. This means it’s ordered as (0,0),(0,1),ldots,(0,J),(1,0),ldots. Notice that we achieve this by doing integer division by the size of the j array (truncation will make it an integer), and we get the $j$ index by moding by the size of the $j$ array. Using these $i$’s and $j$’s, we calculate the inner part of the loop. Notice that the calls to coefs had to be changed because C using 0-based indexing (as opposed to Julia’s 1-based indexing). Note that we can use the abs function from the CUDA math library. All of the CUDA math library functions are available.

[Note that when doing CUDA programming you want to allocate as little memory as possible. This is still on the small side so it worked out to be the optimal code. However, in some cases the non-unraveled version may be better due to the overhead of memory allocation.]

So there’s a template for you to play around with. When all of that is in order, we have to compile the kernal. On Windows with visual studio, I had to find out where cl.exe was (for me it was in Visual Studio 12’s directory) and tell the compiler the location of it. The total command was:

nvcc -ptx integration.cu -ccbin "C:Program Files (x86)Microsoft Visual Studio 12.0VCbin"

On Linux it was simpler:

nvcc -ptx integration.cu

From this you get .ptx files. To use the function in Julia, we then use the following code:

md = CuModule("integrationWin.ptx",false)
integrationFunc = CuFunction(md,"integration")
@time launch(integrationFunc, cudaCores, 64, (g_coefs, g_iarr, g_jarr, sizei,sizej,equalDiv,g_tmp,)); res = sum(to_host(g_tmp));
res = res*((imax-imin)*(jmax-jmin)/(length(imin:dx:imax)*length(jmin:dx:jmax)))
println(res)

What this does is take the “integration” function out of the .ptx and save it as integrationFunc. I can then call it at any time with the launch command. The second and third options are the grid and block sizes. In CUDA you always want to “overload”/overthread the cores so that there is no downtime. What I have found in my tests is that the best setting for this is to set the gridsize to the number of CUDA cores (since in our case these aren’t communicating) and overloading it by setting the block size to 64. Why 64? That worked out best in testing. Change the numbers around and see what works best for you.

The last argument to launch is the tuple of arguments for our integration function. Notice that there is a trailing comma, this is required. After the computation is finished, the results have been saved directly into g_tmp. Thus we send g_tmp to the host and sum up the results as before. We re-scale by the area of integration and that is the result.

Now in the actual objective function, what I want to do is update the coefficients to a new coefficients array, perform this calculation, and then move on. To do this I simply use the code:

    if gpuEnabled
      g_coefs = CudaArray(coefs)
      launch(integrationFunc, cudaCores, 64, (g_coefs, g_iarr, g_jarr, sizei,sizej,equalDiv,g_tmp,))
      res = sum(to_host(g_tmp))
      free(g_coefs)
    else ...

Notice I send coefs over to the GPU, run the kernal to get the answer, and then free up the GPU memory. So in my code I loop over this many times, changing coefs every time. When the entire computation is done, I finish with the command

CUDArt.close(devices(dev->true))

This will clear the GPU.

Comet?

These same steps will work on Comet. You will need to go into an interactive job to compile, but it should work fine. The job script to run Julia on the GPU node is:

#!/bin/bash
#SBATCH -A <account>
#SBATCH --job-name="jOpt"
#SBATCH --output="output/jOpt.%j.%N.out"
#SBATCH -p gpu
#SBATCH --gres=gpu:1
#SBATCH --nodes=1
#SBATCH --export=ALL
#SBATCH --ntasks-per-node=6
#SBATCH -t 48:00:00
export SLURM_NODEFILE=`generate_pbs_nodefile`
/home/crackauc/julia-a2f713dea5/bin/julia --machinefile $SLURM_NODEFILE /home/crackauc/projectCode/ImprovedSRK/Optimization/cometDriver.jl

Here I am only taking 6 nodes for the other computations, and I tell it to use the gpu node with -p and tell it I want only one GPU resource with –gres=gpu:1. If you want to use multiple GPUs, just change that number. However, you will want to split up the arrays amongst the multiple GPUs, change equalDiv to account for each CUDA core doing less calculations, send an integer to the GPU to tell it which GPU it is, and use that integer to change the indexing so that no calculations are repeated. That sounds like a good exercise.

The post Tutorial for Julia on the HPC with GPUs appeared first on Stochastic Lifestyle.

Multi-node Parallelism in Julia on an HPC (XSEDE Comet)

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/multi-node-parallelism-in-julia-on-an-hpc/

Today I am going to show you how to parallelize your Julia code over some standard HPC interfaces. First I will go through the steps of parallelizing a simple code, and then running it with single-node parallelism and multi-node parallelism. The compute resources I will be using are the XSEDE (SDSC) Comet computer (using Slurm) and UC Irvine’s HPC (using SGE) to show how to run the code in two of the main job schedulers. You can follow along with the Comet portion by applying and getting a trial allocation.

First we need code

The code I am going to use to demonstrate this is a simple parallel for loop. The problem can be explained as follows: given a 2-dimensional polynomial p(i,j) with coefficients c_{k}, find the area where |p(i,j)|<1. The function for generating these coefficients in my actual code is quite wild, so for your purposes you can replace my coefficients by generating a random array of 36 numbers. Also it's a semi-sparse polynomial of at most degree 8 for each variable, so create arrays of powers powz and powk. We solve for the area by making a fine grid of (i,j), evaluating the polynomial at each location, and summing up the places where its absolute value is less than 1.

I developed this "simple" code for calculating this on my development computer:

julia -p 4

Opens Julia with 4 processes (or you can use addprocs(4) in your code), and then

dx = 1/400
imin = -8
imax = 1
jmin = -3
jmax = 3
coefs,powz,poww = getCoefficients(A0,A1,B0,B1,α,β1234)
@time res = @sync @parallel (+) for i = imin:dx:imax
  tmp = 0
   for j=jmin:dx:jmax
    ans = 0
    @simd for k=1:36
      @inbounds ans += coefs[k]*(i^(powz[k]))*(j^(poww[k]))
    end
    tmp += abs(ans)<1
  end
  tmp
end
res = res*((imax-imin)*(jmax-jmin)/(length(imin:dx:imax)*length(jmin:dx:jmax)))
println(res)

All that we did was loop over the grid $i$ and the grid of $j$, sum the polynomial, check if its absolute value is less than 1, check how many times it is less than 1, and scale by the area we integrated. Some things to note is that I first wrote the code out the "obvious way" and then added the macros to see if they would help. @inbounds turns off array bounds checking. @simd adds vectorization at the "processor" level (i.e. AVX2 to make multiple computations happen at once). @parallel runs the array in parallel and @sync makes the code wait for all processes to finish their part of the computation before moving on from the loop.

Note that this is not the most efficient code. The problem is that we re-use a lot of floating point operations in order to calculate the powers of i and j, so it actually works out much better to unravel the loop:

@time res = @sync @parallel (+) for i = imin:dx:imax
  tmp = 0
  isq2 = i*i; isq3 = i*isq2; isq4 = isq2*isq2; isq5 = i*isq4
  isq6 = isq4*isq2; isq7 = i*isq6; isq8 = isq4*isq4
  @simd for j=jmin:dx:jmax
    jsq2 = j*j; jsq3= j*jsq2; jsq4 = jsq2*jsq2;
    jsq5 = j*jsq4; jsq6 = jsq2*jsq4; jsq7 = j*jsq6; jsq8 = jsq4*jsq4
    @inbounds tmp += abs(coefs[1]*(jsq2) + coefs[2]*(jsq3) + coefs[3]*(jsq4) + coefs[4]*(jsq5) + coefs[5]*jsq6 + coefs[6]*jsq7 + coefs[7]*jsq8 + coefs[8]*(i) + coefs[9]*(i)*(jsq2) + coefs[10]*i*jsq3
    + coefs[11]*(i)*(jsq4) + coefs[12]*i*jsq5 + coefs[13]*(i)*(jsq6) + coefs[14]*i*jsq7 + coefs[15]*(isq2) + coefs[16]*(isq2)*(jsq2) + coefs[17]*isq2*jsq3 + coefs[18]*(isq2)*(jsq4) +
    coefs[19]*isq2*jsq5 + coefs[20]*(isq2)*(jsq6) + coefs[21]*(isq3) + coefs[22]*(isq3)*(jsq2) + coefs[23]*isq3*jsq3 + coefs[24]*(isq3)*(jsq4) + coefs[25]*isq3*jsq5 +
    coefs[26]*(isq4) + coefs[27]*(isq4)*(jsq2) + coefs[28]*isq4*jsq3 + coefs[29]*(isq4)*(jsq4) + coefs[30]*(isq5) + coefs[31]*(isq5)*(jsq2) + coefs[32]*isq5*jsq3+ coefs[33]*(isq6) +
    coefs[34]*(isq6)*(jsq2) + coefs[35]*(isq7) + coefs[36]*(isq8))<1
  end
  tmp
end
res = res*((imax-imin)*(jmax-jmin)/(length(imin:dx:imax)*length(jmin:dx:jmax)))
println(res)

That's the fast version (for the powers I am using), and you can see how there simply are a lot less computations required by doing this. This gives almost a 10x speedup and is the fastest code I came up with. So now we're ready to put this on the HPC.

Setting up Comet

Log into Comet. I use the XSEDE single sign-on hub and gsissh into Comet from there. On Comet, you can download the generic 64-bit binary from here. You can also give compiling it a try yourself, it didn't work out to well for me and the generic binary worked fine. In order to do anything, you need to enter a compute node. To do this we open an interactive job with

srun -pty -p compute -t 01:00:00 -n24 /bin/bash -l

This gives you an interactive job with 24 cores (all on one node). Now we can actually compute things (if you tried before, you would have gotten an error). Now open up Julia by going to the appropriate directory and using

./julia -p 24

This will put you into an interactive session with 24 workers. Now install all your packages, etc., test your code, make sure everything is working.

Running Jobs

Now that Julia is all set up and you tested your code, you need to set up a job script. It's best explained via an example:

#!/bin/bash
#SBATCH -A <account>
#SBATCH --job-name="juliaTest"
#SBATCH --output="juliaTest.%j.%N.out"
#SBATCH --partition=compute
#SBATCH --nodes=8
#SBATCH --export=ALL
#SBATCH --ntasks-per-node=24
#SBATCH -t 01:00:00
export SLURM_NODEFILE=`generate_pbs_nodefile`
./julia --machinefile $SLURM_NODEFILE /home/crackauc/test.jl

In the first line you put the account name you were given when you signed up for Comet. This is the account whose time will be billed. Then you give your job a name and tell it where to save the output. Next is the partition that you wish to run on. Here I specify for it to be the compute node. Now I tell it the number of nodes. I wanted 8 nodes with 24 tasks per node. The time limit is 1 hour.

8 nodes? So we need to add some MPI code, right? NOPE! Julia does it for you! This is the amazing part. What we do is we export the Slurm node file and give it to Julia as the machinefile. This nodefile is simply a list of the compute nodes that are allocated to our job that is generated by the job manager. This will automatically open up one worker process for each thing in the node file (which will be 24 tasks per node, so there are 8*24= 192 lines in the node file and Julia will open up 192 processes) and run test.jl.

Did it work? A quick check is to have the following code in test.jl:

hosts = @parallel for i=1:192
       println(run(`hostname`))
       end

You should see it printout the names of 8 different hosts, each 24 times. This means our parallel loop automatically is on 192 cores!

Now add the code we made before to test.jl. When we run this in different settings, I get the following results: (SIMD is the first version, Full is the second "full unraveled" version)

8 Nodes:
SIMD - .74 seconds
Full - .324 seconds

4 Nodes:
SIMD - .77 seconds
Full - .21 seconds

2 Nodes:
SIMD - .81 seconds
Full - .21 seconds

1 Node
SIMD - .996 seconds
Full - .273 seconds

In this case we see that the optimal is to use 2 nodes. As you increase the domain you are integrating over / decrease dx, you have to perform more computations which makes using more nodes more profitable. You can check yourself that if you use some extreme values like 1/1600 it may be better to use 4-8 nodes. But this ends our quick tutorial into multi-node parallelism on Comet. You're done. If you have ever programmed with MPI before, this is just beautiful.

What about SGE?

Another popular job scheduler is SGE. They have this on my UC Irvine cluster. To use Julia with multi-nodes on this, you do pretty much the same thing. Here you have to tell it you want to do an MPI job. This will create the machine file and you stick that machine file into Julia. The job script is then as follows:

#!/bin/bash
 
#$ -N jbtest
#$ -q math
#$ -pe mpich 128
#$ -cwd            		# run the job out of the current directory
#$ -m beas
#$ -ckpt blcr
#$ -o output/
#$ -e output/
module load julia/0.4.3
julia --machinefile jbtest-pe_hostfile_mpich.$JOB_ID test.jl

That's it. SGE makes the machine file called jbtest-pe_hostfile_mpich.$JOB_ID which has the 128 processes to run (this is in the math queue, so for me it's split across 5 computers with the nodes repeated for the number of processes on that node), in your current directory, so I just load the Julia module and stick it into Julia as the machine file, test it out, and it prints out the name of compute nodes. On this computer I can even ssh into the nodes while a job is running to run htop. In htop I see that it uses as many cores on each computer that it says its using (sometimes a little bit more when it's using threading. You may need to set the number of BLAS threads to 0 if you are doing any linear algebra and don't want it to bleed). Success!

What about GPUs?

Using GPUs on this same problem is discussed in my next blog post!

The post Multi-node Parallelism in Julia on an HPC (XSEDE Comet) appeared first on Stochastic Lifestyle.

Using Julia’s C Interface to Utilize C Libraries

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/using-julias-c-interface-utilize-c-libraries/

I recently ran into the problem that Julias’s (GNU Scientific Library) GSL.jl library is too new, i.e. some portions don’t work as of early 2016. However, to solve my problem I needed access to adaptive Monte Carlo integration methods. This means it was time to go in depth in Julia’s C interface. I will go step by step on how I created and compiled the C code, and called it from Julia.

What I wished to use was the GSL Vegas functions. Luckily, there is a good example on this page for how to use the library. I started with this code. I then modified it a bit. To start, I changed the main function header to:

int monte_carlo_integrate (double (*integrand)(double *,size_t,void *),double* retRes,double* retErr,int mode,int dim,double* xl,double* xu,size_t calls){

Before it took in no values, but I will want to be able to do most things from Julia. First of all, I changed the function name from main to avoid namespace issues. This is just good practice when working with shared libraries. then I added a bunch of arguments. The first argument is a function pointer, where the function is defined just as g is in the example. Then I pass in pointers which we become the return values. I add mode (1,2,3 are the different integration types), dim, xl, xu, and calls as passed in values to easily extend the functionality. The rest of the C-code stays mostly the same: I change the parts which reference g to integrand (I like the function name better) and comment out the print functions (they tend to no print correctly). So I end up with a C-function as follows:

#include <stdlib.h>
#include <gsl/gsl_math.h>
#include <gsl/gsl_monte.h>
#include <gsl/gsl_monte_plain.h>
#include <gsl/gsl_monte_miser.h>
#include <gsl/gsl_monte_vegas.h>
int monte_carlo_integrate (double (*integrand)(double *,size_t,void *),double* retRes,double* retErr,int mode,int dim,double* xl,double* xu,size_t calls){
  double res, err;
  const gsl_rng_type *T;
  gsl_rng *r;
  gsl_monte_function G = { *integrand, dim, 0 };
  gsl_rng_env_setup ();
  T = gsl_rng_default;
  r = gsl_rng_alloc (T);
  if (mode==1){
    gsl_monte_plain_state *s = gsl_monte_plain_alloc (dim);
    gsl_monte_plain_integrate (&G, xl, xu, dim, calls, r, s,
                               &res, &err);
    gsl_monte_plain_free (s);
    /* display_results ("plain", res, err); */
  }
  if (mode==2){
    gsl_monte_miser_state *s = gsl_monte_miser_alloc (dim);
    gsl_monte_miser_integrate (&G, xl, xu, dim, calls, r, s,
                               &res, &err);
    gsl_monte_miser_free (s);
    /* display_results ("miser", res, err); */
  }
  if (mode==3){
    gsl_monte_vegas_state *s = gsl_monte_vegas_alloc (dim);
    gsl_monte_vegas_integrate (&G, xl, xu, dim, 10000, r, s,
                               &res, &err);
    /* display_results ("vegas warm-up", res, err); */
    /* printf ("converging...n"); */
    do
      {
        gsl_monte_vegas_integrate (&G, xl, xu, dim, calls/5, r, s,
                                   &res, &err);
        /* printf ("result = % .6f sigma = % .6f "
                "chisq/dof = %.1fn", res, err, gsl_monte_vegas_chisq (s)); */
      }
    while (fabs (gsl_monte_vegas_chisq (s) - 1.0) > 0.5);
    /* display_results ("vegas final", res, err); */
    gsl_monte_vegas_free (s);
  }
  gsl_rng_free (r);
  *retRes = res;
  *retErr = err;
  return 0;
}

Notice how nothing was specifically changed to be “Julia-style”, this is all normal C-code. So how do I use it? First I need to compile it to a shared library. This means I use the -shared tag and save it to a .so. I will need the GSL libraries loaded when compiling, so I add the -lgsl, -lgslcblas, and -lm (math) tags to link those libraries. Then I add the -fPIC tag since Julia’s documentation says so, and tell it the file to compile. The complete code is:

gcc -Wall -shared -o libMonte.so -lgsl -lgslcblas -lm -fPIC monteCarloExample.c

This will give you a .so file. Now we need to use it. I will describe passing in the function last. Here’s the setup. The first of the other variables are the two pointers retRes and retErr where the results will be saved. In Julia, to get a pointer, I make two 1-element arrays as follows:

x = Array{Float64}(1)
y = Array{Float64}(1)

The only other peculiar thing I had to do was to change pi from Julia’s irrational type to a Float64 (Cdouble) and use that to make the array. So for the other variables I used:

fpi = convert(Float64,pi)
xl = [0.; 0.; 0.]
xu = [fpi; fpi; fpi]
calls = 500000
mode = 3
dim = 3

Now I pass this all to C as follows:

ccall((:monte_carlo_integrate,"/path/to/library/libMonte.so"),Int32,(Ptr{Void},Ptr{Cdouble},Ptr{Cdouble},Int32,Int32,Ptr{Cdouble},Ptr{Cdouble},Csize_t),integrand_c,x,y,mode,dim,xl,xu,calls)

The first argument a tuple where the first place is the symbol which says which function in the library to use and the second place is the library name or the path to the library. The second argument is the return type (here it’s Int32 because our function returns int 0 when complete). The next portion is a tuple which defined the types of the arguments we are passing. The first is Ptr{Void} which is used for things like function pointers. Next there are two Cdouble pointers for retRes and retErr. Then two integers, two more pointers, and a Csize_t. Lastly we put in all of the variables we want to pass in.

Recall that the result was stored into the pointers for the second and third variable passed in. These are the pointers x and y. So to get the values of res and err, I de-reference them:

res=x[1]
err=y[1]

Now res and err hold the values for the solution and the error.

I am not done yet since I didn’t talk about the function! To do this, we define the function in Julia. We have to use parametric types or Julia will yet at us, and we have to ensure that the returned value is a Cdouble (in our case). So we re-write the g function in Julia as:

function integrand{T,T2,T3}(x::T,dim::T2,params::T3)
  A = 1.0 / (pi * pi * pi)
  return A / (1.0 - cos(unsafe_load(x,1))*cos(unsafe_load(x,2))*cos(unsafe_load(x,3)))::Cdouble
end

Notice that instead of x[1], we have to use the Julia function unsafe_load(x,1) to de-reference a pointer. However, since this part is in Julia, other things are much safer, like how we can use pi directly without having to convert it to a float. Also notice that we can add print statements in this function, and they will print directly to Julia. You can use this to modify the display_results function to be a Julia function which prints. However, this itself is still not able to be passed into C. To do that, you have to translate it to a C function:

integrand_c = cfunction(integrand,Cdouble,(Ptr{Cdouble},Ptr{Cdouble},Ptr{Cdouble}))

Here we used the Julia cfunction function where the first argument is the function, the second argument is what the function returns, and the third argument is a tuple of C-types that the function will take on. If you look back at the ccall, this integrand_c is what we passed as the first argument to the actual C-function.

Check to see that this all works. It worked for me. For completeness I will put the full Julia code below. Happy programming!

x = Array{Float64}(1)
y = Array{Float64}(1)
fpi = convert(Float64,pi)
xl = [0.; 0.; 0.]
xu = [fpi; fpi; fpi]
 
function integrand{T,T2,T3}(x::T,dim::T2,params::T3)
  A = 1.0 / (pi * pi * pi)
  return 3*A / (1.0 - cos(unsafe_load(x,1))*cos(unsafe_load(x,2))*cos(unsafe_load(x,3)))::Cdouble
end
 
integrand_c = cfunction(integrand,Cdouble,(Ptr{Cdouble},Ptr{Cdouble},Ptr{Cdouble}))
 
calls = 500000
mode = 3
dim = 3
ccall((:monte_carlo_integrate,"/home/crackauc/Public/libMonte.so"),Int32,(Ptr{Void},Ptr{Cdouble},Ptr{Cdouble},Int32,Int32,Ptr{Cdouble},Ptr{Cdouble},Csize_t),integrand_c,x,y,mode,dim,xl,xu,calls)
res=x[1]
err=y[1]

The post Using Julia’s C Interface to Utilize C Libraries appeared first on Stochastic Lifestyle.