Tag Archives: HPC

Interfacing with a Xeon Phi via Julia

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/interfacing-xeon-phi-via-julia/

(Disclaimer: This is not a full-Julia solution for using the Phi, and instead is a tutorial on how to link OpenMP/C code for the Xeon Phi to Julia. There may be a future update where some of these functions are specified in Julia, and Intel’s compilertools.jl looks like a viable solution, but for now it’s not possible.)

Intel’s Xeon Phi has a lot of appeal. It’s an instant cluster in your computer, right? It turns out it’s not quite that easy. For one, the installation process itself is quite tricky, and the device has stringent requirements for motherboard choices. Also, making out at over a taraflop is good, but not quite as high as NVIDIA’s GPU acceleration cards.

However, there are a few big reasons why I think our interest in the Xeon Phi should be renewed. For one, Intel will be releasing its next version Knights Landing in Q3 which promises up to 8 teraflops and 16 GB of RAM. Intel has also been saying that this next platform will be much more user friendly and have improved bandwidth to allow for quicker offloading of data. Lastly, since the Xeon Phi uses X86 cores which one interfaces with via standard tools such as OpenMP and MPI, high performance parallel codes naturally transfer over to the Xeon Phi with little work (if you’ve already parallelized your code). For this reason many major HPCs such as Stampede and SuperMIC have been incorporating a Xeon Phi into every compute node. These details tell us that for high-performance computing using Xeon Phi’s to their full potential is the way forward. I am going to detail some of my advances in interfacing with the Xeon Phi via Julia.

First, let’s talk about automatic offloading

Automatic offloading allows you to offload all of your MKL-calls to the Xeon Phi automatically. This means that if you are doing lots of linear algebra on large matrices, standard operations from BLAS and Linpack like matrix multiplication * will automatically be done on the acceleration card. Details for setting up automatic offload are given by MATLAB. However, automatic offloading is a mixed blessing. First of all, there is no data persistence. If you are repeatedly using the same matrices, like in solving an evolution equation (i.e. parabolic PDE), this adds a large overhead since you’ll be sending that data back and forth every multiplication. Also, one major downside is that it does not apply to vectorized arithmetic such as .*. Sure you could hack it to be matrix multiplication by a sparse diagonal matrix, but these types of hacks really only tend to give you speedups when your vectors are large since you still incur the costs of transferring the arrays every time.

Still, it’s stupid easy to setup. You compile Julia with MKL and and setup a few environment variables and it will do it automatically. Thus you should give this a try first.

Native Execution

You can also compile code to natively execute on the Xeon Phi. However, you need to copy the files (and libraries) over the Phi via ssh and run the job from there. Thus while this is really good for C code, it’s not as easy to use when you wish to control the Phi from the computer itself as a “side job”.

Pragma-assisted Offloading

This is the route we are going to take. Pragmas are a type of syntax from OpenMP where one specifies segments of the code to be parallelized. If you’re familiar with using parallel constructs from MATLAB/Julia like parallel loops, OpenMP’s pragmas are pretty much the C version of that. For the Xeon Phi, there exists extra pragmas telling the Phi to offload. This also allows for data persistence. Lastly, for many C codes parallelized with OpenMP they are just one pragma away from working on the Phi.

Our workflow will be as follows. We will use a driver script from Julia which will set the environment and use Julia’s ccall to call the C-code with the OpenMP pragmas which will perform parallelized function calls. Notice that in this case Julia is just performing the role of glue code. The advantage is that we can prepare the data and plot the results from within Julia. The disadvantage is that we will have to write a lot of C-code. However, I am currently talking with Intel’s Developer Lab on using their CompilerTools.jl to compile Julia functions to be used on the Xeon Phi. When that’s available, I will write a tutorial on how to then replace the core functions from this script with the Julia functions. Then, the only C-code would be the code which starts the parallel loop. Let’s get started.

The Problem

We wish to solve some simple stochastic differential equations via the Euler-Maruyama method. We will specify the stochastic differential equation of the form

dU_{t} = f(U,t)dt + g(U,t)dW_{t}

via functions f and g. In our code we will also allow the ability to have a function for the true solution in order to perform error calculations.

The Julia Code

Again, at this point the Julia code is quite simple because it is simply performing the glue. Let M be the number of simulations of we perform. Set up empty vectors for the values of U and the true solution Utrue at the endpoints. Note that we only will keep the endpoints from each simulation due to memory issues. If we were to keep the full array for thousands (or millions) of runs this would easily be more memory than the Phi could handle (or even a workstation!). We then need to specify our environmental variables. Set OMP_NUM_THREADS to be the number of compute cores on your system. We setup MIC_PREFIX, LD_IBRARY_PATH, and MIC_LD_LIBRARY_PATH so that we can dynamically link to our library. Note that this assumes that you have already sourced the compiler variables via compilervars.sh with the argument intel64. If not, you can use Julia’s run function to source the script. Lastly, we set a constant MIC_OMP_NUM_THREADS to be the number of threads the Xeon Phi will use. Since when offloading you can use all but 1 core (one manages the jobs) and each core has 4 threads, we set 240 threads (for the 5110p). Like in the case of GPUs, using more threads than cores is beneficial since the cores can utilize large vectors to do multiple calculations at once (SIMD) and other magic. We set the environment variable OFFLOAD_REPORT to 3 which will make the Phi give us details about everything it’s offloading (good for debugging). Lastly, we end by calling the library. The total code is as follows:

M = 240000
ENV["OMP_NUM_THREADS"]=12
Us = Vector{Float64}(M)
ts = Vector{Float64}(M)
Ws = Vector{Float64}(M)
Utrues = Vector{Float64}(M)
MIC_OMP_NUM_THREADS = 240
ENV["MIC_PREFIX"]="MIC"
ENV["OFFLOAD_REPORT"]=3
ENV["LD_LIBRARY_PATH"]=string(ENV["LD_LIBRARY_PATH"],":.")
ENV["MIC_LD_LIBRARY_PATH"]=string(ENV["MIC_LD_LIBRARY_PATH"],":.")
alg = 2
#ccall((:monte_carlo,"/home/crackauc/XeonPhiTests/EMtest/sde_solvers_noffload.so"),Void,(Cint,Ptr{Cdouble},Ptr{Cdouble},Ptr{Cdouble}),M,Us,Utrues,ts)
@time ccall((:monte_carlo,"/home/crackauc/XeonPhiTests/EMtest/sde_solvers.so"),Void,(Cint,Ptr{Cdouble},Ptr{Cdouble},Ptr{Cdouble},Ptr{Cdouble},Cint,Cint),M,Us,Utrues,ts,Ws,MIC_OMP_NUM_THREADS,alg)

For more of an explanation on using the ccall function to interface with C-code, see my previous blog post. Note that the arrays Us, Utrues, ts, and Ws will be updated in place as the value of U, Utrue, t, and W at the end of the path. Thus after the job is done one can use Julia to plot the results.

Xeon Phi Driver Function

The ccall function looks for a function of the following type in a shared library named sde_solvers.so:

void monte_carlo(int M,double* Us,double* Utrues,double* ts,double* Ws,const int MIC_OMP_NUM_THREADS,int alg)

In this function we will just do a parallel for loop where each iteration calls the Euler-Maruyama solver on a different random seed. However, instead of doing a straight parallel for loop, we will put a little separation between “the parallel” and “the for” so that we can keep some persistent data to be a little more efficient.

We start by defining some constants:

double Uzero = .5;
  double dt = 0.00001;
  double T = 2.0;
  int N = ceil(T/dt)+1;

Now we send the job over to the Xeon Phi via the following pragma:

#pragma offload target(mic:MIC_DEV) default(none) in(Uzero,dt,T,N,MIC_OMP_NUM_THREADS,alg) out(Us:length(M)) 
  out(Utrues:length(M)) out(ts:length(M)) out(Ws:length(M))

Note that at the top of the script we have

#ifndef MIC_DEV
#define MIC_DEV 0
#endif
 
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <mathimf.h>
#include "mkl.h"
#include "mkl_vsl.h"

and so MIC_DEV singles out the Xeon Phi labeled 0. Using in we send over the variables, and with out we specify the variables we want the Phi to send back. By adding default(none) we get informed if there are any variables which weren’t specified.

After that pragma, we are on the MIC. The first thing we will do is set the number of threads. I don’t know why but setting the environment variable MIC_OMP_NUM_THREADS in Julia does not set the number of MIC threads, so instead we do it manually on via the command

omp_set_num_threads(MIC_OMP_NUM_THREADS);

Next we start our parallel environment by

#pragma omp parallel default(none) shared(Uzero,alg,dt,T,N,M,ts,Us,Utrues,Ws)

Once again, default(none) will make sure no variables are accidentally set to shared, and we specify all of the inputs as shared. With this, we are now coding with that list of variables on the individual threads of the Phi. Thus we will now will setup an individual run of the SDE solver. We make arrays for time t, the Brownian path W, the solution U, and the true solution Utrue. We also grab the id of the thread to setup random seeds later. This gives:

      int i;
      int tid = omp_get_thread_num();
      double* t; double* W; double* U; double* Utrue;
      int steps;
      t = (double*) malloc(N*sizeof(double));
      W = (double*) malloc(N*sizeof(double));
      U = (double*) malloc(N*sizeof(double));
      Utrue = (double*) malloc(N*sizeof(double));

Now we start our parallel for loop. Notice that by allocating these variables before the loop we have increased our efficiency since each run we will simply write over these values, saving us the time of re-allocating. In our for loop we set the initial values (since we are re-using the same arrays), call the solver algorithm, save the results at the end, and re-run. After we are done with the whole loop, then we free that arrays we made. The code is then as follows:

      #pragma omp for
      for(i=0;i<M;i++){
        t[0]=0;
        U[0]=Uzero;
        Utrue[0]=Uzero;
        W[0] = 0;
        euler_maruyama(&f,&g,&trueSol,Uzero,dt,T,t,&W,U,Utrue,tid*i*M+i); /*unique identifier tid*i*M+i since tid spacing */
        Us[i] = U[N-1];
        Utrues[i] = Utrue[N-1];
        ts[i] = t[N-1];
        Ws[i] = W[N-1];
      }
    free(t); free(Utrue); free(U); free(W); free(Z);

Notice that tid*i*M+i has spacings larger than M and tid and so each value will be unique. This is then the value we can use as a random seed. The full code for the driver function is then:

void monte_carlo(int M,double* Us,double* Utrues,double* ts,double* Ws,const int MIC_OMP_NUM_THREADS,int alg){
  double Uzero = .5;
  double dt = 0.00001;
  double T = 2.0;
  int N = ceil(T/dt)+1;
  #pragma offload target(mic:MIC_DEV) default(none) in(Uzero,dt,T,N,MIC_OMP_NUM_THREADS,alg) out(Us:length(M)) 
  out(Utrues:length(M)) out(ts:length(M)) out(Ws:length(M))
  {
    omp_set_num_threads(MIC_OMP_NUM_THREADS);
    #pragma omp parallel default(none) shared(Uzero,alg,dt,T,N,M,ts,Us,Utrues,Ws)
     {
      int i;
      int tid = omp_get_thread_num();
      double* t; double* W; double* U; double* Utrue;
      int steps;
      t = (double*) malloc(N*sizeof(double));
      W = (double*) malloc(N*sizeof(double));
      U = (double*) malloc(N*sizeof(double));
      Utrue = (double*) malloc(N*sizeof(double));
      #pragma omp for
      for(i=0;i<M;i++){
        t[0]=0;
        U[0]=Uzero;
        Utrue[0]=Uzero;
        W[0] = 0;
        euler_maruyama(&f,&g,&trueSol,Uzero,dt,T,t,&W,U,Utrue,tid*i*M+i); /*unique identifier tid*i*M+i since tid spacing */
        Us[i] = U[N-1];
        Utrues[i] = Utrue[N-1];
        ts[i] = t[N-1];
        Ws[i] = W[N-1];
      }
      free(t); free(Utrue); free(U); free(W); free(Z);
    }
  }
}

Notice I left out the extra algorithms. When I put this in a package (and in my soon to be submitted code for a publication) I have different choices for the solver, but here we will just have Euler-Maruyama.

The Inner Functions

Before we get to the solver, notice that euler_maruyama takes in three functions by handle. However, since these will be executed on the Xeon Phi we decorate them with __attribute__((target(mic))). However, I will leave off these declarations since we can instead have them be put on automatically by a compiler command (and this makes it easier to re-compile to be a Xeon Phi free code). Thus the SDE functions are simply

double f(double t,double x){
  return (1.0/20.0)*x;
}
 
double g(double t,double x){
  return (1.0/10.0)*x;
}
 
double trueSol(double t, double Uzero,double W){
  return Uzero*exp(((1.0/20.0)-((1.0/10.0)*(1.0/10.0))/2.0)*t + (1.0/10.0)*W);
}

Thus the SDE is

dU_{t} = frac{1}{20} U_{t}dt + frac{1}{10} W_{t}

which a mathematician would call Geometric Brownian Motion or what someone in finance would know of as the Black-Scholes equation. Our inner function euler_maruyama is then the standard loop for solving via Euler-Maruyama where we replace any instance of dt with a small real number and we replace dW_{t} with normal random variables with zero mean and variance dt. The only tricky part is getting normal random variables, but I used Intel’s VSL library for generating these. The code for solving the Euler-Maruyama equations are then

void euler_maruyama(double (*f)(double,double),double (*g)(double,double),double (*trueSol)(double,double,double),double Uzero,double dt, double T,double* t,double** W,double* U,double* Utrue,int id){
  int N = ceil(T/dt)+1;
  *W = (double*) malloc(N*sizeof(double));
  VSLStreamStatePtr stream;
  vslNewStream(&stream,VSL_BRNG_MT19937,20+id);
  vdRngGaussian(VSL_RNG_METHOD_GAUSSIAN_BOXMULLER,stream,N,*W,0.0f,1.0f);
  (*W)[0] = 0.0f;
 
  int i;
  double dW;
  double sqdt = sqrt(dt);
  for(i=1;i<N;i++){
    /* dW = 0; */
    dW = sqdt* (*W)[i];
    t[i] = t[i-1] + dt;
    (*W)[i] = (*W)[i-1] + dW;
    U[i] = U[i-1] + dt*f(t[i-1],U[i-1]) + g(t[i-1],U[i-1])*dW;
    Utrue[i] = trueSol(t[i],Uzero,(*W)[i]);
  }
  vslDeleteStream(&stream);
}

Notice that this part is nothing special and quite close to what you’d write in C. However, we do note that since we want the value of W at the end of the run outside of this function, and we allocate W within the function, we have to pass W by reference via &W and thus every time it is used we have to deference it via *W. Other than that there’s nothing fancy here.

Compilation

This is always the hardest part. However, notice that if we just take away the offload pragma this is perfectly good OpenMP code! You can do this from the compiler to first check your code. The compilation command is as follows:

icc -mkl -O3 -openmp -fpic -diag-disable 10397 -no-offload -Wno-unknown-pragmas -std=c99 -qopt-report -qopt-report-phase=vec -shared sde_solvers.c -o sde_solvers.so

Most of it is setting up offload reports and libraries, but the important part to notice is that -no-offload is the part that turns off the offload pragma. Give this a try and it should parallelize on the CPU. Now, to compile for the Phi, we use the command

icc -mkl -O3 -openmp -fpic -diag-disable 10397 -qoffload -Wno-unknown-pragmas -std=c99 -qopt-report -qopt-report-phase=vec -shared sde_solvers.c -offload-attribute-target=mic -o sde_solvers.so

Notice that the command -offload-attribute-target=mic is required if you do not put __attribute__((target(mic))) in front of each function that is called when offloaded. I prefer to not put the extra tags because icc required that I delete them to re-compile for the CPU. In this case, we simply get rid of that compiler directive and change to -no-offload and we have working CPU code. Thus you can see how to transfer back and forth between the two via compilation.

After doing this you should be able to call the code from Julia, have it solve the code on the Phi, and then return the result to Julia.

Future Steps

Notice that the functions f, g, and trueSol are simple functions which we pass by pointer into the solver. Julia already has ways to pass function pointers which I go over in my previous tutorial, though since they are not compiled with the __attribute__((target(mic))) flag they will not work on the Phi. Hopefully Intel’s compilertools.jl will support this in the near future. When that’s the case, these functions could be specified from within Julia to allow us to create libraries where we can use Julia-specified functions as the input.

However, this gives a nice template for performing any kind of Monte Carlo simulation or anything else that uses a parallel for loop. This wrapper will form the basis of a library I am creating for stochastic (partial) differential equations. More on that later. In the meantime, have fun experimenting with the Phi!

The post Interfacing with a Xeon Phi via Julia appeared first on Stochastic Lifestyle.

Multiple-GPU Parallelism on the HPC with Julia

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/multiple-gpu-on-the-hpc-with-julia/

This is the exciting Part 3 to using Julia on an HPC. First I got you started with using Julia on multiple nodes. Second, I showed you how to get the code running on the GPU. That gets you pretty far. However, if you got a trial allocation on Cometand started running jobs, you may have noticed when looking at the architecture that you’re not getting to use the full GPU. In the job script I showed you, I asked for 2 GPUs. Why? Well, that’s because the flagship NVIDIA GPU, the Tesla K80, is actually a duel GPU and you have to control the two parts separately. You may have been following along on your own computer and have been wondering how you use the multiple GPUs in your setup as well. This tutorial will cover how to extend our function to multiple-GPU setups. You will notice that while this code was made for 2 GPUs, by simply extended the array cudaCores with the values of other GPUs then this code automatically extends to however many you need.

Recall

I want to recall the problem a bit to rephrase it and make it easier to understand why what we are doing will work. Recall that we are simply looping over a mesh of (i,j) (you can think of it as (x,y)) and solving some inner function at each point, and add together the results. The CUDA kernal is shown in the previous part. Thus, how we will extend our problem to multiple GPUs is to simply split the mesh and solve the inner function on different GPUs. It’s critical to realize that this solution method requires independence, i.e. that different parts of the grid do not effect each other. Thus you can think of our function as being a “vectorized” evaluation.

Note that we are not doing the most memory efficient implementation. Each GPU will have the full array of i‘s and j‘s even though they will only evaluate it at specific parts. One could split it up, though that would take more work for little performance gain. Thus you should really look into this if you are running low on memory (or to do as an exercise). If you would like to do this, then you can calculate how far in the i‘s the first GPU calculates, and then only send the right portion of the i‘s to the other GPU (since we loop down the j‘s, both will need the full j array). Since this problem is not memory limited, I will not show this solution.

Upgrading the Kernal

Upgrading the kernal will be quite simple. Recall that we had the problem in linear indexing before. We will keep the linear indexing, except in each GPU we will pass an integer, StartIdx, which will tell the GPU where in the linear indexing it will start. Thus as before we had the value equalDiv which told the GPU how many indices each core will evaluate, and we had

int index = threadIdx.x + blockIdx.x * blockDim.x;

which tells the GPU which core it is, we have a new global index which is

int globalIndex = index*equalDiv+startIdx;

Notice that each globalIndex is equalDiv apart, and so that is the section each GPU will operate on. The CUDA kernal then just sums up the solution of the inner function over the next equalDiv indices:

    __global__ void integration(const float *coefs, const float *iArr, const float *jArr, const int sizei, const int sizej, const int equalDiv,const int startIdx, int *tmp)
    {
        int index = threadIdx.x + blockIdx.x * blockDim.x;
        int globalIndex = index*equalDiv+startIdx;
        int loopInd;
        float i;
        float j;
        float isq2;
        float isq3;
        float isq4;
        float isq5;
        float isq6;
        float isq7;
        float isq8;
        float jsq2;
        float jsq3;
        float jsq4;
        float jsq5;
        float jsq6;
        float jsq7;
        float jsq8;
        int ans = 0;
        for(loopInd=0;loopInd<equalDiv;loopInd=loopInd+1){
          i = iArr[(globalIndex+loopInd)/sizej];
          j = jArr[(globalIndex+loopInd)%sizej];
          if(globalIndex+loopInd >= sizei*sizej){
            break;
          }
          if((globalIndex+loopInd)%sizej==0 || loopInd==0){
            isq2 = i*i;
            isq3 = i*isq2;
            isq4 = isq2*isq2;
            isq5 = i*isq4;
            isq6 = isq4*isq2;
            isq7 = i*isq6;
            isq8 = isq4*isq4;
          }
          jsq2 = j*j;
          jsq3 = j*jsq2;
          jsq4 = jsq2*jsq2;
          jsq5 = j*jsq4;
          jsq6 = jsq2*jsq4;
          jsq7 = j*jsq6;
          jsq8 = jsq4*jsq4;
          ans = ans + innerFunc(coefs,i,isq2,isq3,isq4,isq5,isq6,isq7,isq8,j,jsq2,jsq3,jsq4,jsq5,jsq6,jsq7,jsq8);
        }
        tmp[index] = ans;
    }

Notice that the inner function did not have to be changed at all:

  __device__ int innerFunc(const float *coefs,const float i,const float isq2,const float isq3,const float isq4,const float isq5,const float isq6,const float isq7,const float isq8,const float j,const float jsq2,const float jsq3,const float jsq4,const float jsq5,const float jsq6,const float jsq7,const float jsq8)
  {
    return abs(coefs[0]*(jsq2) + coefs[1]*(jsq3) + coefs[2]*(jsq4) + coefs[3]*(jsq5) + coefs[4]*jsq6 + coefs[5]*jsq7 + coefs[6]*jsq8 + coefs[7]*(i) + coefs[8]*(i)*(jsq2) + coefs[9]*i*jsq3 + coefs[10]*(i)*(jsq4) + coefs[11]*i*jsq5 + coefs[12]*(i)*(jsq6) + coefs[13]*i*jsq7 + coefs[14]*(isq2) + coefs[15]*(isq2)*(jsq2) + coefs[16]*isq2*jsq3 + coefs[17]*(isq2)*(jsq4) + coefs[18]*isq2*jsq5 + coefs[19]*(isq2)*(jsq6) + coefs[20]*(isq3) + coefs[21]*(isq3)*(jsq2) + coefs[22]*isq3*jsq3 + coefs[23]*(isq3)*(jsq4) + coefs[24]*isq3*jsq5 + coefs[25]*(isq4) + coefs[26]*(isq4)*(jsq2) + coefs[27]*isq4*jsq3 + coefs[28]*(isq4)*(jsq4) + coefs[29]*(isq5) + coefs[30]*(isq5)*(jsq2) + coefs[31]*isq5*jsq3+ coefs[32]*(isq6) + coefs[33]*(isq6)*(jsq2) + coefs[34]*(isq7) + coefs[35]*(isq8))<1;
  }

As before, we wrap this in extern “C”{} for libraries.

Setup Multi-GPU Julia Driver

Recall that we are starting with the following parameters:

  imin = -12
  imax = 1
  jmin = -4
  jmax = 4
  iarr = convert(Vector{Float32},collect(imin:dx:imax))
  jarr = convert(Vector{Float32},collect(jmin:dx:jmax))
  sizei = length(imin:dx:imax)
  sizej = length(jmin:dx:jmax)
  totArea = ((imax-imin)*(jmax-jmin)/(sizei*sizej));
 
  coefs = getCoefficients(A0,A1,B0,B1,α,β1234)

Where getCoefficients is some function which returns an array of 36 integers. You can make this be a call for random integers or something of the like. Now we need to call our kernal from Julia. Start by compling the kernal as in part two into a .ptx. Our simple implementation will be to split the job by the number of cores on each GPU. On Comet, each part of the K80 has 2496 CUDA cores, so we would set

  cudaCores = [2496,2496]

I will instead use the more interesting case of my development computer. It has a GTX 970 with 1664 cores and a GTX 750Ti with 640 cores. Therefore I start by setting

  cudaCores = [1664,640]

The implementation is then as follows: I first get equalDiv as the number of calculations to be done per core. Since I am looping over a mesh with sizei and sizej, if I let totCores=sum(cudaCores) be the total number of cores, then

equalDiv = sizei*sizej÷totCores + 1

tells me how many calculations per core (we add 1 (or take the ceiling) in order to make sure that we cover all of it. Otherwise the part truncated from the integer division will not be calculated). Next we need to calculate where to start the second GPU. Let portions=equalDiv*cudaCores be the total number of calculations in each GPU (it returns a vector of 2 values). I want my code to work when there’s only 1 card (i.e. cudaCores=[2946]), so to calculate the starting indices I run

  if numCards > 1
    startIdx = [0 cumsum(portions)[1:end-1]]
  else
    startIdx = [0]
  end

What this does is give an array where the first value is zero, the second value is portions[1], the third is portions[1]+portions[2], etc., and if there is only one card then it’s simply the array with zero. Notice this is exactly the index which the CUDA kernal needs to stop at, so this works.

Now we need to move our arrays over to the GPU. Note that each array is stored on only one GPU, and so we need to loop through the devices and put each array on each device. We change devices with with command device(i), where i=0 is the first device, i=1 is the second, etc. Since the values are actually C-pointers, we will just store them in arrays of type Any (of size numCards which is the number of GPU cards as calculated by the length(cudaCores)). The setup is then:

  g_iarr = Vector{Any}(numCards)
  g_jarr = Vector{Any}(numCards)
  g_tmp = Vector{Any}(numCards)
  g_coefs = Vector{Any}(numCards)
  ans = Vector{Int32}(numCards)
  integrationFuncs = Vector{Any}(numCards)
  CUDArt.init(devices(dev->true))
  devlist = devices(dev->true)
  streams = [(device(dev); Stream()) for dev in devlist]
  for i = 1:numCards
    device(i-1)
    g_iarr[i]  = CudaArray(iarr)
    g_jarr[i]  = CudaArray(jarr)
    g_tmp[i]   = CudaArray(Int32,cudaCores[i])
    md = CuModule("integration.ptx",false)
    integrationFuncs[i] = CuFunction(md,"integration")
  end

Lastly, we need to call the kernal. Recall that in my problem the inner loop is to change coefs, run the kernal calculation, and use the result. Thus in my test function is where I send coefs to the GPU. This will call the kernal as before, except we need to do this asynchronously via Julia’s streams. Notice we have a new variable (or array of variables) around: streams. The code initiated 2 GPU streams, one on each device. Thus what we will do is tell the kernal to run on the stream, and move onto the calculation before it finishes (but tell it to not sum the result until after the GPU stream finished). This is done by the following code:

        @async begin
          dev = device(i-1)
          g_coefs[i] = CudaArray(coefs)
          launch(integrationFuncs[i], cudaCores[i], 64, (g_coefs[i], g_iarr[i], g_jarr[i], sizei,sizej,equalDiv,startIdx[i],g_tmp[i],),stream=streams[i]);
          wait(streams[i])
          ans[i] = sum(to_host(g_tmp[i]))
        end

@async is the command which asynchronizes, as in Julia keeps going forward even if the command isn’t done. Thus what we need to do is run all of the GPUs and then wait for each to finish via the @sync command. The full loop is then as follows:

  function f4()
    @sync begin
      for i = 1:numCards
        @async begin
          dev = device(i-1)
          g_coefs[i] = CudaArray(coefs)
          launch(integrationFuncs[i], cudaCores[i], 64, (g_coefs[i], g_iarr[i], g_jarr[i], sizei,sizej,equalDiv,startIdx[i],g_tmp[i],),stream=streams[i]);
          wait(streams[i])
          ans[i] = sum(to_host(g_tmp[i]))
        end
      end
    end
    res = sum(ans)*totArea
    #println(res)
  end

I put it in the function f4 for benchmarking. Notice we have @sync around the inner loop, telling it not to go past the loop until each portion is done, but in the loop we wrap it with @async so that way Julia can start the GPU codes all at the same time without having to wait for the first to finish, and then as they finish it comes back to finish up (this is not needed to be done in parallel since the overhead for doing this is really small. The overhead of @parallel would probably increase the runtime!). After we are done, we scale the area to get the answer.

So how well did it do? On my last post someone told me to checkout the benchmark package. So to finish off the code I run:

  df4 = benchmark(f4,"MultiGPU",1000)
  println(df4)
  CUDArt.close(devices(dev->true))

benchmark will run the function 1000 times and tell me the average runtime by printing the dataframe df4. Then I close the GPU to clean up the memory.

Heterogeneous GPU Setup

Recall that our last implementation simply split up the problem by how many cores each GPU has. We will get back to whether that is correct or not later, but first lets make something a little more complicated to test this out. So what about making equalDiv higher on faster GPUs and slower on slower GPUs? What we will do is declare an array trueBias, like

trueBias = [.75,.25]

which will say how to bias the equalDivs amongst the GPUs. How do we calculate the division? It’s easiest to do this by example. Notice in the code above we have that

equalDiv1 = 3 equalDiv2,

that is equalDiv is 3 times the size on the first GPU than the second GPU. Since the number of values we want to calculate is sizei*sizej, and we calculate equalDiv values on cudaCores[i] different cores, the total amount that we calculate is

equalDiv1*cudaCores1 + equalDiv2*cudaCores2 = sizei*sizej

and by substitution we then get

equalDiv2 = frac{sizei*sizej}{3*cudaCores1+cudaCores2}

Do a bigger example with [.75, 1/16, 1/8, 1/16]. See the pattern? Notice that trueBias/trueBias[i] gives us the value that we need to substitute $k$ with $i$ (where $k$ is the value on the top). Then notice that the left hand side is simply the dot product of these values with the number of cudaCores. Thus we pull out equalDivi and sizei*sizej by that dot product to get what equalDivi should be the final result is that the following calculates the desired quantity:

  equalDiv = Vector{Int32}(numCards)
  for i = 1:numCards
    equalDiv[i] = ceil(Int32,(sizei*sizej)/dot(trueBias/(trueBias[i]),cudaCores))
  end

We then need to make sure we scale portions by the equalDiv per GPU:

  portions = equalDiv.*cudaCores
  if numCards > 1
    startIdx = [0 cumsum(portions)[1:end-1]]
  else
    startIdx = [0]
  end

When calling the kernal, all we have to do is specify which equalDiv we want to use, and thus we get the test function:

  function f5()
    @sync begin
      for i = 1:numCards
        @async begin
          dev = device(i-1)
          g_coefs[i] = CudaArray(coefs)
          launch(integrationFuncs[i], cudaCores[i], 64, (g_coefs[i], g_iarr[i], g_jarr[i], sizei,sizej,equalDiv[i],startIdx[i],g_tmp[i],),stream=streams[i]);
          wait(streams[i])
          ans[i] = sum(to_host(g_tmp[i]))
        end
      end
    end
    res = sum(ans)*totArea
    #println(res)
  end
  df5 = benchmark(f5,"MultiGPU2",1000)
  println(df5)
  CUDArt.close(devices(dev->true))

Same as before with equalDiv[i] now.

Does Changing the Divisions Help?

I ran this on my desktop and got the following results:

1x12 DataFrames.DataFrame
| Row | Category | Benchmark | Iterations | TotalWall | AverageWall |
|-----|----------|-----------|------------|-----------|-------------|
| 1   | "GPU"    | "GPU"     | 1000       | 23.1013   | 0.0231013   |
 
| Row | MaxWall   | MinWall  | Timestamp             |
|-----|-----------|----------|-----------------------|
| 1   | 0.0264843 | 0.021724 | "2016-02-27 17:16:32" |
 
| Row | JuliaHash                                  | CodeHash | OS        |
|-----|--------------------------------------------|----------|-----------|
| 1   | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" | NA       | "Windows" |
 
| Row | CPUCores |
|-----|----------|
| 1   | 8        |
1x12 DataFrames.DataFrame
| Row | Category   | Benchmark  | Iterations | TotalWall | AverageWall |
|-----|------------|------------|------------|-----------|-------------|
| 1   | "MultiGPU" | "MultiGPU" | 1000       | 22.3445   | 0.0223445   |
 
| Row | MaxWall   | MinWall   | Timestamp             |
|-----|-----------|-----------|-----------------------|
| 1   | 0.0593403 | 0.0211532 | "2016-02-27 17:16:55" |
 
| Row | JuliaHash                                  | CodeHash | OS        |
|-----|--------------------------------------------|----------|-----------|
| 1   | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" | NA       | "Windows" |
 
| Row | CPUCores |
|-----|----------|
| 1   | 8        |
1x12 DataFrames.DataFrame
| Row | Category    | Benchmark   | Iterations | TotalWall | AverageWall |
|-----|-------------|-------------|------------|-----------|-------------|
| 1   | "MultiGPU2" | "MultiGPU2" | 1000       | 22.5787   | 0.0225787   |
 
| Row | MaxWall   | MinWall   | Timestamp             |
|-----|-----------|-----------|-----------------------|
| 1   | 0.0557531 | 0.0213999 | "2016-02-27 17:17:18" |
 
| Row | JuliaHash                                  | CodeHash | OS        |
|-----|--------------------------------------------|----------|-----------|
| 1   | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" | NA       | "Windows" |
 
| Row | CPUCores |
|-----|----------|
| 1   | 8        |
1x12 DataFrames.DataFrame
| Row | Category             | Benchmark            | Iterations | TotalWall |
|-----|----------------------|----------------------|------------|-----------|
| 1   | "MultiGPU2-Balanced" | "MultiGPU2-Balanced" | 1000       | 22.3453   |
 
| Row | AverageWall | MaxWall   | MinWall   | Timestamp             |
|-----|-------------|-----------|-----------|-----------------------|
| 1   | 0.0223453   | 0.0593927 | 0.0211599 | "2016-02-27 17:17:41" |
 
| Row | JuliaHash                                  | CodeHash | OS        |
|-----|--------------------------------------------|----------|-----------|
| 1   | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" | NA       | "Windows" |
 
| Row | CPUCores |
|-----|----------|
| 1   | 8        |

The first run was 1 GPU which averaged .0231 seconds. The second run was equal splitting which got .0223 seconds. Next was the split trueBias = [.51,.49] which got .0225 seconds. Lastly I set trueBias = [.5,.5] and got .0223 seconds. Thus notice two things: one allowing for the homogeneous split has no overhead (we still only pass an integer), but more interesting, the best split was still 50-50. This is reproducible and only gets worse when we split more ([.75,.25] is terrible!).

So what’s going on? Well, first of all notice that this is not a memory bound problem: in the loop we only send an array of 36 values. The largest difference between the GPUs is their memory bandwidth. From the GTX 970 specs we see that its memory bandwidth is 224 GB/sec, vs the GTX 750Ti’s 86.4 GB/sec. That’s a whopping 2.5x difference, but it doesn’t matter in this case. All that matters in this test is the compute power. Here the GTX 970 has a base clock of 1050 MHz and a boost clock of 1178 MHz, while the GTX 750Ti has a base clock of 1020 MHz and a boost clock of 1085 MHz. 1178/1085 = 1.08. They are almost the same. Note that this is the difference between a entry level graphics card from ~4 years ago, vs one of the high-end cards of today (only beat in the gaming sphere by the GTX 980(Ti) and the Titan X). GPUs tend to have about the same strength per core, while the difference is mostly in how many cores there are. In that sense, our result is not surprising. However, knowing how to deal with the heterogeneity is probably good for more memory taxing problems.

Running it on Comet

I compiled this code onto Comet and ran it with the job script

#!/bin/bash
#SBATCH -A uci128
#SBATCH --job-name="jOpt"
#SBATCH --output="output/jOpt.%j.%N.out"
#SBATCH --partition=gpu-shared
#SBATCH --gres=gpu:2
#SBATCH --nodes=1
#SBATCH --export=ALL
#SBATCH --mail-type=ALL
#SBATCH --mail-user=crackauc@uci.edu
#SBATCH --ntasks-per-node=12
#SBATCH -t 0:10:00
module load cuda/7.0
module load cmake
export SLURM_NODEFILE=`generate_pbs_nodefile`
/home/crackauc/julia-a2f713dea5/bin/julia --machinefile $SLURM_NODEFILE /home/crackauc/projectCode/ImprovedSRK/Optimization/integrationTestComet.jl

The result was the following:

| Row | Category    | Benchmark   | Iterations | TotalWall | AverageWall |
|-----|-------------|-------------|------------|-----------|-------------|
| 1   | "SingleGPU" | "SingleGPU" | 1000       | 26.5858   | 0.0265858   |
 
| Row | MaxWall   | MinWall   | Timestamp             |
|-----|-----------|-----------|-----------------------|
| 1   | 0.0440866 | 0.0260225 | "2016-02-27 17:28:41" |
 
| Row | JuliaHash                                  | CodeHash | OS      |
|-----|--------------------------------------------|----------|---------|
| 1   | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" | NA       | "Linux" |
 
| Row | CPUCores |
|-----|----------|
| 1   | 24       |
 
1x12 DataFrames.DataFrame
| Row | Category   | Benchmark  | Iterations | TotalWall | AverageWall |
|-----|------------|------------|------------|-----------|-------------|
| 1   | "MultiGPU" | "MultiGPU" | 1000       | 22.3377   | 0.0223377   |
 
| Row | MaxWall   | MinWall   | Timestamp             |
|-----|-----------|-----------|-----------------------|
| 1   | 0.0458176 | 0.0219705 | "2016-02-27 17:29:06" |
 
| Row | JuliaHash                                  | CodeHash | OS      |
|-----|--------------------------------------------|----------|---------|
| 1   | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" | NA       | "Linux" |
 
| Row | CPUCores |
|-----|----------|
| 1   | 24       |
1x12 DataFrames.DataFrame
| Row | Category             | Benchmark            | Iterations | TotalWall |
|-----|----------------------|----------------------|------------|-----------|
| 1   | "MultiGPU2-Balanced" | "MultiGPU2-Balanced" | 1000       | 22.4148   |
 
| Row | AverageWall | MaxWall   | MinWall   | Timestamp             |
|-----|-------------|-----------|-----------|-----------------------|
| 1   | 0.0224148   | 0.0549695 | 0.0220449 | "2016-02-27 17:29:29" |
 
| Row | JuliaHash                                  | CodeHash | OS      |
|-----|--------------------------------------------|----------|---------|
| 1   | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" | NA       | "Linux" |
 
| Row | CPUCores |
|-----|----------|
| 1   | 24       |
1x12 DataFrames.DataFrame
| Row | Category               | Benchmark              | Iterations |
|-----|------------------------|------------------------|------------|
| 1   | "MultiGPU2-Unbalanced" | "MultiGPU2-Unbalanced" | 1000       |
 
| Row | TotalWall | AverageWall | MaxWall   | MinWall   |
|-----|-----------|-------------|-----------|-----------|
| 1   | 22.6683   | 0.0226683   | 0.0568948 | 0.0224583 |
 
| Row | Timestamp             | JuliaHash                                  |
|-----|-----------------------|--------------------------------------------|
| 1   | "2016-02-27 17:29:53" | "a2f713dea5ac6320d8dcf2835ac4a37ea751af05" |
 
| Row | CodeHash | OS      | CPUCores |
|-----|----------|---------|----------|
| 1   | NA       | "Linux" | 24       |

Notice that there is a performance gain by going to multiple GPUs, but again we loose performance by unbalancing (again, by running this a few times we see that the two version which are balanced are the same result within random error). Still, the speedup from going to multiple GPUs is not large here. The difference is bigger for bigger meshes. This shows you that multiple GPUs is really good for tackling larger problems, though in general if your problem is “not too big”, then going to multiple GPUs will only be a mild gain (remember, you want each processor to run enough calculations for it to be worthwhile!).

Lastly, notice that there is not much of a performance difference from our consumer GPU and the Tesla! This is for two reasons. One, we are using single precision. The Tesla and GTX line are almost the the same when it comes to single precision calculations. We would see a huge (~32x) difference for double precision calculations. Secondly, the Tesla has ECC memory which slows it down a little but makes sure it does not have errors in its memory. Lastly, the big advantage of a Tesla K80 is that it has a “huge” amount of memory compared to other GPUs. The GTX 970 has ~3.6 GB (due to NVIDIA’s lies…) whereas the Tesla K80 has 24 GB. This means that if the problem is single precision and fits on your GTX you’re fine. Otherwise, the Tesla is required.

Where do we go from here?

We started with an HPC allocation. We got Julia running on the HPC, and we setup a parallel function to use multiple nodes. Then, realizing how parallel our problem is, we took it to GPUs, first taking on one GPU, and then multiple, with a setup that works for as many GPUs as you need. Our implementation is about a kernal which simply evaluates a function along a vector, i.e. a vectorized function call. Thus very similar code to this works for all vectorized functions. What else do we need?

One obvious generalization is higher dimension arrays. If you need to do vectorized calls on matrices etc., you can just vectorize them in Julia and send them to the GPU in that form. If you want to be able to keep the matrix form in the CUDA kernal, you just have to play with indices but the same approach we did here works.

One place to go is to now start using Float64’s (doubles). Remember, on most GPUs they are really slow, but on the Tesla GPUs like in Comet they are still quite fast. All you have to do is change the array declarations to Float64’s and now you have double precision.

This means that you can basically code whatever vectorized calls you need to do. What about harder problems? If you need harder problems, check out libraries. cuBLAS is a BLAS implementation for doing linear algebra on the GPU. It contains cuBLASxt which automatically scales cuBLAS to multiple GPUs. Another thing for linear algebra is cuSolver for GPU methods for solving the linear equation Ax=b. You can use these in your CUDA kernals.

There are also Julia packages for making this easier. There’s cuBLAS.jl, ArrayFire.jl, etc. all for making GPU calculations easier. These do not require you to write/compile CUDA kernals. I may write a tutorial on this with some experiments, but this should be pretty easy given what you know now.

At this point you should be pretty capable with GPUs. My next focus will then be on Xeon Phis. I will have to write a Multigrid solver soon, so I may write one on the GPU and also on the Xeon Phi (called from Julia) and see the performance difference. Since I also have an allocation on LSU’s SuperMIC, I will also be looking for good ways to both use the Xeon Phi at the same time as the GPU (they have hybrid nodes!) and understand which problems are better on which device. Stay tuned.

Full Julia Code

Again, I am going to put my full Julia code here for reference. I use a bunch of values to calculate the coefficients as it pertains to my real problem, but again you can just swap out the getCoefficients problem with any function that returns an array of 36 Float32’s (rand(36)).

function integrationTest()
  A0 = [0 0 0 0
      1 0 0 0
      1/4 1/4 0 0
      0 0 0 0];
  A1 = [0 0 0 0
      1/4 0 0 0
      1 0 0 0
      0 0 1/4 0];
  B0 = [0 0 0 0
       0 0 0 0
      1 1/2 0 0
      0 0 0 0];
  B1 = [0 0 0 0
      -1/2 0 0 0
      1 0 0 0
      2 -1 1/2 0];
 
  α = [1/6 1/6 2/3 0];
 
  β1 = [-1 4/3 2/3 0];
  β2 = [1 -4/3 1/3 0];
  β3 = [2 -4/3 -2/3 0];
  β4 = [-2 5/3 -2/3 1];
 
  dx = 1/400
 
 
 
  imin = -12
  imax = 1
  jmin = -4
  jmax = 4
  coefs,powz,poww = getCoefficients(A0,A1,B0,B1,α,β1234)
  iarr = convert(Vector{Float32},collect(imin:dx:imax))
  jarr = convert(Vector{Float32},collect(jmin:dx:jmax))
  sizei = length(imin:dx:imax)
  sizej = length(jmin:dx:jmax)
  totArea = ((imax-imin)*(jmax-jmin)/(sizei*sizej));
 
  #=
  function coefFunc()
    getCoefficients(A0,A1,B0,B1,α,β1234)
  end
  dfcoef = benchmark(coefFunc,"coef",10)
  println(dfcoef)
  =#
 
  #=
  #Some SIMD
  function f1()
    res = @sync @parallel (+) for i = imin:dx:imax
      tmp = 0
       for j=jmin:dx:jmax
        ans = 0
        @simd for k=1:36
          @inbounds ans += coefs[k]*(i^(powz[k]))*(j^(poww[k]))
        end
        tmp += abs(ans)<1
      end
      tmp
    end
    res = res*((imax-imin)*(jmax-jmin)/(length(imin:dx:imax)*length(jmin:dx:jmax)))
    println(res)#Mathematica gives 2.98
  end
  df1 = benchmark(f1,"CPUSimp",2)
  println(df1)
  =#
  #=
  ## Full unravel
  function f2()
    res = @sync @parallel (+) for i = imin:dx:imax
        tmp = 0
        isq2 = i*i; isq3 = i*isq2; isq4 = isq2*isq2; isq5 = i*isq4
        isq6 = isq4*isq2; isq7 = i*isq6; isq8 = isq4*isq4
        @simd for j=jmin:dx:jmax
          jsq2 = j*j; jsq3= j*jsq2; jsq4 = jsq2*jsq2;
          jsq5 = j*jsq4; jsq6 = jsq2*jsq4; jsq7 = j*jsq6; jsq8 = jsq4*jsq4
          @inbounds tmp += abs(coefs[1]*(jsq2) + coefs[2]*(jsq3) + coefs[3]*(jsq4) + coefs[4]*(jsq5) + coefs[5]*jsq6 + coefs[6]*jsq7 + coefs[7]*jsq8 + coefs[8]*(i) + coefs[9]*(i)*(jsq2) + coefs[10]*i*jsq3
          + coefs[11]*(i)*(jsq4) + coefs[12]*i*jsq5 + coefs[13]*(i)*(jsq6) + coefs[14]*i*jsq7 + coefs[15]*(isq2) + coefs[16]*(isq2)*(jsq2) + coefs[17]*isq2*jsq3 + coefs[18]*(isq2)*(jsq4) +
          coefs[19]*isq2*jsq5 + coefs[20]*(isq2)*(jsq6) + coefs[21]*(isq3) + coefs[22]*(isq3)*(jsq2) + coefs[23]*isq3*jsq3 + coefs[24]*(isq3)*(jsq4) + coefs[25]*isq3*jsq5 +
          coefs[26]*(isq4) + coefs[27]*(isq4)*(jsq2) + coefs[28]*isq4*jsq3 + coefs[29]*(isq4)*(jsq4) + coefs[30]*(isq5) + coefs[31]*(isq5)*(jsq2) + coefs[32]*isq5*jsq3+ coefs[33]*(isq6) +
          coefs[34]*(isq6)*(jsq2) + coefs[35]*(isq7) + coefs[36]*(isq8))<1
        end
        tmp
      end
      res = res*totArea
      println(res)#Mathematica gives 2.98
  end
  df2 = benchmark(f2,"CPU",2)
  println(df2)
  =#
 
  #=
  #GPU
  cudaCores = 2496;
  equalDiv = sizei*sizej÷cudaCores + 1
  CUDArt.init(devices(dev->true))
  dev = device(0)
  md = CuModule("integrationWin.ptx",false)
  integrationFunc = CuFunction(md,"integration")
  g_iarr = CudaArray(iarr)
  g_jarr = CudaArray(jarr)
  g_tmp = CudaArray(Int32,cudaCores)
  function f3()
    g_coefs = CudaArray(coefs)
    launch(integrationFunc, cudaCores, 64, (g_coefs, g_iarr, g_jarr, sizei,sizej,equalDiv,g_tmp,)); res = sum(to_host(g_tmp));
    res = res*totArea
    #println(res)
  end
  df3 = benchmark(f3,"GPU",10)
  println(df3)
  CUDArt.close(devices(dev->true))
  =#
 
  #Single-GPU
  cudaCores = [2496];
  #cudaCores = [1664,1]
  numCards = length(cudaCores)
  totCores = sum(cudaCores)
  equalDiv = sizei*sizej÷totCores + 1
  portions = equalDiv*cudaCores
  if numCards > 1
    startIdx = [0 cumsum(portions)[1:end-1]]
  else
    startIdx = [0]
  end
  g_iarr = Vector{Any}(numCards)
  g_jarr = Vector{Any}(numCards)
  g_tmp = Vector{Any}(numCards)
  g_coefs = Vector{Any}(numCards)
  ans = Vector{Int32}(numCards)
  integrationFuncs = Vector{Any}(numCards)
  CUDArt.init(devices(dev->true))
  devlist = devices(dev->true)
  streams = [(device(dev); Stream()) for dev in devlist]
  for i = 1:numCards
    device(i-1)
    g_iarr[i]  = CudaArray(iarr)
    g_jarr[i]  = CudaArray(jarr)
    g_tmp[i]   = CudaArray(Int32,cudaCores[i])
    md = CuModule("integration.ptx",false)
    integrationFuncs[i] = CuFunction(md,"integration")
  end
 
  function f4()
    @sync begin
      for i = 1:numCards
        @async begin
          dev = device(i-1)
          g_coefs[i] = CudaArray(coefs)
          launch(integrationFuncs[i], cudaCores[i], 64, (g_coefs[i], g_iarr[i], g_jarr[i], sizei,sizej,equalDiv,startIdx[i],g_tmp[i],),stream=streams[i]);
          wait(streams[i])
          ans[i] = sum(to_host(g_tmp[i]))
        end
      end
    end
    res = sum(ans)*totArea
    println(res)
  end
  df4 = benchmark(f4,"SingleGPU",1000)
  println(df4)
  CUDArt.close(devices(dev->true))
 
  #Multi-GPU
  cudaCores = [2496,2496];
  #cudaCores = [1664,1]
  numCards = length(cudaCores)
  totCores = sum(cudaCores)
  equalDiv = sizei*sizej÷totCores + 1
  portions = equalDiv*cudaCores
  if numCards > 1
    startIdx = [0 cumsum(portions)[1:end-1]]
  else
    startIdx = [0]
  end
  g_iarr = Vector{Any}(numCards)
  g_jarr = Vector{Any}(numCards)
  g_tmp = Vector{Any}(numCards)
  g_coefs = Vector{Any}(numCards)
  ans = Vector{Int32}(numCards)
  integrationFuncs = Vector{Any}(numCards)
  CUDArt.init(devices(dev->true))
  devlist = devices(dev->true)
  streams = [(device(dev); Stream()) for dev in devlist]
  for i = 1:numCards
    device(i-1)
    g_iarr[i]  = CudaArray(iarr)
    g_jarr[i]  = CudaArray(jarr)
    g_tmp[i]   = CudaArray(Int32,cudaCores[i])
    md = CuModule("integration.ptx",false)
    integrationFuncs[i] = CuFunction(md,"integration")
  end
 
  function f4()
    @sync begin
      for i = 1:numCards
        @async begin
          dev = device(i-1)
          g_coefs[i] = CudaArray(coefs)
          launch(integrationFuncs[i], cudaCores[i], 64, (g_coefs[i], g_iarr[i], g_jarr[i], sizei,sizej,equalDiv,startIdx[i],g_tmp[i],),stream=streams[i]);
          wait(streams[i])
          ans[i] = sum(to_host(g_tmp[i]))
        end
      end
    end
    res = sum(ans)*totArea
    println(res)
  end
  df4 = benchmark(f4,"MultiGPU",1000)
  println(df4)
  CUDArt.close(devices(dev->true))
 
 
  #= Hardware Bias Spec.
  cudaCores = [1664,640];
  trueBias = [.75,.25]
  numCards = length(cudaCores)
  totCores = sum(cudaCores)
  coreBias = cudaCores ./ totCores
  workloadBias = trueBias./(coreBias)
  equalDiv = ceil(Int32,sizei*sizej/totCores .* workloadBias)
  portions = equalDiv.*cudaCores
  if numCards > 1
    startIdx = [0 cumsum(portions)[1:end-1]]
  else
    startIdx = [0]
  end
  =#
 
  ##Heterogenous Multi-GPU
  # trueBias = coreBias.* workloadBias*numCards
  cudaCores = [2496,2496]
  trueBias = [.5,.5]
  numCards = length(cudaCores)
  totCores = sum(cudaCores)
  equalDiv = Vector{Int32}(numCards)
  for i = 1:numCards
    equalDiv[i] = ceil(Int32,(sizei*sizej)/dot(trueBias/(trueBias[i]),cudaCores))
  end
 
  portions = equalDiv.*cudaCores
  if numCards > 1
    startIdx = [0 cumsum(portions)[1:end-1]]
  else
    startIdx = [0]
  end
  g_iarr = Vector{Any}(numCards)
  g_jarr = Vector{Any}(numCards)
  g_tmp = Vector{Any}(numCards)
  g_coefs = Vector{Any}(numCards)
  ans = Vector{Int32}(numCards)
  integrationFuncs = Vector{Any}(numCards)
  CUDArt.init(devices(dev->true))
  devlist = devices(dev->true)
  streams = [(device(dev); Stream()) for dev in devlist]
  for i = 1:numCards
    device(i-1)
    g_iarr[i]  = CudaArray(iarr)
    g_jarr[i]  = CudaArray(jarr)
    g_tmp[i]   = CudaArray(Int32,cudaCores[i])
    md = CuModule("integration.ptx",false)
    integrationFuncs[i] = CuFunction(md,"integration")
  end
 
  function f5()
    @sync begin
      for i = 1:numCards
        @async begin
          dev = device(i-1)
          g_coefs[i] = CudaArray(coefs)
          launch(integrationFuncs[i], cudaCores[i], 64, (g_coefs[i], g_iarr[i], g_jarr[i], sizei,sizej,equalDiv[i],startIdx[i],g_tmp[i],),stream=streams[i]);
          wait(streams[i])
          ans[i] = sum(to_host(g_tmp[i]))
        end
      end
    end
    res = sum(ans)*totArea
    #println(res)
  end
  df5 = benchmark(f5,"MultiGPU2-Balanced",1000)
  println(df5)
  CUDArt.close(devices(dev->true))
 
 
 
  ##Heterogenous Multi-GPU
  # trueBias = coreBias.* workloadBias*numCards
  cudaCores = [2496,2496];
  trueBias = [.55,.45]
  numCards = length(cudaCores)
  totCores = sum(cudaCores)
  equalDiv = Vector{Int32}(numCards)
  for i = 1:numCards
    equalDiv[i] = ceil(Int32,(sizei*sizej)/dot(trueBias/(trueBias[i]),cudaCores))
  end
 
  portions = equalDiv.*cudaCores
  if numCards > 1
    startIdx = [0 cumsum(portions)[1:end-1]]
  else
    startIdx = [0]
  end
  g_iarr = Vector{Any}(numCards)
  g_jarr = Vector{Any}(numCards)
  g_tmp = Vector{Any}(numCards)
  g_coefs = Vector{Any}(numCards)
  ans = Vector{Int32}(numCards)
  integrationFuncs = Vector{Any}(numCards)
  CUDArt.init(devices(dev->true))
  devlist = devices(dev->true)
  streams = [(device(dev); Stream()) for dev in devlist]
  for i = 1:numCards
    device(i-1)
    g_iarr[i]  = CudaArray(iarr)
    g_jarr[i]  = CudaArray(jarr)
    g_tmp[i]   = CudaArray(Int32,cudaCores[i])
    md = CuModule("integration.ptx",false)
    integrationFuncs[i] = CuFunction(md,"integration")
  end
 
  function f5()
    @sync begin
      for i = 1:numCards
        @async begin
          dev = device(i-1)
          g_coefs[i] = CudaArray(coefs)
          launch(integrationFuncs[i], cudaCores[i], 64, (g_coefs[i], g_iarr[i], g_jarr[i], sizei,sizej,equalDiv[i],startIdx[i],g_tmp[i],),stream=streams[i]);
          wait(streams[i])
          ans[i] = sum(to_host(g_tmp[i]))
        end
      end
    end
    res = sum(ans)*totArea
    #println(res)
  end
  df5 = benchmark(f5,"MultiGPU2-Unbalanced",1000)
  println(df5)
  CUDArt.close(devices(dev->true))
end
 
using CUDArt
using Benchmark
include("getCoefs.jl")
integrationTest()

The post Multiple-GPU Parallelism on the HPC with Julia appeared first on Stochastic Lifestyle.

Tutorial for Julia on the HPC with GPUs

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/julia-on-the-hpc-with-gpus/

This is a continuous of my previous post on using Julia on the XSEDE Comet HPC. Check that out first for an explanation of the problem. In that problem, we wished to solve for the area of a region where a polynomial was less than 1, which was calculated by code like:

@time res = @sync @parallel (+) for i = imin:dx:imax
  tmp = 0
  isq2 = i*i; isq3 = i*isq2; isq4 = isq2*isq2; isq5 = i*isq4
  isq6 = isq4*isq2; isq7 = i*isq6; isq8 = isq4*isq4
  @simd for j=jmin:dx:jmax
    jsq2 = j*j; jsq3= j*jsq2; jsq4 = jsq2*jsq2;
    jsq5 = j*jsq4; jsq6 = jsq2*jsq4; jsq7 = j*jsq6; jsq8 = jsq4*jsq4
    @inbounds tmp += abs(coefs[1]*(jsq2) + coefs[2]*(jsq3) + coefs[3]*(jsq4) + coefs[4]*(jsq5) + coefs[5]*jsq6 + coefs[6]*jsq7 + coefs[7]*jsq8 + coefs[8]*(i) + coefs[9]*(i)*(jsq2) + coefs[10]*i*jsq3
    + coefs[11]*(i)*(jsq4) + coefs[12]*i*jsq5 + coefs[13]*(i)*(jsq6) + coefs[14]*i*jsq7 + coefs[15]*(isq2) + coefs[16]*(isq2)*(jsq2) + coefs[17]*isq2*jsq3 + coefs[18]*(isq2)*(jsq4) +
    coefs[19]*isq2*jsq5 + coefs[20]*(isq2)*(jsq6) + coefs[21]*(isq3) + coefs[22]*(isq3)*(jsq2) + coefs[23]*isq3*jsq3 + coefs[24]*(isq3)*(jsq4) + coefs[25]*isq3*jsq5 +
    coefs[26]*(isq4) + coefs[27]*(isq4)*(jsq2) + coefs[28]*isq4*jsq3 + coefs[29]*(isq4)*(jsq4) + coefs[30]*(isq5) + coefs[31]*(isq5)*(jsq2) + coefs[32]*isq5*jsq3+ coefs[33]*(isq6) +
    coefs[34]*(isq6)*(jsq2) + coefs[35]*(isq7) + coefs[36]*(isq8))<1
  end
  tmp
end
res = res*((imax-imin)*(jmax-jmin)/(length(imin:dx:imax)*length(jmin:dx:jmax)))
println(res)

Notice that this code is massively parallel: at every point we do something and we just sum the values. This means it’s a perfect problem for the GPU. I am going to show you how to do this.

Getting CUDArt Setup

We will be using the CUDArt.jl package. I first had this working with the CUDA.jl package but was informed that going forward we should be using the CUDArt.jl package. There’s really not much of a difference in my implementation.

To get started, you have to go here and install an appropriate version of CUDA. Note that you have to have an NVIDIA GPU in order to do this. I will be using a GTX970 on one computer (Windows), and a GTX980Ti on another. Comet has a queue with 2x Tesla K80s, but that’s probably overkill and you should try to develop locally first. One you have all of that setup, CUDArt.jl should compile some .ptx files upon first use. If it does, you’re good to go.

CUDArt.jl’s page shows you how to get started. However, I am going to add a little caveat. In reality, this is the objective function in a machine learning problem, and so we wish to be able to recompute this as quickly as possible just by swapping out the coefficient array coefs. Therefore all of the other arrays must persist on the GPU.

Here’s what we’re going to do. To maximize performance on the GPU we will use Float32’s. On most GPUs (all except the Teslas and the Titan Black), the double precision floating point capabilities are SEVERELY gimped in comparison (1/32 the performance…). To do this, we setup our calculations as follows:

## Setup CPU-side parameters
imin = -8
imax = 1
jmin = -3
jmax = 3
@time coefs,powz,poww = getCoefficients(A0,A1,B0,B1,α,β1234)
iarr = convert(Vector{Float32},collect(imin:dx:imax))
jarr = convert(Vector{Float32},collect(jmin:dx:jmax))
sizei = length(imin:dx:imax)
sizej = length(jmin:dx:jmax)
cudaCores = 1664;
equalDiv = sizei*sizej÷cudaCores + 1
 
##GPU Code
CUDArt.init(devices(dev->true)) #Initiate the devices
dev = device(0) #Set the GPU to the first GPU
#Move the arrays over
g_iarr = CudaArray(iarr)
g_jarr = CudaArray(jarr)
g_coefs = CudaArray(coefs)
g_tmp = CudaArray(Int32,cudaCores)

Note that if you have multiple GPUs and don’t know which one is which device, you can use the command CUDArt.name(CUDArt.device_properties(number)). Here I have CUDArt.name(CUDArt.device_properties(0)) return “GeForce GTX 970” and CUDArt.name(CUDArt.device_properties(1)) returns “GeForce GTX 750 Ti”, so I use device(0) to use the 970. Note that at any time you can change the device with this command, but be careful since the CudaArrays are stored on the device that you call them from. If you want to use multiple GPUs, you will also need to split up your arrays.

The CUDA Kernal

Now the last thing I need to do is make my function. To do this, we have to go to C. Here is the Cuda kernal for the calculation:

// filename: integration.cu
// Performs the inner integration loop
extern "C"
{
    __global__ void integration(const float *coefs, const float *iArr, const float *jArr, const int sizei, const int sizej, const int equalDiv, int *tmp)
    {
        int index = threadIdx.x + blockIdx.x * blockDim.x;
        int loopInd;
        float i, j, isq2, isq3, isq4, isq5, isq6, isq7, isq8, jsq2, jsq3, jsq4, jsq5, jsq6, jsq7, jsq8,
        int ans = 0;
        for(loopInd=0;loopInd<equalDiv;loopInd=loopInd+1){
          i = iArr[(index*equalDiv+loopInd)/sizej];
          j = jArr[(index*equalDiv+loopInd)%sizej];
          if(index*equalDiv+loopInd >= sizei*sizej){
            break;
          }
          if((index*equalDiv+loopInd)%sizej==0 || loopInd==0){
            isq2 = i*i;
            isq3 = i*isq2;
            isq4 = isq2*isq2;
            isq5 = i*isq4;
            isq6 = isq4*isq2;
            isq7 = i*isq6;
            isq8 = isq4*isq4;
          }
          jsq2 = j*j;
          jsq3 = j*jsq2;
          jsq4 = jsq2*jsq2;
          jsq5 = j*jsq4;
          jsq6 = jsq2*jsq4;
          jsq7 = j*jsq6;
          jsq8 = jsq4*jsq4;
          /*tmp[index*1878+loopInd]= index*1878+loopInd;*/
		      /* changed to zero indexing */
          ans = ans + (abs(coefs[0]*(jsq2) + coefs[1]*(jsq3) + coefs[2]*(jsq4) + coefs[3]*(jsq5) + coefs[4]*jsq6 + coefs[5]*jsq7 + coefs[6]*jsq8 + coefs[7]*(i) + coefs[8]*(i)*(jsq2) + coefs[9]*i*jsq3 + coefs[10]*(i)*(jsq4) + coefs[11]*i*jsq5 + coefs[12]*(i)*(jsq6) + coefs[13]*i*jsq7 + coefs[14]*(isq2) + coefs[15]*(isq2)*(jsq2) + coefs[16]*isq2*jsq3 + coefs[17]*(isq2)*(jsq4) + coefs[18]*isq2*jsq5 + coefs[19]*(isq2)*(jsq6) + coefs[20]*(isq3) + coefs[21]*(isq3)*(jsq2) + coefs[22]*isq3*jsq3 + coefs[23]*(isq3)*(jsq4) + coefs[24]*isq3*jsq5 + coefs[25]*(isq4) + coefs[26]*(isq4)*(jsq2) + coefs[27]*isq4*jsq3 + coefs[28]*(isq4)*(jsq4) + coefs[29]*(isq5) + coefs[30]*(isq5)*(jsq2) + coefs[31]*isq5*jsq3+ coefs[32]*(isq6) + coefs[33]*(isq6)*(jsq2) + coefs[34]*(isq7) + coefs[35]*(isq8))<1);
        }
        tmp[index] = ans;
    }
}

Okay, that’s not nice looking at all, but it’s the same idea as the Julia script. Let me explain this code a little bit. Our function is integration. It takes in all of the variables and an array tmp which we will save the output to. The first call calculates a unique index for each CUDA core. We then setup our variables. Notice that before that we setup equalDiv to calculate how many calculations each core should do in order to evenly divide the work. Thus we loop over equalDiv times (and check if we go over since this will not necessarily come out perfectly). This means that each CUDA core will calculate the same number of (i,j)‘s. We then use the calculated index and the loop index to get our $i$ and $j$. We index down $j$ first, then go by $i$’s. This means it’s ordered as (0,0),(0,1),ldots,(0,J),(1,0),ldots. Notice that we achieve this by doing integer division by the size of the j array (truncation will make it an integer), and we get the $j$ index by moding by the size of the $j$ array. Using these $i$’s and $j$’s, we calculate the inner part of the loop. Notice that the calls to coefs had to be changed because C using 0-based indexing (as opposed to Julia’s 1-based indexing). Note that we can use the abs function from the CUDA math library. All of the CUDA math library functions are available.

[Note that when doing CUDA programming you want to allocate as little memory as possible. This is still on the small side so it worked out to be the optimal code. However, in some cases the non-unraveled version may be better due to the overhead of memory allocation.]

So there’s a template for you to play around with. When all of that is in order, we have to compile the kernal. On Windows with visual studio, I had to find out where cl.exe was (for me it was in Visual Studio 12’s directory) and tell the compiler the location of it. The total command was:

nvcc -ptx integration.cu -ccbin "C:Program Files (x86)Microsoft Visual Studio 12.0VCbin"

On Linux it was simpler:

nvcc -ptx integration.cu

From this you get .ptx files. To use the function in Julia, we then use the following code:

md = CuModule("integrationWin.ptx",false)
integrationFunc = CuFunction(md,"integration")
@time launch(integrationFunc, cudaCores, 64, (g_coefs, g_iarr, g_jarr, sizei,sizej,equalDiv,g_tmp,)); res = sum(to_host(g_tmp));
res = res*((imax-imin)*(jmax-jmin)/(length(imin:dx:imax)*length(jmin:dx:jmax)))
println(res)

What this does is take the “integration” function out of the .ptx and save it as integrationFunc. I can then call it at any time with the launch command. The second and third options are the grid and block sizes. In CUDA you always want to “overload”/overthread the cores so that there is no downtime. What I have found in my tests is that the best setting for this is to set the gridsize to the number of CUDA cores (since in our case these aren’t communicating) and overloading it by setting the block size to 64. Why 64? That worked out best in testing. Change the numbers around and see what works best for you.

The last argument to launch is the tuple of arguments for our integration function. Notice that there is a trailing comma, this is required. After the computation is finished, the results have been saved directly into g_tmp. Thus we send g_tmp to the host and sum up the results as before. We re-scale by the area of integration and that is the result.

Now in the actual objective function, what I want to do is update the coefficients to a new coefficients array, perform this calculation, and then move on. To do this I simply use the code:

    if gpuEnabled
      g_coefs = CudaArray(coefs)
      launch(integrationFunc, cudaCores, 64, (g_coefs, g_iarr, g_jarr, sizei,sizej,equalDiv,g_tmp,))
      res = sum(to_host(g_tmp))
      free(g_coefs)
    else ...

Notice I send coefs over to the GPU, run the kernal to get the answer, and then free up the GPU memory. So in my code I loop over this many times, changing coefs every time. When the entire computation is done, I finish with the command

CUDArt.close(devices(dev->true))

This will clear the GPU.

Comet?

These same steps will work on Comet. You will need to go into an interactive job to compile, but it should work fine. The job script to run Julia on the GPU node is:

#!/bin/bash
#SBATCH -A <account>
#SBATCH --job-name="jOpt"
#SBATCH --output="output/jOpt.%j.%N.out"
#SBATCH -p gpu
#SBATCH --gres=gpu:1
#SBATCH --nodes=1
#SBATCH --export=ALL
#SBATCH --ntasks-per-node=6
#SBATCH -t 48:00:00
export SLURM_NODEFILE=`generate_pbs_nodefile`
/home/crackauc/julia-a2f713dea5/bin/julia --machinefile $SLURM_NODEFILE /home/crackauc/projectCode/ImprovedSRK/Optimization/cometDriver.jl

Here I am only taking 6 nodes for the other computations, and I tell it to use the gpu node with -p and tell it I want only one GPU resource with –gres=gpu:1. If you want to use multiple GPUs, just change that number. However, you will want to split up the arrays amongst the multiple GPUs, change equalDiv to account for each CUDA core doing less calculations, send an integer to the GPU to tell it which GPU it is, and use that integer to change the indexing so that no calculations are repeated. That sounds like a good exercise.

The post Tutorial for Julia on the HPC with GPUs appeared first on Stochastic Lifestyle.