Tag Archives: gpu

My quest to responsive visualizations with Julia

By: Simon Danisch

Re-posted from: http://randomfantasies.com/2016/09/my-quest-to-responsive-visualizations-with-julia/

Responsive and fun interactivity is notoriously hard! But it’s also the key to less frustration and more patience when working on a project.

There are endless important projects that humanity needs to solve. To become less of a burden to the environment and have a better society we constantly need to advance the state of the art.

Still, people give up on doing this when the initial amount of patience, motivation and curiosity is depleted without getting new incentives.

This actually happened to me with my 3D modeling hobby a few years ago. Not only was the 3D software difficult to learn, it also turned unresponsive fairly quickly, even on a powerful machine.

Just as an example, rendering a 30 second clip in high quality took ~2 weeks with my PC running day and night. This slowness drained my patience and at some point I decided that this is too much hassle for a hobby.

Half a year later, I ran into a similar problem when working on a computer vision pipeline. We simply couldn’t find frameworks which were fast enough for real time processing while maintaining high programming productivity. This time I didn’t give up, but the job could have been much more fun and productive!

At this point I decided that I want to work on software that turns really tough problems like object recognition of 3D models into weekend projects.

I started working on GLVisualize, a library for fast and interactive graphics written in Julia.

Why graphics?

Visualizing something is the first step to understanding and it allows us to explore huge problem spaces that were invisible before.

But current tools seem to be divided into 2D and 3D libraries, high performance libraries, which are kind of a hassle to use, and easily usable libraries which turn into an unresponsive mess quickly.

So with my background, this seemed like a worthy start!

Why Julia?

While there are several reasons for choosing Julia, today I want to show you its impressive speed despite being an easy-to-use dynamic language.

Let me show you the benefit with two toy examples (all graphics including the plots were produced with GLVisualize!).

Consider the following function, the famous Lorenz Attractor:

It takes a point t0 and parameters a to d and returns a new point. Could you imagine what kind of line this will draw, when recursively applying it to the new result and changing parameters?

I guess no one has such a powerful brain, which is why we visualize these kind of things, together with sliders to explore the function:

lorenz

code

These are 100,000 points, and as you can see GLVisualize can deal with that amount easily.

Lets see how long we could have stayed at pleasant interactive speeds in another language.

Lets take for example Python, a language comparable in usability:

 

pyjulo

 

minimal speedup 70.4680453961563
maximal speedup 148.02061279172827
max seconds Julia 0.193422334
max seconds Python 13.630093812942505

As you can see, computation times jump beyond one second at around 10⁵ points for Python, while Julia stays interactive with up to 10⁷ points.

So if you’re interested in a higher resolution in Python you better crank up your patience or call out to C, eliminating all convenience of Python abruptly!

Next example is a 3D mandelbulb visualization:

mandelbulb on the gpu

code

One step through the function takes around 24 seconds with Julia on the CPU, which makes it fairly painful to explore the parameters.

So why is the shown animation still smooth? Well, when choosing Julia, I’ve been betting on the fact that it should be fairly simple

to run Julia code on the GPU. This bet is now starting to become reality.

I’m using CUDAnative and GPUArrays to speed up the calculation of the mandelbulb.

CUDAnative allows you to run Julia code on the GPU while GPUArrays offers a simpler array interface for this task.

Here are the benchmark results:

cujlmandel

minimal speedup 37.05708590791309
maximal speedup 401.59165062897495
max seconds cpu 24.275534426
max seconds cuda 0.128474155

This means that by running Julia on the GPU we can still enjoy a smooth interaction with this function.

But the problem is actually so hard, that I can’t run this interactively at the resolution I would like to.

As you can see, the iso surface visualization still looks very coarse. So even when using state of the art software, you still run into problems that won’t compute under one second.

But with Julia you can at least squeeze out the last bit of performance of your hardware, be it CPU or GPU, while enjoying the comfort of a dynamic high level language!

I’m sure that with Julia and advances in hardware, algorithms and optimizations, we can soon crack even the hardest problem with ease!

 


Benchmarking system:

RAM: 32GB

CPU: Intel® Core™ i7-6700 CPU @ 3.40GHz × 8

GPU: GeForce GTX 950

 

Tutorial for Julia on the HPC with GPUs

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/julia-on-the-hpc-with-gpus/

This is a continuous of my previous post on using Julia on the XSEDE Comet HPC. Check that out first for an explanation of the problem. In that problem, we wished to solve for the area of a region where a polynomial was less than 1, which was calculated by code like:

@time res = @sync @parallel (+) for i = imin:dx:imax
  tmp = 0
  isq2 = i*i; isq3 = i*isq2; isq4 = isq2*isq2; isq5 = i*isq4
  isq6 = isq4*isq2; isq7 = i*isq6; isq8 = isq4*isq4
  @simd for j=jmin:dx:jmax
    jsq2 = j*j; jsq3= j*jsq2; jsq4 = jsq2*jsq2;
    jsq5 = j*jsq4; jsq6 = jsq2*jsq4; jsq7 = j*jsq6; jsq8 = jsq4*jsq4
    @inbounds tmp += abs(coefs[1]*(jsq2) + coefs[2]*(jsq3) + coefs[3]*(jsq4) + coefs[4]*(jsq5) + coefs[5]*jsq6 + coefs[6]*jsq7 + coefs[7]*jsq8 + coefs[8]*(i) + coefs[9]*(i)*(jsq2) + coefs[10]*i*jsq3
    + coefs[11]*(i)*(jsq4) + coefs[12]*i*jsq5 + coefs[13]*(i)*(jsq6) + coefs[14]*i*jsq7 + coefs[15]*(isq2) + coefs[16]*(isq2)*(jsq2) + coefs[17]*isq2*jsq3 + coefs[18]*(isq2)*(jsq4) +
    coefs[19]*isq2*jsq5 + coefs[20]*(isq2)*(jsq6) + coefs[21]*(isq3) + coefs[22]*(isq3)*(jsq2) + coefs[23]*isq3*jsq3 + coefs[24]*(isq3)*(jsq4) + coefs[25]*isq3*jsq5 +
    coefs[26]*(isq4) + coefs[27]*(isq4)*(jsq2) + coefs[28]*isq4*jsq3 + coefs[29]*(isq4)*(jsq4) + coefs[30]*(isq5) + coefs[31]*(isq5)*(jsq2) + coefs[32]*isq5*jsq3+ coefs[33]*(isq6) +
    coefs[34]*(isq6)*(jsq2) + coefs[35]*(isq7) + coefs[36]*(isq8))<1
  end
  tmp
end
res = res*((imax-imin)*(jmax-jmin)/(length(imin:dx:imax)*length(jmin:dx:jmax)))
println(res)

Notice that this code is massively parallel: at every point we do something and we just sum the values. This means it’s a perfect problem for the GPU. I am going to show you how to do this.

Getting CUDArt Setup

We will be using the CUDArt.jl package. I first had this working with the CUDA.jl package but was informed that going forward we should be using the CUDArt.jl package. There’s really not much of a difference in my implementation.

To get started, you have to go here and install an appropriate version of CUDA. Note that you have to have an NVIDIA GPU in order to do this. I will be using a GTX970 on one computer (Windows), and a GTX980Ti on another. Comet has a queue with 2x Tesla K80s, but that’s probably overkill and you should try to develop locally first. One you have all of that setup, CUDArt.jl should compile some .ptx files upon first use. If it does, you’re good to go.

CUDArt.jl’s page shows you how to get started. However, I am going to add a little caveat. In reality, this is the objective function in a machine learning problem, and so we wish to be able to recompute this as quickly as possible just by swapping out the coefficient array coefs. Therefore all of the other arrays must persist on the GPU.

Here’s what we’re going to do. To maximize performance on the GPU we will use Float32’s. On most GPUs (all except the Teslas and the Titan Black), the double precision floating point capabilities are SEVERELY gimped in comparison (1/32 the performance…). To do this, we setup our calculations as follows:

## Setup CPU-side parameters
imin = -8
imax = 1
jmin = -3
jmax = 3
@time coefs,powz,poww = getCoefficients(A0,A1,B0,B1,α,β1234)
iarr = convert(Vector{Float32},collect(imin:dx:imax))
jarr = convert(Vector{Float32},collect(jmin:dx:jmax))
sizei = length(imin:dx:imax)
sizej = length(jmin:dx:jmax)
cudaCores = 1664;
equalDiv = sizei*sizej÷cudaCores + 1
 
##GPU Code
CUDArt.init(devices(dev->true)) #Initiate the devices
dev = device(0) #Set the GPU to the first GPU
#Move the arrays over
g_iarr = CudaArray(iarr)
g_jarr = CudaArray(jarr)
g_coefs = CudaArray(coefs)
g_tmp = CudaArray(Int32,cudaCores)

Note that if you have multiple GPUs and don’t know which one is which device, you can use the command CUDArt.name(CUDArt.device_properties(number)). Here I have CUDArt.name(CUDArt.device_properties(0)) return “GeForce GTX 970” and CUDArt.name(CUDArt.device_properties(1)) returns “GeForce GTX 750 Ti”, so I use device(0) to use the 970. Note that at any time you can change the device with this command, but be careful since the CudaArrays are stored on the device that you call them from. If you want to use multiple GPUs, you will also need to split up your arrays.

The CUDA Kernal

Now the last thing I need to do is make my function. To do this, we have to go to C. Here is the Cuda kernal for the calculation:

// filename: integration.cu
// Performs the inner integration loop
extern "C"
{
    __global__ void integration(const float *coefs, const float *iArr, const float *jArr, const int sizei, const int sizej, const int equalDiv, int *tmp)
    {
        int index = threadIdx.x + blockIdx.x * blockDim.x;
        int loopInd;
        float i, j, isq2, isq3, isq4, isq5, isq6, isq7, isq8, jsq2, jsq3, jsq4, jsq5, jsq6, jsq7, jsq8,
        int ans = 0;
        for(loopInd=0;loopInd<equalDiv;loopInd=loopInd+1){
          i = iArr[(index*equalDiv+loopInd)/sizej];
          j = jArr[(index*equalDiv+loopInd)%sizej];
          if(index*equalDiv+loopInd >= sizei*sizej){
            break;
          }
          if((index*equalDiv+loopInd)%sizej==0 || loopInd==0){
            isq2 = i*i;
            isq3 = i*isq2;
            isq4 = isq2*isq2;
            isq5 = i*isq4;
            isq6 = isq4*isq2;
            isq7 = i*isq6;
            isq8 = isq4*isq4;
          }
          jsq2 = j*j;
          jsq3 = j*jsq2;
          jsq4 = jsq2*jsq2;
          jsq5 = j*jsq4;
          jsq6 = jsq2*jsq4;
          jsq7 = j*jsq6;
          jsq8 = jsq4*jsq4;
          /*tmp[index*1878+loopInd]= index*1878+loopInd;*/
		      /* changed to zero indexing */
          ans = ans + (abs(coefs[0]*(jsq2) + coefs[1]*(jsq3) + coefs[2]*(jsq4) + coefs[3]*(jsq5) + coefs[4]*jsq6 + coefs[5]*jsq7 + coefs[6]*jsq8 + coefs[7]*(i) + coefs[8]*(i)*(jsq2) + coefs[9]*i*jsq3 + coefs[10]*(i)*(jsq4) + coefs[11]*i*jsq5 + coefs[12]*(i)*(jsq6) + coefs[13]*i*jsq7 + coefs[14]*(isq2) + coefs[15]*(isq2)*(jsq2) + coefs[16]*isq2*jsq3 + coefs[17]*(isq2)*(jsq4) + coefs[18]*isq2*jsq5 + coefs[19]*(isq2)*(jsq6) + coefs[20]*(isq3) + coefs[21]*(isq3)*(jsq2) + coefs[22]*isq3*jsq3 + coefs[23]*(isq3)*(jsq4) + coefs[24]*isq3*jsq5 + coefs[25]*(isq4) + coefs[26]*(isq4)*(jsq2) + coefs[27]*isq4*jsq3 + coefs[28]*(isq4)*(jsq4) + coefs[29]*(isq5) + coefs[30]*(isq5)*(jsq2) + coefs[31]*isq5*jsq3+ coefs[32]*(isq6) + coefs[33]*(isq6)*(jsq2) + coefs[34]*(isq7) + coefs[35]*(isq8))<1);
        }
        tmp[index] = ans;
    }
}

Okay, that’s not nice looking at all, but it’s the same idea as the Julia script. Let me explain this code a little bit. Our function is integration. It takes in all of the variables and an array tmp which we will save the output to. The first call calculates a unique index for each CUDA core. We then setup our variables. Notice that before that we setup equalDiv to calculate how many calculations each core should do in order to evenly divide the work. Thus we loop over equalDiv times (and check if we go over since this will not necessarily come out perfectly). This means that each CUDA core will calculate the same number of (i,j)‘s. We then use the calculated index and the loop index to get our $i$ and $j$. We index down $j$ first, then go by $i$’s. This means it’s ordered as (0,0),(0,1),ldots,(0,J),(1,0),ldots. Notice that we achieve this by doing integer division by the size of the j array (truncation will make it an integer), and we get the $j$ index by moding by the size of the $j$ array. Using these $i$’s and $j$’s, we calculate the inner part of the loop. Notice that the calls to coefs had to be changed because C using 0-based indexing (as opposed to Julia’s 1-based indexing). Note that we can use the abs function from the CUDA math library. All of the CUDA math library functions are available.

[Note that when doing CUDA programming you want to allocate as little memory as possible. This is still on the small side so it worked out to be the optimal code. However, in some cases the non-unraveled version may be better due to the overhead of memory allocation.]

So there’s a template for you to play around with. When all of that is in order, we have to compile the kernal. On Windows with visual studio, I had to find out where cl.exe was (for me it was in Visual Studio 12’s directory) and tell the compiler the location of it. The total command was:

nvcc -ptx integration.cu -ccbin "C:Program Files (x86)Microsoft Visual Studio 12.0VCbin"

On Linux it was simpler:

nvcc -ptx integration.cu

From this you get .ptx files. To use the function in Julia, we then use the following code:

md = CuModule("integrationWin.ptx",false)
integrationFunc = CuFunction(md,"integration")
@time launch(integrationFunc, cudaCores, 64, (g_coefs, g_iarr, g_jarr, sizei,sizej,equalDiv,g_tmp,)); res = sum(to_host(g_tmp));
res = res*((imax-imin)*(jmax-jmin)/(length(imin:dx:imax)*length(jmin:dx:jmax)))
println(res)

What this does is take the “integration” function out of the .ptx and save it as integrationFunc. I can then call it at any time with the launch command. The second and third options are the grid and block sizes. In CUDA you always want to “overload”/overthread the cores so that there is no downtime. What I have found in my tests is that the best setting for this is to set the gridsize to the number of CUDA cores (since in our case these aren’t communicating) and overloading it by setting the block size to 64. Why 64? That worked out best in testing. Change the numbers around and see what works best for you.

The last argument to launch is the tuple of arguments for our integration function. Notice that there is a trailing comma, this is required. After the computation is finished, the results have been saved directly into g_tmp. Thus we send g_tmp to the host and sum up the results as before. We re-scale by the area of integration and that is the result.

Now in the actual objective function, what I want to do is update the coefficients to a new coefficients array, perform this calculation, and then move on. To do this I simply use the code:

    if gpuEnabled
      g_coefs = CudaArray(coefs)
      launch(integrationFunc, cudaCores, 64, (g_coefs, g_iarr, g_jarr, sizei,sizej,equalDiv,g_tmp,))
      res = sum(to_host(g_tmp))
      free(g_coefs)
    else ...

Notice I send coefs over to the GPU, run the kernal to get the answer, and then free up the GPU memory. So in my code I loop over this many times, changing coefs every time. When the entire computation is done, I finish with the command

CUDArt.close(devices(dev->true))

This will clear the GPU.

Comet?

These same steps will work on Comet. You will need to go into an interactive job to compile, but it should work fine. The job script to run Julia on the GPU node is:

#!/bin/bash
#SBATCH -A <account>
#SBATCH --job-name="jOpt"
#SBATCH --output="output/jOpt.%j.%N.out"
#SBATCH -p gpu
#SBATCH --gres=gpu:1
#SBATCH --nodes=1
#SBATCH --export=ALL
#SBATCH --ntasks-per-node=6
#SBATCH -t 48:00:00
export SLURM_NODEFILE=`generate_pbs_nodefile`
/home/crackauc/julia-a2f713dea5/bin/julia --machinefile $SLURM_NODEFILE /home/crackauc/projectCode/ImprovedSRK/Optimization/cometDriver.jl

Here I am only taking 6 nodes for the other computations, and I tell it to use the gpu node with -p and tell it I want only one GPU resource with –gres=gpu:1. If you want to use multiple GPUs, just change that number. However, you will want to split up the arrays amongst the multiple GPUs, change equalDiv to account for each CUDA core doing less calculations, send an integer to the GPU to tell it which GPU it is, and use that integer to change the indexing so that no calculations are repeated. That sounds like a good exercise.

The post Tutorial for Julia on the HPC with GPUs appeared first on Stochastic Lifestyle.