# Deep Learning on the New Ubuntu-Based Data Science Virtual Machine for Linux

Authored by Paul Shealy, Senior Software Engineer, and Gopi Kumar, Principal Program Manager, at Microsoft.

Deep learning has received significant attention recently for its ability to create machine learning models with very high accuracy. It’s especially popular in image and speech recognition tasks, where the availability of massive datasets with rich information make it feasible to train ever-larger neural networks on powerful GPUs and achieve groundbreaking results. Although there are a variety of deep learning frameworks available, getting started with one means taking time to download and install the framework, libraries, and other tools before writing your first line of code.

Microsoft’s Data Science Virtual Machine (DSVM) is a family of popular VM images published on the Azure marketplace with a broad choice of machine learning and data science tools. Microsoft is extending it with the introduction of a brand-new offering in this family – the Data Science Virtual Machine for Linux, based on Ubuntu 16.04LTS – that also includes a comprehensive set of popular deep learning frameworks.

Deep learning frameworks in the new VM include:

• Microsoft Cognitive Toolkit
• Caffe and Caffe2
• TensorFlow
• H2O
• MXNet
• NVIDIA DIGITS
• Theano
• Torch, including PyTorch
• Keras

The image can be deployed on VMs with GPUs or CPU-only VMs. It also includes OpenCV, matplotlib and many other libraries that you will find useful.

Run dsvm-more-info at a command prompt or visit the documentation for more information about these frameworks and how to get started.

Sample Jupyter notebooks are included for most frameworks. Start Jupyter or log in to JupyterHub to browse the samples for an easy way to explore the frameworks and get started with deep learning.

#### GPU Support

Training a deep neural network requires considerable computational resources, so things can be made significantly faster by running on one or more GPUs. Azure now offers NC-class VM sizes with 1-4 NVIDIA K80 GPUs for computational workloads. All deep learning frameworks on the VM are compiled with GPU support, and the NVIDIA driver, CUDA and cuDNN are included. You may also choose to run the VM on a CPU if you prefer, and that is supported without code changes. And because this is running on Azure, you can choose a smaller VM size for setup and exploration, then scale up to one or more GPUs for training.

The VM comes with nvidia-smi to monitor GPU usage during training and help optimize parameters to make full use of the GPU. It also includes NVIDIA Docker if you want to run Docker containers with GPU access.

#### Data Science Virtual Machine

The Data Science Virtual Machine family of VM images on Azure includes the DSVM for Windows, a CentOS-based DSVM for Linux, and an Ubuntu-based DSVM for Linux. These images come with popular data science and machine learning tools, including Microsoft R Server Developer Edition, Microsoft R Open, Anaconda Python, Julia, Jupyter notebooks, Visual Studio Code, RStudio, xgboost, and many more. A full list of tools for all editions of the DSVM is available here. The DSVM has proven popular with data scientists as it helps them focus on their tasks and skip mundane steps around tool installation and configuration.

To try deep learning on Windows with GPUs, the Deep Learning Toolkit for DSVM contains all tools from the Windows DSVM plus GPU drivers, CUDA, cuDNN, and GPU versions of CNTK, MXNet, and TensorFlow.

#### Get Started Today

We invite you to use the new image to explore deep learning frameworks or for your machine learning and data science projects – DSVM for Linux (Ubuntu) is available today through the Marketplace. Free Azure credits are available to help get you started.

Paul & Gopi

# My quest to responsive visualizations with Julia

Responsive and fun interactivity is notoriously hard! But it’s also the key to less frustration and more patience when working on a project.

There are endless important projects that humanity needs to solve. To become less of a burden to the environment and have a better society we constantly need to advance the state of the art.

Still, people give up on doing this when the initial amount of patience, motivation and curiosity is depleted without getting new incentives.

This actually happened to me with my 3D modeling hobby a few years ago. Not only was the 3D software difficult to learn, it also turned unresponsive fairly quickly, even on a powerful machine.

Just as an example, rendering a 30 second clip in high quality took ~2 weeks with my PC running day and night. This slowness drained my patience and at some point I decided that this is too much hassle for a hobby.

Half a year later, I ran into a similar problem when working on a computer vision pipeline. We simply couldn’t find frameworks which were fast enough for real time processing while maintaining high programming productivity. This time I didn’t give up, but the job could have been much more fun and productive!

At this point I decided that I want to work on software that turns really tough problems like object recognition of 3D models into weekend projects.

I started working on GLVisualize, a library for fast and interactive graphics written in Julia.

Why graphics?

Visualizing something is the first step to understanding and it allows us to explore huge problem spaces that were invisible before.

But current tools seem to be divided into 2D and 3D libraries, high performance libraries, which are kind of a hassle to use, and easily usable libraries which turn into an unresponsive mess quickly.

So with my background, this seemed like a worthy start!

Why Julia?

While there are several reasons for choosing Julia, today I want to show you its impressive speed despite being an easy-to-use dynamic language.

Let me show you the benefit with two toy examples (all graphics including the plots were produced with GLVisualize!).

Consider the following function, the famous Lorenz Attractor:

It takes a point t0 and parameters a to d and returns a new point. Could you imagine what kind of line this will draw, when recursively applying it to the new result and changing parameters?

I guess no one has such a powerful brain, which is why we visualize these kind of things, together with sliders to explore the function:

code

These are 100,000 points, and as you can see GLVisualize can deal with that amount easily.

Lets see how long we could have stayed at pleasant interactive speeds in another language.

Lets take for example Python, a language comparable in usability:

minimal speedup 70.4680453961563
maximal speedup 148.02061279172827
max seconds Julia 0.193422334
max seconds Python 13.630093812942505

As you can see, computation times jump beyond one second at around 10⁵ points for Python, while Julia stays interactive with up to 10⁷ points.

So if you’re interested in a higher resolution in Python you better crank up your patience or call out to C, eliminating all convenience of Python abruptly!

Next example is a 3D mandelbulb visualization:

code

One step through the function takes around 24 seconds with Julia on the CPU, which makes it fairly painful to explore the parameters.

So why is the shown animation still smooth? Well, when choosing Julia, I’ve been betting on the fact that it should be fairly simple

to run Julia code on the GPU. This bet is now starting to become reality.

I’m using CUDAnative and GPUArrays to speed up the calculation of the mandelbulb.

CUDAnative allows you to run Julia code on the GPU while GPUArrays offers a simpler array interface for this task.

Here are the benchmark results:

minimal speedup 37.05708590791309
maximal speedup 401.59165062897495
max seconds cpu 24.275534426
max seconds cuda 0.128474155

This means that by running Julia on the GPU we can still enjoy a smooth interaction with this function.

But the problem is actually so hard, that I can’t run this interactively at the resolution I would like to.

As you can see, the iso surface visualization still looks very coarse. So even when using state of the art software, you still run into problems that won’t compute under one second.

But with Julia you can at least squeeze out the last bit of performance of your hardware, be it CPU or GPU, while enjoying the comfort of a dynamic high level language!

I’m sure that with Julia and advances in hardware, algorithms and optimizations, we can soon crack even the hardest problem with ease!

Benchmarking system:

RAM: 32GB

CPU: Intel® Core™ i7-6700 CPU @ 3.40GHz × 8

GPU: GeForce GTX 950

# Tutorial for Julia on the HPC with GPUs

This is a continuous of my previous post on using Julia on the XSEDE Comet HPC. Check that out first for an explanation of the problem. In that problem, we wished to solve for the area of a region where a polynomial was less than 1, which was calculated by code like:

@time res = @sync @parallel (+) for i = imin:dx:imax
tmp = 0
isq2 = i*i; isq3 = i*isq2; isq4 = isq2*isq2; isq5 = i*isq4
isq6 = isq4*isq2; isq7 = i*isq6; isq8 = isq4*isq4
@simd for j=jmin:dx:jmax
jsq2 = j*j; jsq3= j*jsq2; jsq4 = jsq2*jsq2;
jsq5 = j*jsq4; jsq6 = jsq2*jsq4; jsq7 = j*jsq6; jsq8 = jsq4*jsq4
@inbounds tmp += abs(coefs[1]*(jsq2) + coefs[2]*(jsq3) + coefs[3]*(jsq4) + coefs[4]*(jsq5) + coefs[5]*jsq6 + coefs[6]*jsq7 + coefs[7]*jsq8 + coefs[8]*(i) + coefs[9]*(i)*(jsq2) + coefs[10]*i*jsq3
+ coefs[11]*(i)*(jsq4) + coefs[12]*i*jsq5 + coefs[13]*(i)*(jsq6) + coefs[14]*i*jsq7 + coefs[15]*(isq2) + coefs[16]*(isq2)*(jsq2) + coefs[17]*isq2*jsq3 + coefs[18]*(isq2)*(jsq4) +
coefs[19]*isq2*jsq5 + coefs[20]*(isq2)*(jsq6) + coefs[21]*(isq3) + coefs[22]*(isq3)*(jsq2) + coefs[23]*isq3*jsq3 + coefs[24]*(isq3)*(jsq4) + coefs[25]*isq3*jsq5 +
coefs[26]*(isq4) + coefs[27]*(isq4)*(jsq2) + coefs[28]*isq4*jsq3 + coefs[29]*(isq4)*(jsq4) + coefs[30]*(isq5) + coefs[31]*(isq5)*(jsq2) + coefs[32]*isq5*jsq3+ coefs[33]*(isq6) +
coefs[34]*(isq6)*(jsq2) + coefs[35]*(isq7) + coefs[36]*(isq8))<1
end
tmp
end
res = res*((imax-imin)*(jmax-jmin)/(length(imin:dx:imax)*length(jmin:dx:jmax)))
println(res)

Notice that this code is massively parallel: at every point we do something and we just sum the values. This means it’s a perfect problem for the GPU. I am going to show you how to do this.

## Getting CUDArt Setup

We will be using the CUDArt.jl package. I first had this working with the CUDA.jl package but was informed that going forward we should be using the CUDArt.jl package. There’s really not much of a difference in my implementation.

To get started, you have to go here and install an appropriate version of CUDA. Note that you have to have an NVIDIA GPU in order to do this. I will be using a GTX970 on one computer (Windows), and a GTX980Ti on another. Comet has a queue with 2x Tesla K80s, but that’s probably overkill and you should try to develop locally first. One you have all of that setup, CUDArt.jl should compile some .ptx files upon first use. If it does, you’re good to go.

CUDArt.jl’s page shows you how to get started. However, I am going to add a little caveat. In reality, this is the objective function in a machine learning problem, and so we wish to be able to recompute this as quickly as possible just by swapping out the coefficient array coefs. Therefore all of the other arrays must persist on the GPU.

Here’s what we’re going to do. To maximize performance on the GPU we will use Float32’s. On most GPUs (all except the Teslas and the Titan Black), the double precision floating point capabilities are SEVERELY gimped in comparison (1/32 the performance…). To do this, we setup our calculations as follows:

## Setup CPU-side parameters
imin = -8
imax = 1
jmin = -3
jmax = 3
@time coefs,powz,poww = getCoefficients(A0,A1,B0,B1,α,β1,β2,β3,β4)
iarr = convert(Vector{Float32},collect(imin:dx:imax))
jarr = convert(Vector{Float32},collect(jmin:dx:jmax))
sizei = length(imin:dx:imax)
sizej = length(jmin:dx:jmax)
cudaCores = 1664;
equalDiv = sizei*sizej÷cudaCores + 1

##GPU Code
CUDArt.init(devices(dev->true)) #Initiate the devices
dev = device(0) #Set the GPU to the first GPU
#Move the arrays over
g_iarr = CudaArray(iarr)
g_jarr = CudaArray(jarr)
g_coefs = CudaArray(coefs)
g_tmp = CudaArray(Int32,cudaCores)

Note that if you have multiple GPUs and don’t know which one is which device, you can use the command CUDArt.name(CUDArt.device_properties(number)). Here I have CUDArt.name(CUDArt.device_properties(0)) return “GeForce GTX 970” and CUDArt.name(CUDArt.device_properties(1)) returns “GeForce GTX 750 Ti”, so I use device(0) to use the 970. Note that at any time you can change the device with this command, but be careful since the CudaArrays are stored on the device that you call them from. If you want to use multiple GPUs, you will also need to split up your arrays.

## The CUDA Kernal

Now the last thing I need to do is make my function. To do this, we have to go to C. Here is the Cuda kernal for the calculation:

// filename: integration.cu
// Performs the inner integration loop
extern "C"
{
__global__ void integration(const float *coefs, const float *iArr, const float *jArr, const int sizei, const int sizej, const int equalDiv, int *tmp)
{
int index = threadIdx.x + blockIdx.x * blockDim.x;
int loopInd;
float i, j, isq2, isq3, isq4, isq5, isq6, isq7, isq8, jsq2, jsq3, jsq4, jsq5, jsq6, jsq7, jsq8,
int ans = 0;
for(loopInd=0;loopInd<equalDiv;loopInd=loopInd+1){
i = iArr[(index*equalDiv+loopInd)/sizej];
j = jArr[(index*equalDiv+loopInd)%sizej];
if(index*equalDiv+loopInd >= sizei*sizej){
break;
}
if((index*equalDiv+loopInd)%sizej==0 || loopInd==0){
isq2 = i*i;
isq3 = i*isq2;
isq4 = isq2*isq2;
isq5 = i*isq4;
isq6 = isq4*isq2;
isq7 = i*isq6;
isq8 = isq4*isq4;
}
jsq2 = j*j;
jsq3 = j*jsq2;
jsq4 = jsq2*jsq2;
jsq5 = j*jsq4;
jsq6 = jsq2*jsq4;
jsq7 = j*jsq6;
jsq8 = jsq4*jsq4;
/*tmp[index*1878+loopInd]= index*1878+loopInd;*/
/* changed to zero indexing */
ans = ans + (abs(coefs[0]*(jsq2) + coefs[1]*(jsq3) + coefs[2]*(jsq4) + coefs[3]*(jsq5) + coefs[4]*jsq6 + coefs[5]*jsq7 + coefs[6]*jsq8 + coefs[7]*(i) + coefs[8]*(i)*(jsq2) + coefs[9]*i*jsq3 + coefs[10]*(i)*(jsq4) + coefs[11]*i*jsq5 + coefs[12]*(i)*(jsq6) + coefs[13]*i*jsq7 + coefs[14]*(isq2) + coefs[15]*(isq2)*(jsq2) + coefs[16]*isq2*jsq3 + coefs[17]*(isq2)*(jsq4) + coefs[18]*isq2*jsq5 + coefs[19]*(isq2)*(jsq6) + coefs[20]*(isq3) + coefs[21]*(isq3)*(jsq2) + coefs[22]*isq3*jsq3 + coefs[23]*(isq3)*(jsq4) + coefs[24]*isq3*jsq5 + coefs[25]*(isq4) + coefs[26]*(isq4)*(jsq2) + coefs[27]*isq4*jsq3 + coefs[28]*(isq4)*(jsq4) + coefs[29]*(isq5) + coefs[30]*(isq5)*(jsq2) + coefs[31]*isq5*jsq3+ coefs[32]*(isq6) + coefs[33]*(isq6)*(jsq2) + coefs[34]*(isq7) + coefs[35]*(isq8))<1);
}
tmp[index] = ans;
}
}

Okay, that’s not nice looking at all, but it’s the same idea as the Julia script. Let me explain this code a little bit. Our function is integration. It takes in all of the variables and an array tmp which we will save the output to. The first call calculates a unique index for each CUDA core. We then setup our variables. Notice that before that we setup equalDiv to calculate how many calculations each core should do in order to evenly divide the work. Thus we loop over equalDiv times (and check if we go over since this will not necessarily come out perfectly). This means that each CUDA core will calculate the same number of $(i,j)$‘s. We then use the calculated index and the loop index to get our $i$ and $j$. We index down $j$ first, then go by $i$’s. This means it’s ordered as $(0,0),(0,1),ldots,(0,J),(1,0),ldots$. Notice that we achieve this by doing integer division by the size of the j array (truncation will make it an integer), and we get the $j$ index by moding by the size of the $j$ array. Using these $i$’s and $j$’s, we calculate the inner part of the loop. Notice that the calls to coefs had to be changed because C using 0-based indexing (as opposed to Julia’s 1-based indexing). Note that we can use the abs function from the CUDA math library. All of the CUDA math library functions are available.

[Note that when doing CUDA programming you want to allocate as little memory as possible. This is still on the small side so it worked out to be the optimal code. However, in some cases the non-unraveled version may be better due to the overhead of memory allocation.]

So there’s a template for you to play around with. When all of that is in order, we have to compile the kernal. On Windows with visual studio, I had to find out where cl.exe was (for me it was in Visual Studio 12’s directory) and tell the compiler the location of it. The total command was:

nvcc -ptx integration.cu -ccbin "C:Program Files (x86)Microsoft Visual Studio 12.0VCbin"

On Linux it was simpler:

nvcc -ptx integration.cu

From this you get .ptx files. To use the function in Julia, we then use the following code:

md = CuModule("integrationWin.ptx",false)
integrationFunc = CuFunction(md,"integration")
@time launch(integrationFunc, cudaCores, 64, (g_coefs, g_iarr, g_jarr, sizei,sizej,equalDiv,g_tmp,)); res = sum(to_host(g_tmp));
res = res*((imax-imin)*(jmax-jmin)/(length(imin:dx:imax)*length(jmin:dx:jmax)))
println(res)

What this does is take the “integration” function out of the .ptx and save it as integrationFunc. I can then call it at any time with the launch command. The second and third options are the grid and block sizes. In CUDA you always want to “overload”/overthread the cores so that there is no downtime. What I have found in my tests is that the best setting for this is to set the gridsize to the number of CUDA cores (since in our case these aren’t communicating) and overloading it by setting the block size to 64. Why 64? That worked out best in testing. Change the numbers around and see what works best for you.

The last argument to launch is the tuple of arguments for our integration function. Notice that there is a trailing comma, this is required. After the computation is finished, the results have been saved directly into g_tmp. Thus we send g_tmp to the host and sum up the results as before. We re-scale by the area of integration and that is the result.

Now in the actual objective function, what I want to do is update the coefficients to a new coefficients array, perform this calculation, and then move on. To do this I simply use the code:

    if gpuEnabled
g_coefs = CudaArray(coefs)
launch(integrationFunc, cudaCores, 64, (g_coefs, g_iarr, g_jarr, sizei,sizej,equalDiv,g_tmp,))
res = sum(to_host(g_tmp))
free(g_coefs)
else ...

Notice I send coefs over to the GPU, run the kernal to get the answer, and then free up the GPU memory. So in my code I loop over this many times, changing coefs every time. When the entire computation is done, I finish with the command

CUDArt.close(devices(dev->true))

This will clear the GPU.

## Comet?

These same steps will work on Comet. You will need to go into an interactive job to compile, but it should work fine. The job script to run Julia on the GPU node is:

#!/bin/bash
#SBATCH -A <account>
#SBATCH --job-name="jOpt"
#SBATCH --output="output/jOpt.%j.%N.out"
#SBATCH -p gpu
#SBATCH --gres=gpu:1
#SBATCH --nodes=1
#SBATCH --export=ALL
export SLURM_NODEFILE=generate_pbs_nodefile
/home/crackauc/julia-a2f713dea5/bin/julia --machinefile \$SLURM_NODEFILE /home/crackauc/projectCode/ImprovedSRK/Optimization/cometDriver.jl