Category Archives: Julia

Technical preview: Native GPU programming with CUDAnative.jl

By: Julia Developers

Re-posted from:

After 2 years of slow but steady development, we would like to announce the first preview
release of native GPU programming capabilities for Julia. You can now write your CUDA
kernels in Julia, albeit with some restrictions, making it possible to use Julia’s
high-level language features to write high-performance GPU code.

The programming support we’re demonstrating here today consists of the low-level building
blocks, sitting at the same abstraction level of CUDA C. You should be interested if you
know (or want to learn) how to program a parallel accelerator like a GPU, while dealing with
tricky performance characteristics and communication semantics.

You can easily add GPU support to your Julia installation (see below for detailed
instructions) by installing CUDAnative.jl. This
package is built on top of experimental interfaces to the Julia compiler, and the
purpose-built LLVM.jl and
CUDAdrv.jl packages to compile and execute code.
All this functionality is brand-new and thoroughly untested, so we need your help and
feedback in order to improve and finalize the interfaces before Julia 1.0.

How to get started

CUDAnative.jl is tightly integrated with the Julia compiler and the underlying LLVM
framework, which complicates version and platform compatibility. For this preview we only
support Julia 0.6 built from source, on Linux or macOS. Luckily, installing Julia from
source is well documented in the main repository’s
Most of the time it boils down to the following commands:

$ git clone
$ cd julia
$ git checkout v0.6.0-pre.alpha  # or any later tag
$ make                           # add -jN for N parallel jobs
$ ./julia

From the Julia REPL, installing CUDAnative.jl and its dependencies is just a matter of using
the package manager. Do note that you need to be using the NVIDIA binary driver, and have
the CUDA toolkit installed.

> Pkg.add("CUDAnative")

# Optional: test the package
> Pkg.test("CUDAnative")

At this point, you can start writing kernels and execute them on the GPU using CUDAnative’s
@cuda! Be sure to check out the
examples, or continue
reading for a more textual introduction.

Hello World Vector addition

A typical small demo of GPU programming capabilities (think of it as the GPU Hello World)
is to perform a vector addition. The snippet below does exactly that using Julia and

using CUDAdrv, CUDAnative

function kernel_vadd(a, b, c)
    # from CUDAnative: (implicit) CuDeviceArray type,
    #                  and thread/block intrinsics
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    c[i] = a[i] + b[i]

    return nothing

dev = CuDevice(0)
ctx = CuContext(dev)

# generate some data
len = 512
a = rand(Int, len)
b = rand(Int, len)

# allocate & upload on the GPU
d_a = CuArray(a)
d_b = CuArray(b)
d_c = similar(d_a)

# execute and fetch results
@cuda (1,len) kernel_vadd(d_a, d_b, d_c)    # from CUDAnative.jl
c = Array(d_c)

using Base.Test
@test c == a + b


How does it work?

Most of this example does not rely on CUDAnative.jl, but uses functionality from CUDAdrv.jl.
This package makes it possible to interact with CUDA hardware through user-friendly wrappers
of CUDA’s driver API. For example, it provides an array type CuArray, takes care of memory
management, integrates with Julia’s garbage collector, implements @elapsed using GPU
events, etc. It is meant to form a strong foundation for all interactions with the CUDA
driver, and does not require a bleeding-edge version of Julia. A slightly higher-level
alternative is available under CUDArt.jl, building
on the CUDA runtime API instead, but hasn’t been integrated with CUDAnative.jl yet.

Meanwhile, CUDAnative.jl takes care of all things related to native GPU programming. The
most significant part of that is generating GPU code, and essentially consists of three

  1. interfacing with Julia: repurpose the compiler to emit GPU-compatible LLVM IR (no
    calls to CPU libraries, simplified exceptions, …)
  2. interfacing with LLVM (using LLVM.jl): optimize the IR, and compile to PTX
  3. interfacing with CUDA (using CUDAdrv.jl): compile PTX to SASS, and upload it to the

All this is hidden behind the call to @cuda, which generates code to compile our kernel
upon first use. Every subsequent invocation will re-use that code, convert and upload
arguments1, and finally launch the kernel. And much like we’re used to on the CPU, you
can introspect this code using runtime reflection:

# CUDAnative.jl provides alternatives to the @code_ macros,
# looking past @cuda and converting argument types
julia> CUDAnative.@code_llvm @cuda (1,len) kernel_vadd(d_a, d_b, d_c)
define void @julia_kernel_vadd_68711 {
    [LLVM IR]

# ... but you can also invoke without @cuda
julia> @code_ptx kernel_vadd(d_a, d_b, d_c)
.visible .func julia_kernel_vadd_68729(...) {
    [PTX CODE]

# or manually specify types (this is error prone!)
julia> code_sass(kernel_vadd, (CuDeviceArray{Float32,2},CuDeviceArray{Float32,2},CuDeviceArray{Float32,2}))
code for sm_20
        Function : julia_kernel_vadd_68481

Another important part of CUDAnative.jl are the intrinsics: special functions and macros
that provide functionality hard or impossible to express using normal functions. For
example, the {thread,block,grid}{Idx,Dim} functions provide access to the size and index
of each level of work. Local shared memory can be created using the @cuStaticSharedMem and
@cuDynamicSharedMem macros, while @cuprintf can be used to display a formatted string
from within a kernel function. Many math
are also available;
these should be used instead of similar functions in the standard library.

What is missing?

As I’ve already hinted, we don’t support all features of the Julia language yet. For
example, it is currently impossible to call any function from the Julia C runtime library
(aka. This makes dynamic allocations impossible, cripples exceptions, etc.
As a result, large parts of the standard library are unavailable for use on the GPU. We will
obviously try to improve this in the future, but for now the compiler will error when it
encounters unsupported language features:

julia> nope() = println(42)
nope (generic function with 1 method)

julia> @cuda (1,1) nope()
ERROR: error compiling nope: emit_builtin_call for REPL[1]:1 requires the runtime language feature, which is disabled

Another big gap is documentation. Most of CUDAnative.jl mimics or copies CUDA
, while CUDAdrv.jl wraps the CUDA
driver API
. But we haven’t documented what
parts of those APIs are covered, or how the abstractions behave, so you’ll need to refer to
the examples and tests in the CUDAnative and CUDAdrv repositories.

Another example: parallel reduction

For a more complex example, let’s have a look at a parallel
for Kepler-generation
. This
is a typical well-optimized GPU implementation, using fast communication primitives at each
level of execution. For example, threads within a warp execute together on a SIMD-like core,
and can share data through each other’s registers. At the block level, threads are allocated
on the same core but don’t necessarily execute together, which means they need to
communicate through core local memory. Another level up, only the GPU’s DRAM memory is a
viable communication medium.

The Julia version of this algorithm
looks pretty similar to the CUDA original: this is as intended, because CUDAnative.jl is a
counterpart to CUDA C. The new version is much more generic though, specializing both on the
reduction operator and value type. And just like we’re used to with regular Julia code, the
@cuda macro will just-in-time compile and dispatch to the correct specialization based on
the argument types.

So how does it perform? Turns out, pretty good! The chart below compares the performance of
both the CUDAnative.jl and CUDA C implementations2, using BenchmarkTools.jl to measure
the execution time
. The small
constant overhead (note the logarithmic scale) is due to a deficiency in argument passing,
and will be fixed.

Performance comparison of parallel reduction

We also aim to be compatible with tools from the CUDA toolkit. For example, you can profile
Julia kernels
using the NVIDIA Visual
Profiler, or use cuda-memcheck to detect out-of-bound accesses3:

$ cuda-memcheck julia examples/oob.jl
========= Invalid __global__ write of size 4
=========     at 0x00000148 in examples/oob.jl:14:julia_memset_66041
=========     by thread (10,0,0) in block (0,0,0)
=========     Address 0x1020b000028 is out of bounds

Full debug information is not
yet, so cuda-gdb and
friends will not work very well.

Try it out!

If you have experience with GPUs or CUDA development, or maintain a package which could
benefit from GPU acceleration, please have a look or try out CUDAnative.jl! We need all the
feedback we can get, in order to prioritize development and finalize the infrastructure
before Julia hits 1.0.

I want to help

Even better! There’s many ways to contribute, for example by looking at the issues trackers
of the individual packages making up this support:

Each of those packages are also in perpetual need of better API coverage, and documentation
to cover and explain what has already been implemented.


This work would not have been possible without Viral Shah and Alan Edelman arranging my stay
at MIT. I’d like to thank everybody at Julia Central and around, it has been a blast! I’m
also grateful to Bjorn De Sutter, and IWT Vlaanderen, for supporting my time at Ghent

  1. See the README for a note on how expensive this currently is. 

  2. The measurements include memory transfer time, which is why a CPU implementation was not included (realistically, data would be kept on the GPU as long as possible, making it an unfair comparison). 

  3. Bounds-checked arrays are not supported yet, due to a bug in the NVIDIA PTX compiler

Super fast pattern search: The FFT trick

By: Júlio Hoffimann

Re-posted from:

In today’s post, I would like to document an old trick in image processing sometimes referred to as “The FFT trick” for exhaustive pattern search.

Although I would like to give credit to the person who introduced the trick in the literature, I don’t have this information with me. Please leave a comment if you have references.

Learning with a sliding window

Filtering an image with a sliding window is a routine operation in various disciplines like image processing, computer vision, geostatistics, and deep learning (e.g. convolutional neural networks), to name a few.

Sliding window on my breakfast.

Depending on the application, this operation serves different purposes. It is used everywhere as illustrated in the table:

Field of study Typical window Purposes
Image processing
  • Sobel kernel: $$\small\begin{bmatrix}1 & 0 & -1 \\ 2 & 0 & -2\\ 1 & 0 & -1\end{bmatrix}$$
  • Gaussian blur: $$\small\frac{1}{16}\begin{bmatrix}1 & 2 & 1 \\ 2 & 4 & 2\\ 1 & 2 & 1\end{bmatrix}$$
  • Edge detection
  • Blurring
  • Denoising
Computer vision Same as in Image processing
  • Image descriptors (e.g. GIST, HoG)
  • Invariance and alignment
Geostatistics A spatial pattern:
  • Pattern search
Deep learning Random weights:
  • Convolutional neural networks

BRAINSTORM: Why sliding windows? Couldn’t we learn more by using a different technique? What does a sliding window misses in terms of learning? :thinking:

Given its importance, a great deal of time was devoted by the scientific community to optimizing software and hardware for filtering images (or arrays) with sliding windows.

This post is about a clever trick that exploits legacy high-performance filtering algorithms to speed up pattern search on very large images.

FFT filtering

Consider a first naive implementation of the filtering operation in Julia that loops over the input image, extracts a subimage, multiplies it with the window entrywise, and adds the results:

function naivefilter(image::AbstractArray, window::AbstractArray)
  nx, ny = size(image)
  wx, wy = size(window)

  [sum(window .* image[i:i+wx-1,j:j+wy-1]) for i=1:nx-wx+1, j=1:ny-wy+1]

It works well, but is relatively slow for industry standards. Compare it to the efficient implementation available in Images.jl:

using Images
using BenchmarkTools

image = zeros(100,100)
window = zeros(3,3)

@benchmark naivefilter(image, window)
@benchmark imfilter(image, window)

# BenchmarkTools.Trial: (naivefilter)
#   memory estimate:  7.99 MiB
#   allocs estimate:  192093
#   --------------
#   minimum time:     6.512 ms (0.00% GC)
#   median time:      6.638 ms (0.00% GC)
#   mean time:        7.209 ms (7.67% GC)
#   maximum time:     10.569 ms (15.77% GC)
#   --------------
#   samples:          693
#   evals/sample:     1
#   time tolerance:   5.00%
#   memory tolerance: 1.00%
# BenchmarkTools.Trial: (imfilter)
#   memory estimate:  204.03 KiB
#   allocs estimate:  268
#   --------------
#   minimum time:     378.637 μs (0.00% GC)
#   median time:      388.839 μs (0.00% GC)
#   mean time:        406.210 μs (2.51% GC)
#   maximum time:     2.035 ms (78.63% GC)
#   --------------
#   samples:          10000
#   evals/sample:     1
#   time tolerance:   5.00%
#   memory tolerance: 1.00%

From the output, we notice turning into s and MiB turning into KiB. Besides the cache-friendly code that Tim Holy and others have written, I would like to comment on another optimization that Images.jl does that is not available in other image processing packages and programming languages.

Image filtering can alternatively be implemented with Fast Fourier Transforms (FFTs). The details of the algorithm and the theory behind it are beyond the scope of this post. The message that I want to convey is that depending on the window size, particularly for large windows, the FFT algorithm is much faster.

In Images.jl you can call the FFT algorithm explicitly by using:

imfilter(image, window, Algorithm.FFT())

but you don’t need to, Images.jl will call the faster algorithm depending on the window size you pass to it! :raised_hands:

The switch point between algorithms was set by means of benchmarks on a few modern processors. Feel free to submit a pull request if you feel that the switch point isn’t quite right for your hardware.

Now that we’ve seen that image filtering is a super efficient operation (specially in Julia), let’s exploit it to find patterns in images.

Given an image and a spatial pattern , an exhaustive pattern search consists of computing a distance (or dissimilarity) between and all subimages of . The subimages for which the distance is smallest are called matches. If the computed distance is zero, the match is called a perfect match.

Now is the trick. Consider the Euclidean distance and relax the notation to mean

Loop over all subimages of and compute

where and have the same size. It turns out that we can implement this operation by rewriting

and replacing the original inefficient loop with two FFT filtering + one reduction:

 = imfilter(A.^2, ones(B))
AB = imfilter(A, B)
 = sumabs2(B)

D = abs( - 2AB + ) # here "abs" takes care of floating point errors

Isn’t that awesome? This trick was crucial in my ImageQuilting.jl package, it allowed me to process large 3D images in minutes instead of hours.

Finally, I would like to add that high-performance GPU implementations of FFT filtering are available in Julia. You can set an option in your package to use CLFFT.jl and have the fastest software in the market. :blush:

Build & Deploy Machine Learning Apps on Big Data Platforms with Microsoft Linux Data Science Virtual Machine

By: Cortana Intelligence and ML Blog Team

Re-posted from:

This post is authored by Gopi Kumar, Principal Program Manager in the Data Group at Microsoft.

This post covers our latest additions to the Microsoft Linux Data Science Virtual Machine (DSVM), a custom VM image on Azure, purpose-built for data science, deep learning and analytics. Offered in both Microsoft Windows and Linux editions, DSVM includes a rich collection of tools, seen in the picture below, and makes you more productive when it comes to building and deploying advanced machine learning and analytics apps.

The central theme of our latest Linux DSVM release is to enable the development and testing of ML apps for deployment to distributed scalable platforms such as Spark, Hadoop and Microsoft R Server, for operating on data at a very large scale. In addition, with this release, DSVM also offers Julia Computing’s JuliaPro on both Linux and Windows editions.

Here’s more on the new DSVM components you can use to build and deploy intelligent apps to big data platforms:

Microsoft R Server 9.0

Version 9.0 of Microsoft R Server (MRS) is a major update to enterprise-scale R from Microsoft, supporting parallel and distributed computation. MRS 9.0 supports analytics execution in the Spark 2.0 context. There’s a new architecture and simplified interface for deploying R models and functions as web services via a new library called mrsdeploy, which makes it easy to consume models from other apps using the open Swagger framework.

Local Spark Standalone Instance

Spark is one of the premier platforms for highly scalable big data analytics and machine learning. Spark 2.0 launched in mid-2016 and brings several improvements such as the revised machine learning library (MLLib), scaling and performance optimization, better ANSI SQL compliance and unified APIs. The Linux DSVM now offers a standalone Spark instance (based on the Apache Spark distribution), PySpark kernel in Jupyter to help you build and test applications on the DSVM and deploy them on large scale clusters like Azure HDInsight Spark or your own on-premises Spark cluster. You can develop your code using either Jupyter notebook or with the included community edition of the Pycharm IDE for Python or RStudio for R.

Single Node Local Hadoop (HDFS and YARN) Instance

To make it easier to develop Hadoop programs and/or use HDFS storage locally for development and testing, a single node Hadoop installation is built into the VM. Also, if you are developing on the Microsoft R Server for execution in Hadoop or Spark remote contexts, you can first test things locally on the Linux DSVM and then deploy the code to a remote scaled out Hadoop or Spark cluster or to Microsoft R Server. These DSVM additions are designed to help you iterate rapidly when developing and testing your apps, before they get deployed into large-scale production big data clusters.

The DSVM is also a great environment for self-learning and running training classes on big data technologies. We provide sample code and notebooks to help you get started quickly on the different data science tools and technologies offered.

DSVM Resources

New to DSVM? Here are resources to get you started:

Linux Edition

Windows Edition

The goal of DSVM is to make data scientists and developers highly productive in their work and provide a broad array of popular tools. We hope you find it useful to have these new big data tools pre-installed with the DSVM.

We always appreciate feedback, so please send in your comments below or share your thoughts with us at the DSVM community forum.