Author Archives: randyzwitch - Articles

Parallelizing Distance Calculations Using A GPU With CUDAnative.jl

By: randyzwitch - Articles

Re-posted from: http://randyzwitch.com/cudanative-jl-julia/

Hacker News discussion: link

Code as Julia Jupyter Notebook

Julia has the reputation as a “fast” language in that it’s possible to write high-performing code. However, what I appreciate most about Julia is not just that the code is fast, but rather that Julia makes high-performance concepts accessible without having to have a deep computer science or compiled language background (neither of which I possess!)

For version 0.6 of Julia, another milestone has been reached in the “accessible” high-performance category: the ability to run Julia code natively on NVIDIA GPUs through the CUDAnative.jl package. While CUDAnative.jl is still very much in its development stages, the package is already far-enough along that within a few hours, as a complete beginner to GPU programming, I was able to see in excess of 20x speedups for my toy example to calculate haversine distance.

Getting Started

The CUDAnative.jl introduction blog post and documentation cover the installation process in-depth, so I won’t repeat the details here. I’m already a regular compile-from-source Julia user and I found the installation process pretty easy on my CUDA-enabled Ubuntu workstation. If you can already do TensorFlow, Keras or other GPU tutorials on your computer, getting CUDAnative.jl to work shouldn’t take more than 10-15 minutes.

Julia CPU Implementation

To get a feel for what sort of speedup I could expect from using a GPU, I wrote a naive implementation of a distance matrix calculation in Julia:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#https://github.com/quinnj/Rosetta-Julia/blob/master/src/Haversine.jl
haversine(lat1::Float32,lon1::Float32,lat2::Float32,lon2::Float32) = 2 * 6372.8 * asin(sqrt(sind((lat2-lat1)/2)^2 + cosd(lat1) * cosd(lat2) * sind((lon2 - lon1)/2)^2))

function pairwise_dist(lat::Vector{Float32}, lon::Vector{Float32})

    #Pre-allocate, since size is known
    n = length(lat)
    result = Array{Float32}(n, n)

    #Brute force fill in each cell, ignore that distance [i,j] = distance [j,i]
    for i in 1:n
        for j in 1:n
            @inbounds result[i, j] = haversine(lat[i], lon[i], lat[j], lon[j])
        end
    end

    return result

end

#Example benchmark call
lat10000 = rand(Float32, 10000) .* 45
lon10000 = rand(Float32, 10000) .* -120
@time native_julia_cellwise = pairwise_dist(lat10000, lon10000)

The above code takes a pair of lat/lon values, then calculates the haversine distance between the two points. This algorithm is naive in that a distance matrix is symmetric (i.e. the distance between A to B is the same from B to A), so I could’ve done half the work by setting result[i,j] and result[j,i] to the same value, but as a measure of work for a benchmark this toy example is fine. Also note that this implementation runs on a single core, no CPU-core-level parallelization has been implemented.

Or to put all that another way: if someone wanted to tackle this problem without thinking very hard, the implementation might look like this.

CUDAnative.jl Implementation

There are two parts to the CUDAnative.jl implementation: the kernel (i.e. the actual calculation) and the boilerplate code for coordinating the writing to/from the CPU and GPU.

Kernel Code

The kernel code has similarities to the CPU implementation, with a few key differences:

  • Method signature is one lat/lon point vs. the lat/lon vectors, rather than a pairwise distance calculation
  • Boilerplate code for thread index on the GPU (0-indexed vs. normal Julia 1-indexing)
  • The trigonometric functions need to be prepended with CUDAnative., to differentiate that the GPU functions aren’t the same as the functions from Base Julia
  • Rather than return an array as part of the function return, we use the out keyword argument to write directly to the GPU memory
1
2
3
4
5
6
7
8
9
10
11
12
13
14
using CUDAnative, CUDAdrv

#Calculate one point vs. all other points simultaneously
function kernel_haversine(latpoint::Float32, lonpoint::Float32, lat::AbstractVector{Float32}, lon::AbstractVector{Float32}, out::AbstractVector{Float32})

    #Thread index
    #Need to do the n-1 dance, since CUDA expects 0 and Julia does 1-indexing
    i = (blockIdx().x-1) * blockDim().x + threadIdx().x

    out[i] =  2 * 6372.8 * CUDAnative.asin(CUDAnative.sqrt(CUDAnative.sind((latpoint-lat[i])/2)^2 + CUDAnative.cosd(lat[i]) * CUDAnative.cosd(latpoint) * CUDAnative.sind((lonpoint - lon[i])/2)^2))

    #Return nothing, since we're writing directly to the out array allocated on GPU
    return nothing
end

Coordination Code

The coordination code is similar to what you might see in a main() function in C or Java, where the kernel is applied to the input data. I am using the dev keyword with the default value of CuDevice(0) to indicate that the code should be run on the first (in my case, only) GPU device.

The remainder of the code has comments on its purpose, primarily:

  • Transfer Julia CPU arrays to GPU arrays (CuArray)
  • Set number of threads/blocks
  • Calculate distance between a point and all other points in the array, write back to CPU
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#validated kernel_haversine/distmat returns same answer as CPU haversine method (not shown)
function distmat(lat::Vector{Float32}, lon::Vector{Float32}; dev::CuDevice=CuDevice(0))

    #Create a context
    ctx = CuContext(dev)

    #Change to objects with CUDA context
    n = length(lat)
    d_lat = CuArray(lat)
    d_lon = CuArray(lon)
    d_out = CuArray(Vector{Float32}(n))

    #Calculate number of calculations, threads, blocks
    len = n
    threads = min(len, 1024)
    blocks = Int(ceil(len/threads))

    #Julia side accumulation of results to relieve GPU memory pressure
    accum = Array{Float32}(n, n)

    # run and time the test
    secs = CUDAdrv.@elapsed begin
        for i in 1:n
            @cuda (blocks, threads) kernel_haversine(lat[i], lon[i], d_lat, d_lon, d_out)
            accum[:, i] = Vector{Float32}(d_out)
        end
    end

    #Clean up context
    destroy!(ctx)

    #Return timing and bring results back to Julia
    return (secs, accum)

end

#Example benchmark call
timing, result = distmat(lat10000, lon10000)
result  native_julia_cellwise #validate results equivalent CPU and GPU

The code is written to process one row of the distance matrix at a time to minimize GPU memory usage. By writing out the results to the CPU after each loop iteration, I have n-1 extra CPU transfers, which is less performant than calculating all the distances first then transferring, but my consumer-grade GPU with 6GB of RAM would run out of GPU memory before completing the calculation otherwise.

Performance

The performance characteristics of the CPU and GPU calculations are below for various sizes of distance matrices. Having not done any GPU calculations before, I was surprised to see how much of a penalty there is writing back and forth to the GPU. As you can see from the navy-blue line, the execution time is fixed for matrices of size 1 to 1000, representing the fixed cost of moving the data from the CPU to the GPU.

Of course, once we get above 1000×1000 matrices, the GPU really starts to shine. Due to the log scale, it’s a bit hard to see the magnitude differences, but at 100000×100000 there is a 23x reduction in execution time (565.008s CPU vs. 24.32s GPU).

What I Learned

There are myriad things I learned from this project, but most important is that GPGPU processing can be accessible for people like myself without a CS background. Julia isn’t the first high-level language to provide CUDA functionality, but the fact that the code is so similar to native Julia makes GPU computing something I can include in my toolbox today.

Over time, I’m sure I’ll get better results as I learn more about CUDA, as CUDAnative.jl continues to smooth out the rough edges, etc. But the fact that as a beginner that I could achieve such large speedups in just an hour or two of coding and sparse CUDAnative.jl documentation bodes well for the future of GPU computing in Julia.

Code as Julia Jupyter Notebook

WordPress to Jekyll: A 30x Speedup

By: randyzwitch - Articles

Re-posted from: http://randyzwitch.com/wordpress-jekyll-30x-speedup/

About a month ago, I switched this blog from WordPress hosted on Bluehost to Jekyll on GitHub Pages. I suspected moving to a static website would be faster than generated HTML via PHP, and it is certainly cheaper (GitHub Pages is “free”). But it wasn’t until I needed a dataset for doing some dataset visualization development that I realize how much of an improvement it has been!

Packages, Packages, Packages

With the release of v0.5 of Julia, I’ve been working (less) on updating my packages and making new packages (more), because making new stuff is more fun than maintaining old stuff! One of the packages I’ve been building is for the ECharts visualization library (v3) from Baidu. While Julia doesn’t necessarily need another visualization library, visualization is something I’m interested in and learning is easier when you’re solving problems you like. And since the world doesn’t need another Iris example, I decided to share some real world website performance data 🙂

Line Chart

One of the first features I developed for ECharts.jl was X-Y charts, which I posit is the most common chart type in business. One thing that is great about the underlying ECharts JavaScript library is that interactivity is really easy to achieve:

using ECharts, DataFrames

#Read in data
df = readtable("/assets/data/website_time_data.csv")

#Make data two different series that overlap, so endpoint touches
df[:pre] = [(x[1] <= "2016-09-06" ? x[2] : nothing) for x in zip(df[:date], df[:loadtime_ms])]
df[:post] = [(x[1] >= "2016-09-06" ? x[2] : nothing) for x in zip(df[:date], df[:loadtime_ms])]

#Graph code
l = line(df[:date], hcat(df[:pre], df[:post]))
l.ec_width = 800
seriesnames!(l, ["loadtime_ms", "post"])
colorscheme!(l, palette = ("acw", "FlatUI"))
yAxis!(l, name = "Load time in ms")
title!(l, text = "randyzwitch.com",
          subtext = "Switching from WordPress on Bluehost to Jekyll on GitHub (2016/09/06)")
toolbox!(l, chartTypes = ["bar", "line"])
slider!(l)

Even though I switched to Jekyll on WordPress on 9/6/2016, it appears that the page cache for Google Webmaster Tools didn’t really expire until 9/12/2016 or so. At the average case, the load time went from 1128ms to 38ms! Of course, this isn’t really a fair comparison, as presumably GitHub Pages runs on much better hardware than the cheap Bluehost hosting I have, and I didn’t reimplement most of the garbage I had on the WordPress version of the blog. But from a user-experience standpoint, good lord what an improvement!

Box Plots

Want to test out further functionality, here are some box plots of the load time variation:

using ECharts, DataFrames

#Read in data
df = readtable("/Users/randyzwitch/Desktop/website_load_time.csv")
df[:pre] = [(x[1] <= "2016-09-06" ? x[2] : nothing) for x in zip(df[:date], df[:loadtime_ms])]
df[:post] = [(x[1] >= "2016-09-12" ? x[2] : nothing) for x in zip(df[:date], df[:loadtime_ms])]

#Remove nulls
pre = [x for x in df[:pre] if x != nothing]
post = [x for x in df[:post] if x != nothing]

#Graph code
b = box([pre, post], names = ["WordPress", "Jekyll"])
b.ec_width = 800
colorscheme!(b, palette = ("acw", "VitaminC"))
yAxis!(b, name = "Load time in ms", nameGap = 50, min = 0)
title!(b, text = "randyzwitch.com",
           subtext = "Switching from WordPress on Bluehost to Jekyll on GitHub (2016/09/06)")
toolbox!(b)

Usually, a box plot comparison that is as smushed as the Jekyll plot vs the WordPress one would be a poor visualization, but in this case I think it actually works. The load time for the Jekyll version of this blog is so quick and so consistent that it barely registers as an outlier if it were WordPress! It’s crazy to think that the -1.5 * IQR time for WordPress is the mean/median/min load time of Jekyll.

Where To Go Next?

This blog post is really just an interesting finding from my experience moving to Jekyll on GitHub. As it stands now, ECharts.jl is stil in pre-METADATA mode. Right now, I assume that this would be a useful enough package to submit to METADATA some day, but I guess that depends on how much further I get smoothing the rough edges. If there are people who are interested in cleaning up this package further, I’d absolutely love to collaborate.

A Million Text Files And A Single Laptop

By: randyzwitch - Articles

Re-posted from: http://randyzwitch.com/gnu-parallel-medium-data/

GNU Parallel Cat Unix

Wait…What? Why?

More often that I would like, I receive datasets where the data has only been partially cleaned, such as the picture on the right: hundreds, thousands…even millions of tiny files. Usually when this happens, the data all have the same format (such as having being generated by sensors or other memory-constrained devices).

The problem with data like this is that 1) it’s inconvenient to think about a dataset as a million individual pieces 2) the data in aggregate are too large to hold in RAM but 3) the data are small enough where using Hadoop or even a relational database seems like overkill.

Surprisingly, with judicious use of GNU Parallel, stream processing and a relatively modern computer, you can efficiently process annoying, “medium-sized” data as described above.

Data Generation

For this blog post, I used a combination of R and Python to generate the data: the “Groceries” dataset from the arules package for sampls ing transactions (with replacement), and the Python Faker (fake-factory) package to generate fake customer profiles and for creating the 1MM+ text files.

The contents of the data itself isn’t important for this blog post, but the data generation code is posted as a GitHub gist should you want to run these commands yourself.

Problem 1: Concatenating (cat * >> out.txt ?!)

The cat utility in Unix-y systems is familiar to most anyone who has ever opened up a Terminal window. Take some or all of the files in a folder, concatenate them together….one big file. But something funny happens once you get enough files…

$ cat * >> out.txt
-bash: /bin/cat: Argument list too long

That’s a fun thought…too many files for the computer to keep track of. As it turns out, many Unix tools will only accept about 10,000 arguments; the use of the asterisk in the `cat` command gets expanded before running, so the above statement passes 1,234,567 arguments to `cat` and you get an error message.

One (naive) solution would be to loop over every file (a completely serial operation):

for f in *; do cat "$f" >> ../transactions_cat/transactions.csv; done

Roughly 10,093 seconds later, you’ll have your concatenated file. Three hours is quite a coffee break…

Solution 1: GNU Parallel & Concatenation

Above, I mentioned that looping over each file gets you past the error condition of too many arguments, but it is a serial operation. If you look at your computer usage during that operation, you’ll likely see that only a fraction of a core of your computer’s CPU is being utilized. We can greatly improve that through the use of GNU Parallel:

ls | parallel -m -j $f "cat {} >> ../transactions_cat/transactions.csv"

The `$f` argument in the code is to highlight that you can choose the level of parallelism; however, you will not get infinitely linear scaling, as shown below (graph code, Julia):

Given that the graph represents a single run at each level of parallelism, it’s a bit difficult to say exactly where the parallelism gets maxed out, but at roughly 10 concurrent jobs, there’s no additional benefit. It’s also interesting to point out what the `-m` argument represents; by specifying `m`, you allow multiple arguments (i.e. multiple text files) to be passed as inputs into parallel. This alone leads to an 8x speedup over the naive loop solution.

Problem 2: Data > RAM

Now that we have a single file, we’ve removed the “one million files” cognitive dissonance, but now we have a second problem: at 19.93GB, the amount of data exceeds the RAM in my laptop (2014 MBP, 16GB of RAM). So in order to do analysis, either a bigger machine is needed or processing has to be done in a streaming or “chunked” manner (such as using the “chunksize” keyword in pandas).

But continuing on with our use of GNU Parallel, suppose we wanted to answer the following types of questions about our transactions data:

  1. How many unique products were sold?
  2. How many transactions were there per day?
  3. How many total items were sold per store, per month?

If it’s not clear from the list above, in all three questions there is an “embarrassingly parallel” portion of the computation. Let’s take a look at how to answer all three of these questions in a time- and RAM-efficient manner:

Q1: Unique Products

Given the format of the data file (transactions in a single column array), this question is the hardest to parallelize, but using a neat trick with the `tr` (transliterate) utility, we can map our data to one product per row as we stream over the file:

The trick here is that we swap the comma-delimited transactions with the newline character; the effect of this is taking a single transaction row and returning multiple rows, one for each product. Then we pass that down the line, eventually using `sort -u` to de-dup the list and `wc -l` to count the number of unique lines (i.e. products).

In a serial fashion, it takes quite some time to calculate the number of unique products. Incorporating GNU Parallel, just using the defaults, gives nearly a 4x speedup!

Q2. Transactions By Day

If the file format could be considered undesirable in question 1, for question 2 the format is perfect. Since each row represents a transaction, all we need to do is perform the equivalent of a SQL `Group By` on the date and sum the rows:

Using GNU Parallel starts to become complicated here, but you do get a 9x speed-up by calculating rows by date in chunks, then “reducing” again by calculating total rows by date (a trick I picked up at this blog post).

Q3. Total items Per store, Per month

For this example, it could be that my command-line fu is weak, but the serial method actually turns out to be the fastest. Of course, at a 14 minute run time, the real-time benefits to parallelization aren’t that great.

It may be possible that one of you out there knows how to do this correctly, but an interesting thing to note is that the serial version already uses 40-50% of the available CPU available. So parallelization might yield a 2x speedup, but seven minutes extra per run isn’t worth spending hours trying to the optimal settings.

But, I’ve got MULTIPLE files…

The three examples above showed that it’s possible to process datasets larger than RAM in a realistic amount of time using GNU Parallel. However, the examples also showed that working with Unix utilities can become complicated rather quickly. Shell scripts can help move beyond the “one-liner” syndrome, when the pipeline gets so long you lose track of the logic, but eventually problems are more easily solved using other tools.

The data that I generated at the beginning of this post represented two concepts: transactions and customers. Once you get to the point where you want to do joins, summarize by multiple columns, estimate models, etc., loading data into a database or an analytics environment like R or Python makes sense. But hopefully this post has shown that a laptop is capable of analyzing WAY more data than most people believe, using many tools written decades ago.