Tag Archives: SIMD

Optimizing Julia for Performance: A Practical Example

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/optimizing-julia-performance-practical-example/

Let’s take a program which plots the standard logistic map:

r = 2.9:.00005:4;
numAttract = 100;
steady = ones(length(r),1)*.25;
for i=1:300 ## Get to steady state
  steady = r.*steady.*(1-steady);
end
x = zeros(length(steady),numAttract);
x[:,1] = steady;
for i=2:numAttract ## Now grab some values at the attractor
  x[:,i] = r.*x[:,i-1].*(1-x[:,i-1]);
end
using PyPlot;
fig = figure(figsize=(20,10));
plot(collect(r),x,"b.",markersize=.06)
savefig("plot.png",dpi=300);

This plots the logistic map. If you take the same code and change the array dereferencing from A[i,j] to A(i,j) you get MATLAB code. However, I couldn’t get MATLAB to have it looks as good, but that’s beside the point.

MATLAB Plot
Julia Plot

What I want to show are the steps for optimizing this code. In MATLAB, this is as vectorized as it gets and so you’re pretty much done. In Julia, there are a few steps that you can do from here. The first thing to notice that if this code is run directly, then all the variables are global scope which hurts performance, so the first thing we do is change this code to a function call. Next we note that we can improve performance by AVX2 which allows the processor to take in multiple numbers into the same input. Think about it as operating on vectors of 8 numbers at a time, and when AVX3 comes out, you will be able to do 16 numbers at a time. This is something you’ll want to take advantage of! Vectorized operations already have this implemented, so to implement it in our loops we simply add @simd before our for loops. Be careful when using simd: it will allows the loops to re-order the computation. Since each simulation (different r’s) is independent, we are fine. More about this can be found here.

This gives the following code:

function logisticPlot()
  tic()
  r = 2.9:.00005:4;
  numAttract = 100;
  steady = ones(length(r),1)*.25;
  @simd for i=1:400 ## Get to steady state
    @inbounds @fastmath steady = r.*steady.*(1-steady);
  end
  x = zeros(length(steady),numAttract);
  x[:,1] = steady;
  @simd for i=2:numAttract ## Now grab some values at the attractor
    @inbounds @fastmath x[:,i] = r.*x[:,i-1].*(1-x[:,i-1]);
  end
  toc()
  fig = figure(figsize=(20,10));
  plot(collect(r),x,"b.",markersize=.06)
  savefig("plot.png",dpi=300);
end
using PyPlot
@time logisticPlot()

which does better than before. Here we can’t. Normally we can do even by explicitly defining the types which allows the LLVM to know the data types and use compiler optimizations like it was C code.

This is pretty much it because we cannot do @parallel because we need to ensure we are stepping in time without skipping steps (@parallel will re-order the timesteps). So, all in all, how well did we do? The final code takes .15 seconds on my machine to run (without the plotting). However, MATLAB is at .05. Why did MATLAB win? If you check the timings in more detail you’ll see that it all comes down to the inner loop steps still taking longer than MATLAB, and there is a clear reason for it: Julia does not have multithreading yet. If you check

A = rand(20000,20000);
B = rand(20000,20000);
A.*B

you will see that this uses all of the cores on your machine, whereas the same exact code in Julia does not (if you do A*B, Julia will because it sends that call to the Fortran program BLAS which is multi-threaded). So there it is, MATLAB wins because it’s using all 16 cores on my workstation, but Julia on 1 core holds up to it and makes a prettier plot. With this, I think Julia will win out in this example once it’s multi-threaded, but we’ll see.

What I really wanted to highlight here is that Julia is flexible enough to make both the prototype and the final code. You can code using simple syntax like in MATLAB to get things done, and then when you have a working product, start adding macros, parallelization, and type specifications to get something that both complies and runs like C. In MATLAB, you write a MATLAB code and when you get that working, you re-write all the heavy-lifting parts in C and link it up via MEX. This is quite a pain to do, and this itself is why I am investing time in Julia, knowing it will payout well.

Note: Julia’s threading branch has merged to the main branch already. They are just waiting for it to stabilize a bit to then put it in a standard release (with the major multi-threaded operations such as .* utilizing it). So stay tuned.

The post Optimizing Julia for Performance: A Practical Example appeared first on Stochastic Lifestyle.