Tag Archives: BLAS

Optimal Number of Workers for Parallel Julia

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/236-2/

How many workers do you choose when running a parallel job in Julia? The answer is easy right? The number of physical cores. We always default to that number. For my Core i7 4770K, that means it’s 4, not 8 since that would include the hyperthreads. On my FX8350, there are 8 cores, but only 4 floating-point units (FPUs) which do the math, so in mathematical projects, I should use 4, right? I want to demonstrate that it’s not that simple.

Where the Intuition Comes From

Most of the time when doing scientific computing you are doing parallel programming without even knowing it. This is because a lot of vectorized operations are “implicitly paralleled”, meaning that they are multi-threaded behind the scenes to make everything faster. In other languages like Python, MATLAB, and R, this is also the case. Fire up MATLAB and run

A = randn(10000,10000)
B = randn(10000,10000)
A.*B

and you will see that all of your cores are used. Threads are a recent introduction to Julia, and so in version 0.5 this will also be the case.

Another large place where implicit parallelization comes up is in linear algebra. When one uses a matrix multiplication, it is almost surely calling an underlying program which is an implementation of BLAS. BLAS (Basic Linear Algebra Subroutines) is aptly named just a set of functions for solving linear algebra problems. These are written in either C or Fortran and are heavily optimized. They are well-studied and many smart people have meticulously crafted “perfect code” which minimizes cache misses and all of that other low level stuff, all to make this very common operation run smoothly.

Because BLAS (and LINPACK, Linear Algebra Package, for other linear algebra routines) is so optimized, people say you should always make sure that it knows exactly how many “real” processors it has to work with. So in my case, with a Core i7 with 4 physical cores and 4 from hyperthreading, forget the hyperthreading and thus there are 4. With the FX8350, there are only 4 processors for doing math, so 4 threads. Check to make sure this is best.

What about for your code?

Most likely this does not apply to your code. You didn’t carefully manage all of your allocations and tell the compiler what needs to be cached etc. You just wrote some beautiful Julia code. So how many workers do you choose?

Let’s take my case. I have 4 real cores, do I choose 4? Or do I make 3 workers to allow for 1 to “command” the others freely? Or do I make 7/8 due to hyperthreading?

I decided to test this out on a non-trivial example. I am not going to share all of the code (I am submitting it as part of a manuscript soon), but the basic idea is that it is a high-order adaptive solver for stochastic differential equations. The code sets up the problem and then calls pmap to do a Monte Carlo simulation and solve the equation 1000 times in parallel. The code is mostly math, but there is a slight twist where some values are stored on stacks (very lightweight datastructure). To make sure I could trust the times, I ran the code 1000 times and took the average, min, and max times.

So in this case, what was best? The results speak for themselves.

Average wall times vs Number of Worker Processes, 1000 iterations
Number of Workers Average Wall Time Max Wall Time Min Wall Time
1 62.8732 64.3445 61.4971
3 25.749 26.6989 25.1143
4 22.4782 23.2046 21.8322
7 19.7411 20.2904 19.1305
8 19.0709 20.1682 18.5846
9 18.3677 18.9592 17.6
10 18.1857 18.9801 17.6823
11 18.1267 18.7089 17.5099
12 17.9848 18.5083 17.5529
13 17.8873 18.4358 17.3664
14 17.4543 17.9513 16.9258
15 16.5952 17.2566 16.1435
16 17.5426 18.4232 16.2633
17 16.927 17.5298 16.4492

Note there are two “1000”s here. I ran the Monte Carlo simulation (each solving the SDE 1000 times itself) 1000 times. I plotted the mean, max, and min times it took to solve the simulation. From the plot it’s very clear that the minimum exists somewhere around 15. 15!

What’s going on? My guess is that this is because of the time that’s not spent on the actual mathematics. Sometimes there are things performing logic, checking if statements, allocating new memory as the stacks grow bigger, etc. Although it is a math problem, there is more than just the math in this problem! Thus it seems the scheduler is able to effectively let the processes compete and more fully utilize the CPU by pushing the jobs around. This can only go so far: if you have too many workers, then you start to get cache misses and then the computational time starts to increase. Indeed, at 10 workers I could already see signs of problems in the resource manager.

Overload at 10 Workers as seen in Window's Resource Manager

Overload at 10 Workers as seen in Window’s Resource Manager

However, allowing one process to start re-allocating memory but causing a cache miss (or whatever it’s doing) seems to be a good tradeoff at low levels. Thus for this code the optimal number of workers is far above the number of physical cores.

Moral of the Story

The moral is, test your code. If your code is REALLY efficient, then sure, making sure you don’t mess with your perfect code is optimal. If your code isn’t optimal (i.e. it’s just some Julia code that is pretty good and you want to parallelize it), try some higher numbers of workers. You may be shocked what happens. In this case, the compute time dropped more than 30% by overloading the number of workers.

The post Optimal Number of Workers for Parallel Julia appeared first on Stochastic Lifestyle.

Optimizing .*: Details of Vectorization and Metaprogramming

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/an-in-depth-look-at-vectorization/

For many of us mathematicians, we were taught to use MATLAB, and we were taught to vectorize everything. I mean obviously if we have matrices A, B, and C and want to multiply element-wise (say to solve a reaction-equation at each point in space), then the optimal code is

A.*B.*C

No questions to ask, right? Actually, this code isn’t as optimized as you’d think. Lets dig deeper.

BLAS, Linpack, and MKL

The reason you are always told by “the lords of numerical math” to vectorize your code is because very smart programmers worked really hard on making basic things work well. Most of the “standardized” vectorized computations are calling subroutines from packages known as BLAS and LINPACK. To see what version your MATLAB is using, you can call

version -blas
version -linpack

So every time you call svd, A*B, etc., you are actually calling a highly-optimized state of the art C/Fortran/Assembly code mixture which does all sorts of things to make sure you don’t have cache misses, have the function multi-threaded (i.e. these functions will run parallel on your multi-core machine), and much more!

But A.*B doesn’t do a BLAS call! When you use A.*B in MATLAB, it uses its own C-code to loop through an perform this operation. This causes a few problems. First of all, if you are using co-processors/accelerators such as the Xeon Phi, the option of automatically offloading highly-vectorized problems to a GPU-like device only offloads BLAS/LINPACK calls. Therefore A*B gets accelerated whereas A.*B does not. Another thing is that, as far as I can tell, A.*B does not have all of the processor-specific optimizations to make this operation super fast. This is huge since processor technologies like AVX2 allows specific vector operations to do calls on 8 numbers at a time, making highly-parallel math operations like this get close to an 8x speedup if used correctly (and AVX3 is coming out soon to double that!).

So the simple A.*B is good for prototyping, but from benchmarking many SPDE solvers I realized this was holding me back. How can we do better? Well, we can replace .* with vector operations from MKL by directly calling MKL VML functions. MATLAB has a page on how to use the max interface to call BLAS/LINPACK functions and using this with the v?mul (i.e. vdmul for 64-bit floats). Thus to do this in MATLAB you have to write some C code. Fun!

Number of Operations ~ Number of Calls

However, this STILL isn’t optimal! Too see this, let us write out what MATLAB’s interpreter turns this into:

times(times(A,B),C)

Thus the behind the scenes looks something like this:

for i = 1:N
  for j = 1:M
     temp = A[i,j] * B[i,j];
  end
end
 
%Some function call overhead to return temp, save it to temp2
 
for i = 1:N
  for j = 1:M
     temp3 = temp2[i,j] * C[i,j];
  end
end
 
%Return temp3

Instead of one NxM loop, we did two NxM loops, effectively doubling the amount of computations required! Now think about solving reactions like:

A.*B.*C + D.*E*gamma + Diffusion*A;

you see that simple vectorization is actually far from optimal. Not only did we have to do extra loops, but we also had to make a bunch of temporary variables and garbage collect. What we really want to do is something like:

for i = 1:N
  for j = 1:M
     temp = A[i,j] * B[i,j] *C[i,j];
  end
end

How do you do this in MATLAB? Once again, the answer is to write C-code via the MEX interface. You can do this if you want.

Julia’s Approach

Lets look at if Julia does any better. Since Julia is opensource, we can actually look directly at the source code. Since it’s written in Julia and hosted on github it’s easy to go through the details. In the Julia REPL, we can find out what function is actually called by looking at:

julia> methods(.*)
# 26 methods for generic function ".*":
.*(x::Real, r::OrdinalRange{T,S}) at range.jl:574
.*(x::Real, r::FloatRange{T<:AbstractFloat}) at range.jl:575
.*(x::Real, r::LinSpace{T<:AbstractFloat}) at range.jl:576
.*(r::FloatRange{T<:AbstractFloat}, x::Real) at range.jl:578
.*(r::LinSpace{T<:AbstractFloat}, x::Real) at range.jl:579
.*(r::Range{T}, x::Real) at range.jl:577
.*(x::Number, r::Range{T}) at range.jl:646
.*(r::Range{T}, y::Number) at range.jl:647
.*(x::Number, y::Number) at operators.jl:115
.*(x::Bool, B::BitArray{N}) at bitarray.jl:1033
.*(x::Number, B::BitArray{N}) at bitarray.jl:1035
.*(A::Number, B::SparseMatrixCSC{Tv,Ti<:Integer}) at sparse/sparsematrix.jl:1028
.*{P<:Base.Dates.Period}(y::Real, X::Union{DenseArray{P<:Base.Dates.Period,N},SubArray{P<:Base.Dates.Period,N,A<:DenseArray{T,N},I<:Tuple{Vararg{Union{Colon,Int64,Range{Int64}}}},LD}}) at dates/periods.jl:56
.*{T}(A::Number, B::AbstractArray{T,N}) at arraymath.jl:118
.*(B::BitArray{N}, x::Bool) at bitarray.jl:1034
.*(B::BitArray{N}, x::Number) at bitarray.jl:1036
.*(A::SparseMatrixCSC{Tv,Ti<:Integer}, B::Number) at sparse/sparsematrix.jl:1027
.*{P<:Base.Dates.Period}(X::Union{DenseArray{P<:Base.Dates.Period,N},SubArray{P<:Base.Dates.Period,N,A<:DenseArray{T,N},I<:Tuple{Vararg{Union{Colon,Int64,Range{Int64}}}},LD}}, y::Real) at dates/periods.jl:66
.*{T}(A::AbstractArray{T,N}, B::Number) at arraymath.jl:125
.*(x::Number, J::UniformScaling{T<:Number}) at linalg/uniformscaling.jl:143
.*(J::UniformScaling{T<:Number}, x::Number) at linalg/uniformscaling.jl:144
.*(A::BitArray{N}, B::BitArray{N}) at broadcast.jl:475
.*(A::AbstractArray{Bool,N}, B::BitArray{N}) at broadcast.jl:475
.*(A::BitArray{N}, B::AbstractArray{Bool,N}) at broadcast.jl:475
.*(A::AbstractArray{T,N}, B::AbstractArray{T,N}) at sparse/sparsematrix.jl:1026
.*(As::AbstractArray{T,N}...) at broadcast.jl:307

What this is saying is that there are 26 possible function calls, each depending on the type that you put in there. For example, if you do .5.*(1:100), 1:100 is not an array but a special Julia type known as a range, and so .* calls the function .*(x::Number, r::Range{T}) which can be found in range.jl at line 647. For A.*B, we are want

.*(A::AbstractArray{T,N}, B::AbstractArray{T,N}) at sparse/sparsematrix.jl:1026

Surprisingly, it’s actually defined in the sparematrix.jl file. Find that file on GitHub and we see that it is defined as:

 
(.*)(A::AbstractArray, B::AbstractArray) = broadcast_zpreserving(MulFun(), A, B)
...
broadcast_zpreserving!(args...) = broadcast!(args...)

MulFun() is the what Julia translates * into, and so this means that when Julia sees A.*B, it calls broadcast on the * function. In the Julia documentation we see that broadcast simply calls the function on all of the array members. This is what we’d expect it to do, but is it better than if we were to write a loop? We check broadcast.jl and find:

        function broadcast!(f, B::$Bsig, As::$Asig...)
            nd = ndims(B)
            narrays = length(As)
 
            cache_f    = @get! cache      f       Dict{Int,Dict{Int,Any}}()
            cache_f_na = @get! cache_f    narrays Dict{Int,Any}()
            func       = @get! cache_f_na nd      $gbf($gbb, nd, narrays, f)
 
            func(B, As...)
            B
        end

which shows that it speeds up the operation by storing the function in cache. So moral of the story, A.*B uses broadcast which will beat your simple loop because of cache-control.

Can We Do Better?

Now just like MATLAB’s .* we have to ask, can we do better? Well, the first things we can do is use MKL VML bindings in Julia. With that packages you can plug it in and call the functions with ease (once vdmul gets added…). We can even over-write .* so that VML is used by default. However, we still have the problem we noted with MATLAB that the most efficient way would be to de-vectorize and write a loop which does multiple operations at once.

If you aren’t familiar with metaprogramming, it is simply using programs to write programs. MATLAB’s inability to metaprogram means we needed to go to C to write a loop, but that loop only works for exactly the type of inputs we had. It also causes problems since MATLAB’s anonymous functions are notoriously slow and so if you write a function that makes a function and you make it into an anonymous function (i.e. @(x) …), then you will have at least a 2x slowdown in your code over written the function yourself. Thus in MATLAB’s inability to metaprogram means you have to write a lot of code.

This is where Julia being a newer langugage comes to the rescue. It is chalk-full of metaprogramming abilities. What we are going to use here are macros. In a previous post I used macros such as @simd without explaining what it did. Well, a macro is simply a type of metaprogramming function that re-writes the code associated with it. It’s best explained by an example. There is a de-vectorization module for Julia which contains the @devec macro. What is does is take a vectorized expression like:

r = a .* b + c .* d + a

and re-writes it as a de-vectorized loop:

n = length(a)
r = zeros(n)
for i = 1 : n
    r[i] = a[i] * b[i] + c[i] * d[i] + a[i]
end

In MATLAB we can’t use loops for high-performance code because they are not optimized, so we had to take our equations, write C-code, compile it, make sure it all works, get rid of segmentation faults… it wasn’t fun. In Julia, loops are optimized so there’s no reason to go to C to write the loop, and here the macro @devec does the same thing. Therefore, to make the optimized loop in Julia, we write

@devec r = a .* b + c .* d + a

Thus the difference between your “high-performance code” and your prototype code is @devec. Welcome to the future folks.

Going all the way: Optimizing the loop

That’s not necessarily the end. Remember that BLAS/MKL/VML uses technologies like SIMD/AVX in order to do multiple operations per computation? Intel has written a Julia macro which will take loops and vectorize them in this manner by adding the macro @simd before the loop. MKL also speeds up computations by lowering the precision in the math which can be done with the @fastmath macro. Lastly, we can to turn off array out of bounds checking to make things faster as well. Thus what we really want is @devec to write the code:

n = length(a)
r = zeros(n)
@simd for i = 1 : n
    @fastmath @nobounds r[i] = a[i] * b[i] + c[i] * d[i] + a[i]
end

By adding @fastmath @nobounds before @devec we can get that part changed. What about adding @simd? Macros can take in arguments, and the developer for @devec is looking to add a context argument so you can tell it to expand with @simd, @parallel (for multi-cpu), or even with Cuda (i.e. run this on GPU in parallel), OpenCL and others. Multi-threaded will probably be added as well when it’s added to Julia’s base library. If you want it done faster, just contribute to the github repository!.

I do have to mention a caveat. There are codes where @devec will not work. For example, sqrt(exp(A*B)) where A and B are matrices. Even though it’s simple to see how we’d want to vectorize it, writing something that will go “okay, make A*B be a BLAS call, then call sqrt(exp()) on that” is hard to do generally. The creator of @devec is looking into using Julia’s generated functions to perform this kind of task. However, you can always do this by writing the loop out yourself. Or another vectorized implementation would be to use the broadcast function. Both are much easier than going to C and still do quite well performance-wise. (On this note, there is something like broadcast in MATLAB and these are the cellfun, structfun, arrayfun methods. However, they are so slow (except on GPUs where they are the fastest…) that they should never be used).

Moral of the Story

I hope I showed you that there is much more to making .* achieve its best performance than you originally thought. Of course, in MATLAB, using .* and other vectorized operations is almost always better than writing a loop. Thus in every class you’re told to ALWAYS use vectorized MATLAB functions because it’s best. In reality, that’s not the case. If you want performance, you need especially written loops in a high-performance language and you have to implement cache-handling, AVX/SIMD, multi-threading, etc. In MATLAB, this means writing C-codes and calling it via the MEX interface. This needs to be done for every different setup you encounter. However, in Julia, the @devec macro achieves the same end (almost, and it keeps getting better!). That’s the power of metaprogramming and one of the many reasons why I am trying to use more Julia.

The post Optimizing .*: Details of Vectorization and Metaprogramming appeared first on Stochastic Lifestyle.