Optimizing .*: Details of Vectorization and Metaprogramming

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/an-in-depth-look-at-vectorization/

For many of us mathematicians, we were taught to use MATLAB, and we were taught to vectorize everything. I mean obviously if we have matrices A, B, and C and want to multiply element-wise (say to solve a reaction-equation at each point in space), then the optimal code is

A.*B.*C

No questions to ask, right? Actually, this code isn’t as optimized as you’d think. Lets dig deeper.

BLAS, Linpack, and MKL

The reason you are always told by “the lords of numerical math” to vectorize your code is because very smart programmers worked really hard on making basic things work well. Most of the “standardized” vectorized computations are calling subroutines from packages known as BLAS and LINPACK. To see what version your MATLAB is using, you can call

version -blas
version -linpack

So every time you call svd, A*B, etc., you are actually calling a highly-optimized state of the art C/Fortran/Assembly code mixture which does all sorts of things to make sure you don’t have cache misses, have the function multi-threaded (i.e. these functions will run parallel on your multi-core machine), and much more!

But A.*B doesn’t do a BLAS call! When you use A.*B in MATLAB, it uses its own C-code to loop through an perform this operation. This causes a few problems. First of all, if you are using co-processors/accelerators such as the Xeon Phi, the option of automatically offloading highly-vectorized problems to a GPU-like device only offloads BLAS/LINPACK calls. Therefore A*B gets accelerated whereas A.*B does not. Another thing is that, as far as I can tell, A.*B does not have all of the processor-specific optimizations to make this operation super fast. This is huge since processor technologies like AVX2 allows specific vector operations to do calls on 8 numbers at a time, making highly-parallel math operations like this get close to an 8x speedup if used correctly (and AVX3 is coming out soon to double that!).

So the simple A.*B is good for prototyping, but from benchmarking many SPDE solvers I realized this was holding me back. How can we do better? Well, we can replace .* with vector operations from MKL by directly calling MKL VML functions. MATLAB has a page on how to use the max interface to call BLAS/LINPACK functions and using this with the v?mul (i.e. vdmul for 64-bit floats). Thus to do this in MATLAB you have to write some C code. Fun!

Number of Operations ~ Number of Calls

However, this STILL isn’t optimal! Too see this, let us write out what MATLAB’s interpreter turns this into:

times(times(A,B),C)

Thus the behind the scenes looks something like this:

for i = 1:N
  for j = 1:M
     temp = A[i,j] * B[i,j];
  end
end
 
%Some function call overhead to return temp, save it to temp2
 
for i = 1:N
  for j = 1:M
     temp3 = temp2[i,j] * C[i,j];
  end
end
 
%Return temp3

Instead of one NxM loop, we did two NxM loops, effectively doubling the amount of computations required! Now think about solving reactions like:

A.*B.*C + D.*E*gamma + Diffusion*A;

you see that simple vectorization is actually far from optimal. Not only did we have to do extra loops, but we also had to make a bunch of temporary variables and garbage collect. What we really want to do is something like:

for i = 1:N
  for j = 1:M
     temp = A[i,j] * B[i,j] *C[i,j];
  end
end

How do you do this in MATLAB? Once again, the answer is to write C-code via the MEX interface. You can do this if you want.

Julia’s Approach

Lets look at if Julia does any better. Since Julia is opensource, we can actually look directly at the source code. Since it’s written in Julia and hosted on github it’s easy to go through the details. In the Julia REPL, we can find out what function is actually called by looking at:

julia> methods(.*)
# 26 methods for generic function ".*":
.*(x::Real, r::OrdinalRange{T,S}) at range.jl:574
.*(x::Real, r::FloatRange{T<:AbstractFloat}) at range.jl:575
.*(x::Real, r::LinSpace{T<:AbstractFloat}) at range.jl:576
.*(r::FloatRange{T<:AbstractFloat}, x::Real) at range.jl:578
.*(r::LinSpace{T<:AbstractFloat}, x::Real) at range.jl:579
.*(r::Range{T}, x::Real) at range.jl:577
.*(x::Number, r::Range{T}) at range.jl:646
.*(r::Range{T}, y::Number) at range.jl:647
.*(x::Number, y::Number) at operators.jl:115
.*(x::Bool, B::BitArray{N}) at bitarray.jl:1033
.*(x::Number, B::BitArray{N}) at bitarray.jl:1035
.*(A::Number, B::SparseMatrixCSC{Tv,Ti<:Integer}) at sparse/sparsematrix.jl:1028
.*{P<:Base.Dates.Period}(y::Real, X::Union{DenseArray{P<:Base.Dates.Period,N},SubArray{P<:Base.Dates.Period,N,A<:DenseArray{T,N},I<:Tuple{Vararg{Union{Colon,Int64,Range{Int64}}}},LD}}) at dates/periods.jl:56
.*{T}(A::Number, B::AbstractArray{T,N}) at arraymath.jl:118
.*(B::BitArray{N}, x::Bool) at bitarray.jl:1034
.*(B::BitArray{N}, x::Number) at bitarray.jl:1036
.*(A::SparseMatrixCSC{Tv,Ti<:Integer}, B::Number) at sparse/sparsematrix.jl:1027
.*{P<:Base.Dates.Period}(X::Union{DenseArray{P<:Base.Dates.Period,N},SubArray{P<:Base.Dates.Period,N,A<:DenseArray{T,N},I<:Tuple{Vararg{Union{Colon,Int64,Range{Int64}}}},LD}}, y::Real) at dates/periods.jl:66
.*{T}(A::AbstractArray{T,N}, B::Number) at arraymath.jl:125
.*(x::Number, J::UniformScaling{T<:Number}) at linalg/uniformscaling.jl:143
.*(J::UniformScaling{T<:Number}, x::Number) at linalg/uniformscaling.jl:144
.*(A::BitArray{N}, B::BitArray{N}) at broadcast.jl:475
.*(A::AbstractArray{Bool,N}, B::BitArray{N}) at broadcast.jl:475
.*(A::BitArray{N}, B::AbstractArray{Bool,N}) at broadcast.jl:475
.*(A::AbstractArray{T,N}, B::AbstractArray{T,N}) at sparse/sparsematrix.jl:1026
.*(As::AbstractArray{T,N}...) at broadcast.jl:307

What this is saying is that there are 26 possible function calls, each depending on the type that you put in there. For example, if you do .5.*(1:100), 1:100 is not an array but a special Julia type known as a range, and so .* calls the function .*(x::Number, r::Range{T}) which can be found in range.jl at line 647. For A.*B, we are want

.*(A::AbstractArray{T,N}, B::AbstractArray{T,N}) at sparse/sparsematrix.jl:1026

Surprisingly, it’s actually defined in the sparematrix.jl file. Find that file on GitHub and we see that it is defined as:

 
(.*)(A::AbstractArray, B::AbstractArray) = broadcast_zpreserving(MulFun(), A, B)
...
broadcast_zpreserving!(args...) = broadcast!(args...)

MulFun() is the what Julia translates * into, and so this means that when Julia sees A.*B, it calls broadcast on the * function. In the Julia documentation we see that broadcast simply calls the function on all of the array members. This is what we’d expect it to do, but is it better than if we were to write a loop? We check broadcast.jl and find:

        function broadcast!(f, B::$Bsig, As::$Asig...)
            nd = ndims(B)
            narrays = length(As)
 
            cache_f    = @get! cache      f       Dict{Int,Dict{Int,Any}}()
            cache_f_na = @get! cache_f    narrays Dict{Int,Any}()
            func       = @get! cache_f_na nd      $gbf($gbb, nd, narrays, f)
 
            func(B, As...)
            B
        end

which shows that it speeds up the operation by storing the function in cache. So moral of the story, A.*B uses broadcast which will beat your simple loop because of cache-control.

Can We Do Better?

Now just like MATLAB’s .* we have to ask, can we do better? Well, the first things we can do is use MKL VML bindings in Julia. With that packages you can plug it in and call the functions with ease (once vdmul gets added…). We can even over-write .* so that VML is used by default. However, we still have the problem we noted with MATLAB that the most efficient way would be to de-vectorize and write a loop which does multiple operations at once.

If you aren’t familiar with metaprogramming, it is simply using programs to write programs. MATLAB’s inability to metaprogram means we needed to go to C to write a loop, but that loop only works for exactly the type of inputs we had. It also causes problems since MATLAB’s anonymous functions are notoriously slow and so if you write a function that makes a function and you make it into an anonymous function (i.e. @(x) …), then you will have at least a 2x slowdown in your code over written the function yourself. Thus in MATLAB’s inability to metaprogram means you have to write a lot of code.

This is where Julia being a newer langugage comes to the rescue. It is chalk-full of metaprogramming abilities. What we are going to use here are macros. In a previous post I used macros such as @simd without explaining what it did. Well, a macro is simply a type of metaprogramming function that re-writes the code associated with it. It’s best explained by an example. There is a de-vectorization module for Julia which contains the @devec macro. What is does is take a vectorized expression like:

r = a .* b + c .* d + a

and re-writes it as a de-vectorized loop:

n = length(a)
r = zeros(n)
for i = 1 : n
    r[i] = a[i] * b[i] + c[i] * d[i] + a[i]
end

In MATLAB we can’t use loops for high-performance code because they are not optimized, so we had to take our equations, write C-code, compile it, make sure it all works, get rid of segmentation faults… it wasn’t fun. In Julia, loops are optimized so there’s no reason to go to C to write the loop, and here the macro @devec does the same thing. Therefore, to make the optimized loop in Julia, we write

@devec r = a .* b + c .* d + a

Thus the difference between your “high-performance code” and your prototype code is @devec. Welcome to the future folks.

Going all the way: Optimizing the loop

That’s not necessarily the end. Remember that BLAS/MKL/VML uses technologies like SIMD/AVX in order to do multiple operations per computation? Intel has written a Julia macro which will take loops and vectorize them in this manner by adding the macro @simd before the loop. MKL also speeds up computations by lowering the precision in the math which can be done with the @fastmath macro. Lastly, we can to turn off array out of bounds checking to make things faster as well. Thus what we really want is @devec to write the code:

n = length(a)
r = zeros(n)
@simd for i = 1 : n
    @fastmath @nobounds r[i] = a[i] * b[i] + c[i] * d[i] + a[i]
end

By adding @fastmath @nobounds before @devec we can get that part changed. What about adding @simd? Macros can take in arguments, and the developer for @devec is looking to add a context argument so you can tell it to expand with @simd, @parallel (for multi-cpu), or even with Cuda (i.e. run this on GPU in parallel), OpenCL and others. Multi-threaded will probably be added as well when it’s added to Julia’s base library. If you want it done faster, just contribute to the github repository!.

I do have to mention a caveat. There are codes where @devec will not work. For example, sqrt(exp(A*B)) where A and B are matrices. Even though it’s simple to see how we’d want to vectorize it, writing something that will go “okay, make A*B be a BLAS call, then call sqrt(exp()) on that” is hard to do generally. The creator of @devec is looking into using Julia’s generated functions to perform this kind of task. However, you can always do this by writing the loop out yourself. Or another vectorized implementation would be to use the broadcast function. Both are much easier than going to C and still do quite well performance-wise. (On this note, there is something like broadcast in MATLAB and these are the cellfun, structfun, arrayfun methods. However, they are so slow (except on GPUs where they are the fastest…) that they should never be used).

Moral of the Story

I hope I showed you that there is much more to making .* achieve its best performance than you originally thought. Of course, in MATLAB, using .* and other vectorized operations is almost always better than writing a loop. Thus in every class you’re told to ALWAYS use vectorized MATLAB functions because it’s best. In reality, that’s not the case. If you want performance, you need especially written loops in a high-performance language and you have to implement cache-handling, AVX/SIMD, multi-threading, etc. In MATLAB, this means writing C-codes and calling it via the MEX interface. This needs to be done for every different setup you encounter. However, in Julia, the @devec macro achieves the same end (almost, and it keeps getting better!). That’s the power of metaprogramming and one of the many reasons why I am trying to use more Julia.

The post Optimizing .*: Details of Vectorization and Metaprogramming appeared first on Stochastic Lifestyle.