Tag Archives: matlab

Optimizing .*: Details of Vectorization and Metaprogramming

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/an-in-depth-look-at-vectorization/

For many of us mathematicians, we were taught to use MATLAB, and we were taught to vectorize everything. I mean obviously if we have matrices A, B, and C and want to multiply element-wise (say to solve a reaction-equation at each point in space), then the optimal code is

A.*B.*C

No questions to ask, right? Actually, this code isn’t as optimized as you’d think. Lets dig deeper.

BLAS, Linpack, and MKL

The reason you are always told by “the lords of numerical math” to vectorize your code is because very smart programmers worked really hard on making basic things work well. Most of the “standardized” vectorized computations are calling subroutines from packages known as BLAS and LINPACK. To see what version your MATLAB is using, you can call

version -blas
version -linpack

So every time you call svd, A*B, etc., you are actually calling a highly-optimized state of the art C/Fortran/Assembly code mixture which does all sorts of things to make sure you don’t have cache misses, have the function multi-threaded (i.e. these functions will run parallel on your multi-core machine), and much more!

But A.*B doesn’t do a BLAS call! When you use A.*B in MATLAB, it uses its own C-code to loop through an perform this operation. This causes a few problems. First of all, if you are using co-processors/accelerators such as the Xeon Phi, the option of automatically offloading highly-vectorized problems to a GPU-like device only offloads BLAS/LINPACK calls. Therefore A*B gets accelerated whereas A.*B does not. Another thing is that, as far as I can tell, A.*B does not have all of the processor-specific optimizations to make this operation super fast. This is huge since processor technologies like AVX2 allows specific vector operations to do calls on 8 numbers at a time, making highly-parallel math operations like this get close to an 8x speedup if used correctly (and AVX3 is coming out soon to double that!).

So the simple A.*B is good for prototyping, but from benchmarking many SPDE solvers I realized this was holding me back. How can we do better? Well, we can replace .* with vector operations from MKL by directly calling MKL VML functions. MATLAB has a page on how to use the max interface to call BLAS/LINPACK functions and using this with the v?mul (i.e. vdmul for 64-bit floats). Thus to do this in MATLAB you have to write some C code. Fun!

Number of Operations ~ Number of Calls

However, this STILL isn’t optimal! Too see this, let us write out what MATLAB’s interpreter turns this into:

times(times(A,B),C)

Thus the behind the scenes looks something like this:

for i = 1:N
  for j = 1:M
     temp = A[i,j] * B[i,j];
  end
end
 
%Some function call overhead to return temp, save it to temp2
 
for i = 1:N
  for j = 1:M
     temp3 = temp2[i,j] * C[i,j];
  end
end
 
%Return temp3

Instead of one NxM loop, we did two NxM loops, effectively doubling the amount of computations required! Now think about solving reactions like:

A.*B.*C + D.*E*gamma + Diffusion*A;

you see that simple vectorization is actually far from optimal. Not only did we have to do extra loops, but we also had to make a bunch of temporary variables and garbage collect. What we really want to do is something like:

for i = 1:N
  for j = 1:M
     temp = A[i,j] * B[i,j] *C[i,j];
  end
end

How do you do this in MATLAB? Once again, the answer is to write C-code via the MEX interface. You can do this if you want.

Julia’s Approach

Lets look at if Julia does any better. Since Julia is opensource, we can actually look directly at the source code. Since it’s written in Julia and hosted on github it’s easy to go through the details. In the Julia REPL, we can find out what function is actually called by looking at:

julia> methods(.*)
# 26 methods for generic function ".*":
.*(x::Real, r::OrdinalRange{T,S}) at range.jl:574
.*(x::Real, r::FloatRange{T<:AbstractFloat}) at range.jl:575
.*(x::Real, r::LinSpace{T<:AbstractFloat}) at range.jl:576
.*(r::FloatRange{T<:AbstractFloat}, x::Real) at range.jl:578
.*(r::LinSpace{T<:AbstractFloat}, x::Real) at range.jl:579
.*(r::Range{T}, x::Real) at range.jl:577
.*(x::Number, r::Range{T}) at range.jl:646
.*(r::Range{T}, y::Number) at range.jl:647
.*(x::Number, y::Number) at operators.jl:115
.*(x::Bool, B::BitArray{N}) at bitarray.jl:1033
.*(x::Number, B::BitArray{N}) at bitarray.jl:1035
.*(A::Number, B::SparseMatrixCSC{Tv,Ti<:Integer}) at sparse/sparsematrix.jl:1028
.*{P<:Base.Dates.Period}(y::Real, X::Union{DenseArray{P<:Base.Dates.Period,N},SubArray{P<:Base.Dates.Period,N,A<:DenseArray{T,N},I<:Tuple{Vararg{Union{Colon,Int64,Range{Int64}}}},LD}}) at dates/periods.jl:56
.*{T}(A::Number, B::AbstractArray{T,N}) at arraymath.jl:118
.*(B::BitArray{N}, x::Bool) at bitarray.jl:1034
.*(B::BitArray{N}, x::Number) at bitarray.jl:1036
.*(A::SparseMatrixCSC{Tv,Ti<:Integer}, B::Number) at sparse/sparsematrix.jl:1027
.*{P<:Base.Dates.Period}(X::Union{DenseArray{P<:Base.Dates.Period,N},SubArray{P<:Base.Dates.Period,N,A<:DenseArray{T,N},I<:Tuple{Vararg{Union{Colon,Int64,Range{Int64}}}},LD}}, y::Real) at dates/periods.jl:66
.*{T}(A::AbstractArray{T,N}, B::Number) at arraymath.jl:125
.*(x::Number, J::UniformScaling{T<:Number}) at linalg/uniformscaling.jl:143
.*(J::UniformScaling{T<:Number}, x::Number) at linalg/uniformscaling.jl:144
.*(A::BitArray{N}, B::BitArray{N}) at broadcast.jl:475
.*(A::AbstractArray{Bool,N}, B::BitArray{N}) at broadcast.jl:475
.*(A::BitArray{N}, B::AbstractArray{Bool,N}) at broadcast.jl:475
.*(A::AbstractArray{T,N}, B::AbstractArray{T,N}) at sparse/sparsematrix.jl:1026
.*(As::AbstractArray{T,N}...) at broadcast.jl:307

What this is saying is that there are 26 possible function calls, each depending on the type that you put in there. For example, if you do .5.*(1:100), 1:100 is not an array but a special Julia type known as a range, and so .* calls the function .*(x::Number, r::Range{T}) which can be found in range.jl at line 647. For A.*B, we are want

.*(A::AbstractArray{T,N}, B::AbstractArray{T,N}) at sparse/sparsematrix.jl:1026

Surprisingly, it’s actually defined in the sparematrix.jl file. Find that file on GitHub and we see that it is defined as:

 
(.*)(A::AbstractArray, B::AbstractArray) = broadcast_zpreserving(MulFun(), A, B)
...
broadcast_zpreserving!(args...) = broadcast!(args...)

MulFun() is the what Julia translates * into, and so this means that when Julia sees A.*B, it calls broadcast on the * function. In the Julia documentation we see that broadcast simply calls the function on all of the array members. This is what we’d expect it to do, but is it better than if we were to write a loop? We check broadcast.jl and find:

        function broadcast!(f, B::$Bsig, As::$Asig...)
            nd = ndims(B)
            narrays = length(As)
 
            cache_f    = @get! cache      f       Dict{Int,Dict{Int,Any}}()
            cache_f_na = @get! cache_f    narrays Dict{Int,Any}()
            func       = @get! cache_f_na nd      $gbf($gbb, nd, narrays, f)
 
            func(B, As...)
            B
        end

which shows that it speeds up the operation by storing the function in cache. So moral of the story, A.*B uses broadcast which will beat your simple loop because of cache-control.

Can We Do Better?

Now just like MATLAB’s .* we have to ask, can we do better? Well, the first things we can do is use MKL VML bindings in Julia. With that packages you can plug it in and call the functions with ease (once vdmul gets added…). We can even over-write .* so that VML is used by default. However, we still have the problem we noted with MATLAB that the most efficient way would be to de-vectorize and write a loop which does multiple operations at once.

If you aren’t familiar with metaprogramming, it is simply using programs to write programs. MATLAB’s inability to metaprogram means we needed to go to C to write a loop, but that loop only works for exactly the type of inputs we had. It also causes problems since MATLAB’s anonymous functions are notoriously slow and so if you write a function that makes a function and you make it into an anonymous function (i.e. @(x) …), then you will have at least a 2x slowdown in your code over written the function yourself. Thus in MATLAB’s inability to metaprogram means you have to write a lot of code.

This is where Julia being a newer langugage comes to the rescue. It is chalk-full of metaprogramming abilities. What we are going to use here are macros. In a previous post I used macros such as @simd without explaining what it did. Well, a macro is simply a type of metaprogramming function that re-writes the code associated with it. It’s best explained by an example. There is a de-vectorization module for Julia which contains the @devec macro. What is does is take a vectorized expression like:

r = a .* b + c .* d + a

and re-writes it as a de-vectorized loop:

n = length(a)
r = zeros(n)
for i = 1 : n
    r[i] = a[i] * b[i] + c[i] * d[i] + a[i]
end

In MATLAB we can’t use loops for high-performance code because they are not optimized, so we had to take our equations, write C-code, compile it, make sure it all works, get rid of segmentation faults… it wasn’t fun. In Julia, loops are optimized so there’s no reason to go to C to write the loop, and here the macro @devec does the same thing. Therefore, to make the optimized loop in Julia, we write

@devec r = a .* b + c .* d + a

Thus the difference between your “high-performance code” and your prototype code is @devec. Welcome to the future folks.

Going all the way: Optimizing the loop

That’s not necessarily the end. Remember that BLAS/MKL/VML uses technologies like SIMD/AVX in order to do multiple operations per computation? Intel has written a Julia macro which will take loops and vectorize them in this manner by adding the macro @simd before the loop. MKL also speeds up computations by lowering the precision in the math which can be done with the @fastmath macro. Lastly, we can to turn off array out of bounds checking to make things faster as well. Thus what we really want is @devec to write the code:

n = length(a)
r = zeros(n)
@simd for i = 1 : n
    @fastmath @nobounds r[i] = a[i] * b[i] + c[i] * d[i] + a[i]
end

By adding @fastmath @nobounds before @devec we can get that part changed. What about adding @simd? Macros can take in arguments, and the developer for @devec is looking to add a context argument so you can tell it to expand with @simd, @parallel (for multi-cpu), or even with Cuda (i.e. run this on GPU in parallel), OpenCL and others. Multi-threaded will probably be added as well when it’s added to Julia’s base library. If you want it done faster, just contribute to the github repository!.

I do have to mention a caveat. There are codes where @devec will not work. For example, sqrt(exp(A*B)) where A and B are matrices. Even though it’s simple to see how we’d want to vectorize it, writing something that will go “okay, make A*B be a BLAS call, then call sqrt(exp()) on that” is hard to do generally. The creator of @devec is looking into using Julia’s generated functions to perform this kind of task. However, you can always do this by writing the loop out yourself. Or another vectorized implementation would be to use the broadcast function. Both are much easier than going to C and still do quite well performance-wise. (On this note, there is something like broadcast in MATLAB and these are the cellfun, structfun, arrayfun methods. However, they are so slow (except on GPUs where they are the fastest…) that they should never be used).

Moral of the Story

I hope I showed you that there is much more to making .* achieve its best performance than you originally thought. Of course, in MATLAB, using .* and other vectorized operations is almost always better than writing a loop. Thus in every class you’re told to ALWAYS use vectorized MATLAB functions because it’s best. In reality, that’s not the case. If you want performance, you need especially written loops in a high-performance language and you have to implement cache-handling, AVX/SIMD, multi-threading, etc. In MATLAB, this means writing C-codes and calling it via the MEX interface. This needs to be done for every different setup you encounter. However, in Julia, the @devec macro achieves the same end (almost, and it keeps getting better!). That’s the power of metaprogramming and one of the many reasons why I am trying to use more Julia.

The post Optimizing .*: Details of Vectorization and Metaprogramming appeared first on Stochastic Lifestyle.

Optimizing Julia for Performance: A Practical Example

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/optimizing-julia-performance-practical-example/

Let’s take a program which plots the standard logistic map:

r = 2.9:.00005:4;
numAttract = 100;
steady = ones(length(r),1)*.25;
for i=1:300 ## Get to steady state
  steady = r.*steady.*(1-steady);
end
x = zeros(length(steady),numAttract);
x[:,1] = steady;
for i=2:numAttract ## Now grab some values at the attractor
  x[:,i] = r.*x[:,i-1].*(1-x[:,i-1]);
end
using PyPlot;
fig = figure(figsize=(20,10));
plot(collect(r),x,"b.",markersize=.06)
savefig("plot.png",dpi=300);

This plots the logistic map. If you take the same code and change the array dereferencing from A[i,j] to A(i,j) you get MATLAB code. However, I couldn’t get MATLAB to have it looks as good, but that’s beside the point.

MATLAB Plot
Julia Plot

What I want to show are the steps for optimizing this code. In MATLAB, this is as vectorized as it gets and so you’re pretty much done. In Julia, there are a few steps that you can do from here. The first thing to notice that if this code is run directly, then all the variables are global scope which hurts performance, so the first thing we do is change this code to a function call. Next we note that we can improve performance by AVX2 which allows the processor to take in multiple numbers into the same input. Think about it as operating on vectors of 8 numbers at a time, and when AVX3 comes out, you will be able to do 16 numbers at a time. This is something you’ll want to take advantage of! Vectorized operations already have this implemented, so to implement it in our loops we simply add @simd before our for loops. Be careful when using simd: it will allows the loops to re-order the computation. Since each simulation (different r’s) is independent, we are fine. More about this can be found here.

This gives the following code:

function logisticPlot()
  tic()
  r = 2.9:.00005:4;
  numAttract = 100;
  steady = ones(length(r),1)*.25;
  @simd for i=1:400 ## Get to steady state
    @inbounds @fastmath steady = r.*steady.*(1-steady);
  end
  x = zeros(length(steady),numAttract);
  x[:,1] = steady;
  @simd for i=2:numAttract ## Now grab some values at the attractor
    @inbounds @fastmath x[:,i] = r.*x[:,i-1].*(1-x[:,i-1]);
  end
  toc()
  fig = figure(figsize=(20,10));
  plot(collect(r),x,"b.",markersize=.06)
  savefig("plot.png",dpi=300);
end
using PyPlot
@time logisticPlot()

which does better than before. Here we can’t. Normally we can do even by explicitly defining the types which allows the LLVM to know the data types and use compiler optimizations like it was C code.

This is pretty much it because we cannot do @parallel because we need to ensure we are stepping in time without skipping steps (@parallel will re-order the timesteps). So, all in all, how well did we do? The final code takes .15 seconds on my machine to run (without the plotting). However, MATLAB is at .05. Why did MATLAB win? If you check the timings in more detail you’ll see that it all comes down to the inner loop steps still taking longer than MATLAB, and there is a clear reason for it: Julia does not have multithreading yet. If you check

A = rand(20000,20000);
B = rand(20000,20000);
A.*B

you will see that this uses all of the cores on your machine, whereas the same exact code in Julia does not (if you do A*B, Julia will because it sends that call to the Fortran program BLAS which is multi-threaded). So there it is, MATLAB wins because it’s using all 16 cores on my workstation, but Julia on 1 core holds up to it and makes a prettier plot. With this, I think Julia will win out in this example once it’s multi-threaded, but we’ll see.

What I really wanted to highlight here is that Julia is flexible enough to make both the prototype and the final code. You can code using simple syntax like in MATLAB to get things done, and then when you have a working product, start adding macros, parallelization, and type specifications to get something that both complies and runs like C. In MATLAB, you write a MATLAB code and when you get that working, you re-write all the heavy-lifting parts in C and link it up via MEX. This is quite a pain to do, and this itself is why I am investing time in Julia, knowing it will payout well.

Note: Julia’s threading branch has merged to the main branch already. They are just waiting for it to stabilize a bit to then put it in a standard release (with the major multi-threaded operations such as .* utilizing it). So stay tuned.

The post Optimizing Julia for Performance: A Practical Example appeared first on Stochastic Lifestyle.

Julia iFEM1: Porting Mesh Generation

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/julia-ifem-1-mesh-generation-and-julias-type-system/

My first project on the quest for a Julia finite element method is a simple homework problem. Just some background, this is for UC Irvine’s graduate Computational PDEs 226B course where in the first quarter we did all sorts of finite difference methods and now is our first foray into finite element methods. The purpose of the project is to grasp the data structure enough to use simple tools (i.e. mesh creation and plotting) to create a finite element solver for Poisson’s equation in 2D and check the performance differences. For those who haven’t programmed much this is a great learning exercise, but being a pretty standard exercise the MATLAB code took no time to writeup and so I started thinking: how does Julia compare to MATLAB for solving simple PDEs with the finite element method?

Testing this would take a few steps. The code base is all in MATLAB, so there is a step of (partial) porting MATLAB to Julia. Then there is a fiasco as how to plot the data structures given that our display commands are different (i.e. is there a way to plot in Julia that’s almost exactly like MATLAB?). Then we have to optimize the Julia code and see how it fares against the MATLAB code. This post will focus on how I did the first two steps (and the problems I encountered), the last will be its own post later.

Semi-Porting to Julia: MATLAB.jl and number casting

The moment I looked at Julia code I was struck by how similar it looked to MATLAB code. The first part of the problem is to make a uniform square grid, which is easy enough to be just a few MATLAB commands, so my first response to porting it into Julia was to simply copy-paste it, find out what caused errors, and beat through the errors until it worked. So I downloaded the Julia+Juno IDE bundle and started hacking away at it. The first thing I noticed was the Julia did not have a command for meshgrid in its base library. However, not surprisingly enough people want this command that someone already implemented it and ndgrid. So I took that file and it played as a template for the syntax as well.

With meshgrid in hand, things went really smooth. Array dereferencing in Julia uses square brackets, so I had to go through and change X(:,1) to X[:,1] and the like, but that was easy enough. Another issue is that in MATLAB you can drop parts of the array using

t2nidxMap(topNode) = [];

whereas in Julia I used the deleteat function:

t2nidxMap = deleteat!(collect(t2nidxMap),collect(topNode));

You may be asking, what is that peculiar “collect” function? Well in Julia arrays defined by ranges (i.e. A=1:100) return a range object and not an array. In most cases it seems that they can be used just like an array (with better performance and less memory usage), but this seems like a case where they could not, and so the collect function builds the array from the range object. This gave me the code to make the square mesh as:

function squaremesh(square,h)
  #square = [x0 x1 y0 y1], h is stepsize
  x0 = square[1]; x1= square[2]
  y0 = square[3]; y1= square[4]
  x,y = meshgrid(x0:h:x1,y0:h:y1)
  node = [x[:] y[:]];
 
  ni = size(x,1); # number of rows
  N = size(node,1);
  t2nidxMap = 1:N-ni;
  topNode = ni:ni:N-ni;
  t2nidxMap = deleteat!(collect(t2nidxMap),collect(topNode))
  k = t2nidxMap;
  elem = [k+ni k+ni+1 k ; k+1 k k+ni+1]
  return node,elem
end

whereas the MATLAB code was

%% Generate nodes
x0 = square(1); x1 = square(2); 
y0 = square(3); y1 = square(4);
[x,y] = meshgrid(x0:h:x1,y0:h:y1);
node = [x(:),y(:)];
 
%% Generate elements
ni = size(x,1); % number of rows
N = size(node,1);
t2nidxMap = 1:N-ni;
topNode = ni:ni:N-ni;
t2nidxMap(topNode) = [];
k = (t2nidxMap)';
elem = [k+ni k+ni+1 k; k+1 k k+ni+1];

As you can see, porting that function over took just a few minutes and very few things changed. It would take less than a minute if I knew about the range and deleteat issues. In fact, for next many things I did in Julia, those (and square brackets) were really the only differences, other than the plots.

Plotting via matplotlib

The next part of the project is to plot the triangulation. The MATLAB code for the “showmesh” throws this directly into MATLAB’s trisurf function. Thus I Google’d Julia trisurf and the first hit was Julia’s PyPlot package. It didn’t tell me how to do it, but the package simply calls matplotlib, so I went over to the matplotlib page and and found their function conveniently named plot_trisurf. All I had to do was pick a better color scheme than the default and I was good to go. So while the MATLAB code called

trisurf(elem(:,1:3),node(:,1),node(:,2),zeros(size(node,1),1));

to get the code on the project webpage, the Julia code called

plot_trisurf(node[:,1],node[:,2],zeros(size(node,1)),cmap=get_cmap("ocean"))

to give me this nifty 3D plot which shows the triangalization:

juliaTriangulation

Simple enough.

Getting lazy: MATLAB time?

Okay, things are pretty easy. Now we have to make a circle mesh. So I go into the code only to find that it calls a much larger MATLAB package for mesh generation. This is the part where most people give up on the new language: this part is done in MATLAB and looks like it would take more than the next 15 minutes to port… could I just call MATLAB? If you’ve ever dealt with MEX or similar things that “glue together languages”, you they are a mess. But I found the MATLAB.jl package so I decided to give it a try. The directions are dead simple: just stick the folder from the GitHub repository into the spot they tell you and you’re done. So I fire it up and… it didn’t work. But recall that in MATLAB numbers always promote to the highest state, i.e.

int8(5)+4.4

returns 9. We will return to this later when talking about efficiency, but the problem I was having boiled down this problem. All I had to do was pass in the same array explicitly telling it the numbers were floats by putting a . after it (i.e. 1 vs 1.) and all of the MATLAB codes worked. I was able to mix and matching generating meshes in Julia/MATLAB and plotting them in Julia/MATLAB. An example looks like this. Here I use the iFEM package functions and some of their Julia translations I described before, and mix and match calling them in Julia and MATLAB.

# Testing Square Mesh Generation
node,elem = squaremesh([0 1 0 1],.1)
showmesh(node,elem);
 
# Testing MATLAB call
node2,elem2 = mxcall(:squaremesh,2,[0. 1. 0. 1.],.1) #Array has periods to turn to Float
showmesh(node2,elem2)
 
# Testing MATLAB showmesh
mxcall(:showmesh,0,node,elem) #surpress return of function handle, cannot be saved STDIO for Julia return
mxcall(:showmesh,0,node2,elem2)
 
# Testing Quadralateral showmesh
C = .5*zeros(length(node[:,1]),length(node[:,2]))
D = zeros(length(node[:,1]))
p = pcolormesh(node[:,1],node[:,2],C,edgecolor=D,cmap="ocean")
 
# Doing circle and refine via MATLAB, pyplot plotting
node,elem = mxcall(:circlemesh,2,0.,0.,1.,0.2)
showmesh(node,elem)
node,elem = mxcall(:uniformrefine,2,node,elem)
showmesh(node,elem)
 
## 3D Mesh Generation and Plotting
node3,elem3 = mxcall(:cubemesh,2,[0. 1. 0. 1. 0. 1.],.25)
showmesh(node3,elem3)

You notice that to do MATLAB calls you simply use mxcall, tell it the function (with a colon in front, in Julia this means you converted it to the symbol type), the number of outputs, and give it the inputs. Be careful about integer vs floating point problems and everything works.

Conclusion: Moving to Julia is dead simple, and you don’t have to move too fast

This really appealed to me in two ways. First of all, Julia’s syntax was very natural to pick up and within minutes I was fine. In fact, the only errors that were hard to diagnose were the ones due to MATLAB! That leads me to the second point, Julia’s language interfacing is so good that I didn’t even have to stray too far. If I wanted to use old MATLAB code, it took 2 minutes to get it setup and I was calling it from Julia. The same seems to be true for calling Python and calling R as well. Not only that, but using C and Fortran code is built right into the core of Julia and has the same syntax structure as mxcall (instead you use ccall), meaning it’s easier to use than MATLAB’s MEX interface. Julia melds together different languages so nicely that you can act like you have the packages of all of them! All of this ease of use comes down Julia’s type system, which I will start looking into in my next post.

So was it worth it? As of right now I can’t say that, but I can say it didn’t cost me more than 30 minutes to get up and running in Julia. But will it be better? We will look at Julia’s type system and performance next time.

The post Julia iFEM1: Porting Mesh Generation appeared first on Stochastic Lifestyle.