Author Archives: Kristoffer Carlsson on Kristoffer Carlsson

Case study: Improving performance of a code written in Matlab style

By: Kristoffer Carlsson on Kristoffer Carlsson

Re-posted from: http://kristofferc.github.io/post/vectorization_performance_study/

Introduction

A few weeks ago, someone asked on the Julia Discourse forum for assistance how to make their code a bit faster.

The original code posted was (with a few minor modifications)

function myImg(pts::Integer)
    # rotation of ellipse
    aEll = 35.0/180.0*pi
    axes = [1/6,1/25]
    values = collect(linspace(-0.5,0.5,pts))

    # meshes
    gridX = [i for i in values, j in values]
    gridY = [j for i in values, j in values]

    # generate ellipse
    # rotate by alpha
    Xr = cos(aEll).*gridX - sin(aEll).*gridY
    Yr = cos(aEll).*gridY + sin(aEll).*gridX
    img = ((1/axes[1]*Xr.^2 + 1/axes[2]*Yr.^2).<=1).*( 10*pi*Yr);
    return mod(img-pi,2*pi)+pi
end

I did not spend too much time trying to figure out the purpose of the code but it looks like it is creating some sort of rotated ellipse.
Plotting a heatmap of the resulting matrix confirms this:

julia> using Plots

julia> img = myImg(1024)

julia> heatmap(img)

Analysis

To get an estimate of the time and memory allocation it takes to run the function we can use the @time macro.1
As to not measure compilation overhead, the function is timed twice (here using Julia v0.5).

julia> @time myImg(1024);
  0.117500 seconds (845 allocations: 144.179 MB, 50.45% gc time)

The first thing to do when trying to improve performance of Julia code is to check if there are any type instabilities (see here and here for references) in performance sensitive parts of the code.
Julia provides a macro called @code_warntype which gives colored output where type instabilities are shown in red.
Running @code_warntype myImg(1024) shows, however, that this function is perfectly fine from a type stability point of view.

The @time output, shows a significant amount of time (~50%) is spent in the garbage collector (GC).
This indicates that large amounts of memory is being allocated and released and thus needs to be garbage collected.
The initial goal should thus first be to reduce the amount of memory allocations.2

The most significant memory allocations start with the “mesh grid” type of variables gridX and gridY which are created as

gridX = [i for i in values, j in values]
gridY = [j for i in values, j in values]

Often, creating a mesh grid like this is quite wasteful in terms of memory because we are here storing pts amount of data in 2*pts^2 amount of memory.
Later, these matrices are used to do operations in an “vectorized” fashion:

Xr = cos(aEll).*gridX - sin(aEll).*gridY
Yr = cos(aEll).*gridY + sin(aEll).*gridX
img = ((1/axes[1]*Xr.^2 + 1/axes[2]*Yr.^2).<=1).*( 10*pi*Yr);
return mod(img-pi,2*pi)+pi

This is where the problem with memory allocations sits. For example the command cos(aEll).*gridX has to allocate a brand new matrix to store the result in.
And then again allocate a new matrix for the sin part. And then a new matrix for the subtraction between the two matrices etc.
It is evident that there are a lot of new matrices being created here.3
Code like this is quite common from users coming from a MATLAB programming background, since for loops have traditionally been quite slow there.
In addition to the overhead of allocating memory, this also has the effect that the computer is working with “cold” memory (memory not in cache) a lot.
For good performance, it is important to try to do as much operations as possible on the data while it is in cache before loading new data.

The remedy to the memory and cache problem is to attack the problem along a different dimension.
Instead of building up the full result array by doing small operations array by array, it is built up element by element.

My proposed rewrite of the function is:

function myImg2(pts::Integer)
  # rotation of ellipse
  aEll = 35.0/180.0*pi
  axes_inv = [6.0, 25.0]
  values = collect(linspace(-0.5,0.5,pts))

  img = zeros(Float64, pts, pts)
  cosa = cos(aEll)
  sina = sin(aEll)
  @inbounds @fastmath for j in eachindex(values), i in eachindex(values)
      Xr = cosa*values[i] - sina*values[j]
      Yr = cosa*values[j] + sina*values[i]
      v = (axes_inv[1]*Xr^2 + axes_inv[2]*Yr^2)
      k = v <= 1 ? 10*pi*Yr : 0.0
      img[i,j] = mod(k-pi,2*pi)+pi
    end
  return img
end

The “meshgrid” variables gridX and gridY are gone and instead, a nested loop completely computes the result for each element and stores it in img[i,j].
Timing the new function results in

julia> @time myImg2(1024);
  0.011171 seconds (9 allocations: 8.008 MB)

which is a speed up of about 10x and a reduction of memory use by almost 20x without (I would say) making the code much more complicated to read and understand.

Conclusions

Rewriting code that is written in a “vectorized” form can sometimes be beneficial if you see that the code is allocating a lot of memory and the time spent garbage collecting is significant.

Edits:

Removed a section that suggested taking away collect from the values variable. Since we are indexing directly into values it turns out that using an iterator is slightly slower (10%) than using a Vector. Thanks to Simon Danisch


  1. Instead of using the @time macro it is often better to use a dedicated benchmark framework like BenchmarkTools.jl. However, the run time of the function is here quite large so the @time macro is ok to use.
    ^
  2. Typically, it is always good to get in the habit of profiling code before trying to optimize it. “Measuring gives you a leg up on experts who don’t need to measure” – Walter Bright. Julia has a macro @profile that together with the package ProfileView.jl gives a flame graph overview of where time is spent. However, when there are glaring performance bottle necks, I typically fix those first before profiling.
    ^
  3. Julia 0.6 comes with a cool feature where chained calls to dotted operators (like .+) are fused. As an example, in 0.6, the command cos(aEll).*gridX .- sin(aEll).*gridY would only allocate one array instead of three, as in 0.5.
    ^

Deep Learning

By: Kristoffer Carlsson on Kristoffer Carlsson

Re-posted from: http://kristofferc.github.io/projects/deep-learning/

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.

Nullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.

Cras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.

Suspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.

Aliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.