Re-posted from: https://blog.glcs.io/profiling_allocations
This post was written by Steven Whitaker.
In this Julia for Devs post,we will discussusing Julia’s Profile standard libraryfor performance and allocation profiling.We will illustrate these toolswith example codeand then show how to improve the code.
This post will showcasethe powerful techniques we usedto significantly accelerateour clients simulations,resulting in a remarkable 25% reduction in run timeof code that was already highly optimized for performance.The impact of these techniqueson our client’s workis a testament to their effectivenessand should inspire youin your own projects.
Many new Julia developersface poor code performance,but these techniques can net10x or even 100x faster run times!
While we won’t be sharing client-specific code,we will provide similar examplesthat are practical and applicableto a wide range of simulations.This approachshould give you the confidenceto apply these techniquesin your work.
Here are some of the key ideas we will focus on:
- Pinpoint areas of improvement with
Profile.@profile. - Track down allocations with
Profile.Allocs.@profile. - Improve run time and reduce allocations with
@generatedfunctions.
Let’s dive in!
Profiling in Julia
Profiling is a crucial toolfor locating performance bottlenecks.This knowledge is invaluableas it guides development effortsto the parts of the codethat will have the most significant impact on run time.Understanding this processwill keep you informedand in controlof your code’s performance.
Here is some example codewe will useto illustrate profiling in Julia:
using StaticArrays: SVectorfunction kernel_original(x::SVector{N}) where N y = zeros(N + 1) y[1] = x[1] for i = 2:N y[i] = 0.5 * y[i-1] + x[i] end y[end] = sum(@view(y[1:end-1])) return SVector{N+1}(y)endfunction workflow_original() total = 0.0 for i = 1:100 x = SVector{10, Float64}(rand(10)) y = kernel_original(x) total += y[end] end return totalend
The main idea with this exampleis that there is a workflow,workflow_original,that calls out to a core computationfor many different input values.
To profile this code,we will use the Profile standard library.Since Profile implements a statistical profiler,we need to ensure the code we profileruns long enoughto reduce the impact of noise in the measurements(see the documentation for more info).So,let’s see how long workflow_original takes(using BenchmarkTools.jl):
julia> using BenchmarkToolsjulia> @btime workflow_original(); 6.410 s (300 allocations: 25.00 KiB)
6 microseconds is very fastcompared to the default profiling sample delayof 1 millisecond.Therefore,we must run the workflow many timesto get enough samples.We have found that 1000–5000 samplesare usually plentyfor identifying performance bottlenecks,though slower code will likely need more samplesand/or a larger sample delay.
So,to get approximately 1000 samples,we will need to run the workflow\( 1000 \cdot \frac{1 \mathrm{ms}}{0.006 \mathrm{ms}} = 166,667 \) times.We’ll round up to 200,000 for good measure.
Let’s see how to profile the code:
julia> using Profilejulia> Profile.clear()julia> workflow_original();julia> Profile.@profile for _ in 1:200_000 workflow_original() end
A couple of notes:
- The
Profile.clear()stepis not strictly necessary.However, a habit of always clearing the profile dataensures one doesn’t inadvertently spoil their profileswith old data from previous profiling results. - We called
workflow_originalonce before profilingto avoid profiling JIT compilation.
Now we want to display the profile data.One way is via Profile.print,which prints a textual representationof the profile to the console,but this isn’t the most efficient methodfor inspecting the data.Typically,profile data is visualizedusing a flamegraph.
The Profiling docs list several packagesfor visualizing profiles.We will use ProfileCanvas.jl,which creates an HTML filewe can view in a web browserand interactively inspect the profiling results:
julia> using ProfileCanvasjulia> ProfileCanvas.view()
This code creates and displays a flamegraph:

One thing that stands out in this flamegraphis the three yellow rectangles.These indicate occurrences of garbage collection (GC),which implies the code allocated memory.Since these yellow blocksrepresent a decent portion of the run time(as indicated by the width of the rectangles),let’s investigate.
Allocation Profiling
We will use Julia’s allocation profiling toolsto further inspectthe allocations we know the workflow has.The process is similar to performance profiling,with the following differences:
- We will use
Profile.Allocs.@profileand pass the optionsample_rate = 1.0to record all allocations.In a larger workflow with many more allocations,a smallersample_rateis advised(the default value is0.1). - We will run the workflow just one time.
- We will visualize the results with
ProfileCanvas.view_allocs.
Here’s how to profile allocations:
julia> using Profile, ProfileCanvasjulia> Profile.Allocs.clear()julia> workflow_original();julia> Profile.Allocs.@profile sample_rate = 1.0 workflow_original();julia> ProfileCanvas.view_allocs()

In this flamegraph,the widths of each rectanglecorrespond to how many allocations were made.In this example,each yellow blockis the same width,meaning they all contributedthe same number of allocations.
Now we need to determinewhether we can do anything about these allocations.
Moving up from the bottom of the flamegraphlets us trace the call stackto see where these allocations came from.For example,we can see that GenericMemorywas called from Array,which itself was called from Array,and so on.Eventually,we get to workflow_original.
If we hover our mouse cursor over that block,it will display the file and line number:

(In this case,I defined workflow_originalin the first REPL promptof a fresh Julia session,so that’s why REPL[1] shows up for the file name.)
Looking at these profile resultsshows us that the offending lines are:
x = SVector{10, Float64}(rand(10))(inworkflow_original)y = zeros(N + 1)(inkernel_original)
This makes sense;in each of these lines,we explicitly create an array(via rand and zeros),which allocates memory.
But it turns outwe can eliminate these allocations!
Eliminating the first allocationrequires knowledgeof the StaticArrays.jl package.In particular,the @SVector macrocan create an SVectordirectly from standard array expressions(like rand)without allocating memory.So,instead of using the SVector constructor,we can write:
x = @SVector rand(10)
The second allocation,however,is a bit trickier to remove.
Reducing Allocations with @generated Functions
What makes it difficultto remove the allocationin kernel_originalis the elements of the vector ydepend on previous elementsof the vector.That means we have to compute one elementto compute the next,which means we have to store that element somehow.If we knew exactly how long y would be,we could store the computationsin local (scalar) variables.However, the function needs to work with any input size;we don’t want to create a separate methodfor each possible input size.
At least, not by hand.
It turns outJulia can automatethe creation of these specialized methods.The trick is to use the @generated macro.
A method annotated with @generateduses type informationto produce specialized implementationsof the methoddepending on the input types.And since type information is available at compile time,this specialization occurs at compile time,leading to run time improvements.
Using a @generated functionto replace kernel_originalwill allow us to, essentially,move the run time allocationto compile time.
Here’s what the new function looks like:
@generated function kernel_generated(x::SVector{N}) where N assignments = [:(y1 = x[1])] for i = 2:N yprev = Symbol(:y, i - 1) yi = Symbol(:y, i) push!(assignments, :($yi = 0.5 * $yprev + x[$i])) end yend = Symbol(:y, N + 1) sum_expr = reduce((a, b) -> :($a + $b), (Symbol(:y, i) for i = 1:N)) push!(assignments, :($yend = $sum_expr)) vars = ntuple(i -> Symbol(:y, i), N + 1) return quote $(assignments...) return SVector{$(N + 1), Float64}($(vars...)) endend
The first thing you might notice,especially if you are unfamiliar with metaprogramming,is that this function looks quite differentfrom kernel_original.So, let’s unpack this a bit:
- A
@generatedfunctionneeds to return an expression.The compiler will then compilethe code resulting from the expression.Finally, at run time,the compiled code will be used,not the code usedto generate the compiled expression.In other words,this function must returnthe specialized code itself,not the result of a run time computation. - The returned expressionis created with a
quoteblock.Thisquoteblockinterpolates (using$) expressionsbuilt up earlier in the function.The computationsto build up these expressionsoccur only at compile time.
If we want to inspectwhat the generated function looks like,we need to refactor the code just a bit.Essentially,we’ll create a regular Julia functionthat returns an expression,and the @generated functionwill just call that function:
function _gen_kernel_generated(::Type{<:SVector{N}}) where N # Same code as `kernel_generated` above.end@generated kernel_generated(x::SVector) = _gen_kernel_generated(x)
Then we can call _gen_kernel_generated directlyto see what code actually runs:
julia> _gen_kernel_generated(typeof(@SVector rand(1)))quote #= REPL[2]:18 =# y1 = x[1] y2 = y1 #= REPL[2]:19 =# return SVector{2, Float64}(y1, y2)endjulia> _gen_kernel_generated(typeof(@SVector rand(2)))quote #= REPL[2]:18 =# y1 = x[1] y2 = 0.5y1 + x[2] y3 = y1 + y2 #= REPL[2]:19 =# return SVector{3, Float64}(y1, y2, y3)endjulia> _gen_kernel_generated(typeof(@SVector rand(3)))quote #= REPL[2]:18 =# y1 = x[1] y2 = 0.5y1 + x[2] y3 = 0.5y2 + x[3] y4 = (y1 + y2) + y3 #= REPL[2]:19 =# return SVector{4, Float64}(y1, y2, y3, y4)end
Yep, the implementation looks right!
We can also seethat the generated codeavoids creating an arrayto store intermediate results,instead storing computationsin local variables.But we didn’t have to writeany of those methods ourselves!Using @generated allows usto maintain just one functionto generate all these specialized implementations.
Let’s benchmark the new implementation:
julia> @btime workflow_generated(); 1.093 s (0 allocations: 0 bytes)
Nice, six times fasterand no allocations!
And here’s the profile:

(Click here to see the code for profiling.)
juliausing StaticArrays: @SVector, SVector@generated function kernel_generated(x::SVector{N}) where N assignments = [:(y1 = x[1])] for i = 2:N yprev = Symbol(:y, i - 1) yi = Symbol(:y, i) push!(assignments, :($yi = 0.5 * $yprev + x[$i])) end yend = Symbol(:y, N + 1) sum_expr = reduce((a, b) -> :($a + $b), (Symbol(:y, i) for i = 1:N)) push!(assignments, :($yend = $sum_expr)) vars = ntuple(i -> Symbol(:y, i), N + 1) return quote $(assignments...) return SVector{$(N + 1), Float64}($(vars...)) endendfunction workflow_generated() total = 0.0 for i = 1:100 x = @SVector rand(10) y = kernel_generated(x) total += y[end] end return totalendusing BenchmarkTools@btime workflow_generated();using Profile, ProfileCanvasProfile.clear()Profile.@profile for _ in 1:200_000 workflow_generated() endProfileCanvas.view()Note,there’s no need for allocation profilingbecause there are no allocations.There aren’t obvious places for improvement,so I’d say we did a pretty good joboptimizing the code.
Summary
In this post,we saw how to use the Profile standard libraryto profile the run timeand the allocationsof a piece of code.We also illustratedhow a @generated functioncan eliminate run time allocationsand speed up the code.These were key ideas we usedto help one of our clientsspeed up their simulations.
Do you need help pinpointing performance bottlenecksor tracking down allocationsin your code?Contact us, and we can help you out!
Additional Links
- Profiling in Julia
- Julia manual section on profiling.
- Metaprogramming in Julia
- Julia manual section on metaprogramming,including
@generatedfunctions.
- Julia manual section on metaprogramming,including
- GLCS Modeling & Simulation
- Connect with us for Julia Modeling & Simulation.

