Category Archives: Julia

diff all the things! Part 1

By: Domenic Di Francesco

Re-posted from: https://allyourbayes.com/posts/gradients_pt1/


TLDR

Gradients guide us through daunting and unwieldy, high-dimensional models to draw samples from posterior distributions, take steps towards loss minimising parameter values, identify model vulnerabilities using adversarial methods, and more. One (of many) fun features about the Julia programming language is its unique approach to autodiff.

In Part 1, I provide an intro to autodiff workflow in Julia and the emergence of the Enzyme library.I personally found some documentation a little difficult to follow, so this is intended to be a practical guide, with a couple of example use cases.


autodiff (AD)

So much of computational statistics and machine learning is built on automatic (algorithmic) differentiation – AD, “autodiff”, or “autograd”. Dig into literature on Bayesian inference, or deep learning, and you will find parameters being nudged in a direction informed by a gradient. I’m sometimes surprised at the extent of the ‘gradient-based’ monopoly in scientific computing, but I don’t mean to trivialise! …getting gradients of complex functions, very quickly and without error, is a powerful tool and its great that we are able to leverage this.

AD works by “the relentless application of the chain rule”, as I vaguely recall one of the Stan developers saying. Large functions are differentiated piece by piece (for which look-up rules can be applied), and the results are stitched together.

inter-operability and Julia ♥️??

Perhaps my favourite feature of Julia is its inter-operability. Look at a GitHub repo for a Julia package and you will generally find the following:

Julia is so performant, that its libraries for scientific computing will work with normal variables – without needing to package up their own types. So solvers from DifferentialEquations, or neural networks defined in Flux can be immediately combined with the Julia probabilistic programming language, Turing.

If a new framework emerges in Python, an entirely new ecosystem may need to be developed to prop it up. This may involve duplicating existing but now incompatible functionality – think of JAX needing to implement its own NumPy module. …whereas if you write a new Julia library, it could offer a vast range of applications as it is combined with other packages, which feels like potential for a multiplicative, rather than additive, impact.

Here I am, impressed that the interoperability of the Copulas package:


AD in Julia

ML frameworks in Python require you to work in their own syntax, with their specific types, and use their own in-built AD methods. Locking into a framework is not ideal, as you are limited to the methods they support and you need to juggle types of inputs and outputs.

Conversely, Julia works the other way around. You write your code, using whichever libraries, functions and types you want, and then you choose an AD library to get you the gradients you need.

the joy of gradients – image courtesy of Google Gemini

Why does this feel so powerful? its the promise of gradient-based methods for your scientific problem, rather than a walled-off machine learning model. You can point an AD engine at the aspect of your analysis that you are interested in for more bespoke interrogation or optimisation.

I recently completed a project, which had an element of adversarial AI (counterfactual analysis and saliency maps), which I wrote in JAX, mainly as an excuse to learn JAX. I used their NNX module for neural networks, and found this to be unexpectedly restrictive. I wasn’t able to run analysis that required gradients of outputs w.r.t. inputs, as I hit NNX errors/limitations that I wasn’t able to resolve. I ended up re-writing everything in Julia.

Once upon a time there was Zygote: an AD library that powered ML in Julia. A great achievement, but limitations began to emerge. Because of how it operates on your code, it can struggle with certain features. This is similar to the so-called “sharp bits” of JAX …though those limitations emerge for very different reasons.

Zygote is also sometimes described as “too permissive”. This is because it intercepts your code at some intermediate representation level (see note below), performs AD and then reassembles. Julia’s IR levels were designed for performance, not AD and so doesn’t reliably detect and report errors in Zygote’s process.

It is possible for AD solutions to work with the compiler, rather than against it — more on this in Part 2! But for now…

The Julia compiler doesn’t go straight from your source text to machine code. It passes through several intermediate representations (progressively lower-level versions that are easier for the compiler to analyse and optimise).

What’s cool is that we can actually view these with the below macros:

# some nice high-level code
nice_function(x) = 2x + 1
nice_function (generic function with 1 method)

# the "lowered" IR
@code_lowered nice_function(1.0)
CodeInfo(
1 ─ %1 = Main.:+
│   %2 = Main.:*
│   %3 = (%2)(2, x)
│   %4 = (%1)(%3, 1)
└──      return %4
)

# the "typed" IR
@code_typed nice_function(1.0)
CodeInfo(
1 ─ %1 = Base.mul_float(2.0, x)::Float64
│   %2 = Base.add_float(%1, 1.0)::Float64
└──      return %2
) => Float64

# the LLVM level
@code_llvm nice_function(1.0)
; Function Signature: nice_function(Float64)
;  @ none:3 within `nice_function`
define double @julia_nice_function_7614(double %"x::Float64") #0 {
top:
;  @ none:5 within `nice_function`
; ┌ @ promotion.jl:430 within `*` @ float.jl:493
   %0 = fmul double %"x::Float64", 2.000000e+00
; └
; ┌ @ promotion.jl:429 within `+` @ float.jl:491
   %1 = fadd double %0, 1.000000e+00
   ret double %1
; └
}

# finally, actual machine code
@code_native nice_function(1.0)
    .section    __TEXT,__text,regular,pure_instructions
    .build_version macos, 15, 0
    .globl  _julia_nice_function_7864       ; -- Begin function julia_nice_function_7864
    .p2align    2
_julia_nice_function_7864:              ; @julia_nice_function_7864
; Function Signature: nice_function(Float64)
; ┌ @ none:3 within `nice_function`
; %bb.0:                                ; %top
; │ @ none within `nice_function`
    ;DEBUG_VALUE: nice_function:x <- $d0
    ;DEBUG_VALUE: nice_function:x <- $d0
; │ @ none:5 within `nice_function`
; │┌ @ promotion.jl:430 within `*` @ float.jl:493
    fadd    d0, d0, d0
    fmov    d1, #1.00000000
; │└
; │┌ @ promotion.jl:429 within `+` @ float.jl:491
    fadd    d0, d0, d1
    ret
; └└
                                        ; -- End function
    .section    __DATA,__const
    .p2align    3, 0x0                          ; @"+Core.Float64#7866"
"l_+Core.Float64#7866":
    .quad   "l_+Core.Float64#7866.jit"

.set "l_+Core.Float64#7866.jit", 6235317488
.subsections_via_symbols

And then came Enzyme: an AD library that works at the LLVM level (your code is compiled first, and then the gradients are computed). This led to improvements in both performance and flexibility – we can now differentiate through mutation and control flow. This more resilient library has a steeper learning curve (imo), but (also imo) requiring more explicit instructions ends up making things clearer.

I hope the below examples will help get you started.

example 1: a Bayesian linear regression

How about computing the gradients we need for Hamiltonian Monte Carlo sampling. For a linear regression model, with inputs, and outputs, :

with priors:

Let’s put these in a tuple:

using Distributions

priors = (
    α_prior = Normal(0, 3), 
    β_prior = Normal(0, 3), 
    σ_prior = Exponential(1)
)
(α_prior = Normal{Float64}(μ=0.0, σ=3.0), β_prior = Normal{Float64}(μ=0.0, σ=3.0), σ_prior = Exponential{Float64}(θ=1.0))

For numerical stability reasons MCMC typically works in negative log space, so the below function finds the unnormalised negative log posterior for our model. This could of course be sped-up, but I’ve tried to keep it friendly ?

using Random

function neg_log_posterior(params::NamedTuple, priors::NamedTuple,
                          x::Vector, y::Vector)

    α = params.α; β = params.β; σ = params.σ
    
    # mean of likelihood
    μ_pred = α .+ x * β
    
    # increment the negative log likelihood for all observations
    neg_log_lik = 0.0
    for i in 1:length(y)
        neg_log_lik += -logpdf(Normal(μ_pred[i], σ), y[i])
    end
    
    # ...and for the priors
    neg_log_prior_α = -logpdf(priors.α_prior, α)
    neg_log_prior_β = -logpdf(priors.β_prior, β)
    neg_log_prior_σ = -logpdf(priors.σ_prior, σ)
    
    # summing in log space is equivalent to multiplying priors and likelihoods ?
    return neg_log_lik + neg_log_prior_α + neg_log_prior_β + neg_log_prior_σ
end
neg_log_posterior (generic function with 1 method)

To use this function, we need to define some inputs. Here, I’m just simulating some data, using “true” parameter values:

# 20 data points, why not...
n_samples = 20

# define a PRNG for reproducibility
prng = MersenneTwister(231123)

# inputs from a standard Gaussian
x = randn(prng, n_samples)

# outputs by sending inputs through a "true" model
α_true = 1/2; β_true = -1/2; σ_true = 2

y = α_true .+ x * β_true .+ σ_true * randn(prng, n_samples)

We can use Enzyme to get the gradients of the negative log posterior w.r.t. the model parameters – as required by Hamiltonian Monte Carlo.

# where do i want gradients?
params_init = (
    α = rand(prng, priors.α_prior),
    β = rand(prng, priors.β_prior),
    σ = rand(prng, priors.σ_prior)
)
(α = -6.290988779313054, β = 3.9183834605694834, σ = 1.6155587713748865)

Imagine a function with inputs and outputs.

Forward mode answers: “if I nudge one input, how do all outputs change?” — so you need passes to cover every input.

Reverse mode answers: “for one output, how did all inputs contribute?” — so you need passes to cover every output.

Often, ML and Bayesian inference problems have many parameters (large ) but a single scalar output i.e. a loss, or a log probability density (small ). Reverse mode gets us gradients w.r.t. all parameters in one backward pass.

That said, the threshold isn’t always obvious and I’ve had cases where switching modes gave a noticeable speedup, so it’s worth experimenting!

Julia has dedicated packages for each: ForwardDiff.jl and ReverseDiff.jl as part of its AD ecosystem. These are solid and well-established, but aren’t the focus of this post.

I am giving the gradient() function three arguments:

  • the mode/direction to apply AD, Reverse. Each pass of a reverse-mode AD computes gradients of all inputs w.r.t. a single output (as a vector-Jacobian product). In Forward mode, each pass computes gradients of a single input w.r.t. all outputs (as a Jacobian-vector product).

    See above callout note for more on this and why reverse mode is likely not an optimal choice for so few parameters.

    Consequently, there are efficiency trade-offs associated with this selection, depending on the number of inputs and outputs of…

  • …the function we are differentiating, params -> neg_log_posterior(params, priors, x, y). Here, an anonymous function that takes params as input and returns the negative log posterior, using neg_log_posterior(), which we defined above.

  • the point at which we want gradients, params_init. This is the current location of the Markov chain. In the first instance we need an initial guess, for which we have drawn from the priors.

using Enzyme
# computing the gradients
∇θ = Enzyme.gradient(
    Reverse, 
    params -> neg_log_posterior(params, priors, x, y), 
    params_init
)
((α = -54.13440425830732, β = 25.990919631591435, σ = -323.2246379556173),)

We can then use these gradients to update the momentum of our Hamiltonian ‘particles’, generating proposals guided by the geometry of the posterior distribution. Unlike random walks or Gibbs samplers, this generation of samplers remain efficient in high dimensions ?

example 2: an MLP (simple neural network)

I defined a simple, densely connected neural network without anything clever (no layer normalisations, recurrent connections or attention mechanisms), sometimes referred to as a multi-layer perceptron (MLP).

I’ll spare you this set-up code here as we are focussing on autodiff, but you can find the full code on GitHub.

Instead, let’s look at my training function. Notice that I am now using a different function, Enzyme.autodiff() for backpropagation. It has more arguments:

  • the mode/direction to apply AD, set_runtime_activity(Reverse). Similar to gradient(), but here we are specifying that we want to use reverse-mode AD, with runtime activity analysis. As a rule of ?, I start with regular Reverse mode AD. If I get compilation errors about, for instance, type inference or broadcasting, then I add set_runtime_activity().
  • the function we are differentiating, (net, funs, inputs, targets) -> find_loss(net, funs, inputs, targets). Here, an anonymous function that takes the neural network, its functions, inputs and targets as arguments, and returns the loss.
  • the activity of the function Active. we need to make the output to the loss function active, because it is the starting point of the chain rule.
  • the activity of each argument, Active, Const(), or Duplicated(). This is where things get more explicit. We need to tell Enzyme which arguments we want gradients for (Active), and which we don’t (Const) – the derivative of a constant is zero. Finally, we also want gradients for Duplicated variables, but they could be large. So we create a shadow copy of the neural network, nn_shadow, which we use to accumulate gradients in-place (without allocating new memory each time!)
function train(nn::neural_network, nn_funs::neural_network_funs, 
               a::Array{Float64}, y::Array{Float64};
               a_test::Array{Float64} = a, y_test::Array{Float64} = y, 
               n_epochs::Int = 10, η::Float64 = 0.01)
    @assert n_epochs > 0 "n_epochs must be greater than 0"

    training_df = DataFrame(epoch = Int[], loss = Float64[], test_loss = Float64[])

    for i in 1:n_epochs

        # initiate our memory-saving shadow
        ∇nn = Enzyme.make_zero(nn)

        # find ∂ℒ/∂θ
        Enzyme.autodiff(
            set_runtime_activity(Reverse),
            (net, funs, inputs, targets) -> find_loss(net, funs, inputs, targets)[1],
            Active,
            Duplicated(nn, ∇nn),
            Const(nn_funs),
            Const(a),
            Const(y)
        )

        # nudge all weights and biases towards a lower loss, using learning rate, η
        for j = 1:length(nn.Ws)
            nn.Ws[j] -= η * ∇nn.Ws[j]
            nn.bs[j] -= η * ∇nn.bs[j]
        end
        
        # record losses
        append!(training_df, 
                DataFrame(epoch = i, 
                          loss = find_loss(nn, nn_funs, a, y)[1],
                          test_loss = find_loss(nn, nn_funs, a_test, y_test)[1]))

    end
    
    return nn, training_df
end

Enzyme.make_zero(nn) creates a structural copy of nn with the same type (neural_network), field names (Ws, bs), and dimensions …but with all numerical values set to zero. This memory-saving trick is important for large vectors of parameters, as we will generally have in deep learning.

The example applications that I selected are already very well equipped with sophisticated Julia libraries. If you are interested in probabilistic modelling in Julia, use Turing, if you are interested in deep learning, use Flux. Both are Enzyme compatible, but the later has specific guidance on how to set this up, using the Duplicated method that we used above:

some references

The Julia autodiff ecosystem, which is more vast than the examples covered in this blog post, link

A summary of the key trade-offs accross various autodiff methods, link

Professor Simone Scardapone’s book, “Alice’s adventures in a differential wonderland” link.

“As the name differentiable implies, gradients play a pivotal role”

JuliaCon talk on Julia’s unique approach to autodiff:

Citation

BibTeX citation:
@online{di_francesco2025,
  author = {Di Francesco, Domenic},
  title = {Diff All the Things! {Part} 1},
  date = {2025-09-09},
  url = {https://allyourbayes.com/posts/gradients_pt1/},
  langid = {en}
}
For attribution, please cite this work as:
Di Francesco, Domenic. 2025. “Diff All the Things! Part 1.”
September 9, 2025. https://allyourbayes.com/posts/gradients_pt1/.

Accelerate Your Julia Code with Effective Profiling Methods

By: Great Lakes Consulting

Re-posted from: https://blog.glcs.io/profiling_allocations

This post was written by Steven Whitaker.

In this Julia for Devs post,we will discussusing Julia’s Profile standard libraryfor performance and allocation profiling.We will illustrate these toolswith example codeand then show how to improve the code.

This post will showcasethe powerful techniques we usedto significantly accelerateour clients simulations,resulting in a remarkable 25% reduction in run timeof code that was already highly optimized for performance.The impact of these techniqueson our client’s workis a testament to their effectivenessand should inspire youin your own projects.

Many new Julia developersface poor code performance,but these techniques can net10x or even 100x faster run times!

While we won’t be sharing client-specific code,we will provide similar examplesthat are practical and applicableto a wide range of simulations.This approachshould give you the confidenceto apply these techniquesin your work.

Here are some of the key ideas we will focus on:

  • Pinpoint areas of improvement with Profile.@profile.
  • Track down allocations with Profile.Allocs.@profile.
  • Improve run time and reduce allocations with @generated functions.

Let’s dive in!

Profiling in Julia

Profiling is a crucial toolfor locating performance bottlenecks.This knowledge is invaluableas it guides development effortsto the parts of the codethat will have the most significant impact on run time.Understanding this processwill keep you informedand in controlof your code’s performance.

Here is some example codewe will useto illustrate profiling in Julia:

using StaticArrays: SVectorfunction kernel_original(x::SVector{N}) where N    y = zeros(N + 1)    y[1] = x[1]    for i = 2:N        y[i] = 0.5 * y[i-1] + x[i]    end    y[end] = sum(@view(y[1:end-1]))    return SVector{N+1}(y)endfunction workflow_original()    total = 0.0    for i = 1:100        x = SVector{10, Float64}(rand(10))        y = kernel_original(x)        total += y[end]    end    return totalend

The main idea with this exampleis that there is a workflow,workflow_original,that calls out to a core computationfor many different input values.

To profile this code,we will use the Profile standard library.Since Profile implements a statistical profiler,we need to ensure the code we profileruns long enoughto reduce the impact of noise in the measurements(see the documentation for more info).So,let’s see how long workflow_original takes(using BenchmarkTools.jl):

julia> using BenchmarkToolsjulia> @btime workflow_original();  6.410 s (300 allocations: 25.00 KiB)

6 microseconds is very fastcompared to the default profiling sample delayof 1 millisecond.Therefore,we must run the workflow many timesto get enough samples.We have found that 1000–5000 samplesare usually plentyfor identifying performance bottlenecks,though slower code will likely need more samplesand/or a larger sample delay.

So,to get approximately 1000 samples,we will need to run the workflow\( 1000 \cdot \frac{1 \mathrm{ms}}{0.006 \mathrm{ms}} = 166,667 \) times.We’ll round up to 200,000 for good measure.

Let’s see how to profile the code:

julia> using Profilejulia> Profile.clear()julia> workflow_original();julia> Profile.@profile for _ in 1:200_000 workflow_original() end

A couple of notes:

  • The Profile.clear() stepis not strictly necessary.However, a habit of always clearing the profile dataensures one doesn’t inadvertently spoil their profileswith old data from previous profiling results.
  • We called workflow_original once before profilingto avoid profiling JIT compilation.

Now we want to display the profile data.One way is via Profile.print,which prints a textual representationof the profile to the console,but this isn’t the most efficient methodfor inspecting the data.Typically,profile data is visualizedusing a flamegraph.

The Profiling docs list several packagesfor visualizing profiles.We will use ProfileCanvas.jl,which creates an HTML filewe can view in a web browserand interactively inspect the profiling results:

julia> using ProfileCanvasjulia> ProfileCanvas.view()

This code creates and displays a flamegraph:

Profile of original workflow

One thing that stands out in this flamegraphis the three yellow rectangles.These indicate occurrences of garbage collection (GC),which implies the code allocated memory.Since these yellow blocksrepresent a decent portion of the run time(as indicated by the width of the rectangles),let’s investigate.

Allocation Profiling

We will use Julia’s allocation profiling toolsto further inspectthe allocations we know the workflow has.The process is similar to performance profiling,with the following differences:

  • We will use Profile.Allocs.@profileand pass the option sample_rate = 1.0to record all allocations.In a larger workflow with many more allocations,a smaller sample_rate is advised(the default value is 0.1).
  • We will run the workflow just one time.
  • We will visualize the results with ProfileCanvas.view_allocs.

Here’s how to profile allocations:

julia> using Profile, ProfileCanvasjulia> Profile.Allocs.clear()julia> workflow_original();julia> Profile.Allocs.@profile sample_rate = 1.0 workflow_original();julia> ProfileCanvas.view_allocs()

Allocation profile of original workflow

In this flamegraph,the widths of each rectanglecorrespond to how many allocations were made.In this example,each yellow blockis the same width,meaning they all contributedthe same number of allocations.

Now we need to determinewhether we can do anything about these allocations.

Moving up from the bottom of the flamegraphlets us trace the call stackto see where these allocations came from.For example,we can see that GenericMemorywas called from Array,which itself was called from Array,and so on.Eventually,we get to workflow_original.

If we hover our mouse cursor over that block,it will display the file and line number:

Mouse hover displaying file and line info

(In this case,I defined workflow_originalin the first REPL promptof a fresh Julia session,so that’s why REPL[1] shows up for the file name.)

Looking at these profile resultsshows us that the offending lines are:

  • x = SVector{10, Float64}(rand(10)) (in workflow_original)
  • y = zeros(N + 1) (in kernel_original)

This makes sense;in each of these lines,we explicitly create an array(via rand and zeros),which allocates memory.

But it turns outwe can eliminate these allocations!

Eliminating the first allocationrequires knowledgeof the StaticArrays.jl package.In particular,the @SVector macrocan create an SVectordirectly from standard array expressions(like rand)without allocating memory.So,instead of using the SVector constructor,we can write:

x = @SVector rand(10)

The second allocation,however,is a bit trickier to remove.

Reducing Allocations with @generated Functions

What makes it difficultto remove the allocationin kernel_originalis the elements of the vector ydepend on previous elementsof the vector.That means we have to compute one elementto compute the next,which means we have to store that element somehow.If we knew exactly how long y would be,we could store the computationsin local (scalar) variables.However, the function needs to work with any input size;we don’t want to create a separate methodfor each possible input size.

At least, not by hand.

It turns outJulia can automatethe creation of these specialized methods.The trick is to use the @generated macro.

A method annotated with @generateduses type informationto produce specialized implementationsof the methoddepending on the input types.And since type information is available at compile time,this specialization occurs at compile time,leading to run time improvements.

Using a @generated functionto replace kernel_originalwill allow us to, essentially,move the run time allocationto compile time.

Here’s what the new function looks like:

@generated function kernel_generated(x::SVector{N}) where N    assignments = [:(y1 = x[1])]    for i = 2:N        yprev = Symbol(:y, i - 1)        yi = Symbol(:y, i)        push!(assignments, :($yi = 0.5 * $yprev + x[$i]))    end    yend = Symbol(:y, N + 1)    sum_expr = reduce((a, b) -> :($a + $b), (Symbol(:y, i) for i = 1:N))    push!(assignments, :($yend = $sum_expr))    vars = ntuple(i -> Symbol(:y, i), N + 1)    return quote        $(assignments...)        return SVector{$(N + 1), Float64}($(vars...))    endend

The first thing you might notice,especially if you are unfamiliar with metaprogramming,is that this function looks quite differentfrom kernel_original.So, let’s unpack this a bit:

  • A @generated functionneeds to return an expression.The compiler will then compilethe code resulting from the expression.Finally, at run time,the compiled code will be used,not the code usedto generate the compiled expression.In other words,this function must returnthe specialized code itself,not the result of a run time computation.
  • The returned expressionis created with a quote block.This quote blockinterpolates (using $) expressionsbuilt up earlier in the function.The computationsto build up these expressionsoccur only at compile time.

If we want to inspectwhat the generated function looks like,we need to refactor the code just a bit.Essentially,we’ll create a regular Julia functionthat returns an expression,and the @generated functionwill just call that function:

function _gen_kernel_generated(::Type{<:SVector{N}}) where N    # Same code as `kernel_generated` above.end@generated kernel_generated(x::SVector) = _gen_kernel_generated(x)

Then we can call _gen_kernel_generated directlyto see what code actually runs:

julia> _gen_kernel_generated(typeof(@SVector rand(1)))quote    #= REPL[2]:18 =#    y1 = x[1]    y2 = y1    #= REPL[2]:19 =#    return SVector{2, Float64}(y1, y2)endjulia> _gen_kernel_generated(typeof(@SVector rand(2)))quote    #= REPL[2]:18 =#    y1 = x[1]    y2 = 0.5y1 + x[2]    y3 = y1 + y2    #= REPL[2]:19 =#    return SVector{3, Float64}(y1, y2, y3)endjulia> _gen_kernel_generated(typeof(@SVector rand(3)))quote    #= REPL[2]:18 =#    y1 = x[1]    y2 = 0.5y1 + x[2]    y3 = 0.5y2 + x[3]    y4 = (y1 + y2) + y3    #= REPL[2]:19 =#    return SVector{4, Float64}(y1, y2, y3, y4)end

Yep, the implementation looks right!

We can also seethat the generated codeavoids creating an arrayto store intermediate results,instead storing computationsin local variables.But we didn’t have to writeany of those methods ourselves!Using @generated allows usto maintain just one functionto generate all these specialized implementations.

Let’s benchmark the new implementation:

julia> @btime workflow_generated();  1.093 s (0 allocations: 0 bytes)

Nice, six times fasterand no allocations!

And here’s the profile:

Profile of optimized workflow

(Click here to see the code for profiling.) juliausing StaticArrays: @SVector, SVector@generated function kernel_generated(x::SVector{N}) where N assignments = [:(y1 = x[1])] for i = 2:N yprev = Symbol(:y, i - 1) yi = Symbol(:y, i) push!(assignments, :($yi = 0.5 * $yprev + x[$i])) end yend = Symbol(:y, N + 1) sum_expr = reduce((a, b) -> :($a + $b), (Symbol(:y, i) for i = 1:N)) push!(assignments, :($yend = $sum_expr)) vars = ntuple(i -> Symbol(:y, i), N + 1) return quote $(assignments...) return SVector{$(N + 1), Float64}($(vars...)) endendfunction workflow_generated() total = 0.0 for i = 1:100 x = @SVector rand(10) y = kernel_generated(x) total += y[end] end return totalendusing BenchmarkTools@btime workflow_generated();using Profile, ProfileCanvasProfile.clear()Profile.@profile for _ in 1:200_000 workflow_generated() endProfileCanvas.view()Note,there’s no need for allocation profilingbecause there are no allocations.

There aren’t obvious places for improvement,so I’d say we did a pretty good joboptimizing the code.

Summary

In this post,we saw how to use the Profile standard libraryto profile the run timeand the allocationsof a piece of code.We also illustratedhow a @generated functioncan eliminate run time allocationsand speed up the code.These were key ideas we usedto help one of our clientsspeed up their simulations.

Do you need help pinpointing performance bottlenecksor tracking down allocationsin your code?Contact us, and we can help you out!

Additional Links

]]>