Tag Archives: Deep Learning

diff all the things! Part 2

By: Domenic Di Francesco

Re-posted from: https://allyourbayes.com/posts/gradients_pt2/


TLDR

After reading Part 1, we know how great autodiff is, and how Julia lets us use it freely. We introduced the Enzyme library and showed some example applications.

Here in Part 2, we look at an emerging competing library, Mooncake, and why it’s worth keeping an eye on ?


recap: Enzyme is great ?

As we saw in Part 1, Enzyme solves the reliability problems that Zygote can (occasionally) exhibit. It differentiates our code at the LLVM level and is mega performant.

We also discussed in Part 1, how nice it is that Julia projects use only Julia code and the benefits of this for interoperability. This isn’t strictly true for Enzyme. Since it operates at the LLVM level, it can differentiate code in any language that compiles to the LLVM IR (including C++, and Rust!). We call a Julia API, but the differentiation is actually happening in the Enzyme software, outside of Julia.

We’ve mentioned it a few times as if it’s basic knowledge – wasn’t for me! LLVM is a compiler framework used by many languages (including Julia) as a shared backend for generating machine code.

When you write Julia, your code eventually gets lowered to the LLVM level (we even saw how you can display this, using @code_llvm in Part 1).

Enzyme operates at this level. This means your high-level code has already been simplified and optimised a fair bit before any autodiff is attempted. This is a big reason why Enzyme is so performant.

The downside is that at the LLVM level, there is no concept of Julia types, dispatch, or packages. There are challenges (read on for details) associated with needing to cross this boundary.

So how good can autodiff be, if we learn lessons from Zygote and Enzyme, but stay entirely in Julia….

enter Mooncake ?

The pitch is as follows: an AD library, written entirely in Julia and competitive with Enzyme.

Oh, Mooncake

Like Enzyme, Mooncake handles mutation, control flow, and provides reliable correctness – stuff that Zygote can struggle with.

…but unlike Enzyme, it does all of this without leaving Julia. It is a self-described language-level autograd compiler.

Zygote (and ReverseDiff) are tracing AD libraries. They execute the function and record all operations on a tape. The tape can then be replayed in reverse to compute the gradients neatly – especially for simple functions.

The tape is data, not code, so Julia can’t optimise it.

A tape records a path (as your code runs), but Enzyme and Mooncake can apply the chain rule to the code before it runs, and then produce new code that: (a) preserves mutation and control flow, and (b) returns gradients with very little overhead, and (c) can be optimised by the compiler. This is only possible because (as we saw in Part 1), Julia’s compiler produces an IR that retains the loops, branches, types, and all, of your code. Python doesn’t have this and so it has to trace.

autodiff library reads generates consequence
Zygote Julia IR tape (via fragile IR transforms) slow on control flow, can silently mishandle edge cases
ReverseDiff runtime trace tape can’t be compiler-optimised
Enzyme LLVM IR new LLVM code fast, but outside Julia
Mooncake Julia IR new Julia functions fast, and stays in Julia

All of the major Python AD libraries (PyTorch, TensorFlow, JAX) implement some kind of tape or tracing. AFAIK, the only one that doesn’t re-trace every operation is JAX, which uses a method that imposes a fixed control flow – hence the self-described “sharp bits” ?.

Why does staying in Julia matter?

  • debugging: we get Julia errors – not messages from LLVM-land, which I certainly can’t follow.
  • new custom rules: adding new derivatives just requires a Julia function. Mooncake provides helpful macros for this too!
  • stability: we won’t get breaking changes on new Julia releases if something changes with the LLVM. I have read that Enzyme has previously had to make fixes for this.

If you still aren’t sure whether to take notice, then listen to the man himself, Chris Rackauckas, “Mooncake is Zygote, but done good, with mutation support” (see clip below)

lets be honest about current trade-offs

It’s tricky trying to find definitive benchmarks for the Julia AD ecosystem. The discourse pages will provide one-off examples of older libraries occasionally outperforming newer ones in terms of speed and reliability. However, the general consensus seems to be that Enzyme is currently the most performant, with Mooncake not far behind.

Let’s do our own….

# loading our autodiff libraries
using ReverseDiff, Zygote, Enzyme, Mooncake

# and to test
using BenchmarkTools, Random

# a simple function
test_function(x) = sum(sin.(x) .+ cos.(x.^2))

# and an example with some control flow (the *if*'s can be a problem for some AD libraries)
function control_flow_function(x)
    s = 0.0
    for i in eachindex(x)
        if x[i] > 0.0
            s += sin(x[i])
        else
            s += cos(x[i])
        end
    end
    return s
end

# simulate some reproducible inputs
x = rand(MersenneTwister(2311), 1000)

Loops, if/else branches, and recursion are how we naturally write scientific code. In this example, the control flow example has a for loop (with iterations) and a branching if/else statement. A tape-based AD (see callout above, “How does Mooncake work”) needs to trace every iteration and record whichever branch was taken, each with its own allocation!

The tape-free approach taken by Enzyme and Mooncake avoids this overhead entirely, producing derivative code where the loop is still a loop. They are especially powerful when differentiating through code with lots of control flow.

@btime ReverseDiff.gradient(test_function, $x)[begin]
@btime ReverseDiff.gradient(control_flow_function, $x)[begin]

@btime Zygote.gradient(test_function, $x)
@btime Zygote.gradient(control_flow_function, $x)

@btime Enzyme.gradient(Reverse, test_function, $x)
@btime Enzyme.gradient(Reverse, control_flow_function, $x)

Unlike Enzyme, which has both gradient() (convenience) and autodiff() (more explicit specification), Mooncake’s value_and_gradient!! is the main API. There isn’t a separate “advanced” form.

In Julia, the familiar single ! denotes a function that may mutate its arguments (but leaving them in a valid state), like how sort!(x) redefines x as a sorted array.

The double !! is new to me, and is apparently a convention from the AD ecosystem, and not base Julia. It signifies that arguments may be mutated, with no guarantees about how meaningful/useful they are afterwards. Presumably in the below example, it is the aggressive recycling of memory that is being signposted i.e. previous contents of the cache are being overwritten and shouldn’t be referenced? But maybe there is more to it.

# Mooncake needs a prepared cache
cache = Mooncake.prepare_gradient_cache(test_function, x);
@btime Mooncake.value_and_gradient!!(cache, test_function, $x)

cache = Mooncake.prepare_gradient_cache(control_flow_function, x);
@btime Mooncake.value_and_gradient!!(cache, control_flow_function, $x)

The results… ? ? ?

test_function()

library time (μs) allocations memory
Enzyme 15.4 8 24.16 KiB
Zygote 16.5 49 97.14 KiB
ReverseDiff 23.7 105 67.78 KiB
Mooncake 24.8 11 16.48 KiB

control_flow_function()

library time allocations memory
Enzyme 3.2 μs 3 8.06 KiB
Mooncake 14.5 μs 3 352 B
ReverseDiff 257.0 μs 8,023 375.70 KiB
Zygote 3,481 μs 42,118 9.51 MiB

As expected, Enzyme is fast. It sees this already-optimised LLVM code and differentiates that.

Although Mooncake was the slowest on the simple test function, it was only times slower than Enzyme – same order of magnitude, still competitive. The newer libraries really shone on the control flow function, hundreds (Mooncake) to thousands (Enzyme) times faster than Zygote.

And look at the memory! ?

Mooncake allocates so little! After the initial cache preparation, the gradients calculated in training or inference loops will be almost zero-allocation, which is fantastic for performance or memory bottlenecks.

reviewing the Bayesian model from Part 1

Using the same simulated data and priors, let’s run the linear regression using Turing (look how much neater it is) and swap AD backends with a single argument.

using Turing, Distributions

@model function linear_regression(x, y)
    # priors
    α ~ α_prior; β ~ β_prior; σ ~ σ_prior
    
    # there are ofc lots of ways to vector/optim-ise the likelihood, but...
    for i in eachindex(y)
        y[i] ~ Normal+ β * x[i], σ)
    end
end

linear_model = linear_regression(x, y); n_draws = 1_000

just running a single chain for post-warmup samples for the purposes of this example:

mooncake_draws =  sample(linear_model, NUTS(; adtype=AutoMooncake(; config=nothing)), n_draws)

enzyme_draws = sample(linear_model, NUTS(; adtype=AutoEnzyme(; mode=Enzyme.set_runtime_activity(Enzyme.Reverse))), n_draws)

not ideal syntax, but within each sample you can specify an AD backend. Actually in the above example, I wouldn’t expect any benefit from using Mooncake or Enzyme – it’s a tiny model ( parameters, data points) and, as discussed in Part 1, a Forward Mode AD library like ForwardDiff would likely be the best bet. As I mentioned in Part 1, there have been cases where I benefitted enormously from simply switching the AD backend to AutoMooncake().

my current thoughts

Please keep in mind, I don’t feel best placed to comment on direction of travel. I am a user (and fan) of the Julia scientific computing ecosystem, but I am not an open source developer.

Mooncake is still being developed by that community, with the stated goal to: “improve on ForwardDiff.jl, ReverseDiff.jl, and Zygote.jl in several ways.” It encourages us to use it, in a seemingly arduous way. Either by adding an extra step (prepare_gradient()), or by using DifferentiationInterface – a common interface for multiple Julia AD libraries.

As with most statistical software, I expect the future success of these libraries will be tied to how well they integrate with the rest of the ecosystem. If Mooncake can be subbed in for Enzyme in libraries like Turing and Flux, then it will be easy to switch and benefit from its features.

Or will the Julia AD ecosystem fail to converge and I’ll end up writing a Part 3 in this series? ?

Citation

BibTeX citation:
@online{di_francesco2026,
  author = {Di Francesco, Domenic},
  title = {Diff All the Things! {Part} 2},
  date = {2026-03-01},
  url = {https://allyourbayes.com/posts/gradients_pt2/},
  langid = {en}
}
For attribution, please cite this work as:
Di Francesco, Domenic. 2026. “Diff All the Things! Part 2.”
March 1, 2026. https://allyourbayes.com/posts/gradients_pt2/.

A Beginner’s Guide to DataLoaders in Julia with MLUtils.jl

By: Random blog posts about machine learning in Julia

Re-posted from: https://rkube.github.io/julia/deep-learning/dataloaders/2025/11/08/dataloaders.html

Introduction

Data Loaders are critical tools to work efficiently with deep learning.
They are the glue that bring the data to your machine learning model. Doing this efficiently
and without errors is important for successful training of any model.

Datasets and DataLoaders are concepts that encapsulate the logic for preparing the data to be
input in your model. In their basic form, they prepare small batches of data for ingestion by
reshaping individual samples into matrix form. More involved applications may load
images from files, augment them through flipws, crops, rotations.

The good news is that in Julia, MLUtils has you covered for
all your data loading needs. In this blog post we’ll dive into how to create both Datasets and DataLoaders.
We start out with a simple-as-possible example and then walk through a more realistic example.

Data loaders in easy mode

Let’s start by writing a simple as possible data loader to illustrate the core concepts of data loading
using the DataLoader. The task we are looking at is somewhat related to image classification, but simplified.

Our data may be a bunch of images, like 1,000 of them. Each of the images is assumed to be in one of
ten classes. A mock up of that data is something like

X_train = randn(Float32, 10, 10, 1_000)
Y_train = randn(1:10, 1_000)

Here, X_train are 1_000 samples of 10×10 matrices. These are mocked up as normally distributed Float32
data with shape (10, 10, 1_000), corresponding to (width, height, sample). Julia stores arrays in column-major,
so in our case the data for width is consecutive in memory. The last dimension, the sample dimension varies slowest.
The targets are mocked as 10_000 samples of random integers between 1 and 10, indicated by the 1:10 syntax.

A nice thing about Julia is that you often get away without defining custom classes. In our simlpe case,
we can get away with defining our dataset as

data = (X_train, Y_train)

No custom class definition needed. The dataset is just a tuple of two vectors.

Now we move on to the data loader. The job of the data loader is to sample from our dataset. It should
do batching, i.e. loading multiple (x,y) samples in one go, and random shuffling.
For this simple dataset, we just need to create a DataLoader:

loader = DataLoader(data, batch_size = 3)

DataLoaders are iterable, and we can look at the first item like so:

julia> (x_first, y_first) = first(loader);
julia> size(x_first)
(10, 10, 3)
julia> size(y_first)
(3,)

When we iterate over the dataloader, we can use individual samples. Here is the iteration skeleton:

julia> for (x, y) in loader
            @show size(x), size(y)
       end
[...]
(size(x), size(y)) = ((10, 10, 3), (3,))
(size(x), size(y)) = ((10, 10, 1), (1,))

Note that in the last iteration, the DataLoader only returns vectors with a single sample.
This is the loader watching out for us and stopping at the end of the data. We have 1_000 samples,
and using batchsize=3, the last iteration can only feature a single sample. This is because
1000 / 3 = 333.33333 and the dataset is exhausted after the first sample in the last iteration.

We can confirm this by querying the length of the dataloader:

julia> length(loader)
334

That’s it. In conclusion, for a simple in-memory dataset creating a tuple data = (X, Y) is all you need.
The DataLoader does the rest.

A more realistic example

Now let’s look at a more involved case where we want to read text data from a file and tokenize it.
This sample is inspired by Andrej Karpathy’s NanoGPT video,
github collab.

For this application, it’s convenient to encapsulate information on the DataSet into a struct. This struct
collects the length of the text, a block size for data loading, dictionaries that map characters to tokens
and vice-versa, and a vector of all the tokens. The members important for the data loading logic are
block_size and data. While data holds the data vector itself, block_size defines the length of
an individual sample.

In addition, we give the struct a constructors that reads all lines in the text file, creates an array of
unique characters, and dictionaries to map from character to token and a dictionary to map from token to character.
While the dictionaries are not necessary for the data loading logic to work, they are included here for completeness.

struct NanoDataset
    block_size::Int64               # How long an observation is. This is important for the dataloader
    ch_to_int::Dict{Char, Int64}    # Maps chars to tokens
    int_to_ch::Dict{Int64, Char}    # Maps tokens to chars
    data::Vector{Int64}             # The tokens

    function NanoDataset(filename::String, block_size::Int64)
        lines = readlines(filename)             # Read all lines
        _out = [c for l ∈ lines for c ∈ l]      # Create char array
        chars = sort!(unique(_out))             # Sort all chars occuring in the dataset
        push!(chars, '\n')                      # Add the \n character that we stripped when only looking at lines
        ch_to_int = Dict(val => ix for (ix, val) in enumerate(chars))   # Mapping of chars to int
        int_to_ch = Dict(ix => val for (ix, val) in enumerate(chars))


        all_tokens = [ch_to_int[s] for s in join(lines, "\n")]

        new(block_size, ch_to_int, int_to_ch, all_tokens)
    end
end

To create this dataset we run

ds = NanoDataset(FILENAME, 16)

Great, now moving on to the DataLoader. To make the dataloader work, we have to implement the numobs and
getobs interface as described in the
Documentation. This basically means
that we have to make DataLoader know how to get the length of the dataset and how to get a single observation.

Fortunately, the MLUtils documentation tells us how to implement both. numobs just requires to specialize Base.length for our type. This function returns how
many observations (that is the number of samples) there are in the dataset.

We implement this like

Base.length(d::NanoDataset) = length(d.data) - d.block_size - 1

which makes numobs work for our dataset:

julia> numobs(ds)
1115375

Next, we’ll get getobs working.

For the easy case, an individual sample of X_train can be accessed by X_train[:,:,42]. That is, array
indexing is the access. For NanoDataset we need to define how to get a single observation. We do this
by specializing Base.getindex on our dataset:

function Base.getindex(d::NanoDataset, i::Int)
    1 <= i <= length(d) - d.block_size - 1|| throw(ArgumentError("Index is out of bounds"))
    return (d.data[i:i+d.block_size-1], d.data[i+1:i+d.block_size])

So when we access ds through [] braces, we get two vectors: The input X and the target Y. X is
a sequence of tokens of length block_size and Y is this same sequence, shifted by a single index.

I’d like to note here, that Base.getindex is a good place to get fancy. For image loading, this
function may load data from a file. Or apply augmentations through
Augmentor.jl. The sky (or ChatGPT) is the limit!

Now we can access samples in our dataset like

julia> ds[1]
([18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 65, 14], [47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 65, 14, 43])

Note that Y is just X shifted by one index, just as implemented in getindex.
Before we define a dataloader, let’s implement one more function that allows us to load minibatchs:

MLUtils.getobs(d::NanoDataset, i::AbstractArray{<:Integer}) = [getobs(d, ii) for ii in i]

This function defines to dispatch calls where requested samples are given by i::AbstractArray{<:Integer}
to multiple calls of getobs. For example, loading the first two samples returns this:

julia> getobs(d, [1,2])
2-element Vector{Tuple{Vector{Int64}, Vector{Int64}}}:
 ([18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 65, 14], [47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 65, 14, 43])
 ([47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 65, 14, 43], [56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 65, 14, 43, 44])

Fantastic, now we have everything in place to create a DataLoader. We can create a dataloader that
loads 2 batches at the same time:

julia> dl = DataLoader(d, batchsize=2)
557688-element DataLoader(::NanoDataset, batchsize=2)
  with first element:
  2-element Vector{Tuple{Vector{Int64}, Vector{Int64}}}

julia> (x,y) = first(dl)
2-element Vector{Tuple{Vector{Int64}, Vector{Int64}}}:
 ([18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 65, 14], [47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 65, 14, 43])
 ([47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 65, 14, 43], [56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 65, 14, 43, 44])

The data loader returns a vector of lenth 2, each element of the vector a tuple of vectors.
Ideally for deep learning, we’d like a matrix of size (16,2) though. Remember the first dimension is the sequence
length, dimensions 2 is the batch dimension. The DataLoader can do this by setting collate=true:

julia> dl = DataLoader(d, batchsize=2, collate=true)
557688-element DataLoader(::NanoDataset, batchsize=2, collate=Val{true}())
  with first element:
  (16×2 Matrix{Int64}, 16×2 Matrix{Int64},)

julia> (x,y) = first(dl)
([18 47; 47 56;  ; 65 14; 14 43], [47 56; 56 57;  ; 14 43; 43 44])

julia> x
16×2 Matrix{Int64}:
 18  47
 47  56
 56  57
 57  58
 58   1
  1  15
 15  47
 47  58
 58  47
 47  64
 64  43
 43  52
 52  10
 10  65
 65  14
 14  43

Now the DataLoader returns the desired matrix. In addition, DataLoader supports shuffling and multithreading,
just add shuffle=true, parallel=true to the parameter list.

Where to go from here

In this tutorial we looked at how to use DataLoaders in Julia through two examples. In the first example,
we had in-memory data that can be passed as a tuple into the DataLoader. That just worked.
In the second example, we had to implement the numobs, getobs interface to make it work. This is a
just a bit more work, but allows to work with arbitrary custom data. The getobs implementation is where
to hook in lazy-loading from disk, augmentations, and so on.

Now that we looked at some examples on data loading, here are some things to try next

  • Try building your own Dataloader and use it in a Lux or Flux training loop.
  • Play around with image augmentation using Augmentor
  • Try using threads and measure the speed-up compared to single-threaded loading.

diff all the things

By: Domenic Di Francesco

Re-posted from: https://allyourbayes.com/posts/gradients/


TLDR

Gradients guide us through high-dimensional parameter space to draw samples from posterior distributions, take steps towards loss minimising parameter values, identify model vulnerabilities using adversarial methods, and more. One (of many) fun features about the Julia programming language is it’s unique approach to autodiff. However, I personally found some documentation a little difficult to follow, so this is intended to be a practical guide, with a couple of example use cases.


autodiff (AD)

Soooo much of computational statistics and machine learning is built on automatic (algorithmic) differentiation (AD). Dig into literature on Bayesian inference, or deep learning, and you will find parameters being nudged in a direction informed by a gradient. I’m sometimes suprised at the extent of the ‘gradient-based’ monopoly in scientific computing, but I don’t mean to trivialise! …getting gradients of complex functions, very quickly and without error, is a powerful tool and it’s great that people are able to leverage this.

AD achieves this by “the relentless application of the chain rule”, as I vaguely recall one of the Stan developers saying. Large functions are differentiated piece by piece (for which look-up rules can be applied), and the results are stitched together.

inter-operability and Julia ♥️??

Perhaps my favourite feature of Julia is it’s inter-operability. Look at a GitHub repo for a Julia package and you will generally find the following:

Julia is so performant, that it’s libraries for scientific computing will work with normal variables – without needing to package up their own types. So solvers from DifferentialEquations, or neural networks defined in Flux can be immediately combined with the Julia probabilistic programming language, Turing.

If a new framework emerges in Python, an entirely new ecosystem may need to be developed to prop it up. This may involve duplicating existing but now incompatible functionality – think of JAX needing to implement it’s own NumPy module. …whereas if you write a new Julia library, it could offer a vast range of applications as it is combined with other packages, which feels like potential for a multiplicative, rather than additive, impact.

Here I am, impressed that the interoperability of the Copulas package:


AD in Julia

ML frameworks in Python require you to work in their own syntax, with theit specific types, and use their own in-built AD methods. Locking into a framework is not ideal, as you are limited to the methods they support and you need to juggle types of inputs and outputs.

Conversely, Julia works the other way around. You write your code, using whichever libraries, functions and types you want, and then you choose an AD library to get you the gradients you need.

the joy of gradients – image courtesy of Google Gemini

Why does this feel so powerful? It’s the promise of gradient-based methods for your scientific problem, rather than a walled-off machine learning model. You can point an AD engine at the aspect of your analysis that you are interested in for more bespoke interrogation or optimisation.

I recently completed a project, which had an element of adversarial AI (counterfactual analysis and saliency maps), which I wrote in JAX, mainly as an excuse to learn JAX. I used their NNX module for neural networks, and found this to be unexpectedly restrictive. I wasn’t able to run analysis that required gradients of outputs w.r.t. inputs, as I hit NNX errors/limitations that I wasn’t able to resolve. I ended up re-writing everything in Julia.

Once upon a time there was Zygote: an AD library that powered ML in Julia. A great achievement, but limitations began to emerge. Because it operates on your high-level Julia code, it can struggle with certain features, similar to the so-called “sharp bits” of JAX …though I believe those limitations emerge for different reasons. I have also heard Zygote being described as “too permissive”“: it’s attempts to differentiate anything you throw at it can open the door to unidentified errors.

And then came Enzyme: an AD library that works at the LLVM level (your code is compiled first, and then the gradients are computed). This led to improvements in both performance and flexibility – we can now differentiate through mutation and control flow. This more resilient library has a steeper learning curve (imo), but (also imo) requiring more explicit instructions ends up making things clearer.

I hope the below examples will help get you started.

example 1: a Bayesian linear regression

How about computing the gradients we need for Hamiltonian Monte Carlo sampling. For a linear regression model, with inputs, and outputs, :

with priors:

Let’s put these in a tuple:

using Distributions

priors = (
    α_prior = Normal(0, 3), 
    β_prior = Normal(0, 3), 
    σ_prior = Exponential(1)
)
(α_prior = Normal{Float64}(μ=0.0, σ=3.0), β_prior = Normal{Float64}(μ=0.0, σ=3.0), σ_prior = Exponential{Float64}(θ=1.0))

For numerical stability reasons MCMC typically works in negative log space, so the below function finds the unnormalised negative log posterior for our model. This could of course be sped-up, but I’ve tried to keep it friendly ?

using Random

function neg_log_posterior(params::NamedTuple, priors::NamedTuple,
                          x::Vector, y::Vector)

    α = params.α; β = params.β; σ = params.σ
    
    # mean of likelihood
    μ_pred = α .+ x * β
    
    # increment the negative log likelihood for all observations
    neg_log_lik = 0.0
    for i in 1:length(y)
        neg_log_lik += -logpdf(Normal(μ_pred[i], σ), y[i])
    end
    
    # ...and for the priors
    neg_log_prior_α = -logpdf(priors.α_prior, α)
    neg_log_prior_β = -logpdf(priors.β_prior, β)
    neg_log_prior_σ = -logpdf(priors.σ_prior, σ)
    
    # summing in log space is equivalent to multiplying priors and likelihoods ?
    return neg_log_lik + neg_log_prior_α + neg_log_prior_β + neg_log_prior_σ
end
neg_log_posterior (generic function with 1 method)

To use this function, we need to define some inputs. Here, I’m just simulating some data, using “true” parameter values:

# 20 data points because, why not?
n_samples = 20
# define a PRNG for reproducibility
prng = MersenneTwister(231123)
# inputs from a standard Gaussian
x = randn(prng, n_samples)
# outputs by sending inputs through a "true" model
y = 1/2 .+ x * (-1/2) .+ 2 * randn(prng, n_samples)

We can use Enzyme to get the gradients of the negative log posterior w.r.t. the model parameters – as required by Hamiltonian Monte Carlo.

# where do i want gradients?
params_init = (
    α = rand(prng, priors.α_prior),
    β = rand(prng, priors.β_prior),
    σ = rand(prng, priors.σ_prior)
)
(α = -6.290988779313054, β = 3.9183834605694834, σ = 1.6155587713748865)

I am giving the gradient() function three arguments:

  • the mode/direction to apply AD, Reverse. Each pass of a reverse-mode AD computes gradients of all inputs w.r.t. a single output (as a vector-Jacobian product). In Forward mode, each pass computes gradients of a single input w.r.t. all outputs (as a Jacobian-vector product). Consequently, there are efficiency trade-offs associated with this selection, depending on the number of inputs and outputs of…

  • …the function we are differentiating, params -> neg_log_posterior(params, priors, x, y). Here, an anonymous function that takes params as input and returns the negative log posterior, using neg_log_posterior(), which we defined above.

  • the point at which we want gradients, params_init. This is the current location of the Markov chain. In the first instance we need an initial guess, for which we have drawn from the priors.

using Enzyme
# computing the gradients
grads = Enzyme.gradient(
    Reverse, 
    params -> neg_log_posterior(params, priors, x, y), 
    params_init
)
((α = -54.13440425830732, β = 25.990919631591435, σ = -323.2246379556173),)

We can then use these gradients to update the momentum of our Hamiltonian ‘particles’, generating proposals guided by the geometry of the posterior distribution. Unlike random walks or Gibbs samplers, this generation of samplers remain efficient in high dimensions ?

example 2: an MLP (simple neural network)

I defined a simple, densely connected neural network without anything clever (no layer normalisations, recurrent connections or attention mechanisms), sometimes referred to as a multi-layer perceptron (MLP).

I’ll spare you this set-up code here as we are focussing on autodiff, but you can find the full code on GitHub.

Instead, let’s look at my training function. Notice that I am now using a different function, Enzyme.autodiff() for backpropagation. It has more arguments:

  • the mode/direction to apply AD, set_runtime_activity(Reverse). Similar to gradient(), but here we are specifying that we want to use reverse-mode AD, with runtime activity analysis. As a rule of ?, I start with regular Reverse mode AD. If I get compilation errors about, for instance, type inference or broadcasting, then I add set_runtime_activity().
  • the function we are differentiating, (net, funs, inputs, targets) -> find_loss(net, funs, inputs, targets). Here, an anonymous function that takes the neural network, its functions, inputs and targets as arguments, and returns the loss.
  • the activity of the function Active. we need to make the output to the loss function active, because it is the starting point of the chain rule.
  • the activity of each argument, Active, Const(), or Duplicated(). This is where things get more explicit. We need to tell Enzyme which arguments we want gradients for (Active), and which we don’t (Const) – the derivative of a constant is zero. Finally, we also want gradients for Duplicated variables, but they could be large. So we create a shadow copy of the neural network, nn_shadow, which we use to accumulate gradients in-place (without allocating new memory each time!)
function train(nn::neural_network, nn_funs::neural_network_funs, 
               a::Array{Float64}, y::Array{Float64};
               a_test::Array{Float64} = a, y_test::Array{Float64} = y, 
               n_epochs::Int = 10, η::Float64 = 0.01)
    @assert n_epochs > 0 "n_epochs must be greater than 0"

    training_df = DataFrame(epoch = Int[], loss = Float64[], test_loss = Float64[])

    for i in 1:n_epochs

        # initiate our memory-saving shadow
        ∇nn = Enzyme.make_zero(nn)

        # find ∂ℒ/∂θ
        Enzyme.autodiff(
            set_runtime_activity(Reverse),
            (net, funs, inputs, targets) -> find_loss(net, funs, inputs, targets)[1],
            Active,
            Duplicated(nn, ∇nn),
            Const(nn_funs),
            Const(a),
            Const(y)
        )

        # nudge all weights and biases towards a lower loss, using learning rate, η
        for j = 1:length(nn.Ws)
            nn.Ws[j] -= η * ∇nn.Ws[j]
            nn.bs[j] -= η * ∇nn.bs[j]
        end
        
        # record losses
        append!(training_df, 
                DataFrame(epoch = i, 
                          loss = find_loss(nn, nn_funs, a, y)[1],
                          test_loss = find_loss(nn, nn_funs, a_test, y_test)[1]))

    end
    
    return nn, training_df
end

Enzyme.make_zero(nn) creates a structural copy of nn with the same type (neural_network), field names (Ws, bs), and dimensions …but with all numerical values set to zero. This memory-saving trick is important for large vectors of parameters, as we will generally have in deep learning.

The example applications that I selected are already very well equipped with sophisticated Julia libraries. If you are interested in probabilistic modelling in Julia, use Turing, if you are interested in deep learning, use Flux. Both are Enzyme compatible, but the later has specific guidance on how to set this up, using the Duplicated method that we used above:

some references

The Julia autodiff ecosystem, which is more vast than the examples covered in this blog post, link

A summary of the key trade-offs accross various autodiff methods, link

Professor Simone Scardapone’s book, “Alice’s adventures in a differential wonderland” link.

“As the name differentiable implies, gradients play a pivotal role”

JuliaCon talk on Julia’s unique approach to autodiff:

Citation

BibTeX citation:
@online{di_francesco2025,
  author = {Di Francesco, Domenic},
  title = {Diff All the Things},
  date = {2025-09-09},
  url = {https://allyourbayes.com/posts/gradients/},
  langid = {en}
}
For attribution, please cite this work as:
Di Francesco, Domenic. 2025. “Diff All the Things.”
September 9, 2025. https://allyourbayes.com/posts/gradients/.