Author Archives: Random blog posts about machine learning in Julia

A Beginner’s Guide to DataLoaders in Julia with MLUtils.jl

By: Random blog posts about machine learning in Julia

Re-posted from: https://rkube.github.io/julia/deep-learning/dataloaders/2025/11/08/dataloaders.html

Introduction

Data Loaders are critical tools to work efficiently with deep learning.
They are the glue that bring the data to your machine learning model. Doing this efficiently
and without errors is important for successful training of any model.

Datasets and DataLoaders are concepts that encapsulate the logic for preparing the data to be
input in your model. In their basic form, they prepare small batches of data for ingestion by
reshaping individual samples into matrix form. More involved applications may load
images from files, augment them through flipws, crops, rotations.

The good news is that in Julia, MLUtils has you covered for
all your data loading needs. In this blog post we’ll dive into how to create both Datasets and DataLoaders.
We start out with a simple-as-possible example and then walk through a more realistic example.

Data loaders in easy mode

Let’s start by writing a simple as possible data loader to illustrate the core concepts of data loading
using the DataLoader. The task we are looking at is somewhat related to image classification, but simplified.

Our data may be a bunch of images, like 1,000 of them. Each of the images is assumed to be in one of
ten classes. A mock up of that data is something like

X_train = randn(Float32, 10, 10, 1_000)
Y_train = randn(1:10, 1_000)

Here, X_train are 1_000 samples of 10×10 matrices. These are mocked up as normally distributed Float32
data with shape (10, 10, 1_000), corresponding to (width, height, sample). Julia stores arrays in column-major,
so in our case the data for width is consecutive in memory. The last dimension, the sample dimension varies slowest.
The targets are mocked as 10_000 samples of random integers between 1 and 10, indicated by the 1:10 syntax.

A nice thing about Julia is that you often get away without defining custom classes. In our simlpe case,
we can get away with defining our dataset as

data = (X_train, Y_train)

No custom class definition needed. The dataset is just a tuple of two vectors.

Now we move on to the data loader. The job of the data loader is to sample from our dataset. It should
do batching, i.e. loading multiple (x,y) samples in one go, and random shuffling.
For this simple dataset, we just need to create a DataLoader:

loader = DataLoader(data, batch_size = 3)

DataLoaders are iterable, and we can look at the first item like so:

julia> (x_first, y_first) = first(loader);
julia> size(x_first)
(10, 10, 3)
julia> size(y_first)
(3,)

When we iterate over the dataloader, we can use individual samples. Here is the iteration skeleton:

julia> for (x, y) in loader
            @show size(x), size(y)
       end
[...]
(size(x), size(y)) = ((10, 10, 3), (3,))
(size(x), size(y)) = ((10, 10, 1), (1,))

Note that in the last iteration, the DataLoader only returns vectors with a single sample.
This is the loader watching out for us and stopping at the end of the data. We have 1_000 samples,
and using batchsize=3, the last iteration can only feature a single sample. This is because
1000 / 3 = 333.33333 and the dataset is exhausted after the first sample in the last iteration.

We can confirm this by querying the length of the dataloader:

julia> length(loader)
334

That’s it. In conclusion, for a simple in-memory dataset creating a tuple data = (X, Y) is all you need.
The DataLoader does the rest.

A more realistic example

Now let’s look at a more involved case where we want to read text data from a file and tokenize it.
This sample is inspired by Andrej Karpathy’s NanoGPT video,
github collab.

For this application, it’s convenient to encapsulate information on the DataSet into a struct. This struct
collects the length of the text, a block size for data loading, dictionaries that map characters to tokens
and vice-versa, and a vector of all the tokens. The members important for the data loading logic are
block_size and data. While data holds the data vector itself, block_size defines the length of
an individual sample.

In addition, we give the struct a constructors that reads all lines in the text file, creates an array of
unique characters, and dictionaries to map from character to token and a dictionary to map from token to character.
While the dictionaries are not necessary for the data loading logic to work, they are included here for completeness.

struct NanoDataset
    block_size::Int64               # How long an observation is. This is important for the dataloader
    ch_to_int::Dict{Char, Int64}    # Maps chars to tokens
    int_to_ch::Dict{Int64, Char}    # Maps tokens to chars
    data::Vector{Int64}             # The tokens

    function NanoDataset(filename::String, block_size::Int64)
        lines = readlines(filename)             # Read all lines
        _out = [c for l ∈ lines for c ∈ l]      # Create char array
        chars = sort!(unique(_out))             # Sort all chars occuring in the dataset
        push!(chars, '\n')                      # Add the \n character that we stripped when only looking at lines
        ch_to_int = Dict(val => ix for (ix, val) in enumerate(chars))   # Mapping of chars to int
        int_to_ch = Dict(ix => val for (ix, val) in enumerate(chars))


        all_tokens = [ch_to_int[s] for s in join(lines, "\n")]

        new(block_size, ch_to_int, int_to_ch, all_tokens)
    end
end

To create this dataset we run

ds = NanoDataset(FILENAME, 16)

Great, now moving on to the DataLoader. To make the dataloader work, we have to implement the numobs and
getobs interface as described in the
Documentation. This basically means
that we have to make DataLoader know how to get the length of the dataset and how to get a single observation.

Fortunately, the MLUtils documentation tells us how to implement both. numobs just requires to specialize Base.length for our type. This function returns how
many observations (that is the number of samples) there are in the dataset.

We implement this like

Base.length(d::NanoDataset) = length(d.data) - d.block_size - 1

which makes numobs work for our dataset:

julia> numobs(ds)
1115375

Next, we’ll get getobs working.

For the easy case, an individual sample of X_train can be accessed by X_train[:,:,42]. That is, array
indexing is the access. For NanoDataset we need to define how to get a single observation. We do this
by specializing Base.getindex on our dataset:

function Base.getindex(d::NanoDataset, i::Int)
    1 <= i <= length(d) - d.block_size - 1|| throw(ArgumentError("Index is out of bounds"))
    return (d.data[i:i+d.block_size-1], d.data[i+1:i+d.block_size])

So when we access ds through [] braces, we get two vectors: The input X and the target Y. X is
a sequence of tokens of length block_size and Y is this same sequence, shifted by a single index.

I’d like to note here, that Base.getindex is a good place to get fancy. For image loading, this
function may load data from a file. Or apply augmentations through
Augmentor.jl. The sky (or ChatGPT) is the limit!

Now we can access samples in our dataset like

julia> ds[1]
([18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 65, 14], [47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 65, 14, 43])

Note that Y is just X shifted by one index, just as implemented in getindex.
Before we define a dataloader, let’s implement one more function that allows us to load minibatchs:

MLUtils.getobs(d::NanoDataset, i::AbstractArray{<:Integer}) = [getobs(d, ii) for ii in i]

This function defines to dispatch calls where requested samples are given by i::AbstractArray{<:Integer}
to multiple calls of getobs. For example, loading the first two samples returns this:

julia> getobs(d, [1,2])
2-element Vector{Tuple{Vector{Int64}, Vector{Int64}}}:
 ([18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 65, 14], [47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 65, 14, 43])
 ([47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 65, 14, 43], [56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 65, 14, 43, 44])

Fantastic, now we have everything in place to create a DataLoader. We can create a dataloader that
loads 2 batches at the same time:

julia> dl = DataLoader(d, batchsize=2)
557688-element DataLoader(::NanoDataset, batchsize=2)
  with first element:
  2-element Vector{Tuple{Vector{Int64}, Vector{Int64}}}

julia> (x,y) = first(dl)
2-element Vector{Tuple{Vector{Int64}, Vector{Int64}}}:
 ([18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 65, 14], [47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 65, 14, 43])
 ([47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 65, 14, 43], [56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 65, 14, 43, 44])

The data loader returns a vector of lenth 2, each element of the vector a tuple of vectors.
Ideally for deep learning, we’d like a matrix of size (16,2) though. Remember the first dimension is the sequence
length, dimensions 2 is the batch dimension. The DataLoader can do this by setting collate=true:

julia> dl = DataLoader(d, batchsize=2, collate=true)
557688-element DataLoader(::NanoDataset, batchsize=2, collate=Val{true}())
  with first element:
  (16×2 Matrix{Int64}, 16×2 Matrix{Int64},)

julia> (x,y) = first(dl)
([18 47; 47 56;  ; 65 14; 14 43], [47 56; 56 57;  ; 14 43; 43 44])

julia> x
16×2 Matrix{Int64}:
 18  47
 47  56
 56  57
 57  58
 58   1
  1  15
 15  47
 47  58
 58  47
 47  64
 64  43
 43  52
 52  10
 10  65
 65  14
 14  43

Now the DataLoader returns the desired matrix. In addition, DataLoader supports shuffling and multithreading,
just add shuffle=true, parallel=true to the parameter list.

Where to go from here

In this tutorial we looked at how to use DataLoaders in Julia through two examples. In the first example,
we had in-memory data that can be passed as a tuple into the DataLoader. That just worked.
In the second example, we had to implement the numobs, getobs interface to make it work. This is a
just a bit more work, but allows to work with arbitrary custom data. The getobs implementation is where
to hook in lazy-loading from disk, augmentations, and so on.

Now that we looked at some examples on data loading, here are some things to try next

  • Try building your own Dataloader and use it in a Lux or Flux training loop.
  • Play around with image augmentation using Augmentor
  • Try using threads and measure the speed-up compared to single-threaded loading.

Backpropagation through Poisson solvers

By: Random blog posts about machine learning in Julia

Re-posted from: https://rkube.github.io/julia/zygote/2022/02/19/backprop-poisson.html

Continuing from the previous example this tutorial shows how to use automatic
differentiaton with physics-inspired calculations. Let’s say we have
a profile that can be parameterized by a free parameter. From this profile,
we calculate another expression. And then we want to find the maximum
of this expression as a function of the free parameter.

To be more precise, let’s say we have a quantity \(\rho\) with a profile that has a profile
with a peak whose amplitude is parameterized by the parameter \(t\):

\[\begin{align}
\rho_t(x) = -\sin^2(t) \exp \left( \frac{\left(x-x_0\right)}{2\sigma^2} \right)
\times \frac{\sigma^2 – (x – x_0)^2}{\sigma^4}
\end{align}\]

The plot below shows how the peak of the amplitude varies with \(t\):
Profile variation with t

We see that for \(t = 0.2\ldots2.0\) the peak decreases in amplitude and increases
as \(t\) increases from 2.0 to 4.0.

To complicate things a bit, we would like to know how a quantity that is derived from this
profile depends on \(t\). In particular we want to

  1. Calculate \(\rho_t = \frac{\partial^2 \phi_t}{\partial x^2}\)
  2. Then maximize \(\int\limits_{0}^{L_x} \phi_t(x)^2 \, \mathrm{d}x\)

The ρ profile used here is after all a manufactured solution to the Poisson equation for

\[\begin{align}
\phi_t(x) = \sin^2(t) \exp \left( \frac{\left(x-x_0\right)}{2\sigma^2} \right).
\end{align}\]

Then we only have to ask Wolfram Alpha to
give us

\[\begin{align}
I(t) = \int \limits_{0}^{L_x} \phi_t(x)^2 = \frac{\sqrt{2 \sigma} \sin(t)^4}{s}
\left[ \mathrm{erfi}\left( \frac{L-x_0}{\sqrt{\sigma}} \right) +
\mathrm{erfi}\left( \frac{x_0}{\sqrt{\sigma}}\right)\right].
\end{align}\]

But in this tutorial the focus is on using automatic differentiation for such tasks.
In particular, we can use the approach presented here when we lack an analytic
solution for ϕ. We can also use the approach presented here when we lack an
analytic expression for ρ, as long as we can generate ρ given a value for t.

So let’s draft a game-plan. First we choose a discretization of the domain [0:Lx] using Nx
grid points, equidistantially spaced with \(\triangle x\). Then we pick a value of
t and

  1. Calculate \(\rho_t(x)\).
  2. Invert the poisson equation to find \(\phi_t(x)\).
  3. Calculate the integral \(I(t) = \int \phi_t(x)^2 \mathrm{d}x\).

Starting it off, we import relevant libraries, in particular Zygote, and define parameters
for the grid and the profile:

using Zygote
using FFTW: fft, ifft
using LinearAlgebra
using BenchmarkTools

Lx = 2π
Nx = 128
Δx = Lx / Nx
n0 = 0.0
μ = π
σ = 0.25
t = 1.0

Since we later ask for the \(\partial_t I(t)\) we write a function that calculates \(I(t)\):

function int_prof(t)
    # This calculates I(t)
    xrg = Δx * (0:1:(Nx-1))
    ρ = ρ_prof(xrg, 1.0, t, μ, σ)
    ϕ = invert_poisson(ρ, Nx, Lx)
    ϕ = ϕ .- mean(ϕ)
    sum(ϕ.^2)
end

That’s it. Now we can call Zygote’s gradient function on int_prof(t) to take calculate
\(\partial_t I(t)\). Now there are some more details to explore. In the next section
we digress into Poisson solvers. Feel free to skip this section and head straight to
the results to see how well AD performs on this relatively simple forward model.

Poisson Solvers

Given \(\rho_t\), we want to solve the Poisson equation

\[\begin{align}
\rho_t(x) & = \frac{\partial^2 \phi_t)x)}{\partial x^2} \\
\phi_t(0) & = \phi_t(L)
\end{align}\]

for \(\phi_t\) on the domain \(x \in [0:L]\). We don’t even pretend that we
have any class and use periodic boundary conditions. We want the solution on
the grid points \(x_i = i \triangle x\), where \(\triangle x = L_x / N_x\)
and \(N_x\) denotes the number of grid points.

For this task wwe can either follow this
excellent tutorial
or use a spectral solver, described for example
here.

A simple spectral solver for can be implemented like this:

# Simple spectral solver for Laplace equation
function invert_laplace_spectral(y::Array{Float64}, Nz, Lz)
    k = [1e100; 1:(Nz ÷ 2); -(Nz ÷ 2 - 1):-1] .* 2π/Lz
	y_ft = -fft(y) ./ k ./ k
    y_ft = y_ft .* [0; ones(Nz - 1)]
    d2y = real(ifft(y_ft))
end

The first line assembles a vector of wave numbers. The solution to Poisson’s allows an arbitrary offset. Setting the element of the
wave numbers vector, that corresponds to the zero frequency, to a large number
, k[1] = 1^100, we fix the mean of the solution to zero since we later divide by k.
We then take the Fourier transformation, divide by \(k^{2}\) and explicitly
zero out the zero-mode. The syntax is a bit cumbersome, we perform a point-wise
multiplication with an array, since Zygote does not support array mutation.
Finally we take the real part of the inverse transformation.

A particlar issue with Zygote is that it can’t handle code that mutates arrays.
The following text snippet shows a way of how to assemble an array and a way that will fail:

[0; ones(Nz - 1)]   # This will work with Zygote

tt = ones(Nz)       
tt[1] = 0           # This will not work with Zygote

Alternatively, we can implement a finite difference solver for Poisson’s equation.
Below is the implementation for periodic boundary conditions adapted from
herel

function invert_laplace(y,  Nz, Lz)
    Δz = Lz / Nz
    invΔz² = 1.0 / Δz / Δz

    A0 = Zygote.Buffer(zeros(Nz, Nz))
    for n  1:Nz
        for m  1:Nz
            A0[n,m] = 0.0
        end
    end
    for n  2:Nz
        A0[n, n] = -2.0 * invΔz²
        A0[n, n-1] = invΔz²
        A0[n-1, n] = invΔz²
    end
    A0[1, 1] = 1.0
    A0[1, 2] = 0.0
    A0[Nz,1] = invΔz²

    A = copy(A0)
    yvec = vcat([0.0], y[2:Nz])
    ϕ_num = A \ yvec
end

After calculating sample spacing, the code constructs a Matrix for the
finite difference Laplace operator. Here we use
Zygote.Buffer
which allows us to use array mutation syntax. Again, code that is to be
differentiated with Zygote can’t mutate arrays. After assembling the matrix
we copy it into a matrix, augment the input vector and solve the linear system.

Differentiation through Poisson solvers

The code below shows how I set up the problem. I choose \(\phi\) and
\(\rho\) as a manufactured solution. This way I can compare the
numerical solution for \(\phi_t\) to ϕ_prof. A Gaussian solution is
also a craving test case for a spectral solver. Since the Fourier
transformation of a Gaussian is a Gaussian, all Fourier modes will be
non-negative. This way one can pick up errors where a mode is at a wrong
place in an array.

The functions int_prof_fd(t) and int_prof_sp(t) implement the three
prong workflow described above. In particular their only argument is t.
This way we can pass the whole workflow to Zygote.gradient and this way
get the derivative of I with respect to t, $$\partial_t I$.

Now let’s set up the code:


ϕ_prof(xrg, n0, t, μ, σ) = n0 .+ sin(t).^2 .* exp.(-(xrg .- μ).^2 ./ 2.0 ./ σ ./ σ)
ρ_prof(xrg, n0, t, μ, σ) = -sin(t)^2 * exp.(-(xrg .- μ).^2 / 2 / σ / σ) .* (σ^2 .- (xrg .- μ).^2) / (σ^4)

Lx = 2π
Nx = 128
Δx = Lx / Nx
n0 = 0.0
μ = π
σ = 0.25
t = 1.0

function int_prof_fd(t)
    xrg = Δx * (0:1:(Nx-1))
    ρ = ρ_prof(xrg, 1.0, t, μ, σ)
    ϕ = invert_laplace(ρ, Nx, Lx)
    ϕ = ϕ .- mean(ϕ)
    sum(ϕ.^2)
end

function int_prof_sp(t)
    xrg = Δx * (0:1:(Nx-1))
    ρ = ρ_prof(xrg, 1.0, t, μ, σ)
    ϕ = invert_laplace_spectral(ρ, Nx, Lx)
    sum(ϕ.^2)
end

trg = 0.0:0.01:2π
phi2_fd = zeros(length(trg))
phi2_sp = zeros(length(trg))

phi2_grad_fd = zeros(length(trg))
phi2_grad_sp = zeros(length(trg))

for idx  1:length(trg)
    t = trg[idx]
    phi2_fd[idx] = int_prof_fd(t)
    phi2_grad_fd[idx] = gradient(int_prof_fd, t)[1]

    phi2_sp[idx] = int_prof_sp(t)
    phi2_grad_sp[idx] = real(gradient(int_prof_sp, t)[1])
end

As a sanity check we first compare the output of the forward model when
using the finite difference and spectral solver.

Comparison of forward model using finite difference and spectral solvers

For the resolution I chose, \(\triangle x\) is approximately 0.5 in this
example, both solvers perform well.

Comparison of dI/dt obtained through AD and finite differences

A final thing to look at is speed. Using
Benchmarktool
we can evaluate the performance of backpropagating through either the finite difference and
the spectral solver:

using BenchmarkTools
julia> @benchmark gradient(int_prof_sp, t)[1]
BenchmarkTools.Trial: 
  memory estimate:  141.64 KiB
  allocs estimate:  752
  --------------
  minimum time:     110.582 μs (0.00% GC)
  median time:      118.785 μs (0.00% GC)
  mean time:        139.315 μs (9.20% GC)
  maximum time:     12.583 ms (68.76% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gradient(int_prof_fd, t)[1]
BenchmarkTools.Trial: 
  memory estimate:  6.07 MiB
  allocs estimate:  170758
  --------------
  minimum time:     28.093 ms (0.00% GC)
  median time:      32.338 ms (0.00% GC)
  mean time:        35.562 ms (3.07% GC)
  maximum time:     104.050 ms (0.00% GC)
  --------------
  samples:          141
  evals/sample:     1

The mean time it takes to backprop through the finite difference solver is about 250 times
larger than for the spectral solver. Given that the spectral solver is cheaper to evaluate,
it doesn’t require to solve a linear system, this is not completely unexpected.

Backpropagation through numerical calculations

By: Random blog posts about machine learning in Julia

Re-posted from: https://rkube.github.io/julia/zygote/2022/02/16/backprop-sim.html

Differentiable programming makes all code amenable to gradient based
optimization. This harmless statement has broad implications. Think for
example about a physical simulation where a differential equation is
integrated in time. Given some initial conditions, boundary conditions,
and parameters, the simulator approximates how the system evolves in time.
Now lets say you simulate a wave in a swimming pool that moves towards
a wall. Your goal is to find out how fast the wave has to be launched as
to swap over the edge it is travelling to. With differentiable programming,
you could define a target metric and optimize it with respect to the
initial condition, here the speed with which the wave is launched.

To get started with differentiable programming we are looking at a very
simple example. We are given a density profile n on a finite domain:

\[\begin{align}
n(z) = n_0 + \sin(t) \exp\left( \frac{z-z_0}{2\sigma}\right).
\end{align}\]

The background density is \(n_0\) on which a Gaussian density peak,
centered around \(z_0\) modulated. The amplitude of the peak is given by
\(\sin(t)\). And we are interested on how the integral of the profile over
the domain varies with t. Of course this can be calculated analytically:

\[\begin{align}
\int \limits_{-L_z}^{L_z} n(z’)\, \mathrm{d}z’ = L_z n_0 + \sin(t)^2 \sqrt{2\pi} \sigma \mathrm{erf} \left( \frac{1}{\sqrt{2}\sigma} \right)
\end{align}\]

To study the dependence of this integral on the amplitude parameter t we
evaluate the derivative:

\[\begin{align}
\frac{\partial N}{\partial t} = L_z \sqrt{2\pi} \sigma \mathrm{erf}\left( \frac{1}{\sqrt{2}\sigma} \right) \sin(t) \cos(t).
\end{align}\]

To find the amplitude t that maximizes N we can now just set \(\partial N/\partial t = 0\) and solve for t. But in a situation where we are not so
lucky and have an analytic expression we may want to use automatic
differntiation. Let’s do this in Julia.

Automatic differentiation in Julia

First things first – import Zygote to do automatic differentiation and
the plots package as well as the error function erf:

using Zygote
using Plots
using SpecialFunctions: erf

Now we define a function that gives our density profile:

gen_prof(xrg, n0, t, z0, σ) = n0 .+ sin(t).^2 .* exp.(-(zrg .- z0).^2 ./ 2.0 ./ σ ./ σ)

For automatic differentiation to do its magic we need to numerically integrate
the profile.

function int_prof(t)
    my_prof = gen_prof(z0:Δz:z1, n0, t, μ, σ)
    my_sum = sum(my_prof) * Δz
end

This function generates a vector which holds values of the profile at the points
starting at \(z_0\) to \(z_1\), spaced regularly with \(\triangle z\). The
parameters \(\mu\) and \(\sigma\) are captured from the context where the function is
called from later. After generating the profile vector we numerically integrate the
profile using the rectangle rule.

Now we have all the ingredients to calculate \(\partial N/\partial t\) using
automatic differentiation. In code, we first define the necessary parameters
and allocate vectors for the t’s where we want the derivative:

n0 = 1.0
σ = 0.2
x0 = -1.0
x1 = 1.0
Δx = 0.01
n0 = 1.0
μ = x0 + (x1 - x0) / 2.0
σ = (x1 - x0) / 10.0

# Perform numerical integration of the profile for a range of t's:
t_vals = 0.0:0.05:6.5
sum_vals = similar(t_vals)
sum_vals_grad = similar(t_vals)

To find the gradient, we first evaluate int_prof for a given value of t.
Then we call gradient(int_prof, t) on this call. The return value is just the
gradient \(\partial N / \partial t\). Thist functionality is just Zygote’s
gradient doing it’s
work but on a more complicated example than in the documentation. Note that
gradient returns a tuple where the actual gradient is the first element.

for tidx ∈ tvals
    t = t_vals[tidx]
    sum_vals[tidx] = int_prof(t)
    sum_vals_grad[tidx] = gradient(int_prof, t)[1]
end

The gradient calculated using automatic differentiation can be compared to
the analytic formula:

dNdt_anl = 2. * sqrt(2π) * σ .* sin.(t_vals) .* cos.(t_vals) .* erf(1 / (2) / σ )

p = plot(t_vals, sum_vals_grad, label="∂ ∫n(z,t) dz / ∂t (autodiff).")
plot!(p, t_vals, dNdt_anl, label="∂ ∫n(z,t) dz / ∂t(analytical).", xlabel="t")

Comparing analytical derivative with derivative found by automatic
differentiation

As shown in the plot, the analytical derivative and the one calculated by
automatic differentiation are almost identical.

Comparing analytical derivative with derivative found by automatic
differentiation

And plotting the error, we find differences of the order of \(10^{-8}\) to
\(10^{-9}\).

Summary

In this blog post we looked at a very simple use-case for automatic differentiation.
We have a mathematical expression that we know depends on a paramter and we wish
to find the optimum value of this expression. Using automatic differentiation we
now can calculate the gradient of that expression with respect to the parameter.
This approach allows one to find the optimize the expression with resepct to the
value.

A common procedure for this optimization is gradient descent, used in deep learning.
Instead of a optimizing a neural network, automatic differentiation can be applied
to arbitrary code. So we can optimize a much broader class of items than just
neural networks.