Tag Archives: zygote

Backpropagation through Poisson solvers

By: Random blog posts about machine learning in Julia

Re-posted from: https://rkube.github.io/julia/zygote/2022/02/19/backprop-poisson.html

Continuing from the previous example this tutorial shows how to use automatic
differentiaton with physics-inspired calculations. Let’s say we have
a profile that can be parameterized by a free parameter. From this profile,
we calculate another expression. And then we want to find the maximum
of this expression as a function of the free parameter.

To be more precise, let’s say we have a quantity \(\rho\) with a profile that has a profile
with a peak whose amplitude is parameterized by the parameter \(t\):

\[\begin{align}
\rho_t(x) = -\sin^2(t) \exp \left( \frac{\left(x-x_0\right)}{2\sigma^2} \right)
\times \frac{\sigma^2 – (x – x_0)^2}{\sigma^4}
\end{align}\]

The plot below shows how the peak of the amplitude varies with \(t\):
Profile variation with t

We see that for \(t = 0.2\ldots2.0\) the peak decreases in amplitude and increases
as \(t\) increases from 2.0 to 4.0.

To complicate things a bit, we would like to know how a quantity that is derived from this
profile depends on \(t\). In particular we want to

  1. Calculate \(\rho_t = \frac{\partial^2 \phi_t}{\partial x^2}\)
  2. Then maximize \(\int\limits_{0}^{L_x} \phi_t(x)^2 \, \mathrm{d}x\)

The ρ profile used here is after all a manufactured solution to the Poisson equation for

\[\begin{align}
\phi_t(x) = \sin^2(t) \exp \left( \frac{\left(x-x_0\right)}{2\sigma^2} \right).
\end{align}\]

Then we only have to ask Wolfram Alpha to
give us

\[\begin{align}
I(t) = \int \limits_{0}^{L_x} \phi_t(x)^2 = \frac{\sqrt{2 \sigma} \sin(t)^4}{s}
\left[ \mathrm{erfi}\left( \frac{L-x_0}{\sqrt{\sigma}} \right) +
\mathrm{erfi}\left( \frac{x_0}{\sqrt{\sigma}}\right)\right].
\end{align}\]

But in this tutorial the focus is on using automatic differentiation for such tasks.
In particular, we can use the approach presented here when we lack an analytic
solution for ϕ. We can also use the approach presented here when we lack an
analytic expression for ρ, as long as we can generate ρ given a value for t.

So let’s draft a game-plan. First we choose a discretization of the domain [0:Lx] using Nx
grid points, equidistantially spaced with \(\triangle x\). Then we pick a value of
t and

  1. Calculate \(\rho_t(x)\).
  2. Invert the poisson equation to find \(\phi_t(x)\).
  3. Calculate the integral \(I(t) = \int \phi_t(x)^2 \mathrm{d}x\).

Starting it off, we import relevant libraries, in particular Zygote, and define parameters
for the grid and the profile:

using Zygote
using FFTW: fft, ifft
using LinearAlgebra
using BenchmarkTools

Lx = 2π
Nx = 128
Δx = Lx / Nx
n0 = 0.0
μ = π
σ = 0.25
t = 1.0

Since we later ask for the \(\partial_t I(t)\) we write a function that calculates \(I(t)\):

function int_prof(t)
    # This calculates I(t)
    xrg = Δx * (0:1:(Nx-1))
    ρ = ρ_prof(xrg, 1.0, t, μ, σ)
    ϕ = invert_poisson(ρ, Nx, Lx)
    ϕ = ϕ .- mean(ϕ)
    sum(ϕ.^2)
end

That’s it. Now we can call Zygote’s gradient function on int_prof(t) to take calculate
\(\partial_t I(t)\). Now there are some more details to explore. In the next section
we digress into Poisson solvers. Feel free to skip this section and head straight to
the results to see how well AD performs on this relatively simple forward model.

Poisson Solvers

Given \(\rho_t\), we want to solve the Poisson equation

\[\begin{align}
\rho_t(x) & = \frac{\partial^2 \phi_t)x)}{\partial x^2} \\
\phi_t(0) & = \phi_t(L)
\end{align}\]

for \(\phi_t\) on the domain \(x \in [0:L]\). We don’t even pretend that we
have any class and use periodic boundary conditions. We want the solution on
the grid points \(x_i = i \triangle x\), where \(\triangle x = L_x / N_x\)
and \(N_x\) denotes the number of grid points.

For this task wwe can either follow this
excellent tutorial
or use a spectral solver, described for example
here.

A simple spectral solver for can be implemented like this:

# Simple spectral solver for Laplace equation
function invert_laplace_spectral(y::Array{Float64}, Nz, Lz)
    k = [1e100; 1:(Nz ÷ 2); -(Nz ÷ 2 - 1):-1] .* 2π/Lz
	y_ft = -fft(y) ./ k ./ k
    y_ft = y_ft .* [0; ones(Nz - 1)]
    d2y = real(ifft(y_ft))
end

The first line assembles a vector of wave numbers. The solution to Poisson’s allows an arbitrary offset. Setting the element of the
wave numbers vector, that corresponds to the zero frequency, to a large number
, k[1] = 1^100, we fix the mean of the solution to zero since we later divide by k.
We then take the Fourier transformation, divide by \(k^{2}\) and explicitly
zero out the zero-mode. The syntax is a bit cumbersome, we perform a point-wise
multiplication with an array, since Zygote does not support array mutation.
Finally we take the real part of the inverse transformation.

A particlar issue with Zygote is that it can’t handle code that mutates arrays.
The following text snippet shows a way of how to assemble an array and a way that will fail:

[0; ones(Nz - 1)]   # This will work with Zygote

tt = ones(Nz)       
tt[1] = 0           # This will not work with Zygote

Alternatively, we can implement a finite difference solver for Poisson’s equation.
Below is the implementation for periodic boundary conditions adapted from
herel

function invert_laplace(y,  Nz, Lz)
    Δz = Lz / Nz
    invΔz² = 1.0 / Δz / Δz

    A0 = Zygote.Buffer(zeros(Nz, Nz))
    for n  1:Nz
        for m  1:Nz
            A0[n,m] = 0.0
        end
    end
    for n  2:Nz
        A0[n, n] = -2.0 * invΔz²
        A0[n, n-1] = invΔz²
        A0[n-1, n] = invΔz²
    end
    A0[1, 1] = 1.0
    A0[1, 2] = 0.0
    A0[Nz,1] = invΔz²

    A = copy(A0)
    yvec = vcat([0.0], y[2:Nz])
    ϕ_num = A \ yvec
end

After calculating sample spacing, the code constructs a Matrix for the
finite difference Laplace operator. Here we use
Zygote.Buffer
which allows us to use array mutation syntax. Again, code that is to be
differentiated with Zygote can’t mutate arrays. After assembling the matrix
we copy it into a matrix, augment the input vector and solve the linear system.

Differentiation through Poisson solvers

The code below shows how I set up the problem. I choose \(\phi\) and
\(\rho\) as a manufactured solution. This way I can compare the
numerical solution for \(\phi_t\) to ϕ_prof. A Gaussian solution is
also a craving test case for a spectral solver. Since the Fourier
transformation of a Gaussian is a Gaussian, all Fourier modes will be
non-negative. This way one can pick up errors where a mode is at a wrong
place in an array.

The functions int_prof_fd(t) and int_prof_sp(t) implement the three
prong workflow described above. In particular their only argument is t.
This way we can pass the whole workflow to Zygote.gradient and this way
get the derivative of I with respect to t, $$\partial_t I$.

Now let’s set up the code:


ϕ_prof(xrg, n0, t, μ, σ) = n0 .+ sin(t).^2 .* exp.(-(xrg .- μ).^2 ./ 2.0 ./ σ ./ σ)
ρ_prof(xrg, n0, t, μ, σ) = -sin(t)^2 * exp.(-(xrg .- μ).^2 / 2 / σ / σ) .* (σ^2 .- (xrg .- μ).^2) / (σ^4)

Lx = 2π
Nx = 128
Δx = Lx / Nx
n0 = 0.0
μ = π
σ = 0.25
t = 1.0

function int_prof_fd(t)
    xrg = Δx * (0:1:(Nx-1))
    ρ = ρ_prof(xrg, 1.0, t, μ, σ)
    ϕ = invert_laplace(ρ, Nx, Lx)
    ϕ = ϕ .- mean(ϕ)
    sum(ϕ.^2)
end

function int_prof_sp(t)
    xrg = Δx * (0:1:(Nx-1))
    ρ = ρ_prof(xrg, 1.0, t, μ, σ)
    ϕ = invert_laplace_spectral(ρ, Nx, Lx)
    sum(ϕ.^2)
end

trg = 0.0:0.01:2π
phi2_fd = zeros(length(trg))
phi2_sp = zeros(length(trg))

phi2_grad_fd = zeros(length(trg))
phi2_grad_sp = zeros(length(trg))

for idx  1:length(trg)
    t = trg[idx]
    phi2_fd[idx] = int_prof_fd(t)
    phi2_grad_fd[idx] = gradient(int_prof_fd, t)[1]

    phi2_sp[idx] = int_prof_sp(t)
    phi2_grad_sp[idx] = real(gradient(int_prof_sp, t)[1])
end

As a sanity check we first compare the output of the forward model when
using the finite difference and spectral solver.

Comparison of forward model using finite difference and spectral solvers

For the resolution I chose, \(\triangle x\) is approximately 0.5 in this
example, both solvers perform well.

Comparison of dI/dt obtained through AD and finite differences

A final thing to look at is speed. Using
Benchmarktool
we can evaluate the performance of backpropagating through either the finite difference and
the spectral solver:

using BenchmarkTools
julia> @benchmark gradient(int_prof_sp, t)[1]
BenchmarkTools.Trial: 
  memory estimate:  141.64 KiB
  allocs estimate:  752
  --------------
  minimum time:     110.582 μs (0.00% GC)
  median time:      118.785 μs (0.00% GC)
  mean time:        139.315 μs (9.20% GC)
  maximum time:     12.583 ms (68.76% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark gradient(int_prof_fd, t)[1]
BenchmarkTools.Trial: 
  memory estimate:  6.07 MiB
  allocs estimate:  170758
  --------------
  minimum time:     28.093 ms (0.00% GC)
  median time:      32.338 ms (0.00% GC)
  mean time:        35.562 ms (3.07% GC)
  maximum time:     104.050 ms (0.00% GC)
  --------------
  samples:          141
  evals/sample:     1

The mean time it takes to backprop through the finite difference solver is about 250 times
larger than for the spectral solver. Given that the spectral solver is cheaper to evaluate,
it doesn’t require to solve a linear system, this is not completely unexpected.

Backpropagation through numerical calculations

By: Random blog posts about machine learning in Julia

Re-posted from: https://rkube.github.io/julia/zygote/2022/02/16/backprop-sim.html

Differentiable programming makes all code amenable to gradient based
optimization. This harmless statement has broad implications. Think for
example about a physical simulation where a differential equation is
integrated in time. Given some initial conditions, boundary conditions,
and parameters, the simulator approximates how the system evolves in time.
Now lets say you simulate a wave in a swimming pool that moves towards
a wall. Your goal is to find out how fast the wave has to be launched as
to swap over the edge it is travelling to. With differentiable programming,
you could define a target metric and optimize it with respect to the
initial condition, here the speed with which the wave is launched.

To get started with differentiable programming we are looking at a very
simple example. We are given a density profile n on a finite domain:

\[\begin{align}
n(z) = n_0 + \sin(t) \exp\left( \frac{z-z_0}{2\sigma}\right).
\end{align}\]

The background density is \(n_0\) on which a Gaussian density peak,
centered around \(z_0\) modulated. The amplitude of the peak is given by
\(\sin(t)\). And we are interested on how the integral of the profile over
the domain varies with t. Of course this can be calculated analytically:

\[\begin{align}
\int \limits_{-L_z}^{L_z} n(z’)\, \mathrm{d}z’ = L_z n_0 + \sin(t)^2 \sqrt{2\pi} \sigma \mathrm{erf} \left( \frac{1}{\sqrt{2}\sigma} \right)
\end{align}\]

To study the dependence of this integral on the amplitude parameter t we
evaluate the derivative:

\[\begin{align}
\frac{\partial N}{\partial t} = L_z \sqrt{2\pi} \sigma \mathrm{erf}\left( \frac{1}{\sqrt{2}\sigma} \right) \sin(t) \cos(t).
\end{align}\]

To find the amplitude t that maximizes N we can now just set \(\partial N/\partial t = 0\) and solve for t. But in a situation where we are not so
lucky and have an analytic expression we may want to use automatic
differntiation. Let’s do this in Julia.

Automatic differentiation in Julia

First things first – import Zygote to do automatic differentiation and
the plots package as well as the error function erf:

using Zygote
using Plots
using SpecialFunctions: erf

Now we define a function that gives our density profile:

gen_prof(xrg, n0, t, z0, σ) = n0 .+ sin(t).^2 .* exp.(-(zrg .- z0).^2 ./ 2.0 ./ σ ./ σ)

For automatic differentiation to do its magic we need to numerically integrate
the profile.

function int_prof(t)
    my_prof = gen_prof(z0:Δz:z1, n0, t, μ, σ)
    my_sum = sum(my_prof) * Δz
end

This function generates a vector which holds values of the profile at the points
starting at \(z_0\) to \(z_1\), spaced regularly with \(\triangle z\). The
parameters \(\mu\) and \(\sigma\) are captured from the context where the function is
called from later. After generating the profile vector we numerically integrate the
profile using the rectangle rule.

Now we have all the ingredients to calculate \(\partial N/\partial t\) using
automatic differentiation. In code, we first define the necessary parameters
and allocate vectors for the t’s where we want the derivative:

n0 = 1.0
σ = 0.2
x0 = -1.0
x1 = 1.0
Δx = 0.01
n0 = 1.0
μ = x0 + (x1 - x0) / 2.0
σ = (x1 - x0) / 10.0

# Perform numerical integration of the profile for a range of t's:
t_vals = 0.0:0.05:6.5
sum_vals = similar(t_vals)
sum_vals_grad = similar(t_vals)

To find the gradient, we first evaluate int_prof for a given value of t.
Then we call gradient(int_prof, t) on this call. The return value is just the
gradient \(\partial N / \partial t\). Thist functionality is just Zygote’s
gradient doing it’s
work but on a more complicated example than in the documentation. Note that
gradient returns a tuple where the actual gradient is the first element.

for tidx ∈ tvals
    t = t_vals[tidx]
    sum_vals[tidx] = int_prof(t)
    sum_vals_grad[tidx] = gradient(int_prof, t)[1]
end

The gradient calculated using automatic differentiation can be compared to
the analytic formula:

dNdt_anl = 2. * sqrt(2π) * σ .* sin.(t_vals) .* cos.(t_vals) .* erf(1 / (2) / σ )

p = plot(t_vals, sum_vals_grad, label="∂ ∫n(z,t) dz / ∂t (autodiff).")
plot!(p, t_vals, dNdt_anl, label="∂ ∫n(z,t) dz / ∂t(analytical).", xlabel="t")

Comparing analytical derivative with derivative found by automatic
differentiation

As shown in the plot, the analytical derivative and the one calculated by
automatic differentiation are almost identical.

Comparing analytical derivative with derivative found by automatic
differentiation

And plotting the error, we find differences of the order of \(10^{-8}\) to
\(10^{-9}\).

Summary

In this blog post we looked at a very simple use-case for automatic differentiation.
We have a mathematical expression that we know depends on a paramter and we wish
to find the optimum value of this expression. Using automatic differentiation we
now can calculate the gradient of that expression with respect to the parameter.
This approach allows one to find the optimize the expression with resepct to the
value.

A common procedure for this optimization is gradient descent, used in deep learning.
Instead of a optimizing a neural network, automatic differentiation can be applied
to arbitrary code. So we can optimize a much broader class of items than just
neural networks.