Author Archives: Random blog posts about machine learning in Julia

Training GANs in Julia’s Flux

By: Random blog posts about machine learning in Julia

Re-posted from: https://rkube.github.io/julia/gan/2021/11/08/training-gans.html

In order to effectively run machine learning experiments we need a fast
turn-around time for model training. So simply implementing the model is not
the only thing we need to worry about. We also want to be able to change the
hyperparameters in a convenient way. This could either be through a configuration
file or through command line arguments. This post demonstrates how I train
a vanilla GAN on the
MNIST dataset. It is not about GAN theory, for this the original paper by
Goodfellow et al. [[1]] is a good starting point. Instead I focus on how to
structure the code and subtle implementation issues I came across when writing
the code. You can find the current version of the code on github.

Project structure

I am taking a starting point in the vanilla GAN implementation on the
FluxML website. This
implementation works and the trained generator indeed generates images that
look indistinguishable from images belonging to the MNIST dataset.
But how do we arrive there? Why are the learning rates chosen as \(\eta = 2 \times 10^{-4}\)? IS the leakyrelu the optimal activation function or does it perform
on-par with relu in some regime? To answer these questions we need a code that
quickly allows us to change these parameters.

And while we are at it, lets bundle the code together with its dependencies in a
Julia package. This allows us to conveniently a package dependencies to the code.
Taken together, the code and well defined dependencies make the behaviour reproducible. The Julia documentation gives a comprehensive introduction on
packages here.

In order to run the code in the project I first checkout the code from github,
then enter the repository and then execute the runme script:

$ git checkout https://github.com/rkube/mnist_gan.git
$ cd mnist_gan
$ julia --project=. src/runme.jl --activation=ADAM --train_k=8 ...

All packages installed in the project are local to this project and don’t interfere
with packages installed in the general environment. This allows for example to
specify for certain version numbers and will give us producibility of our results.

Code structure

The code is structued as a standard Julia project. The root folder layout looks
like this

├── Manifest.toml
├── Project.toml
├── README.md
└── src
    ├── Manifest.toml
    ├── mnist_gan.jl
    ├── models.jl
    ├── Project.toml
    ├── runme.jl
    └── training.jl

The root folder contains Manifest.toml and Project.toml which include information
about dependencies, versions, package names. More information is given in the
Pkg.jl documentation.

The src folder contains all source codes files. In particular it contains a
mnist_gan.jl file. This is named after the package name and in the simple case
here only twofines the package as a module, includes all other modules and
my two source files

module mnist_gan

using NNlib
using Flux
using Zygote
using Base:Fix2

# All GAN models
include("models.jl")
# Functions used for training
include("training.jl")
end #module

As additional structure I put the models in models.jl and training functions in
training.jl.

Command line arguments

To quickly train the GAN with specific hyperparameters one can either read the
hyperparameters from a configuration file or pass them through the command line.
Here we do the second approach. To comfortably parse command line arguments I’m
using (ArgParse.jl)[https://argparsejl.readthedocs.io/en/latest/argparse.html].

Condensing to only single argument, my code looks like this:

using ArgParse

s = ArgParseSettings()
@add_arg_table s begin
    "--lr_dscr"
        help = "Learning rate for the discriminator. Default = 0.0002"
        arg_type = Float64
        default = 0.0002

args = parse_args(s)

That’s it. Now I can access the single command line arguments via args[lr_dscr].

Logging

Keeping track of the model performance while training is crucial when performing
parameter scans. For the vanilla GAN alone I defined 10 parameters that can be
varied. Letting each parameter assume only two distinct values this allows for
1024 combinations. Julia’s [logging facilities(https://github.com/JuliaLogging)
provide means to systematicallylog model training for a large hyperparameter scan.

In particular, we can use TensorBoardLogger.jl. TensorBoard
provides a visualization of training and includes numerous useful features, such
as visualization of loss curves, displaying of model output images and more. To
use TensorBoardLogger.jl in my code I have to include the module, instantiate
a logger. Then I can easily log my experiments:

# Import the modules
using TensorBoardLogger
...
# Instantiate TensorBoardLogger
# Let's log the hyperparamters of the current run. 
dir_name = join(["$(k)_$(v)" for (k, v) in a])
tb_logger = TBLogger("logs/" * dir_name)
with_logger(tb_logger) do
    @info "hyperparameters" args
end

# Wrap the main training loop in a with clause to enable logging
lossvec_gen = zeros(Float32, args["num_iterations"])
lossvec_dscr = zeros(Float32, ["num_iterations"])

with_logger(tb_logger) do
    for n  args["num_iterations"]e
        # Do machine learning ...
        ...
        # Code to log PNG images to tensorboard, inside the main training loop
        if n % args["output_period"] == 0
            noise = randn(args["latent_dim"], 4) |> gpu;
            fake_img = reshape(generator(noise), 28, 4*28) |> cpu;
            # I need to clip pixel values to [0.0; 1.0]
            fake_img[fake_img .> 1.0] .= 1.0
            fake_img[fake_img .< -1.0] .= -1.0
            fake_img = (fake_img .+ 1.0) .* 0.5
            # 
            log_image(tb_logger, "generatedimage", fake_img, ImageFormat(202))
        end
        # Log the generato and discriminator loss
        @info "test" loss_generator=lossvec_gen[n] loss_discriminator=lossvec_dscr[n]
    end # for
end #  Logger

First, I’m generating a string from all keys and values defined in the command
line argument dictionary. Later this will allow me to filter these arguments.
Then I’m logging the args dictionary, which contains the hyperparameters of
the current experiment. Then I’m generating a fake image using the generator
and log it as well. Here I need to clip the pixel values to [0.0; 1.0]. Since
the Generator is trained on images with pixel values between -1.0 and 1.0 I need
to transform the pixel space. Note that he last argument to the call in
log_image encodes the layout of the fake_img array. I had to look up the
available encodings via

?

Loss functions on-the-fly

To resolve the correct loss function from command line arguments I’m using the
getfield method. To make it a little more convoluted, we also need to distinguish
between loss functions that take an additional, tunable parameterr
like celu, elu, leakyrelu and trelu, and loss functions who do not.
The following code block shows how to map a string that encodes the function name
to the actual function using getfield. To create a closure over an optional
parameter I’m using Fix2. The code below is from models.jl

function get_vanilla_discriminator(args)
    ...
    if args["activation"] in ["celu", "elu", "leakyrelu", "trelu"]
        # Now continue: We want to use Base.Fix2
        act = Fix2(getfield(NNlib, Symbol(args["activation"])), Float32(args["activation_alpha"]))
    else
        act = getfield(NNlib, Symbol(args["activation"]));
    end

    return Chain(Dense(28 * 28, 1024, act), 
        ...);

I found out that can have an impact on performance how I pass the activation
function as an argument to the dense layer. By passing only the function, the
implementation of Dense handles how the activation function is applied to the linear
transformation. This is how it should be. If I manually prescribe how to apply
the broadcast I find slower performance:

julia> d1 = Dense(100, 100, act)
Dense(100, 100, relu)  # 10_100 parameters

julia> @btime d1(randn(Float32, 100, 100));
  163.863 μs (6 allocations: 117.33 KiB)

julia> d2 = Dense(100, 100, x -> act(x))
Dense(100, 100, #5)  # 10_100 parameters

julia> @btime d2(randn(Float32, 100, 100));
  3.041 ms (20016 allocations: 430.30 KiB)

So manually prescribing how to perform the broadcast is about 20 times slower.
Instead, I let the code above return a function that Flux knows how to apply a
broadcast on.

Running a parameter scan

Now we are set up to run a parameter scan. For this I generate runscripts
where I vary my command line arguments. The resulting scripts look like this

#SBATCH things

cd /location/of/the/repo
julia --project=. --lr_dscr=0.0002 --lr_gen=0.0002 --batch_size=8 --num_iterations=10000 --latent_dim=100 --optimizer=ADAM --activation=leakyrelu --activation_alpha=0.2 --train_k=8 --prob_dropout=0.3 --output_period=250

Of course the arguments vary across the scripts. After crunching all the numbers,
the log file directory is populated with the tensorboard log files. The next
blog post will discuss how the results look like and how to pick the best
hyperparameters.

References

[1]
I. Goodfellow et al. Generative Adversarial Networks

Backpropagating through QR decomposition

By: Random blog posts about machine learning in Julia

Re-posted from: https://rkube.github.io/jekyll/update/2021/07/06/backpropagating-through-qr.html

One of the most useful decompositions of Linear Algebra is the QR decomposition.
This decomposition is particularly important when we are interested in the
vector space spanned by the columns of a matrix A. Formally, we write the QR
decomposition of \(A \in \mathbb{R}^{m \times n}\), where m ≥ n, as

\[\begin{align}
A = QR =
\left[
\begin{array}{c|c|c|c}
& & & \\
q_1 & q_2 & \cdots & q_n \\
& & &
\end{array}
\right]
%
\left[
\begin{array}{cccc}
r_{1,1} & r_{1,2} & \cdots & r_{1,n} \\
0 & r_{2,2} & \cdots & r_{2,n} \\
0 & 0 & \ddots & \vdots \\
0 & \cdots & & r_{n,n}
\end{array}
\right].
\end{align}\]

Where Q is an orthogonal m-by-n matrix, i.e. the columns of Q are orthogonal
\(q_{i} \cdot q_{j} = \delta_{i,j}\). R is a square n-by-n matrix.
Using the decomposition, we can reconstruct the columns of A as

\[\begin{align}
a_1 & = r_{1,1} q_1 \\
a_2 & = r_{1,2} q_1 + r{2,2} q_2 \\
& \cdots \\
a_n & = \sum_{j=1}^{n} r_{j,n} q_j.
\end{align}\]

Internally, Julia treats QR-factorized matrices through a packed format [1].
This format does not store the matrices Q and R explicitly, but using a packed format.
All matrix algebra and arithmetic is implemented through methods that expect
the packed format as input. And the QR decomposition itself is calling the LAPACK
method
geqrt which
returns just this packed format.

A computational graph where one may want to backpropagate through a QR factorization
may look similar to this one:
Computational graph with QR factorization

While the incoming gradients, \(\bar{f} \partial f / \partial Q\) and
\(\bar{f} \partial f / \partial R\) depend on \(f\), the gradients for the QR
decomposition \(\bar{Q} \partial Q / \partial A\) and \(\bar{R} \partial R / \partial A\) are defined through the QR factorization. While these can be in principle
computed through an automatic differentiation framework, it can be beneficial to
implement a pullback. For one, this saves compilation time before first execution.
An additional benefit is less memory use, as the gradients will be propagated through
fewer functions. A pullback for the QR factorization may thus also aid numerical
stability, as fewer accumulations and propagations are performed.

Formulas for the pullback of the QR factorization are given in [2],
[3] and [4]. Pytorch for example implements the method described in [3], see here.

In this blog post, we implement the pullback for the QR factorization in Julia.
Specifically, we implement a so-called rrule for ChainRules.jl. While you are here, please take a moment and read
the documentation of this package. It is very well written and helped me tremendously
to understand how automatic differentiation works and is implemented n Julia.
Also, if you want to skip ahead, here is my example implementation of the
QR pullback for ChainRules.

Now, let’s take a look at the code and how to implement a custom rrule. The first
thing we need to look at is the definition of the QR struct

struct QR{T,S<:AbstractMatrix{T}} <: Factorization{T}
    factors::S
    τ::Vector{T}

    function QR{T,S}(factors, τ) where {T,S<:AbstractMatrix{T}}
        require_one_based_indexing(factors)
        new{T,S}(factors, τ)
    end
end

There are two fields in the structure, factors and τ. The matrices Q and R are
returned through an accompanying getproperty function:

function getproperty(F::QR, d::Symbol)
    m, n = size(F)
    if d === :R
        return triu!(getfield(F, :factors)[1:min(m,n), 1:n])
    elseif d === :Q
        return QRPackedQ(getfield(F, :factors), F.τ)
    else
        getfield(F, d)
    end
end

For the QR pullback we first need to implement a pullback for the getproperty
function. The pullback for this function only propagates incoming gradients
backwards. Incoming gradients are described using a Tangent. This struct can only
have fields that are present in the parameter type, in our case
LinearAlgebra.QRCompactWY. This struct has fields factors and τ as discussed
above. Now the incoming gradients would be \(\bar{Q}\) and \(\bar{R}\). Thus
the pullback needs to map between these two. Thus the pullback can be implemented
like this:


function ChainRulesCore.rrule(::typeof(getproperty), F::LinearAlgebra.QRCompactWY, d::Symbol) 
    function getproperty_qr_pullback()
        ∂factors = if d === :Q
            
        else
            nothing
        end

        ∂T = if d === :R
            
        else
            nothing
        end

        ∂F = Tangent{LinearAlgebra.QRCompactWY}(; factors=∂factors, T=∂T)
        return (NoTangent(), ∂F)
    end

    return getproperty(F, d), getproperty_qr_pullback
end

Notice the call signature of the rrule. The first argument is always
::typeof(functionname), where functionname is the name of the function that
we want a custom rrule for. Following this argument are the actual arguments
that one normally passes to functionname.

Finally we need to implement the pullback that actually performs the relevant
calculations. The typical way of implementing custom pullbacks with
ChainRules is write a function
that calculates that returns a tuple containing the result of the forward pass,
as well as the pullback \(\mathcal{B}^{x}_{f}(\bar{y})\). Doing it this way allows
the pullback to re-use results from the forward pass. In other words, by defining
the pullback as a function in the forward pass allows to easily use cached results.
There is more discussion on the reason behind this design choice in the
ChainRules documentation.

Returning to the pullback for the QR factorization, here is a possible implementation:


function ChainRules.rrule(::typeof(qr), A::AbstractMatrix{T}) where {T} 
    QR = qr(A)
    m, n = size(A)
    function qr_pullback(::Tangent)
        function qr_pullback_square_deep(, , A, Q, R)
            M = *R' - Q'*
            # M <- copyltu(M)
            M = triu(M) + transpose(triu(M,1))
             = ( + Q * M) / R'
        end 
         = .factors
         = .T 
        Q = QR.Q
        R = QR.R
        if m  n 
             =  isa ChainRules.AbstractZero ?  : @view [:, axes(Q, 2)] 
             = qr_pullback_square_deep(, , A, Q, R)
        else
            # partition A = [X | Y]
            # X = A[1:m, 1:m]
            Y = A[1:m, m + 1:end]
    
            # partition R = [U | V], and we don't need V
            U = R[1:m, 1:m]
            if  isa ChainRules.AbstractZero
                 = zeros(size(Y))
                Q̄_prime = zeros(size(Q))
                 =  
            else
                # partition R̄ = [Ū | V̄]
                 = [1:m, 1:m]
                 = [1:m, m + 1:end]
                Q̄_prime = Y * '
            end 

            Q̄_prime =  isa ChainRules.AbstractZero ? Q̄_prime : Q̄_prime +  

             = qr_pullback_square_deep(Q̄_prime, , A, Q, U)
             = Q *  
            # partition Ā = [X̄ | Ȳ]
             = [ ]
        end 
        return (NoTangent(), )
    end 
    return QR, qr_pullback
end

The part of the rrule that calculates the foward pass just calculates the QR
decomposition and returns the result of that. The pullback consumes the
incoming gradients, \(\bar{R}\) and \(\bar{Q}\), which are stored as defined
in the pullback for getproperty. The actual calculation for the case where
the dimensions of the matrix A are equal, m=n, and the case where A is tall and
skinny, m>n are different. But they can re-use some code. That is what the
local function qr_pullback_square_deep does. The rest is mostly matrix slicing
and the occasional matrix product. The implementation is borrows strongly from
pytorch’s autograd implementation.

Finally, I need to verify that my implemenation calculates correct results. For
this, I define two small test functions which each work on exactly one output of the
QR factorization, either the matrix Q or the matrix R. Then I generate some random
input and compare the gradient calculated through reverse-mode AD in Zygote
to the gradient calculated by FiniteDifferences. If they are approximately equal,
I take the result to be correct.



V1 = rand(Float32, (4, 4));

function f1(V) where T
    Q, _ = qr(V)
    return sum(Q)
end

function f2(V) where T
    _, R = qr(V)
    return sum(R)
end


res1_V1_ad = Zygote.gradient(f1, V1)
res1_V1_fd = FiniteDifferences.grad(central_fdm(5,1), f1, V1)
@assert res1_V1_ad[1]  res1_V1_fd[1]

[1]
Schreiber et al. A Storage-Efficient WY Representation for Products of Householder Transformations

[2]
HJ Liao, JG Liu et al. Differentiable Programming Tensor Networks

[3]
M. Seeger, A. Hetzel et al. Auto-Differenting Linear Algebra

[4]
Walter and Lehman Walter and Lehmann, 2018, Algorithmic Differentiation of Linear Algebra Functions with Application in Optimum Experimental Design

Automatic differentiation – Reverse Mode

By: Random blog posts about machine learning in Julia

Re-posted from: https://rkube.github.io/julia/autodiff/2021/05/16/autodiff2.html

This post is the follow-up of my post on forward mode AD.
There I motivated motivated the use of automatic differentiation by noting that it is helpful for
gradient-baseed optimization. In this post I will discuss how reverse mode allows us to
take the derivative of a function output with respect to basically an arbitrary number
of parameters in one sweep. To do this comfortably, one often evaluates local
derivatives, whose use is facilitated by introducing the bar notation.

To see this, let’s go the other way around. Given our computational graph we
start at the output \(yy\) and calculate derivatives with respect to intermediate
values. For convenience we define the notation

\[\begin{align}
\bar{u} = \frac{\partial y}{\partial u}
\end{align}\]

that is, the symbol under the bar is the variable we wish to take the derivative with
respect to. Now let’s dive right in and take derivatives from the example. We start
out with

\[\begin{align}
\bar{v}_6 & = \frac{\partial y}{\partial v_6} = 1 \\
\bar{v}_5 & = \frac{\partial y}{\partial v_5} = \frac{\partial y}{\partial v_6} \frac{\partial v_6}{\partial v_5} = \bar{v}_6 \frac{\partial v_6(v_4, v_5)}{\partial v_5} = \bar{v}_6 v_4 \\
\end{align}\]

The first expression is often called the seed gradient and is trivial to evaluate.
In order to evaluate \(\bar{v}_5\) we had to use the chain-rule.

Continuing, we now have to evaluate \(\bar{v}_4\). Let’s do it first using the
chain rule:

\[\begin{align}
\frac{\partial y}{\partial v_4} & =
\frac{\partial y}{\partial v_6} \left(
\frac{\partial v_6}{\partial v_5} \frac{\partial v_5}{\partial v_4} +
\frac{\partial v_6}{\partial v_4}
\right) \\
& = v_4 (-1) + v_5.
\end{align}\]

where we have used the chain rule and that \(v_6 = v_6(v_5(v_4), v_4)\). Looking at
the computational graph, we see that we had to split the product using a plus-sign
precisely at a position where there are multiple paths from the origin \(y\) to
the target node \(v_4\), one via \(v_6\) and one via \(v_5\). Now as trivial as
this example is, real-world programs are often much more complicated. And to
calculate the partial derivatives, the formulas can become ever more complex.

Now the bar notation is here to make our life easier. Essentially, it gives us an
easy way to replace chain-rule evaluation with local values that are stored in
all child-nodes of the target node \(v_i\) in \(\bar{v}_i\). Put in another way,
they are used as cache-values when traverseing the graph from right-to-left, as we
do in reverse-mode AD. This is illustrated in the sketch below:

Coputational graph for reverse-mode autodiff

We see that there are two gradients incoming to the \(v_4\) node:
\(\bar{v}_5 \partial v_5 / \partial v_4\) and \(\bar{v}_6 \partial v_6 / \partial v_4\).
Assuming that \(\bar{v}_6\) and \(\bar{v}_5\) are already evaluated, we just need the
partial derivatives \(\partial v_5 / \partial v_4\) and
\(\partial v_6 / \partial v_4\) to find \(\bar{v}_4\):

\[\begin{align}
\bar{v}_4 = \bar{v}_5 \frac{\partial v_5}{\partial v_4} + \bar{v}_6 \frac{\partial v_6}{\partial v_4} = \bar{v}_5 (-1) + \bar{v}_6 v_5.
\end{align}\]

And indeed, we have recovered the same rule as when applying the chain rule.
When we continue to calculate derivatives \(\bar{v}_i\), we see that this
notation becomes very handy. For example, using the chain rule we have

\[\begin{align}
\bar{v}_3 = \frac{\partial y}{v_3} = \bar{v}_6 \frac{\partial v_6(v_5(v_4(v_1,v_3), v_4(v_3, v_1)))}{\partial v_3}
\end{align}\]

where we again need to use the product rule. Using the bar-notation however, the
derivative becomes trivial

\[\begin{align}
\bar{v}_3 = \bar{v}_4 \frac{\partial v_4}{v_3} = \bar{v}_4 v_1,
\end{align}\]

given that both \(\bar{v}_4\) and \(v_1\) have been evaluated. But in typical
reverse-mode use this is the case, as we have completed one forward pass through
the graph and we have traversed the barred values from right to left.

For completeness sake, let’s calculate the Reverse-mode AD evaluation trace as we did
for the forward mode.

Reverse-mode AD evaluation trace
\(v_{-1} = x_1 = 1.0\)
\(v_0 = x_2 = 1.1\)
\(v_1 = v_0 v_{-1} = (1.1) (1.0) = 1.1\)
\(v_2 = exp(v_1) = 3.004\)
\(v_3 = cos(v_0) = 0.4536\)
\(v_4 = v_1 v_3 = 0.4990\)
\(v_5 = v_2 – v_4 = (3.004) – (0.4990) = 2.5052\)
\(v_6 = v_4 v_5 = (0.4990) (2.5052) = 1.2500\)
\(\bar{v}_6 = \frac{\partial y}{\partial v_6} = 1\)
\(\bar{v}_5 = \frac{\partial y}{\partial v_5} = \bar{v}_6 v_4 = (1) (0.4990) = 0.4990\)
\(\bar{v}_4 = \frac{\partial y}{\partial v_4} = \bar{v}_6 \frac{\partial v_6}{\partial v_4} + \bar{v}_5 \frac{\partial v_5}{v_4} = \bar{v}_6 v_5 + \bar{v}_5 (-1) = (1)(2.5052) + (0.4990)(-1) = 2.006\)
\(\bar{v}_3 = \frac{\partial y}{v_3} = \bar{v}_4 \frac{\partial v_4}{\partial v_3} = \bar{v}_4 v_1 = (2.006) * (1.1) = 2.2069\)
\(\bar{v}_2 = \bar{v}_5 \frac{\partial v_5}{\partial v_2} = \bar{v}_5 (1) = 0.4990\)
\(\bar{v}_1 = \bar{v}_2 \frac{\partial v_2}{\partial v_1} + \bar{v}_4 \frac{\partial v_4}{\partial v_1} = \bar{v}_2 \exp(v_1) + \bar{v}_4 v_3 = (0.4990) (3.004) + (2.006)(0.4536) = 2.4090\)
\(\bar{v}_0 = \bar{v}_3 \frac{\partial v_3}{\partial v_0} + \bar{v}_1 \frac{\partial v_1}{\partial v_0} = \bar{v}_3 (- \sin v_0) + \bar{v}_1 v_{-1} = (2.2069) (-0.8912) + (2.4090)(1) = 4.3758\)
\(\bar{v}_{-1} = \bar{v}_1 \frac{\partial v_1}{\partial v_{-1}} = \bar{v}_1 v_0 = (2.4090) * (1.1) = 2.6500\)

As a first step, we are evaluating all the un-barred quantities \(v_i\) before we begin the
backward pass, where we evaluate all the \(\bar{v}_i\). We also see that we need the values
of the child nodes to calculate the barred values. For example, to calculate \(\bar{v}_3\)
we need the value of \(v_1\), which is a child-node of \(v_4\), when evaluating the
partial derivative \(\partial v_4 / \partial v_1\).

We also see that this procedure would allow us to calculate the derivatives \(\partial y / \partial v_i\) for an arbitrary number of \(v_i\)s in a single reverse pass. This is the
exactly the behaviour that makes reverse-mode automatic differentiation so powerful in
context of gradient-based optimization. In that situation we want to take the derivative of
a scalar loss function with respect to a possible large amount of parameters.

Reverse-mode AD in higher dimensions: Jacobian-Vector products

We are often working in higher dimensional spaces, hidden layers in multilayer perceptrons
for example can have hundreds or even thousands of features. It is therefore instructive to take
a look how reverse-mode AD works in this setting. For this we are looking at a new example:

\[\begin{align}
x \in \mathbb{R}^{n} \quad f: \mathbb{R}^{n} \rightarrow \mathbb{R}^{m} \quad g: \mathbb{R}^{m} \rightarrow \mathbb{R}^{l} \quad u \in \mathbb{R}^{m} \quad y \in \mathbb{R}^{l} \\
u = f(x) \quad y = g(u) = g(f(x))
\end{align}\]

The computational graph for the program \(y = g(f(x))\) is just

Computational graph for simple example with gradient backprop

Now the formulas for the higher-dimensional case arise naturally from the ones derived in the
previous section. Since we are discussing reverse mode, we start at the right. The
bar quantities for each element of the first hidden layer is given by:

\[\begin{align}
\bar{u}_j & = \sum_{i} \bar{y}_i \frac{\partial y_i}{\partial u_j} \\
\end{align}\]

Using vector notation, this equation can be written as

\[\begin{align}
\left[ \bar{y}_1 \ldots \bar{y}_l \right]
\left[
\begin{matrix}
\frac{\partial y_1}{\partial u_1} & \ldots & \frac{\partial y_1}{\partial u_m} \\
\vdots & \ddots & \vdots \\
\frac{\partial y_l}{\partial u_1} & \ldots & \frac{\partial y_l}{\partial u_m}
\end{matrix}

\right]
& = \left[ \bar{u}_1 \ldots \bar{u}_m \right]
\end{align}\]

Now \(\bar{f}\) is the vector that we get as an output from that calculation and will be
propagated left-wards. We obtain it by calculating a vector-jacobian product. In practice,
one usually does not calculate the jacobian matrix, but it is more efficient to calculate
the vjp directly.

Moving leftwards to the graph, we use the same procedure to calculate \(\bar{x}\):

\[\begin{align}
\left[ \bar{u}_1 \ldots \bar{u}_n \right]
\left[
\begin{matrix}
\frac{\partial u_1}{\partial x_1} & \ldots & \frac{\partial u_1}{\partial x_n} \\
\vdots & \ddots & \vdots \\
\frac{\partial u_m}{\partial x_1} & \ldots & \frac{\partial u_m}{\partial x_n}
\end{matrix}

\right]
& = \left[ \bar{x}_1 \ldots \bar{x}_n \right]
\end{align}\]

Combining these results we find

\[\begin{align}
\bar{x} = \bar{y} \left[ \left\{ \frac{\partial y_i}{\partial u_j} \right\}_{i,j} \right] \left[ \left\{ \frac{\partial u_i}{\partial x_j} \right\}_{r,s} \right].
\end{align}\]

This equation is evaluated left-to-right, so that it requires to calculate only vector-matrix products.

As a final note, these vector-jacobian products are often called pullbacks in the
context of automatic differentiation software. A notation used for them is

\[\begin{align}
\bar{u} = \sum_{i} \bar{y}_i \frac{\partial y_i}{\partial u} = \mathcal{B}^{u}_{f}(\bar{y}).
\end{align}\]

Let’s unpack the expression \(\mathcal{B}^{u}_{f}(\bar{y})\). The upper index denotes simply
that the pullback is a function of \(u\), since it needs the values from the forward-pass
at the node to be evaluated. The lower index \(f\) denotes that it depends on the mapping
\(f\), since this is the mapping from \(x\) to \(u: f(x) = u\). Finally, the argument
\(\bar{y}\) denotes that the pullback needs the incoming gradient from the right.

In practice, automatic differentiation software uses the chain rule to split the computational
graph of a program into finer units until it can identify a pullback for a segment of the
computational graph. In some cases this may be the desired way to work and in some cases,
a user-defined backpropagation rule may be desireable.

Summary

We have introduced reverse-mode automatic differentiation. By starting with a gradient
at the end of the computational graph, this mode allows to quickly calculate the
sensitivity of an output with respect to an arbitrary large amount of intermediate values.
To aid the necessary computations for this, the bar-notation has been introduced.
Finally, we motivated that reverse-mode AD only needs to calculate vector-jacobian products
and introduced the notation of pullback as the primitive object of reverse-mode AD.
And finally, please check out the references below which I used for this blog-post.

References

[1]
C. Rackauckas 18.337J/6.338J: Parallel Computing and Scientific Machine Learning

[2]
S. Radcliffe Autodiff in python

[3]
R. Grosse Lecture notes CSC321

[4]
A. Griewank, A. Walther – Evaluating Derivatives – SIAM(2008)