Tag Archives: autodiff

Automatic differentiation – Reverse Mode

By: Random blog posts about machine learning in Julia

Re-posted from: https://rkube.github.io/julia/autodiff/2021/05/16/autodiff2.html

This post is the follow-up of my post on forward mode AD.
There I motivated motivated the use of automatic differentiation by noting that it is helpful for
gradient-baseed optimization. In this post I will discuss how reverse mode allows us to
take the derivative of a function output with respect to basically an arbitrary number
of parameters in one sweep. To do this comfortably, one often evaluates local
derivatives, whose use is facilitated by introducing the bar notation.

To see this, let’s go the other way around. Given our computational graph we
start at the output \(yy\) and calculate derivatives with respect to intermediate
values. For convenience we define the notation

\[\begin{align}
\bar{u} = \frac{\partial y}{\partial u}
\end{align}\]

that is, the symbol under the bar is the variable we wish to take the derivative with
respect to. Now let’s dive right in and take derivatives from the example. We start
out with

\[\begin{align}
\bar{v}_6 & = \frac{\partial y}{\partial v_6} = 1 \\
\bar{v}_5 & = \frac{\partial y}{\partial v_5} = \frac{\partial y}{\partial v_6} \frac{\partial v_6}{\partial v_5} = \bar{v}_6 \frac{\partial v_6(v_4, v_5)}{\partial v_5} = \bar{v}_6 v_4 \\
\end{align}\]

The first expression is often called the seed gradient and is trivial to evaluate.
In order to evaluate \(\bar{v}_5\) we had to use the chain-rule.

Continuing, we now have to evaluate \(\bar{v}_4\). Let’s do it first using the
chain rule:

\[\begin{align}
\frac{\partial y}{\partial v_4} & =
\frac{\partial y}{\partial v_6} \left(
\frac{\partial v_6}{\partial v_5} \frac{\partial v_5}{\partial v_4} +
\frac{\partial v_6}{\partial v_4}
\right) \\
& = v_4 (-1) + v_5.
\end{align}\]

where we have used the chain rule and that \(v_6 = v_6(v_5(v_4), v_4)\). Looking at
the computational graph, we see that we had to split the product using a plus-sign
precisely at a position where there are multiple paths from the origin \(y\) to
the target node \(v_4\), one via \(v_6\) and one via \(v_5\). Now as trivial as
this example is, real-world programs are often much more complicated. And to
calculate the partial derivatives, the formulas can become ever more complex.

Now the bar notation is here to make our life easier. Essentially, it gives us an
easy way to replace chain-rule evaluation with local values that are stored in
all child-nodes of the target node \(v_i\) in \(\bar{v}_i\). Put in another way,
they are used as cache-values when traverseing the graph from right-to-left, as we
do in reverse-mode AD. This is illustrated in the sketch below:

Coputational graph for reverse-mode autodiff

We see that there are two gradients incoming to the \(v_4\) node:
\(\bar{v}_5 \partial v_5 / \partial v_4\) and \(\bar{v}_6 \partial v_6 / \partial v_4\).
Assuming that \(\bar{v}_6\) and \(\bar{v}_5\) are already evaluated, we just need the
partial derivatives \(\partial v_5 / \partial v_4\) and
\(\partial v_6 / \partial v_4\) to find \(\bar{v}_4\):

\[\begin{align}
\bar{v}_4 = \bar{v}_5 \frac{\partial v_5}{\partial v_4} + \bar{v}_6 \frac{\partial v_6}{\partial v_4} = \bar{v}_5 (-1) + \bar{v}_6 v_5.
\end{align}\]

And indeed, we have recovered the same rule as when applying the chain rule.
When we continue to calculate derivatives \(\bar{v}_i\), we see that this
notation becomes very handy. For example, using the chain rule we have

\[\begin{align}
\bar{v}_3 = \frac{\partial y}{v_3} = \bar{v}_6 \frac{\partial v_6(v_5(v_4(v_1,v_3), v_4(v_3, v_1)))}{\partial v_3}
\end{align}\]

where we again need to use the product rule. Using the bar-notation however, the
derivative becomes trivial

\[\begin{align}
\bar{v}_3 = \bar{v}_4 \frac{\partial v_4}{v_3} = \bar{v}_4 v_1,
\end{align}\]

given that both \(\bar{v}_4\) and \(v_1\) have been evaluated. But in typical
reverse-mode use this is the case, as we have completed one forward pass through
the graph and we have traversed the barred values from right to left.

For completeness sake, let’s calculate the Reverse-mode AD evaluation trace as we did
for the forward mode.

Reverse-mode AD evaluation trace
\(v_{-1} = x_1 = 1.0\)
\(v_0 = x_2 = 1.1\)
\(v_1 = v_0 v_{-1} = (1.1) (1.0) = 1.1\)
\(v_2 = exp(v_1) = 3.004\)
\(v_3 = cos(v_0) = 0.4536\)
\(v_4 = v_1 v_3 = 0.4990\)
\(v_5 = v_2 – v_4 = (3.004) – (0.4990) = 2.5052\)
\(v_6 = v_4 v_5 = (0.4990) (2.5052) = 1.2500\)
\(\bar{v}_6 = \frac{\partial y}{\partial v_6} = 1\)
\(\bar{v}_5 = \frac{\partial y}{\partial v_5} = \bar{v}_6 v_4 = (1) (0.4990) = 0.4990\)
\(\bar{v}_4 = \frac{\partial y}{\partial v_4} = \bar{v}_6 \frac{\partial v_6}{\partial v_4} + \bar{v}_5 \frac{\partial v_5}{v_4} = \bar{v}_6 v_5 + \bar{v}_5 (-1) = (1)(2.5052) + (0.4990)(-1) = 2.006\)
\(\bar{v}_3 = \frac{\partial y}{v_3} = \bar{v}_4 \frac{\partial v_4}{\partial v_3} = \bar{v}_4 v_1 = (2.006) * (1.1) = 2.2069\)
\(\bar{v}_2 = \bar{v}_5 \frac{\partial v_5}{\partial v_2} = \bar{v}_5 (1) = 0.4990\)
\(\bar{v}_1 = \bar{v}_2 \frac{\partial v_2}{\partial v_1} + \bar{v}_4 \frac{\partial v_4}{\partial v_1} = \bar{v}_2 \exp(v_1) + \bar{v}_4 v_3 = (0.4990) (3.004) + (2.006)(0.4536) = 2.4090\)
\(\bar{v}_0 = \bar{v}_3 \frac{\partial v_3}{\partial v_0} + \bar{v}_1 \frac{\partial v_1}{\partial v_0} = \bar{v}_3 (- \sin v_0) + \bar{v}_1 v_{-1} = (2.2069) (-0.8912) + (2.4090)(1) = 4.3758\)
\(\bar{v}_{-1} = \bar{v}_1 \frac{\partial v_1}{\partial v_{-1}} = \bar{v}_1 v_0 = (2.4090) * (1.1) = 2.6500\)

As a first step, we are evaluating all the un-barred quantities \(v_i\) before we begin the
backward pass, where we evaluate all the \(\bar{v}_i\). We also see that we need the values
of the child nodes to calculate the barred values. For example, to calculate \(\bar{v}_3\)
we need the value of \(v_1\), which is a child-node of \(v_4\), when evaluating the
partial derivative \(\partial v_4 / \partial v_1\).

We also see that this procedure would allow us to calculate the derivatives \(\partial y / \partial v_i\) for an arbitrary number of \(v_i\)s in a single reverse pass. This is the
exactly the behaviour that makes reverse-mode automatic differentiation so powerful in
context of gradient-based optimization. In that situation we want to take the derivative of
a scalar loss function with respect to a possible large amount of parameters.

Reverse-mode AD in higher dimensions: Jacobian-Vector products

We are often working in higher dimensional spaces, hidden layers in multilayer perceptrons
for example can have hundreds or even thousands of features. It is therefore instructive to take
a look how reverse-mode AD works in this setting. For this we are looking at a new example:

\[\begin{align}
x \in \mathbb{R}^{n} \quad f: \mathbb{R}^{n} \rightarrow \mathbb{R}^{m} \quad g: \mathbb{R}^{m} \rightarrow \mathbb{R}^{l} \quad u \in \mathbb{R}^{m} \quad y \in \mathbb{R}^{l} \\
u = f(x) \quad y = g(u) = g(f(x))
\end{align}\]

The computational graph for the program \(y = g(f(x))\) is just

Computational graph for simple example with gradient backprop

Now the formulas for the higher-dimensional case arise naturally from the ones derived in the
previous section. Since we are discussing reverse mode, we start at the right. The
bar quantities for each element of the first hidden layer is given by:

\[\begin{align}
\bar{u}_j & = \sum_{i} \bar{y}_i \frac{\partial y_i}{\partial u_j} \\
\end{align}\]

Using vector notation, this equation can be written as

\[\begin{align}
\left[ \bar{y}_1 \ldots \bar{y}_l \right]
\left[
\begin{matrix}
\frac{\partial y_1}{\partial u_1} & \ldots & \frac{\partial y_1}{\partial u_m} \\
\vdots & \ddots & \vdots \\
\frac{\partial y_l}{\partial u_1} & \ldots & \frac{\partial y_l}{\partial u_m}
\end{matrix}

\right]
& = \left[ \bar{u}_1 \ldots \bar{u}_m \right]
\end{align}\]

Now \(\bar{f}\) is the vector that we get as an output from that calculation and will be
propagated left-wards. We obtain it by calculating a vector-jacobian product. In practice,
one usually does not calculate the jacobian matrix, but it is more efficient to calculate
the vjp directly.

Moving leftwards to the graph, we use the same procedure to calculate \(\bar{x}\):

\[\begin{align}
\left[ \bar{u}_1 \ldots \bar{u}_n \right]
\left[
\begin{matrix}
\frac{\partial u_1}{\partial x_1} & \ldots & \frac{\partial u_1}{\partial x_n} \\
\vdots & \ddots & \vdots \\
\frac{\partial u_m}{\partial x_1} & \ldots & \frac{\partial u_m}{\partial x_n}
\end{matrix}

\right]
& = \left[ \bar{x}_1 \ldots \bar{x}_n \right]
\end{align}\]

Combining these results we find

\[\begin{align}
\bar{x} = \bar{y} \left[ \left\{ \frac{\partial y_i}{\partial u_j} \right\}_{i,j} \right] \left[ \left\{ \frac{\partial u_i}{\partial x_j} \right\}_{r,s} \right].
\end{align}\]

This equation is evaluated left-to-right, so that it requires to calculate only vector-matrix products.

As a final note, these vector-jacobian products are often called pullbacks in the
context of automatic differentiation software. A notation used for them is

\[\begin{align}
\bar{u} = \sum_{i} \bar{y}_i \frac{\partial y_i}{\partial u} = \mathcal{B}^{u}_{f}(\bar{y}).
\end{align}\]

Let’s unpack the expression \(\mathcal{B}^{u}_{f}(\bar{y})\). The upper index denotes simply
that the pullback is a function of \(u\), since it needs the values from the forward-pass
at the node to be evaluated. The lower index \(f\) denotes that it depends on the mapping
\(f\), since this is the mapping from \(x\) to \(u: f(x) = u\). Finally, the argument
\(\bar{y}\) denotes that the pullback needs the incoming gradient from the right.

In practice, automatic differentiation software uses the chain rule to split the computational
graph of a program into finer units until it can identify a pullback for a segment of the
computational graph. In some cases this may be the desired way to work and in some cases,
a user-defined backpropagation rule may be desireable.

Summary

We have introduced reverse-mode automatic differentiation. By starting with a gradient
at the end of the computational graph, this mode allows to quickly calculate the
sensitivity of an output with respect to an arbitrary large amount of intermediate values.
To aid the necessary computations for this, the bar-notation has been introduced.
Finally, we motivated that reverse-mode AD only needs to calculate vector-jacobian products
and introduced the notation of pullback as the primitive object of reverse-mode AD.
And finally, please check out the references below which I used for this blog-post.

References

[1]
C. Rackauckas 18.337J/6.338J: Parallel Computing and Scientific Machine Learning

[2]
S. Radcliffe Autodiff in python

[3]
R. Grosse Lecture notes CSC321

[4]
A. Griewank, A. Walther – Evaluating Derivatives – SIAM(2008)

Automatic differentiation – Forward Mode

By: Random blog posts about machine learning in Julia

Re-posted from: https://rkube.github.io/julia/autodiff/2021/05/10/autodiff1.html

Automatic differentiation allows one to take derivatives of computer programs which
calculate numerical values. The name refers to a set of techniques that produce a
transformed computer program which in turn calculates various derivatives of these values.

To see why this is extremely useful, consider a traditional data fitting problem.
We have a model \(f_\theta: \mathbb{R}^m \rightarrow \mathbb{R}^n\), where
\(\theta\) are the model parameters and observations \(y_i \in \mathbb{R}^n\)
taken at \(x_i \in \mathbb{R}^{m}\). To fit the model on the data we can
use a loss function, for example the mean-squared error

\[\begin{align}
\mathcal{L} = \sum_{i} \left( f_\theta(x_i) – y_i \right)^2
\end{align}\]

and tune the parameters \(\theta\) in order to minimize \(\mathcal{L}\). This process
is also called learning or solving an inverse problem.

A common method to tune the parameters \(\theta\) is gradient descent. Given a learning
rate \(\eta\) one iterates

\[\theta \leftarrow \theta – \eta \frac{\partial \mathcal{L}}{\partial \theta}\]

until \(\mathcal{L}\) is approximately at a local minimum. Applying the chain rule, we
can expand the derivative as

\[\frac{\partial \mathcal{L}}{\partial \theta} = \sum_i 2 \left( f_\theta(x_i) – y_i \right)
\frac{\partial f_\theta(x_i)}{\partial \theta}.\]

To perform the optization procedure we need to calculate the derivatives described
by the last term in the equation above. This is where automatic differentiation
comes into play.

A Toy Example

To better understand what automatic differentiation does and how it works let us consider
the toy example

\[f(x_1, x_2) = x_1 x_2 \cos(x_2) \left[ \exp(x_1 x_2) – 1 \right] = y.\]

In the following we will investigate how to calculate the derivative of \(f\) with respect
to its inputs \(x_1\) and \(x_2\). In the context of the previous section, we can think of
\(x_1\) as a parameter \(\theta\) that we wish to tone. While the method we present here
is as simple as possible – real-world examples of tunable models often have thousands,
millions, or even billions of parameters – it serves well to introduce the fundamental way
in which automatic differentiation works. For the work below we are following [1]
closely.

This mathematical formula for \(f\) can be decomposed into a succession of elementary functions,
like \(+\), \(\sin\), \(\exp\), \(*\). That is, we can define intermediate variables like
\(v_1(a,b) = a \cdot b\), \(v_2(a) = \cos a\) etc, which are only defined once. Each of these
variables is also initialized by applying a simple expression to previous variables.
Applying this procedure to our toy example gives us the following sequence:

A simple program broken up into intermediate values
\(v_{-1} = x_1\)
\(v_{0} = x_2\)
\(v_{1} = v_{-1} \cdot v_{0}\)
\(v_{2} = \exp v_1\)
\(v_{3} = \cos v_0\)
\(v_{4} = v_1 \cdot v_3\)
\(v_{5} = v_2 – v_4\)
\(v_6 = v_4 v_5\)
\(y = v_6\)

We now define the so-called Tangential Derivative \(\dot{v_i} = \partial v_i / \partial x_1\).
For the tangential derivative, we are fixing the denominator to \(x_1\) so that it describes the sensitivity of the functions
output with respect to the given input. Let us calculate this derivative for the v’s now:

\[\begin{align}
\dot{v}_1 & = \frac{\partial v_1}{\partial x_1} = v_0 \frac{\partial v_{-1}}{\partial v_{-1}} = v_0 \\
\dot{v}_2 & = \frac{\partial v_2}{\partial v_1} \dot{v}_1 = v_0 \exp v_1\\
\dot{v}_3 & = \frac{\partial v_3}{\partial v_0} \dot{v}_0 = 0 \\
\dot{v}_4 & = \frac{\partial v_4}{\partial v_1} \dot{v}_1 + \frac{\partial v_4}{\partial v_3} \dot{v}_3 = v_3 v_0 \\
\dot{v}_5 & = \frac{\partial v_5}{\partial v_2} \dot{v}_2 + \frac{\partial v_5}{\partial v_4} \dot{v}_4 = \dot{v}_2 – \dot{v}_4 \\
\dot{v}_6 & = \frac{\partial v_6}{\partial v_5} \dot{v}_5 + \frac{\partial v_6}{\partial v_4} \dot{v}_4 = \dot{v}_5 v_4 + \dot{v}_4 v_5 \\
\end{align}\]

Note that we have to use \(\dot{v}_{-1} = 1\) and \(\dot{v}_{0} = 0\) in order to calculate \(v\dot{v}_1\). That
is, the seed determines which part of the sum is non-zero. From the expressions above we can alos see that we
will need to evaluate the derivatives starting at \(v_1\). That is, we begin at the start of the function,
calculate simple derivatives and push this calculation through. Therefore this is also called forward mode automatic differentiation.

In practice, forward mode AD is often implemented by operator overloading. An implementation
would need seed an initial derivative \(x_i = 1\) and then calculate the program and the derivatives simultaneously.
This approach works for our toy example as we see from the evaluation trace shown below:

Forward-mode AD evaluation trace  
\(v_{-1} = x_1 = 1.0\) \(\dot{v}_{-1} = 1.0\)
\(v_{0} = x_2 = 1.1\) \(\dot{v}_{0} = 0.0\)
\(v_1 = v_0 v_{-1} = (1.1) (1.0) = 1.1\) \(\dot{v}_{1} = v_{0} = (1.1)\)
\(v_{4} = v_1 v_3 = (1.1) (0.4536) = 0.4990\) \(\dot{v}_4 = v_3 \dot{v}_1 = (0.4536) * (1.1) = 0.499\)
\(v_{5} = v_2 – v_4 = (3.004) – (0.4990) = 2.5052\) \(\dot{v}_5 = \dot{v}_2 – \dot{v}_4 = 3.305 – 0.499 = 2.851\)
\(v_6 = v_4 v_5 = (0.4990) (2.7180) = 1.2500\) \(\dot{v}_6 = v_4 \dot{v}_5 + \dot{v}_4 v_5 = (0.499) (2.851) + (0.499) (2.505) = 2.650\)

It is illustrative to visualize how the values and derivatives are propagated in the computational graph.
Coputational graph for forward-mode autodiff

Here the intermediate values are shown in yellow and the gradient values are shown in turquoise. Note that
both, values and derivatives are propagated together from the left, the input, to the output on the right.
The implication of this is that forward-mode autodiff can calculate the sensitivities of all outputs y
with respect to one input x in a multidimensional setting.

Summary

Automatic differentiation is a set of tools that allows to calculate the derivatives of
computer programs that return numerical values. Here we looked at the so-called forward mode of
automatic differentiation. It works by de-composing a program into simple functions for which
we know the derivative. We can initialize a program and set the input variable for which we want
to know the output’s derivatives for. Then we can calculate the function value and the derivative
of the output with respect to the chosen value in one sweep.

This works well for situations where we have a small number of inputs and a large number of outputs.
But in data-fitting problems we often end up with the reverse situation: We have a large number
of outputs and few, or maybe only a single output. In these situations one may want to use
reverse-mode automatic differentiation.

References

[1]
A. Griewank, A. Walther – Evaluating Derivatives – SIAM(2008)