Tag Archives: Update

Backpropagating through QR decomposition

By: Random blog posts about machine learning in Julia

Re-posted from: https://rkube.github.io/jekyll/update/2021/07/06/backpropagating-through-qr.html

One of the most useful decompositions of Linear Algebra is the QR decomposition.
This decomposition is particularly important when we are interested in the
vector space spanned by the columns of a matrix A. Formally, we write the QR
decomposition of \(A \in \mathbb{R}^{m \times n}\), where m ≥ n, as

\[\begin{align}
A = QR =
\left[
\begin{array}{c|c|c|c}
& & & \\
q_1 & q_2 & \cdots & q_n \\
& & &
\end{array}
\right]
%
\left[
\begin{array}{cccc}
r_{1,1} & r_{1,2} & \cdots & r_{1,n} \\
0 & r_{2,2} & \cdots & r_{2,n} \\
0 & 0 & \ddots & \vdots \\
0 & \cdots & & r_{n,n}
\end{array}
\right].
\end{align}\]

Where Q is an orthogonal m-by-n matrix, i.e. the columns of Q are orthogonal
\(q_{i} \cdot q_{j} = \delta_{i,j}\). R is a square n-by-n matrix.
Using the decomposition, we can reconstruct the columns of A as

\[\begin{align}
a_1 & = r_{1,1} q_1 \\
a_2 & = r_{1,2} q_1 + r{2,2} q_2 \\
& \cdots \\
a_n & = \sum_{j=1}^{n} r_{j,n} q_j.
\end{align}\]

Internally, Julia treats QR-factorized matrices through a packed format [1].
This format does not store the matrices Q and R explicitly, but using a packed format.
All matrix algebra and arithmetic is implemented through methods that expect
the packed format as input. And the QR decomposition itself is calling the LAPACK
method
geqrt which
returns just this packed format.

A computational graph where one may want to backpropagate through a QR factorization
may look similar to this one:
Computational graph with QR factorization

While the incoming gradients, \(\bar{f} \partial f / \partial Q\) and
\(\bar{f} \partial f / \partial R\) depend on \(f\), the gradients for the QR
decomposition \(\bar{Q} \partial Q / \partial A\) and \(\bar{R} \partial R / \partial A\) are defined through the QR factorization. While these can be in principle
computed through an automatic differentiation framework, it can be beneficial to
implement a pullback. For one, this saves compilation time before first execution.
An additional benefit is less memory use, as the gradients will be propagated through
fewer functions. A pullback for the QR factorization may thus also aid numerical
stability, as fewer accumulations and propagations are performed.

Formulas for the pullback of the QR factorization are given in [2],
[3] and [4]. Pytorch for example implements the method described in [3], see here.

In this blog post, we implement the pullback for the QR factorization in Julia.
Specifically, we implement a so-called rrule for ChainRules.jl. While you are here, please take a moment and read
the documentation of this package. It is very well written and helped me tremendously
to understand how automatic differentiation works and is implemented n Julia.
Also, if you want to skip ahead, here is my example implementation of the
QR pullback for ChainRules.

Now, let’s take a look at the code and how to implement a custom rrule. The first
thing we need to look at is the definition of the QR struct

struct QR{T,S<:AbstractMatrix{T}} <: Factorization{T}
    factors::S
    τ::Vector{T}

    function QR{T,S}(factors, τ) where {T,S<:AbstractMatrix{T}}
        require_one_based_indexing(factors)
        new{T,S}(factors, τ)
    end
end

There are two fields in the structure, factors and τ. The matrices Q and R are
returned through an accompanying getproperty function:

function getproperty(F::QR, d::Symbol)
    m, n = size(F)
    if d === :R
        return triu!(getfield(F, :factors)[1:min(m,n), 1:n])
    elseif d === :Q
        return QRPackedQ(getfield(F, :factors), F.τ)
    else
        getfield(F, d)
    end
end

For the QR pullback we first need to implement a pullback for the getproperty
function. The pullback for this function only propagates incoming gradients
backwards. Incoming gradients are described using a Tangent. This struct can only
have fields that are present in the parameter type, in our case
LinearAlgebra.QRCompactWY. This struct has fields factors and τ as discussed
above. Now the incoming gradients would be \(\bar{Q}\) and \(\bar{R}\). Thus
the pullback needs to map between these two. Thus the pullback can be implemented
like this:


function ChainRulesCore.rrule(::typeof(getproperty), F::LinearAlgebra.QRCompactWY, d::Symbol) 
    function getproperty_qr_pullback()
        ∂factors = if d === :Q
            
        else
            nothing
        end

        ∂T = if d === :R
            
        else
            nothing
        end

        ∂F = Tangent{LinearAlgebra.QRCompactWY}(; factors=∂factors, T=∂T)
        return (NoTangent(), ∂F)
    end

    return getproperty(F, d), getproperty_qr_pullback
end

Notice the call signature of the rrule. The first argument is always
::typeof(functionname), where functionname is the name of the function that
we want a custom rrule for. Following this argument are the actual arguments
that one normally passes to functionname.

Finally we need to implement the pullback that actually performs the relevant
calculations. The typical way of implementing custom pullbacks with
ChainRules is write a function
that calculates that returns a tuple containing the result of the forward pass,
as well as the pullback \(\mathcal{B}^{x}_{f}(\bar{y})\). Doing it this way allows
the pullback to re-use results from the forward pass. In other words, by defining
the pullback as a function in the forward pass allows to easily use cached results.
There is more discussion on the reason behind this design choice in the
ChainRules documentation.

Returning to the pullback for the QR factorization, here is a possible implementation:


function ChainRules.rrule(::typeof(qr), A::AbstractMatrix{T}) where {T} 
    QR = qr(A)
    m, n = size(A)
    function qr_pullback(::Tangent)
        function qr_pullback_square_deep(, , A, Q, R)
            M = *R' - Q'*
            # M <- copyltu(M)
            M = triu(M) + transpose(triu(M,1))
             = ( + Q * M) / R'
        end 
         = .factors
         = .T 
        Q = QR.Q
        R = QR.R
        if m  n 
             =  isa ChainRules.AbstractZero ?  : @view [:, axes(Q, 2)] 
             = qr_pullback_square_deep(, , A, Q, R)
        else
            # partition A = [X | Y]
            # X = A[1:m, 1:m]
            Y = A[1:m, m + 1:end]
    
            # partition R = [U | V], and we don't need V
            U = R[1:m, 1:m]
            if  isa ChainRules.AbstractZero
                 = zeros(size(Y))
                Q̄_prime = zeros(size(Q))
                 =  
            else
                # partition R̄ = [Ū | V̄]
                 = [1:m, 1:m]
                 = [1:m, m + 1:end]
                Q̄_prime = Y * '
            end 

            Q̄_prime =  isa ChainRules.AbstractZero ? Q̄_prime : Q̄_prime +  

             = qr_pullback_square_deep(Q̄_prime, , A, Q, U)
             = Q *  
            # partition Ā = [X̄ | Ȳ]
             = [ ]
        end 
        return (NoTangent(), )
    end 
    return QR, qr_pullback
end

The part of the rrule that calculates the foward pass just calculates the QR
decomposition and returns the result of that. The pullback consumes the
incoming gradients, \(\bar{R}\) and \(\bar{Q}\), which are stored as defined
in the pullback for getproperty. The actual calculation for the case where
the dimensions of the matrix A are equal, m=n, and the case where A is tall and
skinny, m>n are different. But they can re-use some code. That is what the
local function qr_pullback_square_deep does. The rest is mostly matrix slicing
and the occasional matrix product. The implementation is borrows strongly from
pytorch’s autograd implementation.

Finally, I need to verify that my implemenation calculates correct results. For
this, I define two small test functions which each work on exactly one output of the
QR factorization, either the matrix Q or the matrix R. Then I generate some random
input and compare the gradient calculated through reverse-mode AD in Zygote
to the gradient calculated by FiniteDifferences. If they are approximately equal,
I take the result to be correct.



V1 = rand(Float32, (4, 4));

function f1(V) where T
    Q, _ = qr(V)
    return sum(Q)
end

function f2(V) where T
    _, R = qr(V)
    return sum(R)
end


res1_V1_ad = Zygote.gradient(f1, V1)
res1_V1_fd = FiniteDifferences.grad(central_fdm(5,1), f1, V1)
@assert res1_V1_ad[1]  res1_V1_fd[1]

[1]
Schreiber et al. A Storage-Efficient WY Representation for Products of Householder Transformations

[2]
HJ Liao, JG Liu et al. Differentiable Programming Tensor Networks

[3]
M. Seeger, A. Hetzel et al. Auto-Differenting Linear Algebra

[4]
Walter and Lehman Walter and Lehmann, 2018, Algorithmic Differentiation of Linear Algebra Functions with Application in Optimum Experimental Design

Machine-learned preconditioners – Part 2

By: Random blog posts about machine learning in Julia

Re-posted from: https://rkube.github.io/jekyll/update/2021/04/09/nn-precond-2.html

This post is a follow-up from the last one. But now we are focusing on solving
a non-linear system of the form

\[F(x) = 0\]

using a quasi-Newton method. You find the definition on
Wikipedia. Basically it
is Newton’s method but you replace the Jacobian with an approximation.
In this blog post we are considering a simple two-dimensional system.
Using Newton’s method for this example is trivial and leads to a converged
solution after only two or three steps, since it converges quadratically.
A quasi-Newton method on the other hand converges only linearly. And we
aim to improve on that convergence behaviour by using a machine-learned
preconditioner. Now, preconditioners are usually used to accelerate convergence of
linear solvers, like GMRES. And I don’t claim that they are generally useful for
non-linear systems. So we do it just to see if this works in principle.

Quasi-Newton iteration

We consider the simple two-dimensional system

\[\begin{align}
F_1(x_1, x_2) = x_1 + \alpha x_1 x_2 – s_1 & = 0 \\
F_2(x_1, x_2) = \beta x_1 x_2 + x_2 – s_2 & = 0
\end{align}\]

where \(\alpha\) and \(\beta\) are of the order 0.01 and the source terms are
\(s_{1,2} \sim \mathcal{N}(1, 10^{-2})\). Picking the source terms from a
Normal distribution allows us to consider a large number of instances of this
problem. Later we will use this to generate training and test sets.

Newton’s method can be used to find the root of $F(x)$. Starting out at an
initial guess $x_0$, it updates the guess by guiding it along its gradient:
\(F(x_0 + \delta x) \approx F(x_0) + J(x) \delta x \approx 0\). Here
\(J^{-1}(x)\) is the inverse of an approximation on the Jacobian Matrix whose
entries are is defined as \((J(x))_{ij} = \partial F_i / \partial x_j\).
Solving for \(\delta x\) gives the update \(\delta x = J^{-1}(x) F(x)\). Now the
Jacobian matrix of the system can be easily calculated. For the quasi-Newton
method we consider the approximation

\[\widetilde{J}(x) = \begin{pmatrix}
1 + \alpha x_2 & 1 \\
1 & 1 + \beta x_1
\end{pmatrix}\]

where the off-diagonal elements of the original Jacobian have been modified
as \(\partial F_1 / \partial x_2 = \alpha x_1 \approx 1\) and
\(\partial F_2 / \partial x_1 = \beta x_2 \approx 1\).

A quasi-Newton aims to find the root of \(F(x)=0\) using the same iteration scheme
but using an approximation to the Jacobian instead
\(\begin{align}
x_{k+1} & = x_k – \widetilde{J}^{-1}(x) f(x).
\end{align}\)

If you use the true Jacobian you have Newton’s method and the rate of
convergence is quadratic. If you don’t, you have a quasi-newton method and the
rate of convergence is linear. How can we accelerate convergence in this
case?

To accelerate convergence, we can use a Preconditioner \(P^{-1}\). This is
a invertible matrix that we squeeze into the update by inserting a one:

\[\begin{align}
J(x_k) \delta x_k &= -F(x_k) \Leftrightarrow \\
\left( J(x_k)P^{-1} \right) \left( P \delta x_k \right) &= -F(x_k)
\end{align}\]

The update from \(x_k \rightarrow x_{k+1}\) then needs to calculated in
two steps.

  • Calculate \(\delta w = \left( \widetilde{J} \right)^{-1} P^{-1} \left(F(x_k) \right)\)
  • Calculate \(\delta x_{k} = P^{-1} \delta w\).
  • Update \(x_{k+1} = x_{k} + \delta_x\).

That is, we apply \(P^{-1}\) twice. A closed form of \(P\) is not required.
Next we will see how to implement \(P^{-1}\) as a neural network and
how to train it using differentiable code.

Neural-Network preconditioner

We parameterize the preconditioner with a Multilayer perceptron as
\(P^{-1}_{\theta}: \mathbb{R}^{m} \rightarrow \mathbb{R}^2\):

using Flux
dim = 2
Pinv = Chain(Dense(2*dim + 2, 50, celu), Dense(50, 50, celu), Dense(50, dim))

Here we allow for \(m > 2\) to pass additional inputs to the Network. We are
also using CeLU activation functions to avoid having a non-differentiable point
at \(x=0\) and we are choosing a simple 3-layer feed-forward architecture to start
out with.

While this gives us a parameterization we now have to find out how to train
the network. The task of the preconditioner is to minimize the residual
after \(n_\mathrm{i}\) iterations of Newtons method. The following loss
function tells us how well \(P^{-1}\) is doing over a number of examples
\(F(x) = 0\), where the RHS terms \(s_1, s_2 \sim \mathcal{N}\left(2, 0.01\right)\).

sqnorm(x) = sum(abs2, x) 


function loss2(batch)
# batch is a 2d-array. dim1: index within batch. dim2: indices the batch
    # Loss of the current sample. Defined as the norm of f(x) after nᵢ Newton iterations
    loss_local = 0.0
    # Batch-size
    batch_size = size(batch)[2]
    # Iterate over source vectors in the current mini-batch
    for sidx in 1:batch_size
        # Grab the current source terms 
        s = batch[:, sidx]
        # Define a RHS
        function f(x) 
            F = Zygote.Buffer(zeros(dim))
            F[1] = x[1] + α*x[1]*x[2] - s[1]
            F[2] = β * x[1] * x[2] + x[2] - s[2]
            return copy(F)
        end

        # run num_steps of Newton iteration for the current RHS
        sol = picard(x, x0, Pinv, s, 1)
        loss_local += norm(f(sol))
    end

    return log(loss_local) / batch_size
end

One has to be careful in evaluating the preconditioner within calls to the
Picard iteration. After each iteration, the input \(f(x)\) will decrease by
an order of magnitude. Neural Networks work best of the input is of order
one on the other hand. Thus we need to find a way to provide input of order
one to \(P^{-1}\) at each iteration. In the routine below the input is

  • \(f(x) / \vert f(x) \vert_{2}\) – Input \(f(x)\), scaled to its L² norm
  • \(s\) – The source terms in \(F(x) = 0\).
  • \(\log \vert f(x) \vert\) – Gives the order of magnitude of the input
  • \(i\) – The current iteration

The first term represents the desired input, re-scaled to be order unity, and
the logarithm \(\log \vert f(x) \vert\) estimates the order of magnitude of this
term. For example for \(\vert f(x) \vert = 10^{-5} \rightarrow \log \vert f(x) \vert \approx -11.5\). I also pass the source term \(s\) and the
current Picard iteration \(i\) into \(P^{-1}\).

function picard(f, x, P⁻¹, s, num_steps)
# Run num_Steps of Newton iteration
# P⁻¹: Preconditioner
# s: source terms in F(x) + s = 0

    for i = 1:num_steps
        # P⁻¹ receives the following inputs
        # f(x) / norm(f(x)) : Scaled input
        # s                 : source term
        # log(norm(f(x)))   : Order of magnitude of f(x)
        # i                 : Iteration number I
        δw = pinv(J(x)) * P⁻¹(vcat(f(x) / norm(f(x)), s, log(norm(f(x))), i))
        δx = P⁻¹(vcat(δw / norm(δw), s, log(norm(δw)), i))
        x = x - δx
    end
    return x
end

I am generating 100 training samples which I divide into 10 mini-batches
for training. Similarly, the test-set is 100 examples. Here I don’t use
a validation set but instead only consider the loss on the training set.
While training I am monitoring the loss on the training set. If it has not
decreased for 5 epochs, I’m halfing the learning rate.

In order to optimize the neural network as to give a smaller residual
after one iteration one needs to calculate the gradients of the loss
with respect to the parameters of the neural network. Zygote allows us
to do this using only a single call to gradient:

   
# Start training loop
loss_vec = zeros(num_epochs)
η_vec = zeros(num_epochs)
# Counts how many steps we haven't improved the loss function.
stall_thresh = 5

for epoch in 1:num_epochs
    # Randomly shuffly the training examples
    batch_idx = shuffle(1:num_train)
    for batch in 1:num_batch
        # Get random indices for the current batch
        this_batch = batch_idx[(batch - 1) * batch_size + 1:batch * batch_size]
        grads = Flux.gradient(Flux.params(P⁻¹)) do 
            l = loss2(source_train[:, this_batch])
        end
        for p in Flux.params(P⁻¹)
            Flux.Optimise.update!(p, -η * grads[p])
        end
        Zygote.ignore() do 
            loss_vec[epoch] = loss2(source_train[:, this_batch])
            η_vec[epoch] = η
            if mod(epoch, 10) == 0
                println("Epoch $(epoch) loss $(loss_vec[epoch])  η $(η)")
            end
        end
    end

    if (epoch > stall_thresh) && (loss_vec[epoch] > mean(loss_vec[epoch - stall_thresh:epoch]))
        global η *= 0.5
        println("    new η = $(η)   ")
    end
    if η < 1e-10
        break 
    end
end

Training the preconditioner is a bit tricky. I had to play quiet a bit with
the layout of the network, the learning rate, and the batch size to get useful
results. One setup that works is

x0 = [1.9; 2.2]
α = 0.03
β = -0.07

Pinv = Chain(Dense(2*dim + 2, 100, celu), Dense(100, 200, celu), Dense(200, 100, celu), Dense(100, dim))

num_train = 100
num_epochs = 100
num_batch = 10
num_test = 100

Let’s look at the loss on the training set:

Training loss

For the first few epochs, there is no significant training. But around epochs
10 and then 20, there are significant drops. After 20, training proceeds uniform
and the loss function flattens out around epoch 60. Learning-rate scheduling
is essential here, it is halfed around the steep drops. Finally we end up with
an L² loss of about exp(-4)≈0.01 per sample.

We can evaluate the performance of the learned preconditioner by testing
it on unseen examples:

or idx in 1:num_test
    # Load the source vector from the test set
    s = source_test[:, idx]
    # Define the non-linear system with the current source vector
    function f(x)
        F = zeros(dim)
        F[1] = x[1] + α * x[1] * x[2] - s[1]
        F[2] = β * x[1] * x[2] + x[2] - s[2]
        return F
    end

    # Run 1 iteration with ML preconditioner, store solution after 1 iteration in x1_ml
    x1_ml = newton(f, x0, P⁻¹, s, 1)
    # Run 1 iteration without preconditioner, store solution after 1 iteration in x1_no
    x1_no = newton_no(f, x0, 1)

    # Run 5 more newton iterations, starting at x1_ml
    vsteps_ml = map(n -> norm(f(newton_no(f, x1_ml, n))), 0:5)
    # Run 5 more newton iterations, starting at x1_no
    vsteps_no = map(n -> norm(f(newton_no(f, x1_no, n))), 0:5)
end

The first iteration is performed using the preconditioner, followed by 5
iterations without preconditioning. This is compared to the norm of the
residuals in 6 iterations performed without preconditioner. The plot below
shows the convergence history of 100 test samples. Green line show ML-preconditioned iterations, red lines show the iteration history without
peconditioner.

Residual Loss

We find that the Neural Network can act as a preconditioner in the first iteration.
The residual norm after one iteration shows large scatter, almost two orders of
maagnitude. After one accelerated initial step, the iteration continues with
linear rate of convergence for all test samples. That is at the same rate as the
un-preconditioned samples, shown in red here.

In summary, yes, we can use a machine-learned preconditioner to accelerate
Picard iteration to solve non-linear systems. It is a bit tricky to learn and
I only showed how to do the first iteration. So there are still many details that
have not been explored. I would also like to add that for the case here, Newton’s
method is the way to go. It has a quadratic rate of convergence – much faster than
the linear rate of Picard iteration. For the samples here, residuals are machine precision after about 3 to 4 iterations. But if that is not an option, using a
Neural Network can accelerate your Picard iteration.

Machine-learned preconditioners – Part 1

By: Random blog posts about machine learning in Julia

Re-posted from: https://rkube.github.io/jekyll/update/2021/03/02/neural-networks-and-iterative-solvers.html

What is a preconditioner?

A common task in scientific computing is to solve a system of linear equations

\[Ax = b\]

with a matrix \(A \in \mathcal{R}^{d \times d}\) that gives the coefficients of the
system, a right-hand-side vector \(b \in \mathcal{R}^{d}\),
and a solution vector \(x \in \mathcal{R}^{d}\).

Instead of solving the linear problem directly, one often solves
the preconditioned problem. We write this problem by inserting a one in the
linear system and re-parenthesing:

\(\left( A P^{-1} \right) \left( P x \right) = b\)
In this equation we have changed the coefficient matrix from \(A\) to
\(A P^{-1}\) and the solution vector is now \(P x\) instead of \(x\). To
retrieve the original vector, simply calculate \(x = P^{-1}y\). The matrix
\(P^{-1}\) is called the preconditioner.

The dependence of \(P^{-1}\) on \(x\) is by choice. There is no deeper reason
for why the preconditioner should depend on the initial guess for the iteraion.
We choose to include this dependency here to foreshadow later applications, where
such a dependency may be useful. In practice, this choice here is rather limiting.
Since the matrix \(A_0 P^{-1}\) needs to be positive definite, and by construction
only \(\widetilde{b}\) is positive definite, the method as constructed here is only
valid for \(x \approx 0\).

The Gauss-Seidel method is an iterative method to find the solution of
a linear system. Given an initial guess \(x_0\), update the entries of this
vector following an iterative scheme and after some, or many, iterations, \(x_0\)
solves the equation \(A x_0 = b\).
Wikipedia has a nice
and instructive page on this scheme.

A good preconditioner gives you a converged solution after fewer iterations.
In other words, with a good preconditioner you are closer to the true solution
of the linear system after N iterations than you are with a bad preconditioner.
Choosing a good preconditioner depends on the problem at hand and can become
a dark art. Let’s make it even more dark and train a neural network to be
a preconditioner.

Modeling preconditioner as a neural network

As a first step, let’s model \(P^{-1}\) as a multi-layer perceptron
(technically a single-layer):

\[P^{-1} = \sigma \left( W x + \widetilde{b} \right)\]

with a weight matrix \(W \in \mathcal{R}^{d^2 \times d}\), a bias
\(\widetilde{b} \in \mathcal{R}^{d^2}\), and a ReLU \(\sigma\). With
\(x \in \mathcal{R}^{d}\) the matrix dimensions are chosen such that \(P^{-1}\)
is of dimension \(d^2\). Simply reshape it to \(d \times d\) to have it act like
a matrix.

Here I’m discussing a simple test case and am working with the Gauss-Seidel
iterative solver. For real problem one would probably use a different iterative
algorithm, but it serves as a proof-of-concept. Anyway, for Gauss-Seidel to work,
the system coefficient matrix \(AP^{-1}\) needs to be positive-definite.
How do we do this for our neural network? A simple hack is to consider only
vectors whose entries are close to zero.

That way we get away with requiring only \(\widetilde{b}\) to be
positive-definite. We get the last property by letting
\(\widetilde{b} = b_0 b_0^{T}\) and sampling the entries of \(b_0\) as
\(b_{ij} \sim \mathcal{N}(0, 1)\). In a similar fashion, we sample the entries
of the weight-matrix \(W\) as \(w_{i,j} \sim \mathcal{N}(0,1)\).

How can we train a preconditioner

To make this preconditioner useful, the weights \(W\) and bias term \(b\) need
to be optimized such that the residual after \(N\) iterations is as small
as possible. And this needs to be true for all vectors from a training set.
Using automatic differentiation, we can train \(W\) and \(b\) using gradient
descent like this:

  • Pick an initial guess \(x_0\) with \(x_{0,i} \sim \mathcal{N}(0, 0.01)\)
  • Build the preconditioner \(P^{-1} = \sigma(W, x_0, \widetilde{b})\)
  • Calculate 5 Gauss-Seidel steps. Let’s call the solution here \(y_5\).
  • Un-apply the preconditioner: \(x_5 = P^{-1} y_5\)
  • Calculate the distance to the true solution \(\mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} \left( x_{\text{true}, i} – x_{5,i} \right)^2\)
  • Calculate the gradients \(\nabla_{W} \mathcal{L}\) and
    \(\nabla_{B} \mathcal{L}\)
  • Update the weight matrix and bias using gradient descent: \(W \leftarrow W – \alpha \nabla_{W} \mathcal{L}\) and \(b \leftarrow b – \alpha \nabla_{b} \mathcal{L}\). Here \(\alpha\) is the learning rate

The approach here is to directly back-propagate from the loss function \(\mathcal{L}\), through
the numerical solver, to the parameters of the preconditioner \(W\) and \(b\).
We are not working with an offline training and test-data set, but the data is
taken directly from the numerical calculations. This way we directly capture the
reaction of the numerical solver to updates proposed by gradient descent.

The derivatives \(\nabla_\theta \mathcal{L}\) can be calculate using automatic differentiation
packges, such as Zygote.

Implementation


# Test differentiation through control flow

# Use a iterative conjugate solver with preconditioner

using Random
using Zygote
using LinearAlgebra
using Distributions
using NNlib

I copy-and-pasted the Gauss-Seidel code from wikipedia:

function gauss_seidel(A, b, x, niter)
    # This is from https://en.wikipedia.org/wiki/Gauss%E2%80%93Seidel_method
    x_int = Zygote.Buffer(x)
    x_int[:] = x[:]
    for n  1:niter
        for j  1:size(A, 1)
            x_int[j] = (b[j] - A[j, :]' * x_int[:] + A[j, j] * x_int[j]) / A[j, j]
        end
    end
    return copy(x_int)
end

The block below sets things up.


Random.seed!(1)
dim = 4

# Define a matrix
A0 = [10.0 -1.0 2.0 0.0; -1 11 -1 3; 2 -1 10 -1; 0 3 -1 8]
# Define the RHS
b0 = [6.0; 25.0; -11.0; 15.0]
# This is the true solution
x_true = [1.0, 2.0, -1.0, 1.0]

# Define the size of the training and test set
N_train = 100
N_test = 10

# Define an initial state. Draw this from a narrow distribution around zero
# We need to do this so that the preconditioner eigenvalues are positive
x0 = rand(Normal(0.0, 0.01), (N_train, dim))

# Define an MLP. This will later be our preconditioner
# The output should be a matrix and we work with 2dim as the size for the MLP
W = rand(dim*dim, dim)
# For Gauss-Seidel to work, the matrix A*P⁻¹ needs to be positive semi-definite.
bmat = rand(dim, dim)
# We know that any matrix A*A' is positive semi-definite
bvec = reshape(bmat * bmat', (dim * dim))
# Now Wx + bmat is positive semi-definite if x is very small
P(x, W, b) = NNlib.relu.(reshape(W*x .+ b, (dim, dim)))
# Positive-definite means positive Eigenvalues. We should check this.
@show eigvals(A0*P(x0[0,:], W, bvec))

This function will serve as our loss function. It evaluates the NN preconditioner and then runs
some Gauss-Seidel iterations. Finally, the iterative solution
approximation is transformed back by applying \(P^{-1}\).

function loss_fun(W, bmat, A0, b0, y0, niter=5)
    # W - Weight matrix for NN-preconditioner
    # bmat - Bias vector for NN preconditioner
    # A0: Linear system coefficient matrix
    # b0: RHS of linear system
    # y0 - initial guess for Linear system. Strictly, this is x0. But we call it the same
    # assuming that P⁻¹x0 = x0.
    # niter - Number of Gauss-Seidel iterations to perform
    
    loss = 0.0
    nsamples = size(y0)[1]
    
    for idx ∈ 1:nsamples:
        # Evaluate the preconditioner
        P⁻¹ = P(y0[idx, :], W, reshape(bmat * bmat', (dim * dim)))
        # Initial guess
        # Now we solve A(Px)⁻¹y = rhs for y with 3 Gauss-Seidel iterations
        y_sol = gauss_seidel(A0 * P⁻¹, b0, y0[idx, :], niter)
        # And reconstruct x
        x = P⁻¹ * y_sol
        loss += norm(x - x_true) / length(x)
    end

    return loss / nsamples
end

Now comes the fun part. Zygote calculates the partial derivatives of the
loss function with respect to its input, in our case \(W\) and \(b\).
Given the gradients, we can actually update \(W\) and \(b\).

# Number of epochs to train
num_epochs = 10
loss_arr = zeros(num_epochs)
# Store the Weight and bias matrix at each iteration
W_arr = zeros((size(W)..., num_epochs))
bmat_arr = zeros((size(bmat)..., num_epochs))
# Learning rate
α = 0.005

for epoch ∈ 1:num_epochs
    loss_arr[epoch], grad = Zygote.pullback(loss_fun, W, bmat, A0, b0, x0)
    res = grad(1.0)
    # Gradient descent
    global W -= α * res[1]
    global bmat -= α * res[2]
    # Store W and b
    W_arr[:, :, epoch] = W
    bmat_arr[:, :, epoch] = bmat
end

Finally, let’s evaluate the performance

# This functions returns a vector with the residual of the iterative
# solution to the true solution at each step
function eval_performance(W, bmat, A0, y0, b0, niter=20)
    # Instantiate the preconditioner with the initial guess
    P⁻¹ = P(y0, W, reshape(bmat * bmat', (dim * dim)))
    y_sol = copy(y0)
    loss_vec = zeros(niter)
    for n ∈ 1:niter      
        # Initial guess
        # Now we solve (Px)⁻¹y = rhs for y with 3 Gauss-Seidel iterations
        y_sol = gauss_seidel(A0 * P⁻¹, b0, y_sol, niter)
        # And reconstruct x
        x = P⁻¹ * y_sol
        loss_vec[n] = norm(x - x_true) / length(x)
    end

    return loss_vec
end

# Get the residual at each Gauss-Seidel iteration
sol_err = zeros(20, num_epochs)
for i ∈ 1:num_epochs
    # Calculate the loss at each iteration, averaged over the training data
    sol_avg = zeros(20)
    for idx ∈ 1:N_test
        sol_here[:] += eval_performance(W_arr[:, :, i], bmat_arr[:, :, i], A0, x0, b0)
    end
    sol_err[:, i] = sol_here[:] / N_test
end

Results

A plot says many words, so here we go

Trained preconditioner

The plot shows the average residual to the true solution vector as a function of
Gauss-Seidel iterations. Training for 1 epoch, the residual decreases as a power law
for all 20 GS iterations. The longer we train the preconditioner, the faster
the residual decreases. Remember that we trained only for 5 iterations. But in the plot
we see that preconditioner GS scheme proceeds at an accelerated rate of convergence,
even after the fifth iteration. So for this toy example, the NN preconditioner performs
quiet well.

Finally, a word on what we learn. Since we only take small, non-zero vectors, we update mostly
the bias term and not the weight matrix. We can verify this in the code:

julia> (W_arr[:, :, end] - W_arr[:, :, 1])
16×4 Array{Float64,2}:
  9.2298e-7    3.56583e-6    9.63813e-6   -3.62526e-6
  7.02612e-6   1.83826e-6    4.87793e-6    1.06628e-5
 -1.18359e-5  -5.52383e-6   -1.00905e-5   -2.49149e-5
  4.2548e-6   -4.93141e-7   -5.37068e-6    1.46914e-5
 -4.80919e-6  -1.97348e-5   -5.14656e-5    1.82806e-5
 -2.99721e-5  -1.85244e-5   -4.37366e-5   -3.33036e-5
  5.28254e-5   5.17799e-5    0.000115429   8.27112e-5
 -1.70783e-5  -1.11531e-5   -8.03195e-6   -5.73307e-5
  6.09361e-6   3.89891e-5    8.93109e-5   -3.54379e-5
  5.58404e-5   5.60942e-5    0.000112264   6.67546e-5
 -8.3024e-5   -0.000129324  -0.00020506   -0.000139751
  2.75378e-5   3.47865e-5    1.79561e-5    0.000101882
 -2.89687e-6  -2.14357e-5   -4.965e-5      2.24125e-5
 -3.26029e-5  -3.11671e-5   -5.59135e-5   -4.08413e-5
  5.10635e-5   7.48976e-5    0.000101988   9.20182e-5
 -1.64133e-5  -1.88765e-5   -3.41413e-6   -6.0346e-5

The average matrix element of W has changed only little during learning, on average by
about 0.00001. Looking at how much the elements of b have changed during training

julia> (bmat_arr[:, :, end]*bmat_arr[:, :, 1]') - (bmat_arr[:, :, 1]*bmat_arr[:, :, 1]')
4×4 Array{Float64,2}:
 -0.000674214   0.00163171   0.0012703    0.000217951
  0.00631572    0.00148005  -0.0105803    0.00372173
 -0.050246     -0.0453774   -0.0181303   -0.0511274
  0.011839      0.0128648   -0.00398132   0.00899787

we find that they have changed more, on average by a factor of 100 more than entries of W.
But as discussed earlier, including x0 in the preconditioner is a modeling choice which
one does not have to make.

Conclusions

To summarize, we proposed to use a Neural Network to accelerate an iterative solver
for linear systems by acting as a preconditioner matrix. We propose that the weights
of the neural network can be optimized by automatic differentiation in reverse mode.
By putting the solver in the loop, the training and inference steps couple to the
simulation in a very simple way.

One drawback of the method as written here is that we are limiting ourselfes to initial
guesses \(x \sim 0\). This is due to the requirement of the Gauss-Seidel scheme that the
linear system is positive definite. In more practical settings this can be circumvented
by either using different parameters to be passed to \(P\) than \(x\). Alternatively
one can use an iterative solver that doesn’t pose such restrictions, such as Jacobian-Free
Newton Krylov or Conjugate Gradient etc.