Mutability, scope, and separation of concerns in library code

By: Julia on μβ

Re-posted from: https://matbesancon.xyz/post/2021-12-11-mutability-library/

It has been about a year since I joined the Zuse Institute
to work on optimization methods and computation.
One of the key projects of the first half of 2021 has been on building up
FrankWolfe.jl,
a framework for nonlinear optimization in Julia using Frank-Wolfe
methods. You can find a paper introducing the package here.
This was an opportunity to experiment with different design choices
for efficient, scalable, and flexible optimization tools
while keeping the code simple to read and close to the algorithms.

I will list down a few roads we went on, experimenting what reads and works best to achieve these goals.

Table of Contents

No mutability

Probably the simplest pattern to follow. It is also a good one when
the created objects are light, ideally stack-allocated.

This is typically the code you would write to get the leanest version
of the Frank-Wolfe algorithm:

x = initialize(feasible_set)
while !terminated
    direction = grad(x)
    v = compute_extreme_point(feasible_set, direction)
    γ = find_stepsize(stepsize_strategy, x, v)
    x = x + γ * (v - x)
    terminated = check_termination(x)
end

This program is highly generic, users can define their own
grad function, and typically implement
compute_extreme_point and find_stepsize methods for their custom
feasible set and step size strategy types.
If you push it further, you can use a custom abstract vector type
for x and v.
Not a vector in the programming sense, you can use weird vector spaces
as long as addition and scaling are defined.

What would be a problem then? If you have seen high-performance code
before, you are probably screaming at the allocations happening all over the place. Every line is allocating a new object in memory,
first this direction, then the extreme point v.
The worst might be the x update step which allocates three vectors
because of intermediate expressions.
If you come from another performance-enabling programming environment
(Fortran, C, C++, Rust), what I am saying is probably obvious.
If you come from interpreted languages like Python or R, you may wonder why bothering about these? If you do not need performance, indeed maybe you shouldn’t bother but when developing a library, users will probably expect not
to have to rewrite your code for a larger-scale use case.
Also, these interpreted languages are typically slow across the board
when performing operations in the language itself and not moving them
to external kernels written in a compiled language (or being lucky with Numba).
In Julia, operations will typically be as fast as they can get if you pay
attention to minor things, so the bottleneck quickly becomes
the allocations of large objects.
The other thing people may oppose is that it is the role of the compiler
to take high-level expressions and reformulate them to avoid allocations.
This is a common argument among some functional programming circles,
everything is immutable because the compiler will figure everything out.
To some extent, this is true of course but pushing too much program
transformation to the compiler introduces some complexity
on all users, not just the ones focusing on performance.
You may typically get bitten by iterators methods (filter, map)
in Rust yielding a result of a custom type which changes if
a type annotation is given first.
Without this type annotation, when expecting a consistent type
to be inferred, one can get an error complaining about a weird
type generated by the chaining of all operations.
Finally, pushing this on the compiler means that you expect it to optimize
your code consistently and always in the way you would do it, because in
most cases “overriding” the compiler behaviour is far from trivial
and even verifying the decisions the compiler took will require inspecting
lowered emitted code (down to LLVM IR, assembly).

Finally, worrying about performance of the inner loop is also a consequence
of the nature of the algorithm itself: Frank-Wolfe, as typical for
first-order methods, will perform a lot of iterations that are relatively
cheap, as opposed to, say Interior Point Methods which will typically
converge in few iterations but with each one of them doing significant
work. In the latter case, allocating a few vectors might be fine because
linear algebra will dominate runtime, but not in FW where each
individual operation is relatively cheap compared to allocations.

Passing containers

This would be the typical signature of C functions, receiving almost all
heap-allocated containers as arguments.
A typical example would be replacing the gradient computation with:

grad!(storage, x)

which would compute the gradient at x in-place in the storage container.
Note the ! which is just a Julia idiom to indicate a function that mutates
one of its arguments. Adding storage arguments to function calls is also
used in Optim.jl
for the definition of a gradient or Hessian or in
DifferentialEquations.jl to pass the function describing a dynamic.

This has the strong advantage of making it possible for users to reduce
their memory consumption and runtime. This also means that composing calls
can be made performant: typically, library developers pay attention
to hot loops which they spend more time optimizing. But what if your main
top-level algorithm is someone else’s hot loop? Then they need to be able
to control that setup cost in some way.

Why not use these additional arguments and in-place notation everywhere then?

Consider the same gradient with multiple intermediate expressions that must be held in different structures, where should one hold the storage?
Adding more storage:

grad!(storage1, storage2, storage3, x)

means users would need to implement one with a very large number of arguments
all the time which they wouldn’t use.
Remember we cannot just bundle all the storage containers into one
because the “main” one is supposed to then contain the actual gradient
value at x.
Alternatively, all additional storage elements could be put as keyword arguments, but it also quickly makes for obscure signatures.

Dedicated workspace

This is the approach taken in Nonconvex.jl, all temporary containers required by an algorithm are defined
as a type specific to that algorithm:

struct MyAlgorithm end

mutable struct MyAlgorithmWorkspace
    v::Vector{Float64}
end

prepare_workspace(::MyAlgorithm, x) = MyAlgorithmWorkspace(similar(x))

function run_algorithm(alg, x0; workspace = prepare_workspace(alg, x0))
    compute_step!(workspace.v, x0)
end

This pattern avoids the monstrous signature with 10+ arguments
that are not “real” inputs of the function in a mathematical sense,
lets a good default for most users of letting the keyword be initialized
but allows more advanced users to pass down the workspace if required.
The workspace of a given algorithm can also contain multiple arguments
of different types without requiring changes to the other algorithms.
This is exactly the path experimented for the step size computation
in FrankWolfe.jl#259.

Functors

Sadly, the workspace pattern is not a silver bullet, even if it covers a lot of cases.
When one needs not only some internal workspace, but also returning a large object?

The Frank-Wolfe is also composed of different building blocks, gradient computation,
linear oracle, step size. Should we have a dedicated workspace for each of them?
That would also be a load we place on all advanced users defining a new component;
acceptable for the oracles which typically have a dedicated type, but it quickly becomes
awkward for something like the objective function.
Would we expect users to do something like the following:

using LinearAlgebra
f(x, workspace) = dot(x, workspace.A, x)
struct FWorkspace{T}
    A::Matrix{T}
end
build_objective_workspace(::typeof(f), x) = ...

It reads oddly and will be hard to explain to users relatively new to Julia
while not being a very advanced feature for the package, defining an objective function is
part of the workflow.

A typical pattern would be to use closures for such parameters:

function build_objective(A)
    f(x) = dot(x, A, x)
    function grad!(storage, x)
        storage .= 2 * A * x
    end
    return (f, grad!)
end

(f, grad!) = build_objective(A)
# ...

The two functions close over a common parameter A which can be accessed from within it.
But what if you need to access A outside build_objective, once the functions are created?

You actually can do f.A, but it’s definitely a hack using an implementation
detail more than a feature, do not reproduce this at home!
And users or library contributors might also be confused when trying to see where the f.A
accessor is defined.

Instead, we should transparently define the fields we access and use functors or callable structures.

struct GradientWithMatrix{T} <: Function
    A::Matrix{T}
end

# defines the behaviour of a call to GradientWithMatrix
function (gm::GradientWithMatrix)(storage, x)
    storage .= 2 * gm.A * x
end

grad! = GradientWithMatrix(ones(3,3))
# ...

grad! is named like our previous function and can be called in the same way,
but it is also a struct of type GradientWithMatrix{Float64}.
Furthermore, this parameterized gradient type can be set up by the user, so we,
the package developers, let the user implement an allocation-free version by setting up their gradient only
once.

This pattern could also become handy for dynamic parameters evolving with iterations,
like a number of gradient calls:

mutable struct GradientWithIterationCounter{T} <: Function
    A::Matrix{T}
    counter::Int
end

GradientWithIterationCounter(A) = GradientWithIterationCounter(A, 0)

# a call to GradientWithMatrix increases the counter and updates the storage
function (gm::GradientWithMatrix)(storage, x)
    gm.counter += 1
    storage .= 2 * gm.A * x
end

grad! = GradientWithMatrix(ones(3,3))
# ...

This allows us, in a non-intrusive way for the algorithm code, to add an iteration tracking feature
to Frank-Wolfe.

Thanks Wikunia for proofreading and feedback!