Author Archives: Julia on μβ

Sets, chains and rules – part II

By: Julia on μβ

Re-posted from: https://matbesancon.xyz/post/2020-12-24-chains_sets2/

In a previous post, I detailed some of the features of
MathOptSetDistances.jl
and the evolution of the idea behind it. This is part II focusing on derivatives.

Table of Contents

The most interesting part of the packages is the projection onto a set.
For some applications, what we need is not only the projection but also the
derivative of this projection.

One answer here would be to let Automatic Differentiation (AD) do the work.
However:

  • Just like there are closed-form expressions for the projection, many sets admit closed-form projection derivatives that can be computed cheaply,
  • Some projections may require to perform steps impossible or expensive with AD, as a root-finding procedure1 or an eigendecomposition2;
  • Some functions might make calls into deeper water. JuMP for instance supports a lot of optimization solvers implemented in C and called as shared libraries. AD will not propagate through these calls.

For these reasons, AD systems often let users implement some derivatives themselves,
but as a library developer, I do not want to depend on a full AD package
(and force downstream users to do so).

Meet ChainRules.jl

ChainRules.jl is a Julia package
addressing exactly the issue mentioned above: it defines a set of primitives
to talk about derivatives in Julia.
Library developers can implement custom derivatives for their own functions and types.
Finally, AD library developers can leverage ChainRules.jl to obtain derivatives
from functions when available, and otherwise use AD mechanisms to obtain them from
more elementary functions.

The logic and motivation is explained in more details in Frame’s talk
at JuliaCon 2020 and the package documentation
which is very instructive on AD in general.

Projection derivative

We are interested in computing
$D\Pi_{\mathcal{S}}(v)$, the derivative of the projection with respect to the
initial point. As a refresher, if $\Pi_s(\cdot)$ is a function from $V$ onto itself,
and if $V$ then the derivative $D\Pi$ maps a point in $V$ onto a linear map
from the *tangent space* of $V$ onto itself.
The tangent space of $V$ is roughly speaking the space where differences of
values in $V$ live. If $V$ corresponds to real numbers, then the tangent space
will also be real numbers, but if $V$ is a space of time/dates, then the tangent
space is a duration/time period. See here3 for more references.
Again, roughly speaking, this linear map takes perturbations of the input $\Delta v$
and maps them to perturbation of the projected point $\Delta v_p$.

As an example warm-up:

  • $S$ is the whole domain of $v$ $\Rightarrow$ the projection is $v$ itself, $D\Pi_{\mathcal{S}}(v)$ is the identity operator.
  • $S$ is $\{0\}^n$ $\Rightarrow$ the projection is always $\{0\}^n$, $D\Pi_{\mathcal{S}}(v)$ maps every $Δv$ to a zero vector: perturbations in the input do not change the output.

$D\Pi_{\mathcal{S}}(v)$ is a linear map from $\mathcal{V}$ to $\mathcal{V}$.
If $v \in \mathbb{R}^n$, it can be represented as a
$n\times n$ matrix.
There are several ways of representing linear maps, see the LinearOperators.jl
package for some insight. Two approaches (for now) are implemented for set distances:

  1. Matrix approach: given $v \in \mathbb{R}^n$, return the linear operator as an $n\times n$ matrix.
  2. Forward mode: given $v$ and a direction $\Delta v$, provide the directional derivative $D\Pi_{\mathcal{S}}(v) \Delta v$.
  3. Reverse mode: given $v$, provide a closure corresponding to the adjoint of the derivative.

(1) has been implemented by Akshay for many sets
during his GSoC this summer, along with the projections themselves.

(1) corresponds to computing the derivative eagerly as a full matrix, thus
paying storage and computation cost upfront. The advantage is the simplicity for standard vectors,
take v, s, build and return the matrix.
(2) is the building block for forward-mode differentiation:
given a point $v$ and an input perturbation $\Delta v$, compute the output perturbation.
(3) corresponds to a building block for reverse-mode differentiation.
An aspect of the matrix approach is that it works well for 1-D arrays
but gets complex quite quickly for other structures, including multi-argument
functions or matrices. Concatenating everything into a vector is too rigid.

Example on the nonnegative orthant

The nonnegative orthant cone is the set $\mathbb{R}^n_+$; it is represented in MOI
as MOI.Nonnegatives(n) with n the dimension.
The projection is simple because it can be done elementwise:
$$
(\Pi_S(v))_i = max(v_i, 0) \,\,\forall i.
$$

In other terms, any non-diagonal term of the gradient matrix is 0 for any $v$.
Here is a visualization made with haste for $n=2$ using the very promising Javis.jl:

Projection

The red circle is a vector in the plane and the blue square its projection.4

The Julia implementation follows the same idea, here in a simplified version:

function projection_on_set(v::AbstractVector{T}, s::MOI.Nonnegatives) where {T}
    return max.(v, zero(T))
end

For each component $i \in 1..n$, there are two cases to compute its derivative, either
the constraint is active or not.

$$
\begin{align}
v_i < 0 & \Rightarrow \frac{\partial \Pi_i}{\partial v_i}(v) = 0\\
v_i > 0 & \Rightarrow \frac{\partial \Pi_i}{\partial v_i}(v) = 1.
\end{align}
$$

The projection is not differentiable on points where one of the components is 0.
The convention usually taken is to return any quantity on such point
(to the best of my knowledge, no system guarantees a subgradient).
The Julia implementation holds on two lines:

function projection_gradient_on_set(v::AbstractVector{T}, ::MOI.Nonnegatives) where {T}
    y = (sign.(v) .+ one(T)) / 2
    return LinearAlgebra.Diagonal(y)
end

First the diagonal of the matrix is computed using broadcasting and the sign function.
Then a LinearAlgebra.Diagonal matrix is constructed. This matrix type is sparsity-aware,
in the sense that it encodes the information of having only non-zero entries on
the diagonal. We save on space, using $O(n)$ memory instead of $O(n^2)$ for a
full matrix, and can benefit from specialized methods down the line.

We implemented the matrix approach from scratch. Even though we materialize the
derivative as a diagonal matrix, it still costs storage, which will become a
burden when we compose this projection with other functions and compute derivatives
on the composition.

Forward rule

For a function f, value v and tangent Δv, the forward rule, or frule
in ChainRules.jl does two things at once:

  1. Compute the function value y = f(v),
  2. Compute the directional derivative ∂y = Df(v) Δv.

The motivation for computing the two values at once is detailed in the
documentation.
Quite often, computing the derivative will require computing f(v) itself
so it is likely to be interesting to return it anyway instead of forcing the user
to call the function again.

The exact signature of ChainRulesCore.frule involves some details we want to
ignore for now, but the essence is as follows:

function frule((Δself, v...), ::typeof(f), v...; kwargs...)
    ...
    return y, ∂y
end

∂Y is the directional derivative using the direction Δx. Note here the variadic
Δx and x, since we do not want to impose a rigid, single-argument structure
to functions. The Δself argument is out of scope for this post but you can read
on its use in the docs.

For our set projection, it may look like this:

function ChainRulesCore.frule(
        (_, Δv, _),
        ::typeof(projection_on_set),
        v::AbstractVector{T}, s::MOI.Nonnegatives) where {T}
    vproj = projection_on_set(v, s)
    ∂vproj = Δv .* (v .>= 0)
    return vproj, ∂vproj
end

The last computation line leverages broadcast to express elementwise the
multiplication of Δv with the indicator of v[i] being nonnegative.
The important thing to note here is that we never build the derivative as a data
structure. Instead, we implement it as a function. An equivalent using our
projection_gradient_on_set would be:

function projection_directional_derivative(v, Δv, s)
    vproj = projection_on_set(v, s)
    DΠ = projection_gradient_on_set(v, s)
    ∂vproj =* Δv
    return vproj, ∂vproj
end

Notice the additional allocation and matrix-vector product.

Reverse rules

The forward mode is fairly intuitive, the backward mode less so.
The motivation for using it, and the reason it is the favoured one for several
important fields using AD, is that it can differentiate a composition of functions
with only matrix-vector products, instead of requiring matrix-matrix products.
What it computed is, given a perturbation in the output (or seed), provide the
corresponding perturbation in the input.
There are great resources online which will explain it in better terms than I could
so we will leave it at that.

Looking at the rrule signature from ChainRules.jl:

function rrule(::typeof(f), x...; kwargs...)
    y = f(x...)
    function pullback_f(Δy)
        # implement the pullback here
        return ∂self, ∂x
    end
    return y, pullback_f
end

This is a bit denser. rrule takes the function as input and its arguments.
So far so good. It returns two things, the value y of the function, similalry to frule
and a pullback. This term comes from differential geometry and in the context
of AD, is also referred to as a backpropagator. Again, the ChainRules
docs
got your back with great explanations.

It also corresponds to the Jacobian-transpose vector product if you prefer the term.
In the body of pullback_f, we compute the variation of the output with respect to each input.
If we give the pullback a 1 or 1-like as input, we compute the gradient,
the partial derivative of f with respect to each input x[i] evaluated at the
point x.

Here is the result for our positive orthant (again, simplified for conciseness):

function ChainRulesCore.rrule(::typeof(projection_on_set), v, s::MOI.Nonnegatives)
    vproj = projection_on_set(v, s)
    function pullback(Δvproj)
        n = length(v)
        v̄ = zeros(eltype(Δvproj), n)
        for i in 1:n
            if vproj[i] == v[i]
                v̄[i] = Δvproj[i]
            end
        end
        return (ChainRulesCore.NO_FIELDS, v̄, ChainRulesCore.DoesNotExist())
    end
    return (vproj, pullback)
end

The first step is computing the projection, here we do not bother with saving
for loops and just call the projection function.
For each index i of the vector, if the i-th projection component is equal to
the i-th initial point, $v_i$ is in the positive orthant and variations of
the output are directly equal to variations of the input. Otherwise,
this means the non-negativity constraint is tight, the projection lies on
the boundary vproj[i] = 0, and output variations are not propagated to the input
since the partial derivative is zero.

We see here that a tuple of 3 elements is returned. The first corresponds to
∂self, out of the scope for this package. The second is the interesting one,
, the derivative with respect to the input point.
The last one ChainRulesCore.DoesNotExist() indicates that there is no derivative
with respect to the last argument of projection_on_set, namely the set s.
This makes sense because there is nothing to differentiate in the set.

An interesting point to notice is that the implementation, not the types defines the derivatives.
A non-trivial example would be a floating-point argument p only used to extract
the sign bit. This means it would not have a notion of local perturbation.
The type (a floating-point) would be interpreted as differentiable.
To my understanding, Swift for Tensorflow uses
a type-first approach, where types indicate what field gets differentiated.

If you imagine using this in practice, in an AD library for instance,
one would first call rrule forward, computing primal values and collecting the
successive pullbacks. Once we arrive at the end of our chain of functions,
we could backpropagate from $\Delta Y_{final} = 1$, walking our way back to
the primary input parameters.

Conclusion

This post comes after a few weeks of work on MathOptSetDistances.jl,
the package with the actual implementation of the presented features.
There is still a lot to learn and do on the topic, including solutions to more
projections and derivatives thereof, but also interesting things to build upon.
Defining derivatives and projections is after all a foundation for greater things to
happen.

Notes


  1. See H. Friberg’s talk on exponential cone projection in Mosek at ISMP 2018 ↩︎

  2. An example case for the projection onto the Positive Semidefinite cone ↩︎

  3. If like me you haven’t spent much time lying around differential geometry books,
    the ChainRules.jl
    documentation has a great developer-oriented explanation.
    For more visual explanations, Keno Fischer had a recent talk on
    the topic↩︎

  4. See the source code here↩︎

Sets, chains and rules – part II

By: Julia on μβ

Re-posted from: https://matbesancon.github.io/post/2020-12-24-chains_sets2/

In a previous post, I detailed some of the features of
MathOptSetDistances.jl
and the evolution of the idea behind it. This is part II focusing on derivatives.

Table of Contents

The most interesting part of the packages is the projection onto a set.
For some applications, what we need is not only the projection but also the
derivative of this projection.

One answer here would be to let Automatic Differentiation (AD) do the work.
However:

  • Just like there are closed-form expressions for the projection, many sets admit closed-form projection derivatives that can be computed cheaply,
  • Some projections may require to perform steps impossible or expensive with AD, as a root-finding procedure1 or an eigendecomposition2;
  • Some functions might make calls into deeper water. JuMP for instance supports a lot of optimization solvers implemented in C and called as shared libraries. AD will not propagate through these calls.

For these reasons, AD systems often let users implement some derivatives themselves,
but as a library developer, I do not want to depend on a full AD package
(and force downstream users to do so).

Meet ChainRules.jl

ChainRules.jl is a Julia package
addressing exactly the issue mentioned above: it defines a set of primitives
to talk about derivatives in Julia.
Library developers can implement custom derivatives for their own functions and types.
Finally, AD library developers can leverage ChainRules.jl to obtain derivatives
from functions when available, and otherwise use AD mechanisms to obtain them from
more elementary functions.

The logic and motivation is explained in more details in Lyndon’s talk
at JuliaCon 2020 and the package documentation
which is very instructive on AD in general.

Projection derivative

We are interested in computing
$D\Pi_{\mathcal{S}}(v)$, the derivative of the projection with respect to the
initial point. As a refresher, if $\Pi_s(\cdot)$ is a function from $V$ onto itself,
and if $V$ then the derivative $D\Pi$ maps a point in $V$ onto a linear map
from the *tangent space* of $V$ onto itself.
The tangent space of $V$ is roughly speaking the space where differences of
values in $V$ live. If $V$ corresponds to real numbers, then the tangent space
will also be real numbers, but if $V$ is a space of time/dates, then the tangent
space is a duration/time period. See here3 for more references.
Again, roughly speaking, this linear map takes perturbations of the input $\Delta v$
and maps them to perturbation of the projected point $\Delta v_p$.

As an example warm-up:

  • $S$ is the whole domain of $v$ $\Rightarrow$ the projection is $v$ itself, $D\Pi_{\mathcal{S}}(v)$ is the identity operator.
  • $S$ is $\{0\}^n$ $\Rightarrow$ the projection is always $\{0\}^n$, $D\Pi_{\mathcal{S}}(v)$ maps every $Δv$ to a zero vector: perturbations in the input do not change the output.

$D\Pi_{\mathcal{S}}(v)$ is a linear map from $\mathcal{V}$ to $\mathcal{V}$.
If $v \in \mathbb{R}^n$, it can be represented as a
$n\times n$ matrix.
There are several ways of representing linear maps, see the LinearOperators.jl
package for some insight. Two approaches (for now) are implemented for set distances:

  1. Matrix approach: given $v \in \mathbb{R}^n$, return the linear operator as an $n\times n$ matrix.
  2. Forward mode: given $v$ and a direction $\Delta v$, provide the directional derivative $D\Pi_{\mathcal{S}}(v) \Delta v$.
  3. Reverse mode: given $v$, provide a closure corresponding to the adjoint of the derivative.

(1) has been implemented by Akshay for many sets
during his GSoC this summer, along with the projections themselves.

(1) corresponds to computing the derivative eagerly as a full matrix, thus
paying storage and computation cost upfront. The advantage is the simplicity for standard vectors,
take v, s, build and return the matrix.
(2) is the building block for forward-mode differentiation:
given a point $v$ and an input perturbation $\Delta v$, compute the output perturbation.
(3) corresponds to a building block for reverse-mode differentiation.
An aspect of the matrix approach is that it works well for 1-D arrays
but gets complex quite quickly for other structures, including multi-argument
functions or matrices. Concatenating everything into a vector is too rigid.

Example on the nonnegative orthant

The nonnegative orthant cone is the set $\mathbb{R}^n_+$; it is represented in MOI
as MOI.Nonnegatives(n) with n the dimension.
The projection is simple because it can be done elementwise:
$$
(\Pi_S(v))_i = max(v_i, 0) \,\,\forall i.
$$

In other terms, any non-diagonal term of the gradient matrix is 0 for any $v$.
Here is a visualization made with haste for $n=2$ using the very promising Javis.jl:

Projection

The red circle is a vector in the plane and the blue square its projection.4

The Julia implementation follows the same idea, here in a simplified version:

function projection_on_set(v::AbstractVector{T}, s::MOI.Nonnegatives) where {T}
    return max.(v, zero(T))
end

For each component $i \in 1..n$, there are two cases to compute its derivative, either
the constraint is active or not.

$$
\begin{align}
v_i < 0 & \Rightarrow \frac{\partial \Pi_i}{\partial v_i}(v) = 0\\
v_i > 0 & \Rightarrow \frac{\partial \Pi_i}{\partial v_i}(v) = 1.
\end{align}
$$

The projection is not differentiable on points where one of the components is 0.
The convention usually taken is to return any quantity on such point
(to the best of my knowledge, no system guarantees a subgradient).
The Julia implementation holds on two lines:

function projection_gradient_on_set(v::AbstractVector{T}, ::MOI.Nonnegatives) where {T}
    y = (sign.(v) .+ one(T)) / 2
    return LinearAlgebra.Diagonal(y)
end

First the diagonal of the matrix is computed using broadcasting and the sign function.
Then a LinearAlgebra.Diagonal matrix is constructed. This matrix type is sparsity-aware,
in the sense that it encodes the information of having only non-zero entries on
the diagonal. We save on space, using $O(n)$ memory instead of $O(n^2)$ for a
full matrix, and can benefit from specialized methods down the line.

We implemented the matrix approach from scratch. Even though we materialize the
derivative as a diagonal matrix, it still costs storage, which will become a
burden when we compose this projection with other functions and compute derivatives
on the composition.

Forward rule

For a function f, value v and tangent Δv, the forward rule, or frule
in ChainRules.jl does two things at once:

  1. Compute the function value y = f(v),
  2. Compute the directional derivative ∂y = Df(v) Δv.

The motivation for computing the two values at once is detailed in the
documentation.
Quite often, computing the derivative will require computing f(v) itself
so it is likely to be interesting to return it anyway instead of forcing the user
to call the function again.

The exact signature of ChainRulesCore.frule involves some details we want to
ignore for now, but the essence is as follows:

function frule((Δself, v...), ::typeof(f), v...; kwargs...)
    ...
    return y, ∂y
end

∂Y is the directional derivative using the direction Δx. Note here the variadic
Δx and x, since we do not want to impose a rigid, single-argument structure
to functions. The Δself argument is out of scope for this post but you can read
on its use in the docs.

For our set projection, it may look like this:

function ChainRulesCore.frule(
        (_, Δv, _),
        ::typeof(projection_on_set),
        v::AbstractVector{T}, s::MOI.Nonnegatives) where {T}
    vproj = projection_on_set(v, s)
    ∂vproj = Δv .* (v .>= 0)
    return vproj, ∂vproj
end

The last computation line leverages broadcast to express elementwise the
multiplication of Δv with the indicator of v[i] being nonnegative.
The important thing to note here is that we never build the derivative as a data
structure. Instead, we implement it as a function. An equivalent using our
projection_gradient_on_set would be:

function projection_directional_derivative(v, Δv, s)
    vproj = projection_on_set(v, s)
    DΠ = projection_gradient_on_set(v, s)
    ∂vproj =* Δv
    return vproj, ∂vproj
end

Notice the additional allocation and matrix-vector product.

Reverse rules

The forward mode is fairly intuitive, the backward mode less so.
The motivation for using it, and the reason it is the favoured one for several
important fields using AD, is that it can differentiate a composition of functions
with only matrix-vector products, instead of requiring matrix-matrix products.
What it computed is, given a perturbation in the output (or seed), provide the
corresponding perturbation in the input.
There are great resources online which will explain it in better terms than I could
so we will leave it at that.

Looking at the rrule signature from ChainRules.jl:

function rrule(::typeof(f), x...; kwargs...)
    y = f(x...)
    function pullback_f(Δy)
        # implement the pullback here
        return ∂self, ∂x
    end
    return y, pullback_f
end

This is a bit denser. rrule takes the function as input and its arguments.
So far so good. It returns two things, the value y of the function, similalry to frule
and a pullback. This term comes from differential geometry and in the context
of AD, is also referred to as a backpropagator. Again, the ChainRules
docs
got your back with great explanations.

It also corresponds to the Jacobian-transpose vector product if you prefer the term.
In the body of pullback_f, we compute the variation of the output with respect to each input.
If we give the pullback a 1 or 1-like as input, we compute the gradient,
the partial derivative of f with respect to each input x[i] evaluated at the
point x.

Here is the result for our positive orthant (again, simplified for conciseness):

function ChainRulesCore.rrule(::typeof(projection_on_set), v, s::MOI.Nonnegatives)
    vproj = projection_on_set(v, s)
    function pullback(Δvproj)
        n = length(v)
        v̄ = zeros(eltype(Δvproj), n)
        for i in 1:n
            if vproj[i] == v[i]
                v̄[i] = Δvproj[i]
            end
        end
        return (ChainRulesCore.NO_FIELDS, v̄, ChainRulesCore.DoesNotExist())
    end
    return (vproj, pullback)
end

The first step is computing the projection, here we do not bother with saving
for loops and just call the projection function.
For each index i of the vector, if the i-th projection component is equal to
the i-th initial point, $v_i$ is in the positive orthant and variations of
the output are directly equal to variations of the input. Otherwise,
this means the non-negativity constraint is tight, the projection lies on
the boundary vproj[i] = 0, and output variations are not propagated to the input
since the partial derivative is zero.

We see here that a tuple of 3 elements is returned. The first corresponds to
∂self, out of the scope for this package. The second is the interesting one,
, the derivative with respect to the input point.
The last one ChainRulesCore.DoesNotExist() indicates that there is no derivative
with respect to the last argument of projection_on_set, namely the set s.
This makes sense because there is nothing to differentiate in the set.

An interesting point to notice is that the implementation, not the types defines the derivatives.
A non-trivial example would be a floating-point argument p only used to extract
the sign bit. This means it would not have a notion of local perturbation.
The type (a floating-point) would be interpreted as differentiable.
To my understanding, Swift for Tensorflow uses
a type-first approach, where types indicate what field gets differentiated.

If you imagine using this in practice, in an AD library for instance,
one would first call rrule forward, computing primal values and collecting the
successive pullbacks. Once we arrive at the end of our chain of functions,
we could backpropagate from $\Delta Y_{final} = 1$, walking our way back to
the primary input parameters.

Conclusion

This post comes after a few weeks of work on MathOptSetDistances.jl,
the package with the actual implementation of the presented features.
There is still a lot to learn and do on the topic, including solutions to more
projections and derivatives thereof, but also interesting things to build upon.
Defining derivatives and projections is after all a foundation for greater things to
happen.

Notes


  1. See H. Friberg’s talk on exponential cone projection in Mosek at ISMP 2018 ↩︎

  2. An example case for the projection onto the Positive Semidefinite cone ↩︎

  3. If like me you haven’t spent much time lying around differential geometry books,
    the ChainRules.jl
    documentation has a great developer-oriented explanation.
    For more visual explanations, Keno Fischer had a recent talk on
    the topic. ↩︎

  4. See the source code here. ↩︎

Sets, chains and rules – part I

By: Julia on μβ

Re-posted from: https://matbesancon.xyz/post/2020-12-23-chains_sets/

Table of Contents

In this post, I will develop the process through which the
MathOptSetDistances.jl
package has been created and evolved. In the second one, I will go over the differentiation part.

MathOptInterface and the motivation

MathOptInterface.jl or MOI
for short is a Julia package to unify structured constrained optimization problems.
The abstract representation of problems MOI addresses is as follows:

$$
\begin{align}
\min_{x}\,\, & F(x) \\\\
\text{s.t.}\,\, & G_k(x) \in \mathcal{S}_k \,\, \forall k \\\\
& x \in \mathcal{X}.
\end{align}
$$

$\mathcal{X}$ is the domain of the decision variables,
$F$ is the objective function, mapping values of the variables to the real line.
The constrained aspect comes from the constraints $G_k(x) \in \mathcal{S}_k$,
some mappings of the variables $G_k$ have to belong to a certain set $\mathcal{S}_k$.
See this recent paper on MOI for more information
on this representation.

The structured aspect comes from the fact that a specific form of $F$, $G$
and $\mathcal{S}$ is known in advance by the modeller. In other words, MOI
does not deal with arbitrary unknown functions or black-box sets.
For such cases, other tools are more adapted.

From a given problem in this representation, two operations can be of interest
within a solution algorithm or from a user perspective:

  1. Given a value for $x$, evaluating a function $F(x)$ or $G(x)$,
  2. Given a value $v$ in the co-domain of $G_k$, asserting whether $v \in S_k$.

The first point is addressed by the function eval_variables in the MOI.Utilities submodule
(documentation).

The second point appears as simple (or at least it did to me) but is trickier.
What tolerance should be set?
Most solvers include a numerical tolerance on constraint violations, should this
be propagated from user choices, and how?

The deceivingly simple feature ended up opening one of the
longest discussions
in the MOI repository.

Fairly straightforward[…]

Optimistic me, beginning of the PR, February 2020

A more meaningful query for solvers is, given a value $v$, what is the
distance from $v$ to the set $\mathcal{S}$:

$$
\begin{align}
(\text{δ(v, s)})\,\,\min_{v_p}\,\, & \text{dist}(v_p, v) \\\\
\text{s.t.}\,\, & v_p \in \mathcal{S}.
\end{align}
$$

The optimal value of the problem above noted $δ(v, s)$ depends on the
notion of the distance taken between two values in the domain $\mathcal{V}$,
noted $dist(\cdot,\cdot)$ here.
In terms of implementation, the signature is roughly:

distance_to_set(v::V, s::S) -> Real

Aside:
this is an example where multiple dispatch brings great value to the design:
the implementation of distance_to_set depends on both the value type V
and the type of set S. See why it’s useful in the
Bonus section.

If $\mathcal{S}$ was a generic set, computing this distance would be as hard as
solving an optimization problem with constraints $v \in \mathcal{S}$ but
since we are dealing with structured optimization, many particular sets have
closed-form solutions for the problem above.

Examples

$\|\cdot\|$ will denote the $l_2-$norm if not specified.

The distance computation problem defined by the following data:

$$
\begin{align}
& v \in \mathcal{V} = \mathbb{R}^n,\\
& \mathcal{S} = \mathbb{Z}^n,\\
& dist(a, b) = \|a – b\|
\end{align}
$$

consists of rounding element-wise to the closest integer.

The following data:

$$
\begin{align}
& v \in \mathcal{V} = \mathbb{R}^n,\\
& \mathcal{S} = \mathbb{R}^n_+,\\
& dist(a, b) = \|a – b\|
\end{align}
$$

find the closest point in the positive orthant, with a result:

$$
v_{p}\left[i\right] = \text{max}(v\left[i\right], 0) \,\, \forall i \in \{1..n\}.
$$

Set projections

The distance from a point to a set tells us how far a given candidate is from
respecting a constraint. But for many algorithms, the quantity of interest is
the projection itself:

$$
\Pi_{\mathcal{S}}(v) \equiv \text{arg} \min_{v_p \in \mathcal{S}} \text{dist}(v, v_p).
$$

Like the optimal distance, the best projection onto a set can often be defined
in closed form i.e. without using generic optimization methods.

We also keep the convention that the projection of a point already in the set is
always itself:
$$
δ(v, \mathcal{S}) = 0 \,\, \Leftrightarrow \,\, v \in \mathcal{S} \,\, \Leftrightarrow \,\, \Pi_{\mathcal{S}}(v) = v.
$$

The interesting thing about projections is that once obtained, a distance
can be computed easily, although only computing the distance can be slightly
more efficient, since we do not need to allocate the projected point.

User-defined distance notions

Imagine a set defined using two functions:
$$
\begin{align}
\mathcal{S} = \{v \in \mathcal{V}\,|\, f(v) \leq 0, g(v)\leq 0 \}.
\end{align}
$$

The distance must be evaluated with respect to two values:
$$
(max(f(v), 0), max(g(v), 0)).
$$

Here, the choice boils down to a norm, but hard-coding it seems harsh and rigid for users.
Even if we plan correctly and add most norms people would expect, someone will
end up with new exotic problems on sets,
complex numbers or function spaces.

The solution that came up after discussions is adding a type to dispatch on,
specifying the notion of distance used:

function distance_to_set(d::D, v::V, s::S)
        where {D <: AbstractDistance, V, S <: MOI.AbstractSet}
    # ...
end

which can for instance encode a p-norm or anything else.
In many cases, there is no ambiguity, and the package defines DefaultDistance()
exactly for this.

Bonus

If you are coming from a class-based object-oriented background, a common
design choice is to define a Set abstract class with a method project_on_set(v::V) to implement.
This would work for most situations, since a set often implies a domain V.
What about the following:

# Projecting onto the reals (no-op)
project_on_set(v::AbstractVector{T}, s::Reals) where {T <: Real}

# Projecting onto the reals (actual work)
project_on_set(v::AbstractVector{T}, s::Reals) where {T <: Complex}

Which “class” should own the implementation in that case?
From what I observed, libraries end up with either an enumeration:

if typeof(v) == AbstractVector{<:Reals}
    # ...
elseif # ...
end

or when the number of possible domains is expected to be low, with several methods:

# in the set class Reals
function project_real(v::AbstractVector{T}) where {T <: Real}
end

function project_complex(v::AbstractVector{T}) where {T <: Complex}
end

function project_scalar(v::T) where {T <: Real}
end

As a last remark, one may wonder why would one define trivial sets as the MOI.Reals
or the MOI.Zeros. A good example where this is needed is the polyhedral cone:
$$
A x = 0
$$
with $x$ a vector. This makes more sense to define $Ax$ as the function and
MOI.Zeros as the set.