Tag Archives: modelingtoolkit

Machine learning with hard constraints: Neural Differential-Algebraic Equations (DAEs) as a general formalism

By: Christopher Rackauckas

Re-posted from: https://www.stochasticlifestyle.com/machine-learning-with-hard-constraints-neural-differential-algebraic-equations-daes-as-a-general-formalism/

We recently released a new manuscript Semi-Explicit Neural DAEs: Learning Long-Horizon Dynamical Systems with Algebraic Constraints where we showed a way to develop neural networks where any arbitrary constraint function can be directly imposed throughout the evolution equation to near floating point accuracy. However, in true academic form it focuses directly on getting to the point about the architecture, but here I want to elaborate about the mathematical structures that surround the object, particularly the differential-algebraic equation (DAE), how its various formulations lead to the various architectures (such as stabilized neural ODEs), and elaborate on the other related architectures which haven’t had a paper yet but how you’d do it (and in what circumstances they would make sense).

What I want to describe in this blog post is:

  1. What is a differential-algebraic equation (DAE)?
  2. Why are DAEs a general way to express machine learning with hard constraints (for time series problems)?
  3. How can Neural DAEs be solved and trained? How do the methods of the different papers relate to each other?
  4. Can Neural DAEs always be solved? The answer is no.
  5. The special Julia package ModelingToolkit.jl mixes in symbolic transformations into the DAE pipeline to fix that, i.e. MTK transformed equations are always solvable/trainable versions of the constraints. How does that work?

From this you should have a complete overview of space of all ways to treat and handle the neural DAEs. Let’s dive in!

Primer: What is a differential-algebraic equation?

A differential-algebraic equation is a differential equation with algebraic equations attached to it, duh! But okay let’s look at the form of it. Generally you can write this as:

$$f(u’,u,p,t)=0$$

but this may not be so clear, so instead let’s write it in semi-explicit form:

$$u’ = f(u,v,p,t)$$
$$0 = g(u,v,p,t)$$

where the u, the ones with derivative terms, are called the derivative variables while the v, the ones without v’ showing up in the equations and only showing up in the constraint function $g$, are the algebraic equations. An example of a DAE is the Robertson equations:

$$\begin{aligned}
\frac{dy_1}{dt} &= -0.04 y_1 + 10^4 y_2 y_3 \\
\frac{dy_2}{dt} &= 0.04 y_1 – 10^4 y_2 y_3 – 3*10^7 y_{2}^2 \\
1 &= y_{1} + y_{2} + y_{3} \\
\end{aligned}$$

Thus instead of just saying “solve for (y_1,y_2,y_3)”, we are saying “solve for (y_1,y_2) subject to y_1 + y_2 + y_3 = 1”.

How this leads to machine learning with arbitrary hard constraints

As we first pointed out and demonstrated in the 2020 version of Universal Differential Equations for Scientific Machine Learning, using a DAE formulation allows you to build neural networks which impose any arbitrary hard constraint you would like. Cool…

First, what is a hard constraint? A hard constraint is a constraint that is always satisfied by design. This is as opposed to a soft constraint which is a constraint that is satisfied if the loss is sufficiently small. For example, with physics-informed neural networks, you can add to the loss function that the solution should conserve mass, and if the loss is small, then the violation of mass loss will be small. That is a soft constraint: the constraint is actually violated in every prediction, but hopefully the violation is sufficiently small if you have trained well enough.

The issue is that constraint violations can compound. Let’s see how this works in more detail on the mass-spring model.

$$x’ = v$$
$$v’ = -x$$

What you’re looking at is the solution to these equations plotted as x(t) vs v(t):

The analytical solution is actually easy to compute: it’s just a circle $x(t) = sin(t)$ and $v(t) = cos(t)$. The neural ODE is trying to learn the equations by just learning the ODE from data, i.e. $$u’=NN(u)$$ where the loss function is $$||u(t_i) – d_i||$$ with $$d_i$$ being the data at the ith time point. For the version with the soft constraint, we add to the loss that $$||x(t_i)^2 + v(t_i)^2 – C||$$, i.e that the energy should be constant (the solution should be on the circle). What you see in the picture is the solution in phase space, i.e. x-axis is x and y-axis is v, continuing to go around and around in the circle, and the reason it looks “blurred” is because it’s actually spiraling outwards. So let’s look at a different view of the system, instead of in phase space we look at the solution x(t) over time:

When you look at x(t) over time, what you see is a very clear pattern, that the solution just keeps going up and up. That’s the “spiraling outward”. So effectively what is happening is that even with the soft constraint, we violate energy conservation. And so if we keep simulating the neural network prediction, it keeps compounding that energy growth, so the circle keeps spiraling out. If we only did a short term simulation it’s okay, but once we begin to extrapolate the energy growth pulls it out of where the training regime is and it does something very unphysical and diverges to infinity.

One clear way to solve this would be to say “well why not force it so that at every step you always have $$x(t_i)^2 + v(t_i)^2 = C$$? What is a neural network architecture such that $$NN(x,v) = [x’,v’]$$ has $$x(t_i)^2 + v(t_i)^2 = C$$? There is some literature going down in this direction, such as for example, defining neural networks that naturally give the incompressibility condition for fluids, but the difficulty of going down this path is that you have to come up with a new architecture for every new constraint that you encounter. Is there a way to magically get an architecture for any hard constraint that you can think of?

Doing this is formally a DAE! Instead of imposing two differential equations, you impose one differential equation and one algebraic constraint:

$$x’ = v$$
$$x^2 + v^2 – C = 0$$

If this DAE is solved, then every step will always conserve energy, so your solution has to, by design, always be on the circle. Thus, if we know energy must be conserved, we can instead say “learn the dynamics under the assumption that energy is conserved”, which in the DAE formalism looks like:

$$x’ = NN(x,v)$$
$$x^2 + v^2 – C = 0$$

i.e. you have a neural network on the differential equations but directly impose the algebraic constraint.

Notice that the nice thing about this formalism is thus that you can design any network to enforce any hard constraints through this form. Generally the form is:

$$u’ = NN(u,v)$$
$$0 = g(u,v)$$

where $$g$$ is any hard constraint you wish to have.

This is thus a nice way to build many different potential behaviors into a neural network. Do you want to have a neural network blur an image where the total brightness of the picture is guaranteed to be the same as the brightness before? The brightness is just the sum of the pixels, that’s a neural DAE. Want to push forward a PDE solution and ensure the boundary constraints are always satisfied? Just code the boundary conditions as constraints and train a neural DAE. There are many things to explore here, but the point is that just by choosing $$g$$, if you can train and solve that DAE, then you get a model which directly imposes any hard constraint you want.

If you can solve and train a Neural DAE… how do you do that exactly? Directly Solving and Differentiating the DAE

So how do you solve and train the Neural DAE? Well there are a few ways to transform DAEs into solvable form, and what I want to show you is that the differences between the papers in the literature are simply different ways of doing this transformation.

First of all, you can tackle the DAE directly. The DifferentialEquations.jl Differential Algebraic Equations tutorial goes into detail on this. There are effectively two ways. In one way, you give a numerical solver the implicit form

$$f(u’,u,p,t)=0$$

For example, for the Rosenbrock equations shown above (repeating for clarity):

$$\begin{aligned}
\frac{dy_1}{dt} &= -0.04 y_1 + 10^4 y_2 y_3 \\
\frac{dy_2}{dt} &= 0.04 y_1 – 10^4 y_2 y_3 – 3*10^7 y_{2}^2 \\
1 &= y_{1} + y_{2} + y_{3} \\
\end{aligned}$$

we can write the following code to solve the equation directly in this form:

function f2(out, du, u, p, t)
    out[1] = -0.04u[1] + 1e4 * u[2] * u[3] - du[1]
    out[2] = +0.04u[1] - 3e7 * u[2]^2 - 1e4 * u[2] * u[3] - du[2]
    out[3] = u[1] + u[2] + u[3] - 1.0
end
u₀ = [1.0, 0, 0]
du₀ = [-0.04, 0.04, 0.0]
tspan = (0.0, 100000.0)
differential_vars = [true, true, false]
prob = DAEProblem(f2, du₀, u₀, tspan, differential_vars = differential_vars)
 
using Sundials
sol = solve(prob, IDA())
 
plot(sol, xscale = :log10, tspan = (1e-6, 1e5), layout = (3, 1))

Turning this into the neural network form is relatively straightforward, you simply replace the analytical dynamics with the neural network on there:

function f2(out, du, u, p, t)
    out[1:2] = NN(du,u,p)
    out[3] = u[1] + u[2] + u[3] - 1.0
end

The solvers here are either IDAS from Sundials (available through Sundials.jl in Julia), DASKR/DASPK (wrapped in Julia) or OrdinaryDiffEq.jl’s native Julia DFBDF. Currently IDA is the best of the lot (DFBDF is a similar algorithm but is missing an optimization on the Jacobain, so IDA is simply better unless you need Julia features like arbitrary precision). Only the first and last are compatible with automatic differentiation for training though. IDAS’ adjoint (i.e. reverse mode automatic differentiation) dates back to a really nice read from 2003 while the Julia DFBDF uses direct adjoints (and will implement that IDAS adjoint sometime soon). Thus training a neural DAE in this form simply amounts to writing it down, solving it, asking for the adjoint to give you the gradient (i.e. Zygote or Enzyme etc. with the right adjoint choice) and tada there you go. Problem solved?

Not quite, and the reason is because this formulation ends up being quite slow and not so robust (don’t worry, I’ll back this up in a second). The DAE in this form is a bit too flexible, and so it can help to constrain its behaviors a bit. You can prove that any (nice and simulateable) fully nonlinear DAE $$f(u’,u,p,t)=0$$ can be rewritten into its semi-explicit form (more on this later):

$$u’ = f(u,v,p,t)$$
$$0 = g(u,v,p,t)$$

In this form, you can instead treat it as a mass matrix equation. If you let $x = [u,v]$, then we can write:

$$Mx’ = h(x,p,t)$$

where if $M$ has zeros for some of its rows (i.e. it’s a singular matrix in the right way), it will be equivalent to the equation above. It is easiest to explain by writing an example out. Let’s again take the Rosenbrock equations:

$$\begin{aligned}
\frac{dy_1}{dt} &= -0.04 y_1 + 10^4 y_2 y_3 \\
\frac{dy_2}{dt} &= 0.04 y_1 – 10^4 y_2 y_3 – 3*10^7 y_{2}^2 \\
1 &= y_{1} + y_{2} + y_{3} \\
\end{aligned}$$

Now let’s write this equation in mass matrix form:

$$\begin{align}
\begin{bmatrix}
1 & 0 & 0\\
0 & 1 & 0\\
0 & 0 & 0
\end{bmatrix} \begin{bmatrix}
y_1’\\
y_2’\\
y_3′
\end{bmatrix} = \begin{bmatrix}
-0.04 y_1 + 10^4 y_2 y_3\\
0.04 y_1 – 10^4 y_2 y_3 – 3*10^7 y_{2}^2\\
y_{1} + y_{2} + y_{3} – 1
\end{bmatrix}
\end{align}$$

If you do the matrix-vector multiplication out, you see that last row of zeros simply gives that the last equation is $0 = y_{1} + y_{2} + y_{3} – 1$. Once you see that trick, it’s immediately clear how mass matrices can write out any constraint equation $g$. Done.

Now solving this is done as follows:

using DifferentialEquations
using Plots
function rober(du, u, p, t)
    y₁, y₂, y₃ = u
    k₁, k₂, k₃ = p
    du[1] = -k₁ * y₁ + k₃ * y₂ * y₃
    du[2] = k₁ * y₁ - k₃ * y₂ * y₃ - k₂ * y₂^2
    du[3] = y₁ + y₂ + y₃ - 1
    nothing
end
M = [1.0 0 0
     0 1.0 0
     0 0 0]
f = ODEFunction(rober, mass_matrix = M)
prob_mm = ODEProblem(f, [1.0, 0.0, 0.0], (0.0, 1e5), (0.04, 3e7, 1e4))
sol = solve(prob_mm, Rodas5P(), reltol = 1e-8, abstol = 1e-8)
 
plot(sol, xscale = :log10, tspan = (1e-6, 1e5), layout = (3, 1))

It turns out that you can solve these equations with many (but not all) of the robust solvers for stiff ODEs through a quick trick. Effectively in the Newton method of an implicit solver you have $I – gamma*J$ where $gamma$ is problem-dependent, but if you go back through that derivation you can see that swapping out to $M – gamma*J$ gives an implicit method for the mass matrix equations. Now there are other behaviors I’m glossing over here, especially with respect to initialization, so there are other things you need to be aware of (see Linda Petzold’s classic Differential/Algebraic Equations are not ODE’s paper) (and if you’re an expert, yes I’m glossing over differential index right now, I’ll get to that at the end). But basically you can think of the mass matrix form as a linearized form of the previous, i.e. for $$f(u’,u,p,t)=0$$ you can differentiate with respect to u’ and thus $M = \partial f/\partial u’$ factor that out and move it to the other side. So in a sense, the mass matrix form (with a constant mass matrix) is a simplification of the $$f(u’,u,p,t)=0$$ where we have pre-linearized the equations with respect to $u’$ and isolated them to one side already. And the solvers exploit this structure. So, they should definitely be faster.

But don’t just take my word for it, go to the SciMLBenchmarks.jl and see it for yourself. On the Robertson equations you get:

while for a transistor amplifier you get:

Here it’s showing work-precision, so it’s showing the solution of the equation with many different choices of tolerances. The error is on the x-axis is computed against a reference solution computed at high accuracy (i.e. it’s the error of the solution, not the tolerance chosen with the solver, since the two can be quite different!) and the y-axis is the amount of time it takes to solve the DAE with a given method. Thus if you place a vertical line at a given accuracy, you’d say “for this amount of accuracy, this is how long it takes each solver to compute the solution” where the lower the line is the better. What you see is that IDA does not come close to being the fastest. And in the transistor amplifier benchmark it only has 3 dots instead of 6 that the others had: this indicates that for some choices of the tolerances IDA actually diverged and did not compute an accurate solution. So, as stated, slower and less robust. But that should come as no surprised because by the very formalization of the problem it is solving a harder mathematical problem than the mass matrix solvers.

So that’s how you solve it, but how do you differentiate it? It turns out this is a not-so-often noticed part of the 2020 version of Universal Differential Equations for Scientific Machine Learning paper, in particular Supplemental Information Part 1.1 derives the adjoint method for mass matrix ODEs along with the consistent initialization of index-1 DAEs. Thus if the chosen method can solve it, then using Julia’s SciMLSensitivity.jl system for adjoint overloads it already defines reverse-mode automatic differentiation for the DAE system. Thus getting gradients is trivial using this software. And changing to a neural network is straightforward, i.e.:

function rober(du, u, p, t)
    y₁, y₂, y₃ = u
    k₁, k₂, k₃ = p
    du[1:2] = NN(u,p)
    du[3] = y₁ + y₂ + y₃ - 1
    nothing
end

Algebraic in Time Formalism for the DAE solve

One other method to mention is Incorporating Neural ODEs into DAE-Constrained Optimization Problems. This uses the “Optimal Control approach” to DAEs. I.e. what you can do is write out all of the time steps of the DAE according to some implicit method, like a Radau method, and then make a gigantic nonlinear constrained optimization problem which when solved gives you a $u(t)$ that satisfies the dynamics equations. This can be solved via algebraic optimization frameworks like JuMP, InfiniteOpt.jl, Pyomo, etc. The downside to this approach is manifold. For one, you don’t have adaptivity in the time stepping, and conservative choice of time steps can lead to larger sets of equations. Additionally, since you are solving all of the steps at once, and the cost of the solving scales cubically with the non-zeros of the constraint Jacobian (i.e. the cost of QR factorization on the Jacobian), the computational cost of this approach can be really large in comparison to a normal DAE solver when used on IVPs. There are some interesting things that can be done here to weave in training with the solving approach, and the computational overhead of this method is “not so bad” when considering training, and it can be a very stable method (if dt is small enough), but generally its memory and computational cost precludes it from being used in any large-scale scenarios such as PDEs. I’ll follow up with a post that gives many more benchmarks on this in the near future (now that now that ModelingToolkit.jl has a new backend which can codegen to this form, thus making the benchmarking on large difficult problems easier), but the preliminary results of the benchmarks are that this is around 1000x slower than using an optimized differentiable solver for IVPs when doing the fitting process. Ouch. Again, more benchmarks to follow, that’s the next post…

Not the end of the story: other ways to solve DAEs

So that’s final, we know that we can define any evolution equation with hard constraints as a DAE, and that there shows how to solve and differentiate any DAE, so therefore we’re done right? Not quite. The downside to the previously mentioned methods is that they all treat the equations implicitly. That is, when I described how to treat the mass matrix DAEs, I noted that implicit ODE solvers can be “upgraded” to DAE solvers in a specific way (plus some things around consistent initialization that are somewhat off topic). And the other choice is to use the fully implicit form $$f(u’,u,p,t)=0$$ which also must use implicit solvers like BDF methods. Implicit solvers can be slower than explicit solvers, like explicit Runge-Kutta methods, if you do not have stiffness (i.e. the ratio of your largest to smallest eigenvalue is not too large… is one way to say it, though it’s a lot more complicated than that). Thus if we were to take the pendulum system and treat it as a mass matrix DAE, we would take a major performance hit. Thus the question is, are there other ways to get a simulatable form for the DAE system?

Dimensional Reduction: The ODAEProblem Formulation

One way to do this is through a dimensional reduction technique. In the realm of PDEs this is known the “the ghost point method” but it’s actually rather general. For example, take the Robertson equation again:

$$\begin{aligned}
\frac{dy_1}{dt} &= -0.04 y_1 + 10^4 y_2 y_3 \\
\frac{dy_2}{dt} &= 0.04 y_1 – 10^4 y_2 y_3 – 3*10^7 y_{2}^2 \\
1 &= y_{1} + y_{2} + y_{3} \\
\end{aligned}$$

Now instead of solving for (y_1, y_2, y_3), let’s only solve for (y_1, y_2). How can we do that? Well we can define $y_3 = 1 – y_1 – y_2$. Thus using just an ODE solver:

function rober(du, u, p, t)
    y₁, y₂ = u
    k₁, k₂, k₃ = p
    y₃ = 1 - y₁ - y₂
    du[1] = -k₁ * y₁ + k₃ * y₂ * y₃
    du[2] = k₁ * y₁ - k₃ * y₂ * y₃ - k₂ * y₂^2
    nothing
end

there is no DAE solver necessary for this form as you only end up with a system of ODEs to solve since the algebraic conditions are embedded into the step. More generally, the procedure is in code:

function f(u, p, t)
    v = h(u,t)
    du = f(u,v,p,t)
end

However, in order to do this you need some $v = h(u,t)$ from the implicit constraints $0 = g(u,v,t)$. It turns out that this is not easy to do in general. You could just do a rootfind, i.e. $h(u,t) = NewtonSolve(g,u,t)$, but a Newton solve needs guesses, and it turns out that the Newton solve can be very sensitive to the guess choices here when the equation is nonlinear. The Julia system for sophisticated DAE handling, ModelingToolkit.jl, actually had a feature to construct equations of this form called ODAEProblem which was removed in the MTKv10 because it fails on any sufficiently difficult problem due to the relationship between guess choice propagation and branch selection. Thus this formulation of the equations is only recommended if you have some way of transforming the implicit constraints into an explicit equation $v = h(u,t)$, which generally rules out a large set of very nonlinear DAEs. In this vein, the paper titled Neural Differential Algebraic Equations
(later renamed Learning Neural Differential Algebraic Equations via Operator Splitting) takes this route by training a neural network surrogate as an approximate relationship $v = \tilde{h}(u,t)$ and embedding that into the equations. However, since this now uses a neural network for the explicit relationship $h$, the original equation $0 = g(u,v,t)$ is not exactly preserved and thus only held as a soft constraint on the correctness of $h$. Thus the soft constraint is added to the loss function to try to preserve the behavior. This formalism can be nice in the sense that it only requires an ODE solver, but because of this neural network inversion you lose the hard constraint behavior of the original DAE formalism and thus get the drift issues mentioned with the neural ODE with soft constraints. Thus it is easier to create software-wise but as a trade-off you lose the strong property of the DAE solver.

Singular Perturbation Formulation

Another approach is known as the singular perturbation formulation of the DAE. Again, let’s take the Robertson equations:

$$\begin{aligned}
\frac{dy_1}{dt} &= -0.04 y_1 + 10^4 y_2 y_3 \\
\frac{dy_2}{dt} &= 0.04 y_1 – 10^4 y_2 y_3 – 3*10^7 y_{2}^2 \\
1 &= y_{1} + y_{2} + y_{3} \\
\end{aligned}$$

Now what I’m going to do is introduce a small change to the final equation:

$$\begin{aligned}
\frac{dy_1}{dt} &= -0.04 y_1 + 10^4 y_2 y_3 \\
\frac{dy_2}{dt} &= 0.04 y_1 – 10^4 y_2 y_3 – 3*10^7 y_{2}^2 \\
\epsilon \frac{dy_3}{dt} &= y_{1} + y_{2} + y_{3} – 1 \\
\end{aligned}$$

In the limit as $\epsilon \rightarrow 0$, we are back at the same equation! Thus all we have to do is choose $\epsilon$ sufficiently small and we have a “good enough” approximation to our DAE. How small is small enough? It completely depends on the equation, and there’s another factor here to consider. As epsilon gets smaller, multiplying it to the other side $\mu = 1/\epsilon$ goes towards infinity. In other words, this is a “speed at which the equation returns to the condition $y_{1} + y_{2} + y_{3} – 1$”, where the condition is algebraic made to hold if that speed is infinitely fast. Unfortunately, having an equation which runs extremely fast in comparison to the other equations has another name: stiffness. So using the singular perturbation equation form has a classic trade-off: you either introduce numerical error by having $\epsilon$ too large, or by sending $\epsilon$ to zero you get numerical stiffness and thus need to use implicit methods (the same methods as you would use for mass matrix DAEs) and it becomes much more difficult to solve (many times it’s even more difficult than making it zero!).

Again, the benefit of this approach is that the relaxation to an ODE means that you can use an ODE solver for the process, which means that there are many more softwares readily available and it’s mechanically “much easier to solve”, but for high accuracy it can actually be more difficult because of the induced stiffness and any non-zero $\epsilon$ induces a constraint violation, which means your hard constraint will not be preserved over time. Once again, this will inhibit extrapolated trajectories since you will have violations of conservation of mass / conservation of energy which can propagate and cause a divergence over time. I don’t know of a machine learning paper yet that has introduced this form, but when they do this will be the trade-off they forget to mention in order to get the NeurIPS paper.

The Index Reduction Formulation

Once again…. let’s take the Robertson equations:

$$\begin{aligned}
\frac{dy_1}{dt} &= -0.04 y_1 + 10^4 y_2 y_3 \\
\frac{dy_2}{dt} &= 0.04 y_1 – 10^4 y_2 y_3 – 3*10^7 y_{2}^2 \\
1 &= y_{1} + y_{2} + y_{3} \\
\end{aligned}$$

If we want to get an ODE, one thing we can do is differentiate the last equation w.r.t. to time. If we do that we get the relationship that $y_{1}’ + y_{2}’ + y_{3}’ = 0$. Since the first equation is $y_{1}’$ and the second equation is $y_{2}’$, we can plug in these equations re-arrange to isolate $y_{3}’$ and we get:

$$\begin{aligned}
\frac{dy_1}{dt} &= -0.04y₁ + 10^4 y_2 y_3 \\
\frac{dy_2}{dt} &= 0.04 y_1 – 10^4 y_2 y_3 – 3*10^7 y_{2}^2 \\
\frac{dy_3}{dt} &= 3*10^7 y_{2}^2 \\
\end{aligned}$$

If you have ever seen “the Robertson equations as a stiff ODE test case” like in this tutorial, this is likely the formulation that you saw the equations before. Once again we arrive at something different, but has similar characteristics. This can be solved by ODE solvers, thus we don’t need DAE solvers, but numerical errors in the solution process cause a drift away from directly enforcing our constraint $y_{1} + y_{2} + y_{3} = 1$, and thus this can be a nice formulation to place soft constraints on but it’s not going to reach our ultimate goal for hard constraints.

The Manifold Relaxation Formulation

Again, let’s take the Robertson equations, but now let’s start from the Index-0 form:

$$\begin{aligned}
\frac{dy_1}{dt} &= -0.04y₁ + 10^4 y_2 y_3 \\
\frac{dy_2}{dt} &= 0.04 y_1 – 10^4 y_2 y_3 – 3*10^7 y_{2}^2 \\
\frac{dy_3}{dt} &= 3*10^7 y_{2}^2 \\
\end{aligned}$$

Let’s say we took a step forward with a naive ODE method, for example Euler’s method $$u_{n+1} = u_n + hf(u_n,p,t_n)$$. Then even if the current step $u_n$ satisfied the algebraic constraints, the next step $u_{n+1}$ does not necessarily satisfy the constraint. But, we if we know for example that $y_{1} + y_{2} + y_{3} = 1$, then we can simply change $\tilde{u}_{n+1} = u_{n+1} / ||u_{n+1}||$, i.e. normalize $u$ after each ODE solver step, and tada we’re back onto the manifold! This is actually a classic idea mentioned in Hairer’s third book Geometric Numerical Integration which has a lot of nice qualities for the problem at hand. In general, instead of normalizing we can solve an nonlinear least squares problem via a method like Guass-Newton and, if a solution exists (which one is guaranteed to if index reduced) then $\tilde{u}_{n+1} = NLLS(g, u_{n+1}, t_{n+1})$ is a well-defined projection to the manifold of $g(u,t)=0$. Hairer’s book even has a quick proof via the triangle inequality that such a projection does not lose order, and thus if you use a 5th order ODE integrator, this is a 5th order approximation to the solution on the manifold. This is the manifold relation formulation of DAE solving, which in Julia is rather straightforward as you simply add a ManifoldProjection discrete callback to the ODE solver.

This is the method that we demonstrate in the new paper Semi-Explicit Neural DAEs: Learning Long-Horizon Dynamical Systems with Algebraic Constraints:

It should then be clear why this method is able to always get a constraint violation of approximately 1e-12:

2e-16 is the maximum accuracy allowed by IEEE 64-bit floating point numbers, so with the condition number of the Jacobian etc. etc. with 64-bit float point numbers you can get an NLLS solver to converge to about 1e-12 for all of these problems, hence the constraint always being satisfied. This lets us both use the explicit ODE solver while also having hard constraints, which is why our method extrapolates without the energy loss/gain and does well for long-time trajectories.

Comment: Stabilized Neural ODE

Note that as part of this work, we also show that the Stabilized Neural Differential Equations for Learning Dynamics with Explicit Constraints is a continuous relaxation which is a simplification derived from the manifold projection. In particular:

Thus using the Jacobian reuse trick mentioned in the paper for the manifold project has a similar computational cost to the stabilized neural differential equation form while also enforcing the hard constraint more directly, and thus this derivation shows that in most cases you should prefer at least using the single Jacobian formulation of the manifold projection instead.

Differential Index of DAEs and the Reason for ModelingToolkit.jl

Oops, I glossed over a huge point. Near the very beginning of this discussion I introduced DAEs as equations of the form $$f(u’,u,p,t) = 0$$, and then somewhere down the line said that we can just treat all DAEs as split up:

$$u’ = f(u,v,p,t)$$
$$0 = g(u,v,p,t)$$

where $u$ are the differential variables and $v$ are the algebraic variables, which then leads to the mass matrix form $Mu’ = f(u,p,t)$. Cool… but if I was to give you some DAE like:

$$u_1′ + u_2′ = f_1(u,p,t)$$
$$sin(u_2′) = f_2(u,p,t)$$

in this case, what would the differential variables be? And the algebraic variables? It’s not cleanly split like I had described. Are we actually doomed? It turns out that I glossed over a few crucial details to make the DAE story a little bit easier. But just for completeness, let me give a crash course into why this works out:

  1. You can always transform one of these more difficult DAEs into one of the simpler semi-explicit DAEs through a technique known as index reduction. It’s basically what I showed in the index reduction section, i.e. differentiate and substitute, but the difficulty is finding out what to differentiate and how many times. Standard methods exist for this such as the Pantelides algorithm.
  2. Index reduction by Pantelides results in loss of exact constraints. A method known as the Dummy Derivative Method (first pioneered in Modelica tools) can be used to delete some differential equations and replace them with derived implicit constraints
  3. In order for the system to be solvable by IDA or mass matrix ODE solvers, we need to guarantee that the Jacobian matrix of the implicit solve, i.e. $M – gamma*J$, is non-singular. Luckily we have that this is true if the DAE is of index-1, i.e. if we apply index reduction with dummy derivatives and stop 1 differentiation short before we get to an ODE, then we get a DAE with hard constraints that is solvable by these methods
  4. In the same vein, the Jacobian of the nonlinear least squares solve for the manifold projection must be non-singular for the projection to exist, and luckily the condition that matrix is non-singular is once again the same as the DAE being of index 1, and thus if we apply index reduction to index 0 and solve with the manifold projection given to us by the dummy derivative technique then we have a valid projection

Example of the Index Reduction Process

As an example of what I mean by this, take the Pendulum written in Cartesian coordinates:

$$\begin{aligned}
x^\prime &= v_x\\
v_x^\prime &= Tx\\
y^\prime &= v_y\\
v_y^\prime &= Ty – g\\
0 &= x^2 + y^2 – L^2
\end{aligned}$$

It turns out that this way of writing the DAE is not amenable to being a hard constraint that can be used by any of the previously mentioned methods. The reason is because the algebraic variable $T$ does not show up in the algebraic constraint $0 = x^2 + y^2 – L^2$ and thus we cannot choose $T$ s.t. the constraint is satisfied as a separate projection, and the solvers have singularities in the Jacobians during their stepping. But through an automated procedure we can realize that if we differentiate the last equation 2 times and then substitute other derivatives in, we get:

$$\begin{aligned}
x^\prime &= v_x\\
v_x^\prime &= Tx\\
y^\prime &= v_y\\
v_y^\prime &= Ty – g\\
0 &= 2 \left(v_x^{2} + v_y^{2} + y ( y T – g ) + T x^2 \right)
\end{aligned}$$

which is physically the same system, just with the last equation rewritten, and this formulation of the constraint is compatible with the Neural DAE solvers / training routines. However, this algebraic condition is not complete and leads to a numerical drift over time:

this is “almost a pendulum”, but the length of the bar just keeps increasing, and you can see it drifting. What happened was that by doing index reduction we lost our original constraint. But we can remove some equations to place back in our original constraints to recover a better DAE. This looks like:

$$\begin{aligned}
x^\prime &= v_x\\
v_x^\prime &= Tx\\
0 &= x^2 + y^2 – L^2\\
0 &= \left(x v_x + y v_y\right)\\
0 &= \left(v_x^{2} + v_y^{2} + y ( y T – g ) + T x^2 \right)
\end{aligned}$$

We get these equations by, well $x^2 + y^2 – L^2 = 0$ is our original constraint, $x v_x + y v_y = 0$ is the derivative of the constraint w.r.t. time, and $v_x^{2} + v_y^{2} + y ( y T – g ) + T x^2 = 0$ is the second derivative w.r.t. time. If we keep these two extra equations and delete the two differential equations for $y$, then $y$ and $v_y$ change from being differential variables to algebraic variables. So I changed the system from having 4 differential variables $(x, v_x, y, v_y)$ and one algebraic variable $T$, to an equation with 2 differential variables $(x, v_x)$, and 3 algebraic variables $(y, v_y, T)$. And after doing this rewrite, the system has effectively no issues with drift and is compatible with DAE solvers!

(Note: while a fully implicit DAE solver will allow you to give a $$f(u’,u,p,t) = 0$$ that describes an arbitrarily high index DAE, the solver will simply fail to have its nonlinear solver steps converge. Thus for example IDAS requires a DAE of at most index-2 (and only in special circumstances, generally only index-1). Therefore, just because a DAE solver can take a $$f(u’,u,p,t) = 0$$ does not mean it can solve it!)

Okay… but that seems very difficult to figure out by hand? Can this be done automatically?

In total, not every DAE $$f(u’,u,p,t) = 0$$ that you write down will be compatible with the aforementioned solvers. The DAE solvers need index-1 and sometimes can allow index-2, the derivative rules need index-1 or 2, manifold projection has constraints, etc. but there is an automated symbolic approach that can transform any DAE $$f(u’,u,p,t) = 0$$ into a format that is compatible with all of the previously mentioned methods. The software that does that is ModelingToolkit.jl. It would take way too much time to describe exactly how this works, but if you’re curious see some of the lecture notes from the MIT IAP class on symbolic-numeric methods.

But what we have done is created a library ModelingToolkitNeuralNets.jl which allows you to embed neural networks into DAEs symbolically described by ModelingToolkit, which will automatically transform the equations into index-1 form to give you a reformulation of the constraints that is satisfiable.

And there you go, all of the published methods for Neural DAEs, plus all of the symbolic reformulations, all in one package.

Is there more to do?

Heck yes, reach out if you’re interested in doing research in this area. There’s still lots of edge cases to work out. I can guarantee you that there will be 50 NeurIPS papers in the next 5 years that simply put neural networks into ODE solvers in all of the different ways I described above to say they did Neural DAEs, but that’s not the interesting stuff. The interesting work is in making the symbolic-numeric aspect, i.e. the ModelingToolkit compiler, actually do well and handle more of the extreme cases. In particular, if the equation changes its differentiation index at a point, then you can have some constraints become $0=0$ which messes up the symbolic transformations. What should be done in this case? Hopefully from that discussion above you understand why this is the hard part, and no other research groups are really looking into how these DAE transformations interact with neural networks, so if you’re interested… I’m currently looking for students to work on this.

Conclusion

In conclusion, there’s still a lot more to say about DAEs, DAEs are hard. But under the assumptions that you have transformed your constraints into index-1 form (which symbolic-numeric compilers like ModelingToolkit.jl can automatically derive for you), this gives a complete scope of the different methods for solving DAEs and training neural networks on them. Every method tends to have its place, so let’s create a table of the engineering trade-offs:

DAE Formulation Neural DAE Paper that uses this form Pros Cons
Fully implicit DAE, f(u’,u,p,t)=0 Universal Differential Equations for Scientific Machine Learning, though adjoint is Adjoint Sensitivity Analysis for Differential-Algebraic Equations (2003) Easiest way to write a general DAE (i.e. don’t have to put in mass matrix form, no constant mass matrix required) Slower and less robust than mass matrix form, still requires index-1/2
Mass Matrix DAE, Mu’=f(u,p,t) Universal Differential Equations for Scientific Machine Learning General form that works for any index-1 DAE written in mass matrix form Requires implicit solvers even if the equation would otherwise be non-stiff
Algebraic in Time Formalism Incorporating Neural ODEs into DAE-Constrained Optimization Problems Robust, can exploit parallelism over time Very computationally costly in comparison to dedicated IVP approaches
ODAEProblem Reduction Learning Neural Differential Algebraic Equations via Operator Splitting Fast, only requires an explicit ODE solver Introduces drift if algebraic condition is not exact, has branch switching issues with guesses on hard problems
Singular Perturbation Form None yet Only requires an ODE solver Constraint error related to the size of epsilon, as epsilon -> 0 the stiffness increases, requiring implicit methods
Manifold Relaxation Semi-Explicit Neural DAEs: Learning Long-Horizon Dynamical Systems with Algebraic Constraints Only requires an ODE solver, can be fast (explicit) if non-stiff Requires a separate nonlinear solve from the stepper, thus slower if the solver is implicit
Continuous Manifold Relaxation Stabilized Neural Differential Equations for Learning Dynamics with Explicit Constraints Only requires and ODE solver, can be fast (explicit) if non-stiff Introduces drift in the constraints over time, equivalent in cost to manifold project with Jacobian trick
Index Reduction to ODE None yet Only requires an ODE solver Introduces drift in the constraints over time, thus violating the hard constraint over long solves

From this, we chose to go the approach of the manifold projection in the latest publication since that seems to be the best fit for many Neural DAE cases that may arise in machine learning, but we have been working on the ModelingToolkit.jl software to cover all forms and will continue to keep adding benchmarks to SciMLBenchmarks.jl showcasing the difference between all of the different ways of handling DAEs as they continue to improve. But, we still have much more research to do, specifically in the symbolic-numeric transformations, so stay tuned as we will have a pretty big announcement along these lines in just a few months (benchmarks already look promising!)!

The post Machine learning with hard constraints: Neural Differential-Algebraic Equations (DAEs) as a general formalism appeared first on Stochastic Lifestyle.

ModelingToolkit, Modelica, and Modia: The Composable Modeling Future in Julia

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/modelingtoolkit-modelica-and-modia-the-composable-modeling-future-in-julia/

Let me take a bit of time here to write out a complete canonical answer to ModelingToolkit and how it relates to Modia and Modelica. This question comes up a lot: why does ModelingToolkit exist instead of building on tooling for Modelica compilers? I’ll start out by saying I am a huge fan of Martin and Hilding’s work, I respect them a ton, and they have made major advances in this space. But I think ModelingToolkit tops what they have developed in a not-so-subtle way. And it all comes down to the founding principle, the foundational philosophy, of what a modeling language needs to do.

Composable Abstractions for Model Transformations

There is a major philosophical difference which is seen in both the development and usage of the tools. Everything in the SciML organization is built around a principle of confederated modular development: let other packages influence the capabilities of your own. This is highlighted in a paper about the package structure of DifferentialEquations.jl. The underlying principle is that not everyone wants or needs to be a developer of the package, but still may want to contribute. For example, it’s not uncommon that a researcher in ODE solvers wants to build a package that adds one solver to the SciML ecosystem. Doing this in their own package for their own academic credit, but with the free bonus that now it exists in the multiple dispatch world. In the design of DifferentialEquations.jl, solve(prob,IRKGL16()) now exists because of their package, and so we add it to the documentation. Some of this work is not even inside the organization, but we still support it. The philosophy is to include every researcher as a budding artist in the space of computational research, including all of the possible methods, and building an infrastructure that promotes a free research atmosphere in the methods. Top level defaults and documentation may lead people to the most stable aspects of the ecosystem, but with a flip of a switch you can be testing out the latest research.

When approaching modeling languages like Modelica, I noticed this idea was completely foreign to modeling languages. Modelica is created by a committee, but the implementations that people use are closed like Dymola, or monolithic like OpenModelica. This is not a coincidence but instead a fact of the design of the language. In the Modelica language, there is no reference to what transformations are being done to your models in order to make them “simulatable”. People know about Pantelides algorithm, and “singularity elimination”, but this is outside the language. It’s something that the compiler maybe gives you a few options for, but not something the user or the code actively interacts with. Every compiler is different, advances in one compiler do not help your model when you use another compiler, and the whole world is siloed. By this design, it is impossible for an external user to write compiler passes in Modelica which effects this model lowering process. You can tweak knobs, or write a new compiler. Or fork OpenModelica and hack on the whole compiler to just do the change you wanted.

I do not think that the symbolic transformations that are performed by Modelica are the complete set that everyone will need for all models for all time. I think in many cases you might want to write your own. For example, on SDEs there’s a Lamperti transformation which can exist which transforms general SDEs to SDEs with additive noise. It doesn’t always apply, but when it does it can greatly enhance solver speed and stability. This is niche enough that it’ll never be in a commercial Modelica compiler (in fact, they don’t even have SDEs), but it’s something that some user might want to be able to add to the process.

ModelingToolkit: Opening the Development Process

So the starting goal of ModelingToolkit is to give an open and modular transformation system on which a whole modeling ecosystem can thrive. My previous blog post exemplified how unfamiliar use cases for code transformations can solve many difficult mathematical problems, and my goal is to give this power to the whole development community. `structural_simplify` is something built into ModelingToolkit to do “the standard transformations” on the standard systems, but the world of transformations is so much larger. Log-transforming a few variables? Exponentiating a few to ensure positivity? Lamperti transforms of SDEs? Transforming to the sensitivity equations? And not just transformations, but functionality for inspecting and analyzing models. Are the equations linear? Which parameters are structurally identifiable?

From that you can see that Modia was a major inspiration for ModelingToolkit, but Modia did not go in this direction of decomposing the modeling language: it essentially is a simplified Modelica compiler in Julia. But ModelingToolkit is a deconstruction of what a modeling language is. It pulls it down to its component pieces and then makes it easy to build new modeling languages like Catalyst.jl which internally use ModelingToolkit for all of the difficult transformations. The deconstructed form is a jumping point for building new domain-based languages, along with new transformations which optimize the compiler for specific models. And then in the end, everybody who builds off of it gets improved stability, performance, and parallelism as the core MTK passes improve.

Bringing the Power to the People

Now there’s two major aspects that need to be handle to fully achieve such a vision though. If you want people to be able to reuse code between transformations, what you want is to expose how you are changing code. To achieve this goal, a new Computer Algebra System (CAS), Symbolics.jl, was created for ModelingToolkit.jl. The idea being, if we want everyone writing code transformations, they should all have easy access to a general mathematical toolset for doing such code transformations. We shouldn’t have everyone building a new code for differentiation, simplify, and substitution. And we shouldn’t have everyone relying on undocumented internals of ModelingToolkit.jl either: this should be something that is open, well-tested, documented, and a well-known system so that everyone can easily become a “ModelingToolkit compiler developer”. By building a CAS and making it a Julia standard, we can bridge that developer gap because now everyone knows how to easily manipulate models: they are just Symbolics.jl expressions.

The second major aspect is to achieve a natural embedding into the host language. Modelica is not a language in which people can write compiler passes, which introduces a major gap between the modeler and the developer of extensions to the modeling language. If we want to bridge this gap, we need to ensure the whole modeling language is used from a host which is a complete imperative programming language. And you need to do so in a language that is interactive, high performance, and has a well-developed ecosystem for modeling and simulation. Martin and Hilding had seen this fact as the synthesis for Modia with how Julia uniquely satisfies this need, but I think we need to take it a step further. To really make the embedding natural, you should be able to on the fly automatically convert code to and from the symbolic form. In the previous blog post I showcased how ModelingToolkit.jl could improve people’s code by automatically parallelizing it and performing index reduction even if the code was not written in ModelingToolkit.jl. This grows the developer audience of the transformation language from “anyone who wants to transform models” to “anyone who wants to automate improving models and general code”. This expansion of the audience is thus pulling in developers who are interested in things like automating parallelism and GPU codegen and bringing them into the MTK developer community.

Intern, since all of these advances then apply to the MTK internals and code generation tools such as Symbolics.jl’s build_function, new features are coming all of the time because of how the community is composed. The CTarget build_function was first created to transpile Julia code to C, and thus ModelingToolkit models can generate C outputs for compiling into embedded systems. This is serendipity when seeing one example, but it’s design when you notice that this is how the entire system is growing so fast.

But Can Distributed Development Be As Good As Specialized Code?

Now one of the questions we received early on was, won’t you not be able to match the performance a specialized compiler which was only made to work on Modelica, right? While at face value it may seem like hyperspecialization could be beneficial, the true effect of hyperspecialization is that algorithms are simply less efficient because less work has been put into them. Symbolics.jl has become a phenomenon of its own, with multiple different hundred comment threads digging through many aspects of the pros and cons of its designs, and that’s not even including the 200 person chat channel which has had tens of thousands of messages in the less than 2 months since the CAS was released. Tons of people are advising how to improve every single plus and multiply operation.

So it shouldn’t be a surprise that there are many details that have quickly been added which are still years away from a Modelica implementation. It automatically multithreads tree traversals and rewrite rules. It automatically generates fast parallelized code, and can do so in a way that composes with tearing of nonlinear equations. It lets users define their own high-performance and parallelized functions, register them, and stick them into the right hand side. And that is even excluding the higher level results, like the fact that it is fully differentiable and thus allows training neural networks decomposed within the models, and automatically discover equations from data.

Just at the very basic level we can see that the CAS is transforming the workflows of scientists and engineers in many aspects of the modeling process. By distributing the work of improving symbolic computing, we have already taken examples which were essentially not obtainable and making them instant with Symbolics.jl:

We are building out a full benchmarking system for the symbolic ecosystem to track performance over time and ensure it reaches the top level. It’s integrating pieces from The OSCAR project, getting lots of people tracking performance in their own work, and building a community. Each step is another major improvement and this ecosystem is making these steps fast. It will be hard for a few people working on the internals of a single Modelica compiler to keep up with such an environment, let alone repeating this work to every new Modelica-based project.

But How Do You Connect To Modelica?

This is a rather good question because there are a lot of models already written in Modelica, and it would be a shame for us to not be able to connect with that ecosystem. I will hint that there is coming tooling as part of JuliaSim for connecting to many pre-existing model libraries. In addition, we hope to make use of tooling like Modia.jl and TinyModia.jl will help us make a bridge.

Conclusion: Designing Around the Developer Community Has Many Benefits

The composability and distributed development nature of ModelingToolkit.jl is its catalyst. This is why ModelingToolkit.jl looks like it has rocket shoes on: it is fast and it is moving fast. And it’s because of the thought put into the design. It’s because ModelingToolkit.jl is including the entire research community as its asset instead of just its user. I plan to keep moving forward from here, looking back to learn from the greats, but building it in our own image. We’re taking the idea of a modeling language, distributing it throughout one of the most active developer communities in modeling and simulation, in a language which is made to build fast and parallelized code. And you’re invited.

PS: what about Simulink?

I’m just going to post a self-explanatory recent talk by Jonathan at the NASA Launch Services Program who saw a 15,000x acceleration by moving from Simulink to ModelingToolkit.jl.

Enough said.

The post ModelingToolkit, Modelica, and Modia: The Composable Modeling Future in Julia appeared first on Stochastic Lifestyle.

JuliaCall Update: Automated Julia Installation for R Packages

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/juliacall-update-automated-julia-installation-for-r-packages/

Some sneakily cool features made it into the JuliaCall v0.17.2 CRAN release. With the latest version there is now an install_julia function for automatically installing Julia. This makes Julia a great high performance back end for R packages. For example, the following is an example from the diffeqr package that will work, even without Julia installed:

install.packages("diffeqr")
library(diffeqr)
de <- diffeqr::diffeq_setup()
 
lorenz <- function (u,p,t){
  du1 = p[1]*(u[2]-u[1])
  du2 = u[1]*(p[2]-u[3]) - u[2]
  du3 = u[1]*u[2] - p[3]*u[3]
  c(du1,du2,du3)
}
u0 <- c(1.0,1.0,1.0)
tspan <- c(0.0,100.0)
p <- c(10.0,28.0,8/3)
prob <- de$ODEProblem(lorenz,u0,tspan,p)
fastprob <- diffeqr::jitoptimize_ode(de,prob)
sol <- de$solve(fastprob,de$Tsit5(),saveat=0.01)

Under the hood it’s using the DifferentialEquations.jl package and the SciML stack, but it’s abstracted from users so much that Julia is essentially an alternative to Rcpp with easier interactive development. The following example really brings the seamless integration home:

install.packages(diffeqr)
library(diffeqr)
de <- diffeqr::diffeq_setup()
degpu <- diffeqr::diffeqgpu_setup()
 
lorenz <- function (u,p,t){
  du1 = p[1]*(u[2]-u[1])
  du2 = u[1]*(p[2]-u[3]) - u[2]
  du3 = u[1]*u[2] - p[3]*u[3]
  c(du1,du2,du3)
}
u0 <- c(1.0,1.0,1.0)
tspan <- c(0.0,100.0)
p <- c(10.0,28.0,8/3)
prob <- de$ODEProblem(lorenz,u0,tspan,p)
fastprob <- diffeqr::jitoptimize_ode(de,prob)
 
prob_func <- function (prob,i,rep){
  de$remake(prob,u0=runif(3)*u0,p=runif(3)*p)
}
ensembleprob = de$EnsembleProblem(fastprob, prob_func = prob_func, safetycopy=FALSE)
sol <- de$solve(ensembleprob,de$Tsit5(),degpu$EnsembleGPUArray(),trajectories=10000,saveat=0.01)

This example does the following:

  1. Automatically installs Julia
  2. Automatically installs DifferentialEquations.jl
  3. Automatically installs CUDA (via CUDA.jl)
  4. Automatically installs ModelingToolkit.jl and DiffEqGPU.jl
  5. JIT transpiles the R function to Julia via ModelingToolkit
  6. Uses KernelAbstractions (in DiffEqGPU) to generate a CUDA kernel from the Julia code
  7. Solves the ODE 10,000 times with different parameters 350x over deSolve

What a complicated code! Well maybe it would shock you to know that the source code for the diffeqr package is only 150 lines of code. Of course, it’s powered by a lot of Julia magic in the backend, and so can your next package.

The post JuliaCall Update: Automated Julia Installation for R Packages appeared first on Stochastic Lifestyle.