Tag Archives: mathematics

Direct Automatic Differentiation of (Differential Equation) Solvers vs Analytical Adjoints: Which is Better?

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/direct-automatic-differentiation-of-solvers-vs-analytical-adjoints-which-is-better/

Automatic differentiation of a “solver” is a subject with many details for doing it in the most effective form. For this reason, there are a lot of talks and courses that go into lots of depth on the topic. I recently gave a talk on some of the latest stuff in differentiable simulation with the American Statistical Association, and have some detailed notes on such adjoint derivations as part of the 18.337 Parallel Computing and Scientific Machine Learning graduate course at MIT. And there are entire organizations like my SciML Open Source Software Organization which work day-in and day-out on the development of new differentiable solvers.

I’ll give a brief summary of all my materials here below.

Continuous vs Discrete Differentiation of Solvers

AD of a solver can be done in essentially two different ways: either directly performing automatic differentiation to the steps of the solver, or by defining higher level adjoint rules that will compute the derivative. In some cases these can be mathematically equivalent. For example, forward sensitivity analysis of an ODE $$u’ = f(u,p,t)$$ follows by the chain rule:

$$\frac{d}{dp} \frac{du}{dt} = \frac{d}{dp} f(u,p,t) = \frac{df}{du} \frac{du}{dp} + \frac{\partial f}{\partial p}$$

Thus if you solve the extended system of equations:

$$u’ = f(u,p,t)$$
$$s’ = \frac{df}{du} s + \frac{\partial f}{\partial p}$$

then you get $$s = \frac{du}{dp}$$ as the solution to the new equations. So therefore, solve these bigger ODEs and you get the derivative of the solution with respect to parameters as the extra piece. One way to do “automatic differentiation” is to add a derivative rule to the AD library that “if you see ODE solve, then replace the solve with this extended solve and take the latter part as the derivative”. The other way of course is to simply do forward-mode automatic differentiation of the ODE solver library steps itself. It turns out that in this case, if you work out the math the two are mathematically equivalent. Note that it’s not computationally equivalent though since the AD process may SIMD the expressions in a different way, doing some constant folding and common subexpression elimination (CSE) in a way that’s different from the hand-coded version, and thus the performance can be very different even though it’s computationally the same algorithm.

However, there are cases where the “analytical” way of writing the derivative is not equivalent to its automatic differentiation counterpart. For example, the adjoint method is a different way to get $$\frac{du}{dp}$$ values in $$\mathcal{O}(n+p)$$ time (instead of the $$\mathcal{O}(np)$$ time of the forward sensitivities above) by solving an ODE forward and some related ODE backwards (for a full derivation and description, see the lecture notes or the recorded video). If you were to do reverse-mode automatic differentiation of the solver, you do not get a mathematically equivalent algorithm. For example, if the solver for the ODE was Euler’s method, reverse-mode AD would be mathematically equivalent to solving the forward ODE with Euler’s method and the reverse ODE with something like implicit Euler (where part of the implicit equation is solved exactly using a cached value from the forward solve).

So What is Better, Continuous Derivative Rules or Discrete Derivatives of the Solver?

Like any complex question, it depends. We had a manuscript which looked at this in quite some detail (and a biologically-oriented follow-up), and can boil it down to a few basic notes:

  • Forward-mode outperforms reverse-mode / adjoint techniques when the equations are “sufficiently small”. For modern implementations this seems to be at around 100.
  • For forward-mode cases, “good” automatic differentiation libraries can make use of structure between the primal and derivative constructions to better CSE/SIMD the generated code for the derivative term, thus forward-mode AD of the solver can be much faster than forward sensitivity analysis even though the two are mathematically the same operation.
  • For reverse-mode cases, the continuous adjoints seem to be faster with current implementations.

But that last bit then has many caveats to put on it. For one, there seems to be a trade-off between performance and stability here. This is noted in the appendix of the paper “Universal Differential Equations for Scientific Machine Learning, which states:

Previous research has shown that the discrete adjoint approach is more stable than continuous adjoints in some cases [53, 47, 94, 95, 96, 97] while continuous adjoints have been demonstrated to be more stable in others [98, 95] and can reduce spurious oscillations [99, 100, 101]. This trade-off between discrete and continuous adjoint approaches has been demonstrated on some equations as a trade-off between stability and computational efficiency [102, 103, 104, 105, 106, 107, 108, 109, 110]. Care has to be taken as the stability of an adjoint approach can be dependent on the chosen discretization method [111, 112, 113, 114, 115]

with the references pointing to those in the manuscript.

This is discussed in even more detail in the manuscript Stiff Neural Ordinary Differential Equations which showcases how there are many ways to implement “the adjoint method”, and they can have major differences in stability, essentially trading off memory or performance for improved stability properties.

Special Case: Implicit Equations

The above discussion shows that there are good reasons to differentiate solvers directly, and good reasons to instead write derivative rules for solvers which use forward/adjoint equations. For time series equations, this always has a trade-off. There is an important special case here though that for methods which iterate to convergence, automatic differentiation of the solver is essentially never a good idea. The reason is because the implicit function theorem gives that the derivative of the solution is directly defined at the solution point. For example, for solving $$f(x,p) = 0$$, if $$x^\ast$$ is the value of $$x$$ which satisfies the equation, then $$\frac{d x^\ast}{dp} = …$$. In other words, Newton’s method might take $$n$$ steps, and thus automatic differentation will need to differentiate $$f$$ at least $$n$$ times. But if you use the implicit function theorem result, then you only need to differentiate it once!

Note of course a similar performance vs stability trade-off does apply here. Since this derivation assumes you have $$x^\ast$$ such that $$f(x^\ast,p) = 0$$ exactly, but you don’t. Newton’s method from the solve will give you something that satisfies the equation to tolerance, so maybe $$f(x^\ast,p) \approx 10^{-8}$$, which means that the derivative expression is also only approximate, and this then induces an error in the gradient etc. Thus direct differentiation of Newton’s method can be more accurate, and you need to worry about tolerance here if the gradients seem sufficiently off.

This does lead to some counter-intuitive results. For example, we had a paper where we exploited this to note that differentiating and ODE solve which goes to infinity (steady state) is faster than a “long ODE”, since steady states have a similar implicit definition. It’s quicker to go to infinity than it is to go to 1000, who would’ve thought? Math is fun.

Does Differentiation of Solver Internals Make Sense or Have a Meaning?

“ODE solvers” have all sorts of things in there, like adaptivity parameters and heuristics. One of the things that happens when you do automatic differentiation of the solver is that you aren’t just differentiating the solver’s states and parameters, but the process will differentiate everything. It turns out that AD of a solver can thus be useful in some tricky ways which put this to use. For example, at ICML we had a paper which regularized the parameters of a neural ODE by the sum of the computed error estimates of the adaptivity heuristics. This would then push the learned equation towards an area of parameter space where the adaptivity gives the largest time steps possible, and thus the learned equation is the “fastest possible learned equation that fits the time series”. Such a trick is only possible if you are doing automatic differentiation of the solver since you’d need to differentiate the solver’s internals in order to have access to those values in the loss function! This just shows one of many ways in which AD’s “extra information” which analytical continuous derivative definitions don’t have could potentially be useful for some applications.

Automatic Differentiation in Continuous Sensitivity Methods

Finally, I want to note that even when you attempt to avoid automatic differentiation of the solver by using continuous sensitivity methods, it turns out that the optimal way to build the extended equations is to use automatic differentiation!

Summary: there are many good reasons to do automatic differentiation of solvers, but there are also many good reasons to use some analytical derivative techniques. But even if you do analytical derivative techniques, you still want to automatic differentiate something in order to do it optimally!

For example, let’s return to the forward sensitivity equations:

$$u’ = f(u,p,t)$$
$$s’ = \frac{df}{du} s + \frac{\partial f}{\partial p}$$

It turns out that $$\frac{df}{du} s$$ does not require computing the full Jacobian. This operation, known as Jacobian-vector products or jvps, are the primitive operation of forward-mode automatic differentiation and thus special seeding of a forward-mode AD tool gives a faster and more robust algorithm than a finite difference form. When done correctly, this operation is computed without ever building the full Jacobian. A trick for this does exist in the finite difference sense as well:

$$\frac{df}{du} s \approx \frac{f(u + \epsilon s) – f(u)}{\epsilon}$$

since it is equivalent to the directional derivative. This is explained in more detail in these lectures (or accompanying video).

In the same vein, continuous adjoints of ODE solves boil down to defining a differential equation which is solved backwards and that differential equation which is solved backwards has a term which is $$\frac{df}{du}^T s$$, i.e. Jacobian transposed times a vector, also known as the vector-Jacobian product because it’s equivalent to $$s^T \frac{df}{du}$$ when transposed. It turns out that this is the primitive operation of reverse-mode AD, which then allows for computing this operation without fully building the Jacobian. There is no analogue for this operation with finite differencing, which means that there’s a pretty massive performance gain from doing this properly. Our paper A Comparison of Automatic Differentiation and Continuous Sensitivity Analysis for Derivatives of Differential Equation Solutions measures this effect on a stiff partial differential equation, getting:

The takeaway from this plot is that using these AD tricks results in a few orders of magnitude performance improvements (by avoiding the Jacobian construction, which are the “seeding” versions on the left, the right shows the difference that different AD techniques make, which itself is another few orders of magnitude). When people note that the Julia differential equation adjoint solvers are much faster than the adjoints from Sundials COVDES and IDAS on large equations, this part right here is one of the major factors because Sundials does not embed a reverse AD engine into its adjoint code to do the vjp definitions, and instead falls back to using a numerical formulation unless the user provides a vjp override, which is seemingly to be uncommon to do but from these plots clearly should be done more often.

Summary

In total, what can we takeaway so far about differentiating solvers?

  • There are some advantages to differentiating solvers, but there are also some advantages to mixing in analytical continuous adjoints. It’s context-dependent which is better.
  • Even when mixing in analytical continuous derivative rules, these are best defined with automatic differentiation within their constructed equations, so one cannot avoid AD completely if one wishes to achieve full performance on arbitrary models.
  • For cases which converge to some kind of implicitly defined solution, using special adjoint tricks will be much better than direct differentiation of the solver.

There’s still a lot more to mention, especially as stochastic simulation gets involved, but I’ll cut this here for now. As you can see, there’s still some open questions that are being investigated in the field, so if you find this interesting please feel free to get in touch.

The post Direct Automatic Differentiation of (Differential Equation) Solvers vs Analytical Adjoints: Which is Better? appeared first on Stochastic Lifestyle.

Learning Epidemic Models That Extrapolate, AI4Pandemics

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/learning-epidemic-models-that-extrapolate-ai4pandemics/

I think this talk was pretty good so I wanted to link it here!

Title: Learning Epidemic Models That Extrapolate

Speaker Chris Rackauckas, https://chrisrackauckas.com/

Abstract:
Modern techniques of machine learning are uncanny in their ability to automatically learn predictive models directly from data. However, they do not tend to work beyond their original training dataset. Mechanistic models utilize characteristics of the problem to ensure accurate qualitative extrapolation but can lack in predictive power. How can we build techniques which integrate the best of both approaches? In this talk we will discuss the body of work around universal differential equations, a technique which mixes traditional differential equation modeling with machine learning for accurate extrapolation from small data. We will showcase how incorporating different variations of the technique, such as Bayesian symbolic regression and optimizing the choice of architectures, can lead to the recovery of predictive epidemic models in a robust way. The numerical difficulties of learning potentially stiff and chaotic models will highlight how most of the adjoint techniques used throughout machine learning are inappropriate for learning scientific models, and techniques which mitigate these numerical ills will be demonstrated. We end by showing how these improved stability techniques have been automated and optimized by the software of the SciML organization, allowing practitioners to quickly scale these techniques to real-world applications.

See more on: https://ai4pandemics.org/

The post Learning Epidemic Models That Extrapolate, AI4Pandemics appeared first on Stochastic Lifestyle.

ModelingToolkit, Modelica, and Modia: The Composable Modeling Future in Julia

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/modelingtoolkit-modelica-and-modia-the-composable-modeling-future-in-julia/

Let me take a bit of time here to write out a complete canonical answer to ModelingToolkit and how it relates to Modia and Modelica. This question comes up a lot: why does ModelingToolkit exist instead of building on tooling for Modelica compilers? I’ll start out by saying I am a huge fan of Martin and Hilding’s work, I respect them a ton, and they have made major advances in this space. But I think ModelingToolkit tops what they have developed in a not-so-subtle way. And it all comes down to the founding principle, the foundational philosophy, of what a modeling language needs to do.

Composable Abstractions for Model Transformations

There is a major philosophical difference which is seen in both the development and usage of the tools. Everything in the SciML organization is built around a principle of confederated modular development: let other packages influence the capabilities of your own. This is highlighted in a paper about the package structure of DifferentialEquations.jl. The underlying principle is that not everyone wants or needs to be a developer of the package, but still may want to contribute. For example, it’s not uncommon that a researcher in ODE solvers wants to build a package that adds one solver to the SciML ecosystem. Doing this in their own package for their own academic credit, but with the free bonus that now it exists in the multiple dispatch world. In the design of DifferentialEquations.jl, solve(prob,IRKGL16()) now exists because of their package, and so we add it to the documentation. Some of this work is not even inside the organization, but we still support it. The philosophy is to include every researcher as a budding artist in the space of computational research, including all of the possible methods, and building an infrastructure that promotes a free research atmosphere in the methods. Top level defaults and documentation may lead people to the most stable aspects of the ecosystem, but with a flip of a switch you can be testing out the latest research.

When approaching modeling languages like Modelica, I noticed this idea was completely foreign to modeling languages. Modelica is created by a committee, but the implementations that people use are closed like Dymola, or monolithic like OpenModelica. This is not a coincidence but instead a fact of the design of the language. In the Modelica language, there is no reference to what transformations are being done to your models in order to make them “simulatable”. People know about Pantelides algorithm, and “singularity elimination”, but this is outside the language. It’s something that the compiler maybe gives you a few options for, but not something the user or the code actively interacts with. Every compiler is different, advances in one compiler do not help your model when you use another compiler, and the whole world is siloed. By this design, it is impossible for an external user to write compiler passes in Modelica which effects this model lowering process. You can tweak knobs, or write a new compiler. Or fork OpenModelica and hack on the whole compiler to just do the change you wanted.

I do not think that the symbolic transformations that are performed by Modelica are the complete set that everyone will need for all models for all time. I think in many cases you might want to write your own. For example, on SDEs there’s a Lamperti transformation which can exist which transforms general SDEs to SDEs with additive noise. It doesn’t always apply, but when it does it can greatly enhance solver speed and stability. This is niche enough that it’ll never be in a commercial Modelica compiler (in fact, they don’t even have SDEs), but it’s something that some user might want to be able to add to the process.

ModelingToolkit: Opening the Development Process

So the starting goal of ModelingToolkit is to give an open and modular transformation system on which a whole modeling ecosystem can thrive. My previous blog post exemplified how unfamiliar use cases for code transformations can solve many difficult mathematical problems, and my goal is to give this power to the whole development community. `structural_simplify` is something built into ModelingToolkit to do “the standard transformations” on the standard systems, but the world of transformations is so much larger. Log-transforming a few variables? Exponentiating a few to ensure positivity? Lamperti transforms of SDEs? Transforming to the sensitivity equations? And not just transformations, but functionality for inspecting and analyzing models. Are the equations linear? Which parameters are structurally identifiable?

From that you can see that Modia was a major inspiration for ModelingToolkit, but Modia did not go in this direction of decomposing the modeling language: it essentially is a simplified Modelica compiler in Julia. But ModelingToolkit is a deconstruction of what a modeling language is. It pulls it down to its component pieces and then makes it easy to build new modeling languages like Catalyst.jl which internally use ModelingToolkit for all of the difficult transformations. The deconstructed form is a jumping point for building new domain-based languages, along with new transformations which optimize the compiler for specific models. And then in the end, everybody who builds off of it gets improved stability, performance, and parallelism as the core MTK passes improve.

Bringing the Power to the People

Now there’s two major aspects that need to be handle to fully achieve such a vision though. If you want people to be able to reuse code between transformations, what you want is to expose how you are changing code. To achieve this goal, a new Computer Algebra System (CAS), Symbolics.jl, was created for ModelingToolkit.jl. The idea being, if we want everyone writing code transformations, they should all have easy access to a general mathematical toolset for doing such code transformations. We shouldn’t have everyone building a new code for differentiation, simplify, and substitution. And we shouldn’t have everyone relying on undocumented internals of ModelingToolkit.jl either: this should be something that is open, well-tested, documented, and a well-known system so that everyone can easily become a “ModelingToolkit compiler developer”. By building a CAS and making it a Julia standard, we can bridge that developer gap because now everyone knows how to easily manipulate models: they are just Symbolics.jl expressions.

The second major aspect is to achieve a natural embedding into the host language. Modelica is not a language in which people can write compiler passes, which introduces a major gap between the modeler and the developer of extensions to the modeling language. If we want to bridge this gap, we need to ensure the whole modeling language is used from a host which is a complete imperative programming language. And you need to do so in a language that is interactive, high performance, and has a well-developed ecosystem for modeling and simulation. Martin and Hilding had seen this fact as the synthesis for Modia with how Julia uniquely satisfies this need, but I think we need to take it a step further. To really make the embedding natural, you should be able to on the fly automatically convert code to and from the symbolic form. In the previous blog post I showcased how ModelingToolkit.jl could improve people’s code by automatically parallelizing it and performing index reduction even if the code was not written in ModelingToolkit.jl. This grows the developer audience of the transformation language from “anyone who wants to transform models” to “anyone who wants to automate improving models and general code”. This expansion of the audience is thus pulling in developers who are interested in things like automating parallelism and GPU codegen and bringing them into the MTK developer community.

Intern, since all of these advances then apply to the MTK internals and code generation tools such as Symbolics.jl’s build_function, new features are coming all of the time because of how the community is composed. The CTarget build_function was first created to transpile Julia code to C, and thus ModelingToolkit models can generate C outputs for compiling into embedded systems. This is serendipity when seeing one example, but it’s design when you notice that this is how the entire system is growing so fast.

But Can Distributed Development Be As Good As Specialized Code?

Now one of the questions we received early on was, won’t you not be able to match the performance a specialized compiler which was only made to work on Modelica, right? While at face value it may seem like hyperspecialization could be beneficial, the true effect of hyperspecialization is that algorithms are simply less efficient because less work has been put into them. Symbolics.jl has become a phenomenon of its own, with multiple different hundred comment threads digging through many aspects of the pros and cons of its designs, and that’s not even including the 200 person chat channel which has had tens of thousands of messages in the less than 2 months since the CAS was released. Tons of people are advising how to improve every single plus and multiply operation.

So it shouldn’t be a surprise that there are many details that have quickly been added which are still years away from a Modelica implementation. It automatically multithreads tree traversals and rewrite rules. It automatically generates fast parallelized code, and can do so in a way that composes with tearing of nonlinear equations. It lets users define their own high-performance and parallelized functions, register them, and stick them into the right hand side. And that is even excluding the higher level results, like the fact that it is fully differentiable and thus allows training neural networks decomposed within the models, and automatically discover equations from data.

Just at the very basic level we can see that the CAS is transforming the workflows of scientists and engineers in many aspects of the modeling process. By distributing the work of improving symbolic computing, we have already taken examples which were essentially not obtainable and making them instant with Symbolics.jl:

We are building out a full benchmarking system for the symbolic ecosystem to track performance over time and ensure it reaches the top level. It’s integrating pieces from The OSCAR project, getting lots of people tracking performance in their own work, and building a community. Each step is another major improvement and this ecosystem is making these steps fast. It will be hard for a few people working on the internals of a single Modelica compiler to keep up with such an environment, let alone repeating this work to every new Modelica-based project.

But How Do You Connect To Modelica?

This is a rather good question because there are a lot of models already written in Modelica, and it would be a shame for us to not be able to connect with that ecosystem. I will hint that there is coming tooling as part of JuliaSim for connecting to many pre-existing model libraries. In addition, we hope to make use of tooling like Modia.jl and TinyModia.jl will help us make a bridge.

Conclusion: Designing Around the Developer Community Has Many Benefits

The composability and distributed development nature of ModelingToolkit.jl is its catalyst. This is why ModelingToolkit.jl looks like it has rocket shoes on: it is fast and it is moving fast. And it’s because of the thought put into the design. It’s because ModelingToolkit.jl is including the entire research community as its asset instead of just its user. I plan to keep moving forward from here, looking back to learn from the greats, but building it in our own image. We’re taking the idea of a modeling language, distributing it throughout one of the most active developer communities in modeling and simulation, in a language which is made to build fast and parallelized code. And you’re invited.

PS: what about Simulink?

I’m just going to post a self-explanatory recent talk by Jonathan at the NASA Launch Services Program who saw a 15,000x acceleration by moving from Simulink to ModelingToolkit.jl.

Enough said.

The post ModelingToolkit, Modelica, and Modia: The Composable Modeling Future in Julia appeared first on Stochastic Lifestyle.