# Algorithm efficiency comes from problem information

This is a high level post about algorithms (especially mathematical, scientific, and data analysis algorithms) which I hope can help people who are not researchers or numerical software developers better understand how to choose and evaluate algorithms. When scientists and programmers think about efficiency of algorithms, they tend to think about high level ideas like temporary arrays, choice of language, and parallelism. Or they tend to think about low level ideas like pointer indirection, cache efficiency, and SIMD vectorization. However, while these features matter, most decent implementations of the same algorithm tend to get quite close in efficiency to each other (probably <3x if everyone is using a performant enough programming language or constructs). If you're sitting down thinking "what is the next algorithm that I am going to invent that will be truly a step beyond what existed before?", these computational details matter, but not as much as the answer to one important question:

## What does my algorithm know about my problem?

Let me highlight one of the biggest success stories in numerical analysis: solving linear systems. The problem is simple, for what values of $x$ does $Ax=b$? Most people learn early on in their math career that you solve this via Gaussian elimination which is a very nice $\mathcal{O}(n^3)$ algorithm. In fact, you can show that solving a linear systems has to have $\mathcal{O}(n^3)$ as a lower bound, so we’re done now right? It’s all a game of who can write the most efficient Gaussian elimination?

Not even close. MATLAB’s backslash (A\b) operator solves the linear system $Ax=b$, but what algorithm does it use? Well, Mathworks describes it with a picture:

MATLAB_dense

Do you think that’s it? If so, then you’re wrong. That’s only when the matrix is full/dense. If it’s a sparse matrix, then there’s another algorithm for it:

MATLAB_sparse

Julia’s backslash operator is similar, but what’s cool is you can actually inspect Julia’s source code and see exactly what it’s doing. But it’s the same idea. This kind of algorithm is known as a polyalgorithm since it’s many different algorithms together. The algorithm starts by checking what kind of type the matrix is and then performing a specialized algorithm on that type, and if no specialized algorithm exists, falls back to Gaussian elimination through a pivoted QR-factorization.

Why would you write an algorithm like this? The trivial answer is because it’s faster. If you know the matrix is diagonal then the solution is a $\mathcal{O}(n)$ element-wise division by the diagonal. If you know the matrix is positive-definite then you can use a Cholesky decomposition to solve the equation which takes about half of the operations of Gaussian elimination. However, I think it’s better to realize the common thread here. The diagonal method is faster because it knows about properties of diagonal matrices and uses them. The Cholesky decomposition is faster because it utilizes details about positive-definite matrices in order to get rid of possible operations.

### These specialized algorithms are fast because they use more “information” about the problem

And that’s only the start of this entire field of mathematics known as numerical linear algebra. There are iterative methods which can also be used. (Dense) Matrix factorizations are dense and thus require the same amount of memory as the full matrix itself (though some sparse matrix factorizations similarly need the memory for non-zero elements and maybe some more). Iterative solvers require only the ability to do $A*x$ and thus are much more memory efficient and can scale well on sparse matrices. This also makes them easily to parallelize across large clusters and thus are the method of choice for large solving large sparse systems that arise from things like PDEs. Even then, you can still make it more efficient by choosing a preconditioner, but good choices of preconditioners are dependent on, you guessed it, properties of your $A$ or the equation it is derived from.

As you can see, every significant step of this tool chain is about baking more information about the problem into the solution methods. This is why I would suggest that one thinks about the amount of information an algorithm contains about a problem since that’s where the true gains are. Let’s take another example.

## Are neural networks “efficient”?

If there’s one area of computational data science that everyone is excited about right now, it’s neural networks. The quickest way to explain a neural network is that it is a computationally efficient way to approximate any mapping from inputs to outputs, $f(x)=y$. The $f$ that it creates uses a lot of matrix multiplies which are easy to GPU-parallelize, and the deep in “deep neural networks” simply is the application of more matrix multiplies to get a better approximation of $f$. Movie recommendations are where you take in x = (data about movies the person previously watched) and then y = (the score that you think they’d give to other movies), and you use enough data to get a good enough approximation for the movies they would like and spit out the top few. Classification problems like “does this picture contain a bird?” is just about creating a mapping between x = (a picture represented by its pixel brightness), to y = 1 or 0 (yes or no: it contains a bird). You train this on enough data and it’s correct enough of the time. Amazing, right?

The next question people like to ask is, what neural network frameworks are good for these problems? TensorFlow, PyTorch, KNet.jl will all auto-GPU you things for you and in the end all give around the same performance. Of course, package developers will fight over these 2x-5x differences over the next decade, showing that their internal setup is good for these problems, while others will show it’s not good for these other problems. But what this doesn’t tell you is whether you should be using a neural network in the first place.

(Yes, you can use these computational tools for things other than neural networks, but let’s ignore that for now)

If you think about the point of this blog post, it should trigger something in you. I mentioned that neural networks can approximate any function $f(x)=y$… so what does the neural network “know” about the problem? Pretty much nothing. The reason why neural networks are popular is because they are a nice hammer that you can use on pretty much any problem, but that doesn’t mean it’s always good. In fact, because you can use a neural network on any problem it pretty much implies that it won’t be great on any problem. That’s the fundamental trade off between specificity and generality! My friend Lyndon White displayed this very nicely when he showed 7 different ways to solve a classification problem in Julia. Notice that the neural network method is dead last in efficiency. Neural networks aren’t bad, but they just don’t know anything about a problem. If the problem is linear, then a linear regression or linear SVM will DDDDOMINATE on the problem! If it’s a classification problem, then algorithms which incorporate knowledge about what a classification problem means, more so than “spit out y in [0,1] via sigmoid and say yes if y is greater than some threshold”, will do better. For this domain, it includes things like decision trees. For movie recommendation problems, factorization machines more succinctly capture our internal model of how a user’s movie ratings should be related, and thus it’s no surprise that they tend to win most of the Netflix prize type of competitions.

Note that neural networks can be made problem-specific as well. Convolutional neural networks are successful in image processing because they add an extra constraint about the relations of the data, specifically that small “stencils” of the larger matrix (i.e. groups of nearby pixels) should be interrelated. By adding that kind of information into the structure of the neural network and its optimization algorithms, this then becomes a very efficient tool for tasks where the data has this structure.

So neural networks are not useless even though they are not always efficient. They are extremely useful since they can be applied to pretty much any problem. But the general neural network architectures (like deep feed-forward neural networks) lack knowledge about the specific problems they are approximating, and so they cannot hope to be as efficient as domain-optimized algorithms. How do you make them better? You introduce ideas like “memory” to get recurrent neural networks, and it’s no surprise that the research shows that these methods are more efficient on problems which have some memory relation. But the generality of neural networks leads me to my favorite example.

## Algorithm specificity for differential equations (okay, this goes into more depth!)

I spend my time developing new methods for (stochastic) differential equations and developing differential equation solving software. There are hundreds and hundreds of different algorithms for finding the function $u(t)$ whose derivative satisfies $u'(t) = f(t,u(t))$ where $f$ is the data given by the user starting to $t_0$ and ending at $t_f$ (this is the definition of an ODE BTW. If this is new to you, think I’m given a model $f$ which describes how things change and I want to compute where I will be in the future). And again, solving this problem efficiently is all about incorporating more information about the problem into the solver algorithm.

Just as a interesting note, you can solve a differential equation with a neural network, but it’s not good. I was hoping that the efficiency gained by using optimized GPU-libraries would be enough to overcome the efficiency lost by not specifying on the problem, but it wasn’t even close and the end result was that it was difficult for a deep neural network to solve non-stiff nonlinear problems like the Lotka-Volterra equations which standard differential equations software solve in microseconds. But what’s more interesting is why it failed. The neural network simply failed to understand the temporal dependencies of the problem. If you look at the failed solutions on the blog post, you see what happens is that the neural network doesn’t know to prioritize the early time points because, well, it’s obvious in a temporal model that if you are wrong now then you will be wrong in the future. But the neural network sees each time point as an unconnected matrix and from there the problem becomes much harder. We could try convolutional nets or recurrent nets, or bias the cost function itself, but this is still trying to get around “the information problem” which differential equations don’t have.

With a differential equation, we are modeling reality. In many cases, we know things about that reality. We might know that the trajectory of our rocket is smooth (it has all derivatives), and so we can develop high order algorithms which require 9 derivatives of the user’s $f$ and it’s not surprise that these Vern algorithms benchmark really well. But is it the best integrator? No, there’s no such thing. First of all, these methods are only for “non-stiff” ODEs, that is an ODE which has a single timescale. If there are multiple timescales, then one can show that these are “unstable” and require a very small time step. So what’s the best algorithm?

What I try to highlight in the post on differential equation libraries is that a large choice of methods is essential to actually being efficient. For example, many only have a few integrators, any in many cases this is simply multistep methods (because integrators like LSODE and VODE, provided in ancient Fortran code, have an option to change between stiff and non-stiff problems). But are these actually good methods? That’s not a good way to look at it. Instead, look at what these multistep methods mean. The form for the BDF2 method is:

$y_{n+2} - \frac{4}{3}y_{n+1} + \frac{1}{3}y_n = \frac{2}{3}hf(t_{n+2},y_{n+2})$

What are the advantages of this algorithm? Well, because it uses two past data points (assume you already know $y_n$ and $y_{n+1}$ and want $y_{n+2}$), you only have to solve an implicit function for the variable $y_{n+2}$. If the cost of evaluating the function is high (as is the case for example with large PDE discretizations), then you want to evaluate it less so this does well! Since this only has a single $f$ call per update, that’s doing quite well.

However, it made a trade off. In order to decrease the number of function evaluations, it requires more than one previous data point. It also requires that “nothing happened” between those two points. If you have some discontinuity for example, like modeling a ball when it bounces or modeling the amount of chemicals in a patient and you give a dose, then the previous data is invalid. Adaptive order algorithms like LSODE or CVODE go back to the more error prone backwards Euler method

$y_{n+1} - y_n = hf(t_{n+1},y_{n+1})$

to compute one small step (it has to be small to make the error low since the error is $\mathcal{O}(\Delta t)$), then uses that step as the previous step in a BDF2, then goes up to BDF3 etc., where each higher method requires more previous data and has a lower error order. But if you’re hitting events quite often (Pk/Pd simulations where a drug dose happens every few hours), notice that this is really really bad. Like, awful: you might as well just have had an implicit Euler scheme!

And not only that, the assumption that this method is efficient requires that $f$ is costly. Other methods, like Rosenbrock or (E)SDIRK methods, make the trade off to take more evaluations of $f$ to get less error, and in turn use this decreased error to take larger time step (and thus less steps overall). And so it’s not surprise that these methods benchmark better on small system and even “medium sized systems”, while the BDF method benchmarks as more efficient on larger systems with a more expensive function evaluation (the Rosenbrock methods are things like Rosenbrock23 and Rodas4, and the BDF method is CVODE_BDF). Again, this is nothing more than incorporating more details about the problem into the choice of solution method.

This is why having a wide variety of methods which incorporate different types of problem information is really what matters for efficiency. IMEX methods are a great example of this because instead of the user specifying a single ODE $u' = f(t,u)$, the user specifies two functions $u' = f_1(t,u) + f_2(t,u)$ where one of the functions is “stiff” (and thus should be solved implicitly by things like Rosenbrock or BDF methods) while the other is “non-stiff” (and thus should be solved by explicit methods like the Verner method). By specifying this split form, libraries like ARKODE or DifferentialEquations.jl can utilize algorithms which incorporate this information and thus will be more efficient than any of the previously mentioned methods on appropriately split problems. Another example are symplectic integrators which incorporate the mathematics of the underlying symplectic manifold into the solver itself, for problems which have a symplectic manifold. What kinds of problems lie on a symplectic manifold? Well, those arising from second order differential equations and physical Hamiltonians like N-body problems of astrodynamics and molecular dynamics. These methods, by utilizing a fundamental property of the problem specification, can noticeably reduce the amount of drift in the energy and angular momentum of the approximated solution and make the resulting simulations closer to reality. The changes an algorithm choice can make can be orders of magnitude. In the paper I showed earlier on methods for stochastic differential equations, incorporating the idea of time correlation and interpolation of Brownian motion (the Brownian bridge) to build an adaptive method resulted in an average of 100,000x less time steps to compute solutions to the SDE, so much that I couldn’t even benchmark how long it would take for algorithms which didn’t make use of this because they could not finish! (In my next paper, I am able to estimate that it would take almost 6 years to solve this problem vs the 22 seconds we’re at now). Choice of language, cache efficiency, etc. pale in comparison to proper algorithm choice.

## The Julia Programming Language

I hope I am getting across the point that the way to evaluate algorithms is by the information they use about the problem. This way of thinking about algorithms extends very far. I show in another blog post that the Julia programming language achieves C/Fortran levels of efficiency because it allows the LLVM compiler to have total type information about your program, as opposed to more dynamic languages like R/MATLAB/Python. Julia is specifically designed so that compilers can at compile-time compute as much type information as possible to essentially build the C code you would have written. When you understand the information that is being optimized on, these performance differences aren’t magical anymore: it’s just using more information, more specificity, and getting more performance out because of that.

In fact, I have to put a little blurb out here for the Julia programming language as well. Its multiple dispatch and abstract typing system is a very nice way to organize algorithms by their information. Let’s take how Julia’s linear algebra works. For a standard dense matrix, the backslash operator internally calls factorize(::AbstractMatrix) (and A_ldiv_B!, which is an inplace way of writing A\b) which I showed above is a polyalgorithm. But if you call factorize on some special matrix type, like a Diagonal or Tridiagonal matrix, it uses a specialization of factorize or A_ldiv_B! to perform the most efficient algorithm for that type of matrix. Then in user codes and packages, other people can define a matrix type MyMatrix <: AbstractMatrix and overload factorize or A_ldiv_B! to be efficient for their matrix type. Then in my algorithm like a differential equation solver which uses linear algebra under the hood, I can simply call factorize and Julia will then pick the most efficient form based on the matrix type that I currently have (even if that factorize method was defined in a package I do not depend on!). This means that my generic code (meaning type-insensitive code: I don't know the matrix types and just call factorize without caring what kind of matrix it is) can be made more efficient by a user specifying more problem information in the form of type information and dispatches. The result is that the type system with multiple dispatch doesn't just let the compilers get more information, but it also makes it easy for developers to pass along information and specializations that in tern customize other solvers. So the most efficient algorithm for your differential equation may be a Rosenbrock23 algorithm with a GMRES iterative solver that doesn't use the actual Jacobian because $J*x$ can be computed directly from a function call (this is quite common in PDEs), and you don’t need to re-write your own. Instead you can make my generic Rosenbrock23 algorithm compile a code which does that… even though it was never written to know what kind of method factorize should be. This kind of “opening up of the black box” and being able to optimize pieces of it via information injection for your specific problem is a powerful tool for optimizing numerical algorithms to large scale problems. Given this post I hope it’s easy to understand why I think this will (and already has started to) lead to a new generation of efficient implementations.

(Okay okay, you can do generic programming with C++ templates, but you can’t make me do it! Look at this “gentle introduction” and you’ll be in my boat asking for a high-level language designed to have that information in mind, and I’ll point you to Julia)

## Moral of the story

Getting your perfect caches lined up is small potatoes. BLAS and LAPACK are well-known super fast Fortran linear algebra libraries. They are extremely fast not just because they are bit twiddling with assembly (though they are), but they are super fast because they incorporate information about floating point precision and use that to their advantage. Neural networks, ODE solvers, etc. are all just tools for problems. The more specific they are to the problem you’re trying to solve, the more efficient they can be. So next time you’re looking to optimize some numerical codes, don’t ask “what can I do to make the computation more clean/efficient”, ask “what information about the problem is the solution method not using?” That’s where the big potatoes come from.

The post Algorithm efficiency comes from problem information appeared first on Stochastic Lifestyle.

# DifferentialEquations.jl 3.0 and a Roadmap for 4.0

I am pleased to announce the release of DifferentialEquations.jl 3.0. In the last DiffEq blog post I described the current state of JuliaDiffEq and DifferentialEquations.jl along with the trajectory that we hoped to take. We identified (at that time) current shortcomings of the software and our plans to remedy them. I also recently did a survey of differential equation suites in order to understand where we stand and see where we need to improve. These research efforts were used to put together a list of goals that were systematically achieved during 3.0. What I would like to do this time around is give a broad overview of what we have released in the 3.0 timeframe, the goals that we have achieved, and the goals that we are putting off (for next Google Summer of Code?). And then, more importantly, I want to set some milestones for the next version. If you want to dig into our new features and start using them, please see the documentation. If you want to read the release posts, see the official JuliaDiffEq blog

## A Quick Review of DifferentialEquations.jl Pre-3.0

In 1.0, we made every thing work with generic types and event handling. In all of the native Julia solvers you could use arbitrary arithmetic and use events to have the ODEs do crazy things like change size over time. This was about features. In 2.0, we expanded our capabilities to cover “most” of what users tend to need. A broad array of ordinary differential equation (ODE) solvers, a broad array of stochastic differential equation (SDE) solvers, delay differential equation (DDE) solvers, and some partial differential equation (PDE) solvers. We added addons for parameter estimation, sensitivity analysis, uncertainty quantification, etc. This was really exciting because it was the first set of differential equation solvers which had this range of applicability. What this did was make it possible to solve many different types of problems. You “could” solve the problems. There were some edge cases for sure, but the main areas where the vast majority of individuals were looking was hit. But there were two major remaining warts: stiff problems and PDEs.

## Introducing DifferentialEquations.jl 3.0

There was an issue. There are some specific types of problems, namely stiff differential equations, which require specific types of methods. We had wrappers to common C/Fortran solver for these, but this meant that we lost the type flexibility and event handling when solving these equations. We couldn’t handle some of the more difficult problems like state-dependent delays as well. Thus these types of problems were the focus of 3.0: to have some semblance of “completeness” or “coverage” with native Julia methods. The quick summary of DifferentialEquations.jl 3.0 is the following: for hard problems, we now have methods specifically suited for the problem. We have methods for stiff ODEs, SDEs, DDEs, etc. and these work with the differential-algebraic forms of each of these equations. We still need to round out the suite, but I am pleased to say that for hard problems which require special methods that can be difficult to implement, we do have options available for you. Let’s go into some details.

### Solvers for Stiff ODEs and DAEs

This is probably the area that will impact the most individuals. In DifferentialEquations.jl 3.0 we are happy to announce the release of a vast array of methods for solving stiff differential equations. While before we had wrappers for methods like CVODE from Sundials, LSODA, and radau from Hairer’s software, our offering here wasn’t too unique. However, we now have a wide array of high order methods for solving stiff ODEs. The centerpiece here are Rosenbrock methods and (E)SDIRK methods.

Rosenbrock methods are methods which are generally very good at lower tolerances. Hairer’s second book showed that high order Rosenbrock methods tend to be the most efficient methods when the required error is less than something around 5 digits. This is huge because this is the amount of accuracy many people want. Our new offerings of ODE solvers includes pretty much every Rosenbrock method that we could find that has been proposed in the literature. These methods have special interpolations so that sol(t) not only acts as a continuous function of the solution, but this continuous function is in some sense “stiffness-aware” and can with high accuracy produce the solution and its sharp turns between the solver’s steps. Being a “generic” Julia implementation, these all work with a wide array of Julia-defined number types, including high-precision arithmetic and (if the Jacobian is defined, see below) complex numbers. As far as we know, this is the first set of stiff ODE solvers with this flexibility. They all fully conform to the framework of DifferentialEquations.jl, meaning that they have event handling, the integrator interface, and all of the other extra goodies. They have adaptive timestepping with automatic initial dt calculations and all of the other features to make them a “fully automatic” solver. Using the mass matrices, these solvers can also handle DAEs.

In addition to the Rosenbrock methods we released a large set of (E)SDIRK methods. These methods have a specialized quasi-Newton solver for their implicit equations, making them highly efficient, especially in the case where there are times in which the Jacobian changes less quickly (it will skip factorizations when it determines that it can). As with the Rosenbrock methods, these are “fully automatic” and work with all of the event handling, generic numbers, etc.

So, how did we do? What do the benchmarks look like? DiffEqBenchmarks.jl is how we’ve been tracking some of our progress. In the benchmarks there we show that in the common test problems for stiff ODEs, these newest methods are the fastest we have available, even faster than CVODE from Sundials and radau from Hairer, in the “range of reasonable tolerances”, i.e. where the user wants an the error to be in the 9th digit or lower. I think that satisfies most uses cases and so we are pretty happy with the results. There are many other tests from users which report similar results that our new Rosenbrock methods and (E)SDIRK methods benchmark as the fastest for achieving the desired accuracy.

We are not surprised though: multistep methods like CVODE are specialized to decrease the number of function (f) calls which in turn is only useful when the system is sufficiently large. And fully implicit methods make use of larger linear systems to be efficient when more steps are required. Thus in the case where the system function is very costly or the number of ODEs is huge, Sundials will still be a good choice. Or if you need really high accuracy, radau will still be a good choice (note: these methods are still wrapped so you can keep using them). But other than these cases, we find our new methods to perform really well.

Let me mention a few caveats here which are left. One of them is complex numbers. Complex number handling is a little bit spotty in Julia packages right now. DifferentialEquations.jl fully supports them in each of the *DiffEq (OrdinaryDiffEq.jl, StochasticDiffEq.jl, DelayDiffEq.jl) solvers, along with all of the addons. However, where we run into issues is when interfacing with other packages. For example, ForwardDiff.jl and Calculus.jl cannot handle complex numbers. This becomes an issue only for the stiff solvers because the stiff solvers require the ability to calculate Jocobians, for which we use these packages. This is why I added the caveat “if you provide your own Jacobian”. But actually, we get the capability to numerically compute Jacobians with complex numbers by replacing Calculus.jl with DiffEqDiffTools.jl in DifferentialEquations.jl 3.0 (a little pre-mature, but the PR is just about done here). But in a week or so this will be a problem of the past, and we hope to integrate our tools into things like Optim.jl and NLsolve.jl so that more packages support complex numbers (you quantum physicists keep emailing me! ).

### Solvers for Stiff Delay Differential Equations

Our new methods in OrdinaryDiffEq.jl, the high-order Rosenbrock and (E)SDIRK methods, extend over to stiff delay differential equations. These are specialized so that way they can “re-use” step information to be more efficient than classic designs. We only know of one other available free stiff delay differential equation solver (Radar5), but since we couldn’t find out how to get it to work (it requires some really intricate compilation binding so I don’t think it can be wrapped) we don’t have anything to benchmark against. But from what we’ve seen it works well! Once again (as always), DiffEqBenchmarks.jl is the open resource for seeing how things are doing.

### Solvers for Stiff Stochastic Differential Equations

We wouldn’t be complete without saying that we also have methods for stiff stochastic differential equations. These are based on the SDIRK architecture of the ODE solvers and thus employ the same tricks to get efficiency. Not much more to say here.

### Solvers for Ordinary, Stochastic, and Delay Differential-Algebraic Equations

Many of our stiff solvers allow for defining a mass matrix. The mass matrix is allowed to be singular, in which case the stiff solver will solve a differential-algebraic version of the ODE, SDE, or DDE. As far as we know, this is the first available set of solvers for SDAEs and DDAEs.

### Solvers for for Second Order ODEs

Okay, you can convert first-order ODEs to second order ODEs and solve them like that. However, when doing so you don’t make use of the full structure of the second order equation, and thus you don’t get the full efficiency out. Runge-Kutta Nystrom methods are made directly for second order ODEs and we now have these methods implemented. In addition, in many cases one may want to solve an equation in a way that you know certain quantities are preserved over long-time integrations. These methods are known as symplectic integrators, and we have implemented a large array of symplectic integrators.

The format for these is what we call a “dynamical ODE”. The basic way to specify a DynamicalODEProblem is by specifying it as a second order ODE. However, we also allow one to directly specify the Hamiltonian for a physical system from which we use autodifferentiation to derive the equations of motion. Additionally, we allow a partitioned ODE form which allows one to specify the velocity component directly and thus allows for more advanced dynamics than a simple second order ODE. All of these problems can also directly be solved by first-order ODE solvers which will automatically do the conversion.

### Solvers for State-Dependent Delay Differential Equations

State-dependent delay differential equations are delay differential equations where the delay factor depends on the differential equation itself. For example, you can say something like, the amount of growth in the fish population is dependent on the population from a few weeks ago (since that’s when conception would have occurred), but when there’s more fish there’s a longer delay since the development process is slowed when resources are scarce. This means that the derivative of now is dependent on the derivative of the past, but how long in the past is dependent on the value right now!

Delay differential equations are complicated to solve because these delays propagate discontinuities. If you don’t properly handle the discontinuities then you will not achieve high accuracy. For constant-time delays you can know exactly where all of the discontinuities will be a priori, and thus you can have the solver hit exactly those points in time in order to not have issues. For state-dependent delays, the timepoints for the discontinuities are dependent on the solution itself, so you need the numerical solution in order to know how to handle the discontinuities!

If you ever step over a discontinuity, you will suffer from increased error. Or do you? Questioning this assumption gives you the residual control methods. These are adaptive methods which have a robust form of error estimation and thus try to detect discontinuities by stepping over them and seeing if the resulting error is high. This is the method that MATLAB’s ddesd uses, and thus we implemented this as well. However, shortly before doing so, I received an email from some numerical delay differential equation researchers who questioned the validity of this approach because they ran MATLAB’s ddesd on some test problems and found out that its error was quite high. Well, our resdiual control methods match this behavior: they don’t tend to get more than 3 digits of accuracy, but they are pretty fast. To be fair, Shampine’s paper on ddesd said that it was for getting plotting accuracy, and not necessarily scientific computing accuracy.

So that method handles one case, what about the high accuracy case? The JuliaDiffEq contributor David Widmann is who to thank for this. Using the event handling setup in the ODE solvers, we setup a system by which the solver would continuously track and detect discontinuities, and use this to pullback and hit discontinuities “exactly”. Testing against numerical solutions this method is able to get to full floating point accuracy. This method is also compatible with all ODE solvers via method of steps, and thus allows for using stiff solvers and differential-algebraic delay equations via mass matrices.

### Solvers for Boundary Value Problem (BVP) Solvers

This is a result from which you can thank Google Summer of Code. Boundary-value problems are extensions of ODEs which allow you to set conditions which the ODE must satisfy. Normally one thinks of the two-point boundary value problem where these conditions specify values that the ODE must be at the start and the end of the solution interval. We did create a method for two-point BVPs which mirrors that of bvp4c (though adaptivity is coming soon), but we generalized the allowed BVPs quite a bit. For many of the methods, you are able to specify conditions using the full solution and its interpolation. Thus one can make a “boundary condition” that the maximum of the second derivative over the full interval is 1. We honestly do not know of problems which utilize this full generality yet (though I do know of “multipoint BVP problems” which are of course a subset of this), so we’d like to hear if you end up using this for something crazy.

### Partial Differential Equation (PDE) Toolkits: Linear Operators and FEniCS

Oh PDEs. If you watch my JuliaCon workshop, you’ll see that the same two questions always come up: what about stiff solvers, and what about PDEs? I just told you about solvers for stiff differential equations for a wide variety of problems, and now lets address our PDE tools.

Early on in DifferentialEquations.jl I created a finite element toolbox to go along with the software. It was very basic, and I realized that approach will not scale, so instead what we decided to do was wrap the popular FEM library FEniCS. This was part of a Google Summer of Code project which created FEniCS.jl. You can read the blog post which introduces it and what’s cool is that the pieces that FEniCS created, the assembled operator equations, can be directly converted to sparse matrices in Julia which can be used to solve time-dependent PDEs in our ODE solvers (or time-independent problems can just solve the implicit equation using whatever linear solver you choose from Julia).

But not every problem needs finite element methods. To help with finite difference methods, another GSoC project developed DiffEqOperators.jl which makes it easy to discretize PDEs via finite difference methods. Essentially, you tell it the derivatives you want to discretize and it spits out lazy (matrix-free) linear operators which are fully multithreaded and perform the stencil. Once again, this makes it easy to define the discretized ODE system from the PDE and then solve it using the ODE solvers. We also include upwind operators for stable discretizations of hyperbolic PDEs.

As you can see, this is a toolbox for solving PDEs. People who have a little bit of prior knowledge in solving PDEs can easily use these tools to build a method that solves their specific PDE. However, what we plan to do next is to use this toolbox to make some pre-made solvers for some common PDEs like diffusion-advection equations. With this development it will really complete our PDE story.

## Conclusion: DifferentialEquations.jl 3.0 addresses the major concerns of the past

The main conclusion is this: people wanted methods for all sorts of solvers for stiff ordinary, stochastic, and delay differential equations, along with the differential-algebraic and partial differential variants, and they wanted these to work with generic numbers, event handling, all of the addons, etc. This announcement’s tl;dr is simply that we listened, and we released. Of course, we aren’t done (there’s always more to do), but what we can say is that it is highly likely that one of our offering will solve your differential equation well.

So what’s next? Well, we can always add some more methods which handle specific special cases better. That’s the goal of DifferentialEquations.jl 4.0 (and beyond). Here’s a look into what we have planned.

### Multistep Methods (Adams, BDF) and Implicit ODEs

Classic multistep method solvers like LSODE and CVODE are some of the most commonly used methods. In most cases they aren’t the most efficient: this is a fact noted in Hairer’s benchmarks and now in ours. However they have major upsides when the user’s function f is expensive, or when the system of ODEs is large. This is something that comes up in large PDE discretizations and is why these methods are central to solving large-scale PDEs. We have put this off because it is a niche area and we have pretty good wrappers to the classics like Sundials, LSODA, DASKR… but it is definitely time that we tackle these methods with a native Julia implementation.

One other thing to mention here is that multistep BDF methods are also what have traditionally gave rise to fully implicit DAE solvers like DASSL, DASPK, and IDA. This is an area where we have been lacking quite a bit in terms of native solver capabilities (mostly relying on wrappers), so we will need to spend some time here. We also need to build tooling for finding consistent initial conditions like is done in these solvers… we’re a ways off here. Modeling tools like Sims.jl and Modia.jl directly utilize IDA since we don’t offer anything of interest in this area, but hopefully we can develop some native Julia tooling here and help link it to these other packages so that we can expand the capabilities of not only us but modeling packages as well. If we can provide a better solving backend and they provide a great modeling front end, Julia will be the star of this field.

### Fully Implicit Runge-Kutta Methods (radau)

Fully implicit Runge-Kutta methods also have a niche. Some methods, like radau, are great for high-accuracy (low tolerance) solving of stiff equations. Others are high order symplectic methods for stiff differential equations. They also do really well with DAEs in mass matrix form. Now that we have the availability, these are definitely areas that we will tackle in the near future as well, not just in ODEs, but also in SDEs (and note that the ODE part builds DDE solvers for free as well).

### Exponential Integrators

Exponential integrators allow you to exploit linearity in the definition of an ODE, SDE, or DDE. There are two forms of interest: $u'=A(t)u$ for $A$ is a time-dependent linear operator, and $u'=Au + f(t,u)$ where $A$ is a time-indpendent linear operator. The first form shows up in a lot of quantum mechanics situations. The latter comes from discretizations of semilinear PDEs. Both of these can be solved with standard first-order ODE solvers, but the efficiency can be improved by using $A$ directly.

We have already made great strides in this direction. There are some solvers released for both types of equations, and we have developed an interface, the DiffEqOperator, for handling the definition of $A$ in a way the solvers can exploit. However, the crucial linear algebra tools were picked up by Marcelo Forets and implemented in ExpoKit.jl, and using their expmv and phimv implementations we can tackle the higher order methods. I wouldn’t expect this until the next summer since I see portions of this project as a great Google Summer of Code project, so if you’re interested please feel free to get in contact with us.

### Implicit-Explicit (IMEX) Methods

IMEX methods, where the user can split the function f into two portions so that way one part is explicit and one part is implicitly solved (i.e. a stiff and nonstiff part) has recently received lots of popularity for solving PDEs. We have most of the pieces for high order IMEX methods. The ESDIRK methods from Kennedy and Carpenter are the additive Runge-Kutta methods that are used in Sundials’ ARKODE IMEX solvers. We just haven’t added the explicit part.

But we have the architecture which allows the user to define IMEX methods. There’s also plenty of other IMEX methods which can be implemented as well. Since the architecture here is “already done”, it’s simply a matter of coding in a the inner loop for a few new methods. To me, this sounds like a great Google Summer of Code project as well, so we may be holding off on development here until the next summer.

### Finding Out Our Partial Differential Equation (PDE) Interface

I was happy to release (at least the beginnings of) PDE toolkits… but that satisfies a small group of people who know numerical methods for PDEs and want these pieces in order to write solvers more easily. In practice, many scientists probably don’t know how to (or don’t want to) do this (they are specialists in science! Not solving PDEs!). We need to provide something like MATLAB’s pdepe: simple interfaces for solving common PDEs. While one way we will build these solvers will be to use our toolbox and build method of lines integrators, we will need to make use of the distributed architecture of DifferentialEquations.jl in order to get good coverage of the PDE landscape. Indeed, I know of some individuals like John Gibson who are building spectral PDE solvers and really seem to know what they are doing, and so it would be a shame if we didn’t have a way to allow users to directly interface with these tools (it’ll also be a great way for methods researchers to easily benchmark, hopefully pulling an even larger community of developers in).

The answer will be some kind of problem hierarchy where the user defines something like a DiffusionAdvectionProblem where it contains a bunch of functions and there is a common interpretation of what the problem definition is and how solvers should treat it, and then they should all use that to spit out a similar solution. We have the pieces of how that can work in our (fantastically outdated) FEM Heat and Poisson methods, but the issues are finding out how to do things like boundary conditions in a way that we can make the most out of every’s PDE solver while not complicating the common interface. Once we have solvers all wrapped together, there will still be a nightmare: how do you document this? If we have a different set of four packages available for 5 different PDEs, the combinatoric explosion in documenting the nuances means we can’t add all of this to our current documentation (which is already huge). So we may need a “different section” of the docs somehow? To me it’s very unclear how we will document this well, so my plan is to just start adding the functionality and find out how to document it as we go along. It will probably need to be re-written a few times before it’s any good, so please bear with us!

### High Order Adaptive Solvers for Stiff SDEs

This is actually one of my recent research projects. I have new methods for high order adaptive solvers for stiff stochastic differential equations, and will be submitting the publication soon. When this is published, the associated methods will be released and we will have high order adaptive methods for stiff SDEs. So stay tuned!

### Flesh out the BVP solvers

Our Shooting methods are very flexible, but they will never do well on problems which are sensitive to the initial condition. We need to instead make our MIRK methods get all of the bells and whistles: continuous extension, adaptivity, etc., and we need to wrap some of the classic Netlib solvers into the same interface so we can start to do some comprehensive benchmarking.

### Parallel-in-time ODE Solvers

During the last Google Summer of Code we took a stab at developing some parallel-in-time ODE solvers using Neural Networks. While the student did a great job at trying a bunch of different strategies, we have come to realize that neural networks simply do not “know” enough about the structure of the differential equations in order to be efficient. When talking to a friend about PDE solvers, he noted that he tested the efficiency of TensorFlow’s neural net PDE solver (which it has in a tutorial) against more standard methods and noted similar issues with efficiency. So we tried, and it was definitely a very interesting research project, but this direction didn’t yield the results that we hoped.

Thus instead I am hoping to take a step over to different methods. Parallel-in-time integration methods like parareal integration and the XBraid software do exist, and so I plan on taking a stab on adding these to the repertoire of DiffEq. Note that for large and expensive ODEs, SDEs, and DDEs, one can already parallelize the calculation of f using the current tooling. Thus these methods are for when you want to solve problems where the set of ODEs is quite small, yet you need to solve over a large timespan and have a lot of parallel computing power available. So once again, there’s no rush as this is quite a niche, but we plan to get to it in the next release cycle. In fact, I hear it’s so niche that I was told in an email that parareal is only good for long time integration when you have >128 cores available… let’s make an open-source implementation and try it out ourselves.

### Automatic Stiffness Detection and Switching

What’s a stiff differential equation? That can be hard to explain and predict. For this reason, many people like the idea of automatic stiffness detection and allowing the method to automatically switch between solvers to handle the different types of equations more effectively without user input. We have all of the tooling for building this, and the user can actually specify switching strategies themselves using our CompositeAlgorithm setup, but in the next release cycle I hope to release some methods which have this built in, akin to something like LSODA. Note that my newest paper on SDE methods includes stiffness detection as well, so in one fell swoop we plan to add stiffness detection and switching for ODEs, SDEs, and DDEs (and of course differential-algebraic version via mass matrices, and …, you get the picture about how all of this stuff is composed together!)

This is actually at a point where it would not be too difficult to do but would take a good chunk of time and also take some research time, so I am putting it off and hoping this can be a Google Summer of Code project or another project with a student.

### Efficiency Improvements for Cheap DEs

One final wart in the DiffEq architecture. Due to issues with Julia, it seems that if an ODE is “sufficiency cheap”, i.e. takes less than a millisecond to solve, our setup has some inefficiencies. This is from how we do the setup of the integrator and limitations Julia has on type inference. This issue is constant, meaning that for more expensive ODEs/SDEs/DDEs it’s still just a millisecond. However, it’s annoying and we hope to remedy this. In reality, we haven’t seen this affect real-world benchmarks since < millisecond ODEs are not necessarily where people are looking at performance, but it does start to become an issue when attempting to do something like parameter estimation on cheap ODEs since this involves solving the same ODE thousands of times. Part of the issue is due to the use of keyword arguments whose performance will be fixed in Julia v0.7. For the inference issues, we hope to get the right updates into an early Julia v1.x, along with making a few structural changes, and then this issue will go away. Other enhancements along this line will be a reinit interface to make it easier to reuse internal caches and cleaning up the solution type pre-allocation interface.

## Conclusion

At this point, we are very satisfied with our offering. The 30 people who make up the JuliaDiffEq team have really built a software which has the methods to solve “most” differential equations that users encounter, and also do so efficiently. In the coming months we hope to add extra methods for specific and important niches to our offerings and fill in some holes. But together, I think we have a pretty solid offering, and everything else is (important) icing on the cake.

The post DifferentialEquations.jl 3.0 and a Roadmap for 4.0 appeared first on Stochastic Lifestyle.

# I Like Julia Because It Scales and Is Productive: Some Insights From A Julia Developer

In this post I would like to reflect a bit on Julia. These are my personal views and I have had more than a year developing a lot of packages for the Julia programming language. After roaming around many different languages including R, MATLAB, C, and Python; Julia is finally a language I am sticking to. In this post I would like to explain why. I want to go back through some thoughts about what the current state of the language is, who it’s good for, and what changes I would like to see. My opinions changed a lot since first starting to work on Julia, so I’d just like to share the changed mindset one has after using the language deeply.

## Quick Summary

Here’s a quick summary of my views.

1. Julia is not only a fast language, but what makes it unique is how predictable the performance and the compilation process is.
2. The language gives you lots of introspection tools to be able to easily isolate issues.
3. The opt-in type checking and allowing many different architectures to be fast is a strong bonus for software development, especially when scaling to large software ecosystems, over other scripting languages since it allows you to write simple and self-documenting code.
4. Julia’s unique features make it easy to make packages which are type-generic and parallel.
5. Most of these benefits will be seen by package developers. “Users” will probably not see as much of a difference in their own codes because the majority of their performance will be determined by the packages they use.
6. To attract a new wave of users, Julia needs to start taking a “package-first” mentality and push package-level unique features rather than language-level features. Language level is what developers care about, but the majority of programmers are not developers.
7. We have all of the basics in Julia, but we need to start showing off (and working towards) how we can be different. Every package should be picking some special features and types to support. Speed is just one feature.
8. Julia can have a bright future, but we may need to start advertising and teaching it differently.

I think that to start, I need to discuss a topic which might be new to newcomers.

## Julia’s JiT is not like other JiTs, and it helps package development

LuaJIT is a runtime that just-in-time compiles Lua, so is it the same thing or similar to Julia? What about MATLAB’s JIT? The answer is no, Julia’s compilation strategy is very different. Other JIT setups for scripting languages use something that’s known as a tracing JIT. What these do is they track the commands you are running and then via some algorithm (possibly probabilistic in some setups like LuaJIT) it determines what parts of the code are repetitively ran enough that they warrant compilation, and then code for those specific sets of commands are compiled and when the parser hits those areas again it runs the JIT code.

Julia on the other hand is almost static. When running any Julia function, it will first take your function calls and then auto-specialize it down to concrete types. So f(x::Number)=x^2 will specialize to Float64 when you call it with f(1.0). Using this type information, it has a compile-time stage where it propagates all type information as far as it can go. So if you internally do x^2, it will replace that with ^(::Float64,2) (the 2 is a compile-time constant), and then since it can determine the types of everything, it will push this all the way down to generate clean LLVM IR (and thus clean assembly code) for x^2 on Float64s. It will cache this compiled version of the function so that way f(2.0) uses the same compiled code (notice that the compilation process only needs the type and not the runtime values!). However, when you then call it with f(1), it compiles a separate version for Int.

In some sense, this is very different. The obvious difference is that Julia’s compilation process is deterministic and is open to introspection. If you do

@code_warntype f(1.0)
@code_llvm f(1.0)

Julia will spit back at you its internal AST of the method specialized on Float64, and then the LLVM IR. The nice thing about this is that you can check every step of the process. I never participated in programming language development conversations before working with Julia, but this makes it easy. You see this kick something out, you find out yourself how to optimize it, and then you can easily ask someone “why doesn’t the compiler optimize xyz away here?” and then that leads to code changes (i.e. “oh, array aliasing is preventing compiler optimizations!”), new work on compiler optimizations, and better performance. A recent place where just happened was in this Discourse discussion where a user points out that a separate faster LLVM instruction could be used than the one that is actually omitted.

But what I really like about this process is that it’s clear and deterministic. I can know exactly what my packages are doing and why. This makes it very easy to optimize packages because you always know what’s going on. When trying to piece together 50+ packages like DifferentialEquations.jl, deterministic rules for how to optimize code and easy means of introspection are essential to know what’s going on. This kind of transparency is something that’s really easy to appreciate after working on something like MATLAB where the internals are opaque.

There is a tradeoff though. In order for this process to work well it requires being able to propagate type information. This requires that the types of the variables inside of a function are what’s known as “type-stable” or “type-inferrable”. For example,

function g(a,n)
x = 0
for i in 1:n
x += a
end
end

will add a n times. If you call this with g(1.0,10), you’ll notice this is not as fast as you’d hope. The reason is because the type of x starts as a Int, and then Int+Float64 produces a Float64 and so x changes types. Think about writing a C-code for doing this: you can’t just have a variable change types like that? So what Julia does is compile x in a way such that it is dynamically typed, and the lost assumptions associated with doing this is why the performance drops down to that of Python/MATLAB/R since now it has the same amount of dynamicness as a real scripting language: it’s all just a game of how much information is known by the compiler. So the way to think about it is, if your code can be type-stable, Julia will take full advantage of this and make a completely static function and compile that, otherwise it will make the necessary parts dynamic in order to handle the extra features. Together, this is a pretty nifty way to mix ease of use and performance. At any time, you can @code_warntype your function call and it will explicitly highlight which variables are “being dynamic”, and that tells you exactly what to optimize. For example,

Variables:
#self#::#g
a::Float64
n::Int64
i::Int64
#temp#@_5::Int64
x::UNION{FLOAT64, INT64}
#temp#@_7::Core.MethodInstance
#temp#@_8::Float64

Body:
begin
x::UNION{FLOAT64, INT64} = 0 # line 3:
SSAValue(2) = (Base.select_value)((Base.sle_int)(1, n::Int64)::Bool, n::Int64, (Base.sub_int)(1, 1)::Int64)::Int64
#temp#@_5::Int64 = 1
5:
unless (Base.not_int)((#temp#@_5::Int64 === (Base.add_int)(SSAValue(2), 1)::Int64)::Bool)::Bool goto 29
SSAValue(3) = #temp#@_5::Int64
#temp#@_5::Int64 = SSAValue(4) # line 4:
unless (x::UNION{FLOAT64, INT64} isa Int64)::Bool goto 14
#temp#@_7::Core.MethodInstance = MethodInstance for +(::Int64, ::Float64)
goto 23
14:
unless (x::UNION{FLOAT64, INT64} isa Float64)::Bool goto 18
#temp#@_7::Core.MethodInstance = MethodInstance for +(::Float64, ::Float64)
goto 23
18:
goto 20
20:
#temp#@_8::Float64 = (x::UNION{FLOAT64, INT64} + a::Float64)::Float64
goto 25
23:
#temp#@_8::Float64 = \$(Expr(:invoke, :(#temp#@_7), :(Main.+), :(x), :(a)))
25:
x::UNION{FLOAT64, INT64} = #temp#@_8::Float64
27:
goto 5
29:
return
end::Void

This is spitting back and telling me that x is typed as both Float64 and Int64 which is exactly the problem I identified. But have no fear, Julia is based around typing, so it has ways to handle this. For example, zero(a) gives you a 0 in the type of a, so

function g(a,n)
x = zero(a)
for i in 1:n
x += a
end
end

is a good generic type-stable function for any number which has zero and + defined. That’s quite unique: I can put floating point numbers in here, symbolic expressions from SymEngine.jl, ArbFloats from the user-defined ArbFloats.jl fast arbitrary precision library, etc. Since Julia always auto-specializes on the types, no matter what I give it, it will know how to replace zero with whatever it needs to and do so at compile-time since at compile-time there’s enough type information to know what that constant should be, and then it will replace the + with calls to the correct version of +. Thus there is no runtime overhead for handling genericness. All you have to do is remember to write your code in a manner such that the types match what comes in (other helpful functions for this include typeof, eltype, and similar). So Julia is not only built to make generic functions performant, but it also has the right tools for handling types to make this actually easy to do.

Sure, Numba uses a similar strategy as Julia. If you stick to only using Float64s and a few other basic types (Float32, and some integer types?) you’ll get something very similar to what’s described here. But I’ve described before that making robust generic software with this type-inference approach requires a strong understanding and use of the type system, and I have never found Python+Numba close to matching Julia in its ability to let the user directly handle types, and it’s this combination of the type system + the compilation strategy that makes Julia code scale well. This difference may not be appreciated in one-off scripts, but when building and maintaining robust software I have found it necessary.

But lastly, using this strategy Julia actually can produce static binaries like compiled C or Fortran code, so I think this fact will come in handy for making Julia into a package development language. Basically, you can write packages in Julia, and someday it will be easy to generate a runtime-free binary (like C/Fortran) others can link to. That leads me to my next point.

In some ways, I see Julia actually as a more productive C++ instead of a faster Python. At least as a package developer I tend to use it like that, and a lot of its features essentially give you full generic template programming that you’d attempt in C++, just with a lot less code. I think that for many developers, this is a more appropriate way to consider Julia. But that’s not Julia’s whole audience, which leads to my next point.

## You Will See The Most Benefits in Package Development

How much does this language choice actually effect real use cases? Well, I think it depends on who you are. One thing that you’ll notice is that, for the vast majority of users, the performance of your scripts is almost entirely determined by the speed of your packages. This is true in pretty much any scientific computing application that I know of, in Julia, R, MATLAB, Python, etc. If 90% of your time is spent calling functions from DataFrames or JuMP, then the speed of your analysis code really doesn’t impact the final runtime all that much.

So in some sense I don’t believe that Julia’s performance will directly effect most users of the language (and I think that this is true for R/Python/MATLAB as well). However, where it really manifests itself is indirectly through package development productivity. Simply put, a Python package written in pure Python is slow, like really really slow. Same with R and MATLAB. This impacts the development time and resources since all of the major projects like SciPy, NumPy, and Pandas are almost entirely C++ code underneath. And this also impacts features. It’s really hard to make traditional languages handle things like arbitrary precision arithmetic well, and so you’ll see that most functionality in things like SciPy do not support all of the glory that Python allows since it’s not really Python code! Then Python packages are built on SciPy and its limitations, and that just further propagates the impossibility of extending the algorithms to something more generic. Python in fact has so many issues here that projects have had to internally develop their own JITs in order to get the performance, features, and syntax they want. JitCODE and PyDSTool are two ODE solvers which developed DSLs and ways to compile their code. One nice example is this video on PyTorch where the developer describes the tracing JIT they built into their package. I seriously commend these individuals for building such amazing systems, but dear god I do not want to spend my own development time building a compilation scheme for each new scientific project I am involved in! To me, these are language level issues and should be addressed by the language. What the PyTorch developers had to build into their package is what Julia gives you package developers for free, and as a scientist who doesn’t want to write compilers, I am super satisfied.

In Julia, packages can be written in Julia and be performant. As demonstrated above and in this post, they can also be type-agnostic or type-generic and still be performant. This means that it’s much easier/quicker to develop “good” libraries in Julia, and that it’s easy for any of these libraries to support just about any type. An example for this is GenericSVD.jl which just implements an SVD-factorization in Julia, and this compiles to fast code for things which are not supported by common Fortran bindings like BigFloats, ArbFloats, complex numbers, etc. and the code is simple. It’s a great simple package which likely didn’t take much development time but is already something very unique due to its type-genericness and speed.

This makes me think that most users should look at Julia a little differently. To me, Julia is a language which is designed to make it easy to roll your own package and have them be performant. If you want to develop libraries, then Julia is great for you and the benchmarks (among other things like multiple dispatch and type-genericness) show why you should choose Julia over Python/R/MATLAB/etc. To me it’s almost a no-brainer to choose Julia for your next methods project or package. But if you’re a user who uses libraries, you cannot expect a blanket performance difference from your previous language “just” because it’s Julia. For example, Pandas is in Python, but it’s fast because it’s written in C++. Just because you’re switching to Julia you shouldn’t expect functions on data table objects to be faster (in fact, they might not be since Pandas is pretty well-optimized. One place where the package ecosystem is currently not as well-optimized is plotting, where the plotting libraries need some restructuring to improve performance.). What you can expect is that the development of the associated Julia libraries is more transparent (it’s in Julia) and generally performant as well. But also, all of the tools you’d use to help out your own scripts make it easy to find out what’s wrong in a library and contribute some code to speed it up. Going from user to contributor and actually helping package development is super trivial and probably the most underappreciated “feature” (or more, side effect of the design) of Julia.

But given the market for who actually chooses Julia, Julia packages do tend to be fast in the end Julia is really top-of-the-class in many disciplines, though that is not true all around. But the fact that development is fast and easy to write scalable packages in means that you can find a lot of odd unique packages: lots of small numerical linear algebra or new optimization algorithm packages which run performantly and just don’t exist anywhere else make up a good chunk of what I find in Julia. To top it off, Julia makes it easy to setup continuous integration in Windows, Mac, and Linux to get your package off the ground and well-tested. Here’s a video that goes through all of that. Again, package-development level features.

## Fast Multiparadigms Makes Maintenance Easier

The big thing which has helped me scale packages is that Julia doesn’t care how you program. Looping, vectorization, functional styles with lots of recursion, etc.: it doesn’t care. You can just choose the form that is natural for your problem, write a code which uses that style, and know it’s going to be performant. This is very different than the R/MATLAB/Python mentality of “vectorize everything!”, even if it’s not natural to vectorize. I remember doing some very “fun” hacks in MATLAB and Python specifically in order to make good use of vectorized functions, and having to document inside of comments a step-by-step explanation of what the hell the code is actually doing. Many of those times, a 3 line loop which is almost copy-pasted from the paper’s psudocode would’ve done the trick, but in the name of performance I mangled the code. Julia doesn’t require this, and I have very much appreciated this when reviewing code from others.

This is probably the biggest difference. If you have appropriately vectorized code in MATLAB/Python/R, the performance difference from Julia isn’t actually too big. There’s still a difference because vectorization is particularly wasteful with memory and isn’t actually optimal in most cases, but it’s within an order of magnitude. However, you are required to program in a specific way in these other languages to get here, and if you’re doing anything more than scripting you’ve probably found that this doesn’t scale because, well, sometimes code gets mangled when you force vectorization!

## There is Beauty in Using the Full Language

Sure, Python isn’t too bad if for performance you stick to arrays of 64-bit numbers as allowed in NumPy arrays, and you use pre-written functions (C-bindings) in SciPy. But what if you really want to create your own numbers as some object, and you want to build data structures using objects in order to more easily scale your software? This is where this setup begins to take its toll. When saying that a small part of the language is required to be used in order to get good performance, you get pidgeon-holed away from the most intuitive code and have to move towards big arrays of numbers. Instead of “naming” separate arrays, items are thrown into matrices with comments says “1:3 is chemical A, 4:7 is chemical B, …”, which I did not find scales very well.

But in Julia, using types is actually performant. Structs in Julia are value-types, which means that they create structures which inline the type-values instead of using indirection via pointers. This means that a Complex{Float64}, which is a struct of two values (the real and imaginary parts), is in memory as a 128-bit chunk with two numbers. This is as performant as if you implemented some kind of intrinsic type in C. You can use this without worry. And mutable structs, while including pointers, are much more performant than objects in most scripting languages because functions auto-specialize on them. The result is that you can make use of all of the language when building software that needs to be performant, once again making it easier to scale projects.

## GPUs and Parallelism is Straightforward and Generic

Serial performance only gets you so far. Scaling up your project means two things: scaling up the size and features of the codebase, and scaling up the computing power. I discussed how Julia excels in the first part, but what about the second? What I have found really great about Julia is that its parallelization is dead simple. Add @threads in front of a loop and done, its multithreaded. @parallel and pmap do distributed parallelism via multiprocessing, and I have shown how in 5 minutes you can parallelize your code across a whole cluster. Explicitly writing GPU kernels and using multiple GPUs is easy.

I cannot forget one of my favorite developments as of late: GPUArrays.jl. Essentially, when you write vectorized (broadcasted) code with a GPUArray, it automatically parallelizes on the GPU. Neat! But this means that any generic function written in this style is already compatible to automatically compile a GPU version. So things like OrdinaryDiffEq.jl (the ODE solvers of DifferentialEquations.jl), even though they don’t have a single piece of GPU-specific code in them, compile performant versions of their solvers for GPUs, automatically, because Julia feels like being nice. Like seriously, once you get a strong understanding about how types and dispatching works, Julia is absolutely mindblowing.

## Summary: Julia Is Great For Package Development, But Users Just See Packages

And this leads me to my conclusion. In each of the sections above I note why Julia is great for building packages in. In comparison to the other scripting languages, I find nothing comes close in terms of productivity, scalability, resulting performance, and resulting features. No other languages makes it so easy to make a function which is performant yet doesn’t care what number types you use! And being allowed to use the whole language “correctly” means that your code is much easier to understand and grow. If you’re looking to publish a package along with your algorithm, Julia is definitely the right place to be. In that sense, this group will see Julia as an easier or more productive C++.

But for end users throwing together a 100 line script for a data analysis? I don’t think that this crowd will actually see as much of a difference between other scripting languages if the packages in the other languages they are using are sufficiently performant (this isn’t always true, but let’s assume it is). To people who aren’t “pros” in the language, it will probably look like it just has a different syntax. It will be a little faster than vectorized code in other languages if code is in type-stable functions, but most of the differences a user will notice will come from the mixture of features and performance of packages. Because of this, I am not sure if marketing the features of the language is actually the best way to approach the general audience. The general audience will be convinced Julia is worthwhile only by the package offering.

I want to end though with the fact that, since Julia packages are written in Julia, a Julia user is qualified to write, debug, and contribute to packages. I myself never saw myself becoming a package dev until about a year ago, and this transition only was because Julia makes the change so easy (it wasn’t any different than the Julia development I was already doing!). While this is a good highlight to people who would read a blog post about programming languages, I still think this is a small niche when considering the average programmer. I don’t think that the average programmer sees this as an upside. Most don’t have the time to invest in this kind of development, and see that push that “you can do it all yourself!” as a turn-off (even though it’s “can” instead of “have to”!). It’s a very different crowd to be catered to.

## My Recommendations

### Package Accessibility and Discoverability

So in the end, I see Julia as in a state of transition. The way it was marketed before in terms of performance benchmarks, the type-system, etc. are all things which appeal to package developers, and Julia already has had great adoption by developers. But now Julia needs to start targeting general audiences by sharing its packages. There are a few things in the Base Julia language like adding noalias scopes that I would like to see, but pretty much everything (other than a few compiler optimizations) I put as “not essential” now. What I see as essential is making packages more accessible.

As someone well-aware of the packages which are available, I can tell you that “lack of packages” isn’t really a problem with the ecosystem: you can find a great package that does what you’re looking for. In fact, this weird idea that Julia doesn’t have packages yet leads to some silly chatroom discussions where a newcomer joins the Gitter channel and asks “I wanna contribute to the language… has x been done yet?”, give back 10 example, “y?”, give back 5 examples, “z?”, 5 examples, “oh, I’m surprised how much is already done!”.

This chat happens quite often, which is fine, but it points to a problem. We really need to work more on “discoverability” and “noob-friendly docs”, both items which I’m not entirely sure what the easiest way to handle are. But, I think that this is what is required for Julia to grow (word choice is on purpose: grow instead of succeed, because Julia to me is clearly already successful enough to sustain a development community and thus invest your own time in it, though the community could grow much more). If we think of newcomers as coming in waves: wave 1 brought in language devs and was extremely successful (as evidenced by the over 500 committers to just the Base language, more than projects like Cython has ever had!), wave 2 brought in package devs and has been very successful as well, but now we need to gear up for wave 3: the package users. This means that package discoverability is what will bring in the third wave.

### Package Uniqueness

Julia packages should spend more time on their unique parts. Everyone has optimization packages, and people in Python are using things written in C/C++/Fortran like NLopt and IPOPT and so the performance difference is essentially nil in these well-developed cases. But a native Julia package can have other advantages as well. Working out of the box with complex and arbitrary precision numbers, using autodifferentiation, being compatible with DistributedArrays and GPUArrays to allow these forms of parallelism without having to transfer back to the CPU for the optimizer’s parts. These features are huge for large classes of researchers (I know that quantum physicists want better complex number support everywhere!) and this is where Julia’s packages shine: genericness. Supporting genericness and parallelism should be front and center with every big Julia package, and it should have unique examples to show it off.

We already have very strong libraries in quite a few domains of scientific computing. Here’s what Julia sounds like when, instead of saying “Julia can do ______”, you say some unique things about the libraries:

1. Constrainted optimization and mathematical programming. JuMP‘s DSL automatically defines Jacobians and Hessians via autodifferentiation. Its solver independence makes it easy to switch between libraries to choose the most efficient one for the problem.
2. Numerical linear algebra. Not only does th standard library expose more of BLAS and LAPACK via special matrix types than other languages, there are many special-purpose libraries for linear solving. IterativeSolvers.jl allows you to use any LinearMap, meaning that by passing a lazy LinearMap it automatically works as matrix-free preconditioned GMRES etc., and there are many (and a growing number) of methods to choose from. This library also supported complex and arbitrary precision numbers. In case you were wondering, there does exist PETSc.jl is a binding to the infamous PETSc library, and it is compatible with MPI.
3. Differential equations. I’ve already discussed the feature set of DifferentialEquations.jl.
4. Autodifferentiation. With ForwardDiff.jl and its operator overloading approach, pretty much any Julia code can use forward-mode autodifferentiation without any modifications, including the Base library. This means there’s almost no reason to use numerical differentiation of any pure-Julia code! Reverse-mode autodifferentiation exists as well, but I’ll leave a note pointing to the future Cassette.jl.

I’m not an image processing guy but Images.jl is a large library developed by Tim Holy which must have some unique points. A major contributor to Julia is Steven Johnson, the creator of FFTW (and NLopt which has a Julia binding), the widely used FFT library, is a big Julia contributor and there has been discussions about building a generic FFT library in Julia. In addition, many applied mathematicians are pointing to mixed-precision algorithms as a big possibility for increasing the performance of scientific applications, and Julia’s strong control over types and specialization on types is perfect for handling such algorithms.

The main thing is that, if we only talk about “Python does this, does Julia do _____?”, then we can only talk about speed. We have lots of unique developments already, so we should change the conversation to highlight the things that packages in Julia can do that other packages cannot.

### Package Distributions and Branding

Another thing I think we should be doing is bundling together packages as distributions. We somewhat do this with the organizations (JuliaStats, JuliaOpt, etc.), but it’s not the same as having a single cohesive documentation. People know and respect SciPy. It’s a hodgepodge of different things, but you know who’s going to look at your bug reports and you’ve other parts of SciPy and that’s why over time you respect it. It’s very hard to get that kind of respect for a lone 5 star github package. I think cohesive documentation for orgs and metapackages, much like I have done with DifferentialEquations.jl is required in order to get big “mainstream” packages that users can know and trust.

### “User”-focused Tutorials

I think Julia can become big since it gives package developers so many tools to make big performant package ecosystems with ease. But this has made most people in the community focused on talking to other “developers”. We should reach out to “users” with simple examples and tutorials, and understand that most people don’t want to contribute to a package or anything like that, but really just want to add a package and use a function to spit out a plot as fast as possible. For the next wave of Julia users, we should show should how the package ecosystem enables this kind of usage, and that’s how Julia will grow.

I believe that we should target teaching resources directly to them, similar to what’s seen in other languages. I see workshops for “learn how to do regression in R!”, “learn how to build websites with Shiny!”, “learn how to use Pandas!”, but for Julia I only seem to see “let’s learn Julia in depth: how to write fast code and the type system”. We should instead run workshops directly on Optim, JuMP, DifferentialEquations, etc. at various universities where we are already “teaching Julia”, and have it setup as “direct to skill” for specific disciplines instead of teaching fancy language-level features. Although it’s hard because I too am a purist in “learn the language and you can do it all!”, I think we need to reach out more to those who only want to do a very specific thing, and train them in Julia for psychology research, Julia for climate models, etc. And honestly, I can’t think off of the top of my head a good tutorial that says “here’s how to do scientific computing in Julia”, and works through some specific issues and skills that piece together a few important libraries. We need to write some resources along these lines.

At least, that’s what I think. Feel free to agree/disagree in the comments below.