Tag Archives: gpu

GPU-Accelerated ODE Solving in R with Julia, the Language of Libraries

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/gpu-accelerated-ode-solving-in-r-with-julia-the-language-of-libraries/

R is a widely used language for data science, but due to performance most of its underlying library are written in C, C++, or Fortran. Julia is a relative newcomer to the field which has busted out since its 1.0 to become one of the top 20 most used languages due to its high performance libraries for scientific computing and machine learning. Julia’a value proposition has been its high performance in high level language, known as solving the two language problem, which has allowed allowed the language to build a robust, mature, and expansive package ecosystem. While this has been a major strength for package developers, the fact remains that there are still large and robust communities in other high level languages like R and Python. Instead of spawning distracting language wars, we should ask the question: can Julia become a language of libraries to accelerate these other languages as well?

This is definitely not the first time this question was asked. The statistics libraries in Julia were developed by individuals like Douglas Bates who built some of R’s most widely used packages like lme4, Bioconductor, and Matrix. Doug had written a blog post in 2018 showing how to get top notch performance in linear mixed effects model fitting via JuliaCall. In 2018 the JuliaDiffEq organization had written a blog post demonstrating the use of DifferentialEquations.jl in R and Python (the Jupyter of Diffrential Equations). Now rebranded as SciML for Scientific Machine Learning, we looked to expand our mission and bring automated model discovery and acceleration include other languages like R and Python with Julia as the base.

With the release of diffeqr v1.0, we can now demonstrate many advances in R through the connection to Julia. Specifically, I would like to use this blog post to showcase:

  1. The new direct wrapping interface of diffeqr
  2. JIT compilation and symbolic analysis of ODEs and SDEs in R using Julia and ModelingToolkit.jl
  3. GPU-accelerated simulations of ensembles using Julia’s DiffEqGPU.jl

Together we will demonstrate how models in R can be accelerated by 1000x without a user ever having to write anything but R.

A Quick Note Before Continuing

Before continuing on with showing all of the features, I wanted to ask for support so that we can continue developing these bridged libraries. Specifically, I would like to be able to support developers interested in providing a fully automated Julia installation and static compilation so that calling into Julia libraries is just as easy as any Rcpp library. To show support, the easiest thing to do is to star our libraries. The work of this blog post is build on DifferentialEquations.jl, diffeqr, ModelingToolkit.jl, and DiffEqGPU.jl. Thank you for your patience and now back to our regularly scheduled program.

diffeqr v1.0: Direct wrappers for Differential Equation Solving in R

First let me start with the new direct wrappers of differential equations solvers in R. In the previous iterations of diffeqr, we had relied on specifically designed high level functions, like “ode_solve”, to compensate for the fact that one could not directly use Julia’s original DifferentialEquations.jl interface directly from R. However, the new diffeqr v1.0 directly exposes the entirety of the Julia library in an easy to use framework.

To demonstrate this, let’s see how to define the Lorenz ODE with diffeqr. In Julia’s DifferentialEquations.jl, we would start by defining an “ODEProblem” that contains the initial condition u0, the time span, the parameters, and the f in terms of `u’ = f(u,p,t)` that defines the derivative. In Julia, this would look like:

using DifferentialEquations
function lorenz(du,u,p,t)
  du[1] = p[1]*(u[2]-u[1])
  du[2] = u[1]*(p[2]-u[3]) - u[2]
  du[3] = u[1]*u[2] - p[3]*u[3]
end
u0 = [1.0,1.0,1.0]
tspan = (0.0,100.0)
p = [10.0,28.0,8/3]
prob = ODEProblem(lorenz,u0,tspan,p)
sol = solve(prob,saveat=1.0)

With the new diffeqr, diffeq_setup() is a function that does a few things:

  1. It instantiates a Julia process to utilize as its underlying compute engine
  2. It first checks if the correct Julia libraries are installed and, if not, it installs them for you
  3. Then it exposes all of the functions of DifferentialEquations.jl into its object

What this means is that the following is the complete diffeqr v1.0 code for solving the Lorenz equation is:

library(diffeqr)
de <- diffeqr::diffeq_setup()
f <- function(u,p,t) {
  du1 = p[1]*(u[2]-u[1])
  du2 = u[1]*(p[2]-u[3]) - u[2]
  du3 = u[1]*u[2] - p[3]*u[3]
  return(c(du1,du2,du3))
}
u0 <- c(1.0,1.0,1.0)
tspan <- c(0.0,100.0)
p <- c(10.0,28.0,8/3)
prob <- de$ODEProblem(f, u0, tspan, p)
sol <- de$solve(prob,saveat=1.0)

This then carries on through SDEs, DDEs, DAEs, and more. Through this direct exposing form, the whole library of DifferentialEquations.jl is at the finger tips of any R user, making it a truly cross-language platform.

(Note that the only caveat is that diffeq_setup requires that the user has already installed Julia and julia is included in the path. We hope that future developments can eliminate this need.)

JIT compilation and symbolic analysis of ODEs and SDEs in R via ModelingToolkit.jl

The reason for Julia is speed (well and other things, but here, SPEED!). Using the pure Julia library, we can solve the Lorenz equation 100 times in about 0.05 seconds:

@time for i in 1:100 solve(prob,saveat=1.0) end
0.048237 seconds (156.80 k allocations: 6.842 MiB)
0.048231 seconds (156.80 k allocations: 6.842 MiB)
0.048659 seconds (156.80 k allocations: 6.842 MiB)

Using diffeqr connected version, we get:

lorenz_solve <- function (i){
  de$solve(prob,saveat=1.0)
}
 
> system.time({ lapply(1:100,lorenz_solve) })
   user  system elapsed
   6.81    0.02    6.83
> system.time({ lapply(1:100,lorenz_solve) })
   user  system elapsed
   7.09    0.00    7.10
> system.time({ lapply(1:100,lorenz_solve) })
   user  system elapsed
   6.78    0.00    6.79

That’s not good, that’s about 100x difference! In this blog post I described that interpreter overhead and context switching are the main causes of this issue. We’ve also demonstrated that ML accelerators like PyTorch generally do not perform well in this regime since those kinds of accelerators rely on heavy array operations, unlike the scalarized nonlinear interactions seen in a lot of differential equation modeling. For this reason we cannot just slap any old JIT compiler onto the f call and then put it into the function since there would still be left over. So we need to do something a bit tricky.

In my JuliaCon 2020 talk, Automated Optimization and Parallelism in DifferentialEquations.jl I demonstrated how ModelingToolkit.jl can be used to trace functions and generate highly optimized sparse and parallel code for scientific computing all in an automated fashion. It turns out that JuliaCall can do a form of tracing on R functions, something that exploited to allow autodiffr to automatically differentiate R code with Julia’s AD libraries. Thus it turns out that the same modelingtoolkitization methods used in AutoOptimize.jl can be used on a subset of R codes which includes a large majority of differential equation models.

In short, we can perform automated acceleration of R code by turning it into sparse parallel Julia code. This was exposed in diffeqr v1.0 as the `jitoptimize_ode(de,prob)` function (also `jitoptimize_sde(de,prob)`). Let’s try it out on this example. All you need to do is give it the ODEProblem which you wish to accelerate. Let’s take the last problem and turn it into a pure Julia defined problem and then time it:

fastprob <- diffeqr::jitoptimize_ode(de,prob)
fast_lorenz_solve <- function (i){
  de$solve(fastprob,saveat=1.0)
}
 
system.time({ lapply(1:100,fast_lorenz_solve) })
 
> system.time({ lapply(1:100,fast_lorenz_solve) })
   user  system elapsed
   0.05    0.00    0.04
> system.time({ lapply(1:100,fast_lorenz_solve) })
   user  system elapsed
   0.07    0.00    0.06
> system.time({ lapply(1:100,fast_lorenz_solve) })
   user  system elapsed
   0.07    0.00    0.06

And there you go, an R user can get the full benefits of Julia’s optimizing JIT compiler without having to write lick of Julia code! This function also did a few other things, like automatically defined the Jacobian code to make implicit solving of stiff ODEs much faster as well, and it can perform sparsity detection and automatically optimize computations on that.

To see how much of an advance this is, note that this Lorenz equation is the same from the deSolve examples page. So let’s take their example and see how well it performs:

library(deSolve)
Lorenz <- function(t, state, parameters) {
  with(as.list(c(state, parameters)), {
    dX <-  a * X + Y * Z
    dY <-  b * (Y - Z)
    dZ <- -X * Y + c * Y - Z
    list(c(dX, dY, dZ))
  })
}
 
parameters <- c(a = -8/3, b = -10, c = 28)
state      <- c(X = 1, Y = 1, Z = 1)
times      <- seq(0, 100, by = 1.0)
out <- ode(y = state, times = times, func = Lorenz, parms = parameters)
 
desolve_lorenz_solve <- function (i){
  state      <- c(X = runif(1), Y = runif(1), Z = runif(1))
  parameters <- c(a = -8/3 * runif(1), b = -10 * runif(1), c = 28 * runif(1))
  out <- ode(y = state, times = times, func = Lorenz, parms = parameters)
}
 
> system.time({ lapply(1:100,desolve_lorenz_solve) })
   user  system elapsed
   5.03    0.03    5.07
>
> system.time({ lapply(1:100,desolve_lorenz_solve) })
   user  system elapsed
   5.42    0.00    5.44
> system.time({ lapply(1:100,desolve_lorenz_solve) })
   user  system elapsed
   5.41    0.00    5.41

Thus we see 100x acceleration over the leading R library without users having to write anything but R code. This is the true promise in action of a “language of libraries” helping to extend all other high level languages!

GPU-acceleration of ODEs in R via DiffEqGPU.jl

Can we go deeper? Yes we can. In many cases like in optimization and sensitivity analysis of models for pharmacology the users need to solve the same ODE thousands or millions of times to understand the behavior over a large parameter space. To solve this problem well in Julia, we built DiffEqGPU.jl which transforms the pure Julia function into a .ptx kernel to then parallelize the ODE solver over. What this looks like is the following solves the Lorenz equation 100,000 times with randomized initial conditions and parameters:

using DifferentialEquations, DiffEqGPU
function lorenz(du,u,p,t)
  du[1] = p[1]*(u[2]-u[1])
  du[2] = u[1]*(p[2]-u[3]) - u[2]
  du[3] = u[1]*u[2] - p[3]*u[3]
end
u0 = [1.0,1.0,1.0]
tspan = (0.0,100.0)
p = [10.0,28.0,8/3]
prob = ODEProblem(lorenz,u0,tspan,p)
prob_func = (prob,i,repeat) -> remake(prob,u0=rand(3).*u0,p=rand(3).*p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy=false)
sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=100_000,saveat=1.0f0)

Notice how this is only two lines of code different from what we had before, and now everything is GPU accelerated! The requirement for this to work is that the ODE/SDE/DAE function has to be written in Julia… but diffeqr::jitoptimize_ode(de,prob) accelerates the ODE solving in R by generating a Julia function, so could that mean…?

Yes, it does mean we can use DiffEqGPU directly on ODEs defined in R. Let’s see this in action. Once again, we will write almost exactly the same code as in Julia, except with `de$` and with diffeqr::jitoptimize_ode(de,prob) to JIT compile our ODE definition. What this looks like is the following:

de <- diffeqr::diffeq_setup()
degpu <- diffeqr::diffeqgpu_setup()
lorenz <- function (u,p,t){
  du1 = p[1]*(u[2]-u[1])
  du2 = u[1]*(p[2]-u[3]) - u[2]
  du3 = u[1]*u[2] - p[3]*u[3]
  c(du1,du2,du3)
}
u0 <- c(1.0,1.0,1.0)
tspan <- c(0.0,100.0)
p <- c(10.0,28.0,8/3)
prob <- de$ODEProblem(lorenz,u0,tspan,p)
fastprob <- diffeqr::jitoptimize_ode(de,prob)
prob_func <- function (prob,i,rep){
  de$remake(prob,u0=runif(3)*u0,p=runif(3)*p)
}
ensembleprob = de$EnsembleProblem(fastprob, prob_func = prob_func, safetycopy=FALSE)
sol = de$solve(ensembleprob,de$Tsit5(),degpu$EnsembleGPUArray(),trajectories=100000,saveat=1.0)

Note that diffeqr::diffeqgpu_setup() does the following:

  1. It sets up the drivers and installs the right version of CUDA for the user if not already available
  2. It installs the DiffEqGPU library
  3. It exposes the pieces of the DiffEqGPU library for the user to then call onto ensembles

This means that this portion of the library is fully automated, all the way down to the installation of CUDA! Let’s time this out a bit. 100,000 ODE solves in serial:

@time sol = solve(monteprob,Tsit5(),EnsembleSerial(),trajectories=100_000,saveat=1.0f0)
15.045104 seconds (18.60 M allocations: 2.135 GiB, 4.64% gc time)
14.235984 seconds (16.10 M allocations: 2.022 GiB, 5.62% gc time)

100,000 ODE solves on the GPU in Julia:

@time sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=100_000,saveat=1.0f0)
2.071817 seconds (6.56 M allocations: 1.077 GiB)
2.148678 seconds (6.56 M allocations: 1.077 GiB)

Now let’s check R in serial:

> system.time({ de$solve(ensembleprob,de$Tsit5(),de$EnsembleSerial(),trajectories=100000,saveat=1.0) })
   user  system elapsed
  24.16    1.27   25.42
> system.time({ de$solve(ensembleprob,de$Tsit5(),de$EnsembleSerial(),trajectories=100000,saveat=1.0) })
   user  system elapsed
  25.45    0.94   26.44

and R on GPUs:

> system.time({ de$solve(ensembleprob,de$Tsit5(),degpu$EnsembleGPUArray(),trajectories=100000,saveat=1.0) })
 user  system elapsed
12.39    1.51   13.95
> system.time({ de$solve(ensembleprob,de$Tsit5(),degpu$EnsembleGPUArray(),trajectories=100000,saveat=1.0) })
 user  system elapsed
12.55    1.36   13.90

R doesn’t reach quite the level of Julia here, and if you profile you’ll see it’s because the `prob_func`, i.e. the function that tells you which problems to solve, is still a function written in R and this becomes the bottleneck as the computation becomes faster and faster. Thus you will get closer and closer to the Julia speed with longer and harder ODEs, but it still means there’s work to be done. Another detail is that the Julia code is able to be further accelerated by using 32-bit numbers. Let’s see that in action:

using DifferentialEquations, DiffEqGPU
function lorenz(du,u,p,t)
  du[1] = p[1]*(u[2]-u[1])
  du[2] = u[1]*(p[2]-u[3]) - u[2]
  du[3] = u[1]*u[2] - p[3]*u[3]
end
u0 = Float32[1.0,1.0,1.0]
tspan = (0.0f0,100.0f0)
p = Float32[10.0,28.0,8/3]
prob = ODEProblem(lorenz,u0,tspan,p)
prob_func = (prob,i,repeat) -> remake(prob,u0=rand(Float32,3).*u0,p=rand(Float32,3).*p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy=false)
@time sol = solve(monteprob,Tsit5(),EnsembleGPUArray(),trajectories=100_000,saveat=1.0f0)
 
# 1.781718 seconds (6.55 M allocations: 918.051 MiB)
# 1.873190 seconds (6.56 M allocations: 917.875 MiB)

Right now the Julia to R bridge converts all 32-bit numbers back to 64-bit numbers so this doesn’t seem to be possible without the user writing some Julia code, but we hope to get this fixed in one of our coming releases.

To figure out where that leaves us, let’s use deSolve to solve that same Lorenz equation 100 and 1,000 times:

library(deSolve)
Lorenz <- function(t, state, parameters) {
  with(as.list(c(state, parameters)), {
    dX <-  a * X + Y * Z
    dY <-  b * (Y - Z)
    dZ <- -X * Y + c * Y - Z
    list(c(dX, dY, dZ))
  })
}
 
parameters <- c(a = -8/3, b = -10, c = 28)
state      <- c(X = 1, Y = 1, Z = 1)
times      <- seq(0, 100, by = 1.0)
out <- ode(y = state, times = times, func = Lorenz, parms = parameters)
 
desolve_lorenz_solve <- function (i){
  state      <- c(X = runif(1), Y = runif(1), Z = runif(1))
  parameters <- c(a = -8/3 * runif(1), b = -10 * runif(1), c = 28 * runif(1))
  out <- ode(y = state, times = times, func = Lorenz, parms = parameters)
}
> system.time({ lapply(1:100,desolve_lorenz_solve) })
   user  system elapsed
   5.06    0.00    5.13
> system.time({ lapply(1:1000,desolve_lorenz_solve) })
   user  system elapsed
  55.68    0.03   55.75

We see the expected linear scaling of a scalar code, so we can extrapolate out and see that to solve 100,000 ODEs it would take deSolve 5000 seconds, as opposed to the 14 seconds of diffeqr or the 1.8 seconds of Julia. In summary:

  1. Pure R diffeqr offers a 350x acceleation over deSolve
  2. Pure Julia DifferentialEquations.jl offers a 2777x acceleration over deSolve

Conclusion

We hope that the R community has enjoyed this release and will enjoy our future releases as well. We hope to continue building further connections to Python as well, and make the installation process as seamless as possibly by having the diffeq_setup() function automatically download and install a version of Julia if it’s not acessible from your path. Together this will make Julia a true language of libraries that can be used to accelerate scientific computation in the surrounding higher level scientific ecosystem.

The post GPU-Accelerated ODE Solving in R with Julia, the Language of Libraries appeared first on Stochastic Lifestyle.

Neural Jump SDEs (Jump Diffusions) and Neural PDEs

By: Christopher Rackauckas

Re-posted from: http://www.stochasticlifestyle.com/neural-jump-sdes-jump-diffusions-and-neural-pdes/

This is just an exploration of some new neural models I decided to jot down for safe keeping. DiffEqFlux.jl gives you the differentiable programming tools to allow you to use any DifferentialEquations.jl problem type (DEProblem) mixed with neural networks. We demonstrated this before, not just with neural ordinary differential equations, but also with things like neural stochastic differential equations and neural delay differential equations.

At the time we made DiffEqFlux, we were the “first to the gate” for many of these differential equations types and left it as an open question for people to find a use for these tools. And judging by the Arxiv papers that went out days after NeurIPS submissions were due, it looks like people now have justified some machine learning use cases for them. There were two separate papers on neural stochastic differential equations, showing them to be the limit of deep latent Gaussian models. Thus when you stick these new mathematical results on our existing adaptive high order GPU-accelerated neural SDE solvers, you get some very interesting and fast ways to learn some of the most cutting edge machine learning methods.

So I wanted to help you guys out with staying one step ahead of the trend by going to the next differential equations. One of the interesting NeurIPS-timed Arxiv papers was on jump ODEs. Following the DiffEqFlux.jl spirit, you can just follow the DifferentialEquations.jl tutorials on these problems, implement them, add a neural network, and it will differentiate through them. So let’s take it one step further and show an example of how you’d do that. I wanted to take a look at jump diffusions, or jump stochastic differential equations, which are exactly what they sound like. They are a mixture of these two methods. After that, I wanted to show how using some methods for stiff differential equations plus a method of lines discretization gives a way to train neural partial differential equations.

Instead of being fully defined by neural networks, I will also be showcasing how you can selectively make parts of a differential equation neuralitized and other parts pre-defined, something we’ve been calling mixed neural differential equations, so we’ll demonstrate a mixed neural jump stochastic differential equation and a mixed neural partial differential equation with fancy GPU-accelerated adaptive etc. methods. I’ll then leave as homework how to train a mixed neural jump stochastic partial differential equation with the fanciest methods, which should be easy to see from this blog post (so yes, that will be the MIT 18.337 homework). This blog post will highlight that these equations are all already possible within our framework, and will also show the specific places we see that we need to accelerate to really put these types of models into production.

Neural Jump Stochastic Differential Equations (Jump Diffusions)

To get to jump diffusions, let’s start with a stochastic differential equation. A stochastic differential equation is defined via

 dX_t = f(t,X_t)dt + g(t,X_t)dW_t

which is essentially saying that there is a deterministic term f and a continuous randomness term g driven by a Brownian motion. Theorems like Donsker’s theorem can be thought of as a generalization of the central limit theorem, saying that continuous stochastic processes of some large class can be reinterpreted as this kind of process (due to the Gaussian-ness of Brownian motion), so in some sense this is a very large encompassing class. If you haven’t seen the previous blog post which mentions how to define neural SDEs, please check that out now. Let’s start with a code that uses reverse-mode automatic differentiation through a GPU-accelerated high order adaptive SDE solver. The code looks like:

using Flux, DiffEqFlux, StochasticDiffEq, Plots, DiffEqMonteCarlo
 
u0 = Float32[2.; 0.] |> gpu
datasize = 30
tspan = (0.0f0,1.0f0)
 
function trueODEfunc(du,u,p,t)
    true_A = [-0.1 2.0; -2.0 -0.1] |> gpu
    du .= ((u.^3)'true_A)'
end
t = range(tspan[1],tspan[2],length=datasize)
mp = Float32[0.2,0.2] |> gpu
function true_noise_func(du,u,p,t)
    du .= mp.*u
end
prob = SDEProblem(trueODEfunc,true_noise_func,u0,tspan)
 
# Take a typical sample from the mean
monte_prob = MonteCarloProblem(prob)
monte_sol = solve(monte_prob,SOSRI(),num_monte = 100)
monte_sum = MonteCarloSummary(monte_sol)
sde_data = Array(timeseries_point_mean(monte_sol,t))
 
dudt = Chain(x -&gt; x.^3,
             Dense(2,50,tanh),
             Dense(50,2)) |> gpu
ps = Flux.params(dudt)
n_sde = x-&gt;neural_dmsde(dudt,x,mp,tspan,SOSRI(),saveat=t,reltol=1e-1,abstol=1e-1)
 
pred = n_sde(u0) # Get the prediction using the correct initial condition
 
dudt_(u,p,t) = Flux.data(dudt(u))
g(u,p,t) = mp.*u
nprob = SDEProblem(dudt_,g,u0,(0.0f0,1.2f0),nothing)
 
monte_nprob = MonteCarloProblem(nprob)
monte_nsol = solve(monte_nprob,SOSRI(),num_monte = 100)
monte_nsum = MonteCarloSummary(monte_nsol)
#plot(monte_nsol,color=1,alpha=0.3)
p1 = plot(monte_nsum, title = "Neural SDE: Before Training")
scatter!(p1,t,sde_data',lw=3)
 
scatter(t,sde_data[1,:],label="data")
scatter!(t,Flux.data(pred[1,:]),label="prediction")
 
function predict_n_sde()
  n_sde(u0)
end
loss_n_sde1() = sum(abs2,sde_data .- predict_n_sde())
loss_n_sde10() = sum([sum(abs2,sde_data .- predict_n_sde()) for i in 1:10])
Flux.back!(loss_n_sde1())
 
data = Iterators.repeated((), 10)
opt = ADAM(0.025)
cb = function () #callback function to observe training
  sample = predict_n_sde()
  # loss against current data
  display(sum(abs2,sde_data .- sample))
  # plot current prediction against data
  cur_pred = Flux.data(sample)
  pl = scatter(t,sde_data[1,:],label="data")
  scatter!(pl,t,cur_pred[1,:],label="prediction")
  display(plot(pl))
end
 
# Display the SDE with the initial parameter values.
cb()
 
Flux.train!(loss_n_sde1 , ps, Iterators.repeated((), 100), opt, cb = cb)
Flux.train!(loss_n_sde10, ps, Iterators.repeated((), 100), opt, cb = cb)
 
dudt_(u,p,t) = Flux.data(dudt(u))
g(u,p,t) = mp.*u
nprob = SDEProblem(dudt_,g,u0,(0.0f0,1.2f0),nothing)
 
monte_nprob = MonteCarloProblem(nprob)
monte_nsol = solve(monte_nprob,SOSRI(),num_monte = 100)
monte_nsum = MonteCarloSummary(monte_nsol)
#plot(monte_nsol,color=1,alpha=0.3)
p2 = plot(monte_nsum, title = "Neural SDE: After Training", xlabel="Time")
scatter!(p2,t,sde_data',lw=3,label=["x" "y" "z" "y"])
 
plot(p1,p2,layout=(2,1))
 
savefig("neural_sde.pdf")
savefig("neural_sde.png")

This just uses the diffeq_rd layer function to tell Flux to use reverse-mode AD (using Tracker.jl, unless you check out a bunch of weird Zygote.jl branches: wait for Zygote) and then trains the neural network using a discrete adjoint. While the previously posted example uses forward-mode, we have found that this is much much faster on neural SDEs, so if you’re trying to train them, I would recommend using this code instead (and I’ll get the examples updated).

Now to this equation let’s add jumps. A jump diffusion is defined like:

 du = f(u,p,t)dt + \sum g_i(u,t)dW^i + \sum c_i(u,p,t)dp_i

where dp_i are the jump terms. The jump terms differ from the Brownian terms because they are non-continuous: they are zero except at countably many time points where you “hit” the equation with an amount  c_i(u,p,t) . The timing at which these occur is based on an internal rate  \lambda_i of the jump  dp_i .

Jump diffusions are important because, just as there is a justification for the universality of stochastic differential equations, there is a justification here as well. The Levy Decomposition says that essentially any Markov process can be decomposed into something of this form. They also form the basis for many financial models, because for example changing regimes into a recession isn’t gradual but rather sudden. Models like Merton’s model thus use these as an essential tool in quantitative finance. So let’s train a neural network on that!

What we have to do is define jump processes and append them onto an existing differential equation. The documentation shows how to use the different jump definitions along with their pros and cons, so for now we will use ContinuousRateJump. Let’s define a ContinuousRateJump which has a constant rate and a neural network that decides what the effect of the jump ( c_i(u,p,t) ) will be. To do this, you’d simply put the neural network in there:

rate(u,p,t) = 2.0
affect!(integrator) = (integrator.u = dudt2(integrator.u))
jump = ConstantRateJump(rate,affect!)

where dudt2 is another neural network, and then wrap that into a jump problem:

prob = SDEProblem(dudt_,g,gpu(param(x)),tspan,nothing)
jump_prob = JumpProblem(prob,Direct(),jump,save_positions=(false,false))

And of course you can make this fancier: just replace that rate 2.0 with another neural network, make the g(u,p,t) term also have a neural network, etc.: explore this as you wish and go find some cool stuff. Let’s just stick with this as our example though, but please go ahead and make these changes and allow DiffEqFlux.jl to help you to explore your craziest mathematical idea!

Now when you solve this, the jumps also occur along with the stochastic differential equation. To show what that looks like, let’s define a jump diffusion and solve it 100 times, taking its mean as our training data:

using Flux, DiffEqFlux, StochasticDiffEq, Plots, DiffEqMonteCarlo,
    DiffEqJump
 
u0 = Float32[2.; 0.]
datasize = 30
tspan = (0.0f0,1.0f0)
 
function trueODEfunc(du,u,p,t)
  true_A = [-0.1 2.0; -2.0 -0.1]
  du .= ((u.^3)'true_A)'
end
t = range(tspan[1],tspan[2],length=datasize)
const mp = Float32[0.2,0.2]
function true_noise_func(du,u,p,t)
  du .= mp.*u
end
 
true_rate(u,p,t) = 2.0
true_affect!(integrator) = (integrator.u[1] = integrator.u[1]/2)
true_jump = ConstantRateJump(true_rate,true_affect!)
prob = SDEProblem(trueODEfunc,true_noise_func,u0,tspan)
jump_prob = JumpProblem(prob,Direct(),true_jump,save_positions=(false,false))
 
# Take a typical sample from the mean
monte_prob = MonteCarloProblem(jump_prob)
monte_sol = solve(monte_prob,SOSRI(),num_monte = 100,parallel_type=:none)
plot(monte_sol,title="Training Data")
 
monte_sum = MonteCarloSummary(monte_sol)
sde_data = Array(timeseries_point_mean(monte_sol,t))

From the plot you can see wild discontinuities mixed in with an equation with continuous randomness. Just lovely.

A full code for training a neural jump diffusion thus is:

using Flux, DiffEqFlux, StochasticDiffEq, Plots, DiffEqMonteCarlo,
    DiffEqJump
 
u0 = Float32[2.; 0.] |> gpu
datasize = 30
tspan = (0.0f0,1.0f0)
 
function trueODEfunc(du,u,p,t)
  true_A = [-0.1 2.0; -2.0 -0.1] |> gpu
  du .= ((u.^3)'true_A)'
end
t = range(tspan[1],tspan[2],length=datasize)
const mp = Float32[0.2,0.2] |> gpu
function true_noise_func(du,u,p,t)
  du .= mp.*u
end
 
true_rate(u,p,t) = 2.0
true_affect!(integrator) = (integrator.u[1] = integrator.u[1]/2)
true_jump = ConstantRateJump(true_rate,true_affect!)
prob = SDEProblem(trueODEfunc,true_noise_func,u0,tspan)
jump_prob = JumpProblem(prob,Direct(),true_jump,save_positions=(false,false))
 
# Take a typical sample from the mean
monte_prob = MonteCarloProblem(jump_prob)
monte_sol = solve(monte_prob,SOSRI(),num_monte = 100,parallel_type=:none)
monte_sum = MonteCarloSummary(monte_sol)
sde_data = Array(timeseries_point_mean(monte_sol,t))
 
dudt = Chain(x -&gt; x.^3,
           Dense(2,50,tanh),
           Dense(50,2)) |> gpu
dudt2 = Chain(Dense(2,50,tanh),
            Dense(50,2)) |> gpu
ps = Flux.params(dudt,dudt2)
 
g(u,p,t) = mp.*u
n_sde = function (x)
    dudt_(u,p,t) = dudt(u)
    rate(u,p,t) = 2.0
    affect!(integrator) = (integrator.u = dudt2(integrator.u))
    jump = ConstantRateJump(rate,affect!)
    prob = SDEProblem(dudt_,g,param(x),tspan,nothing)
    jump_prob = JumpProblem(prob,Direct(),jump,save_positions=(false,false))
    solve(jump_prob, SOSRI(); saveat=t ,abstol = 0.1, reltol = 0.1) |> Tracker.collect
end
 
pred = n_sde(u0) # Get the prediction using the correct initial condition
 
dudt__(u,p,t) = Flux.data(dudt(u))
rate__(u,p,t) = 2.0
affect!__(integrator) = (integrator.u = Flux.data(dudt2(integrator.u)))
jump = ConstantRateJump(rate__,affect!__)
nprob = SDEProblem(dudt__,g,u0,(0.0f0,1.0f0),nothing)
njump_prob = JumpProblem(prob,Direct(),jump, save_positions = (false,false))
 
monte_nprob = MonteCarloProblem(njump_prob)
monte_nsol = solve(monte_nprob,SOSRI(),num_monte = 1000,parallel_type=:none, abstol = 0.1, reltol = 0.1)
monte_nsum = MonteCarloSummary(monte_nsol)
 
#plot(monte_nsol,color=1,alpha=0.3)
p1 = plot(monte_nsum, title = "Neural Jump Diffusion: Before Training")
scatter!(p1,t,sde_data',lw=3)
 
scatter(t,sde_data[1,:],label="data")
scatter!(t,Flux.data(pred[1,:]),label="prediction")
 
function predict_n_sde()
    n_sde(u0)
end
 
loss_n_sde1() = sum(abs2,sde_data .- predict_n_sde())
 function loss_n_sde100()
    loss = sum([sum(abs2,sde_data .- predict_n_sde()) for i in 1:100])
    @show loss
    loss
end
function loss_n_sde500()
    loss = sum([sum(abs2,sde_data .- predict_n_sde()) for i in 1:500])
    @show loss
    loss
end 
Flux.back!(loss_n_sde1())
 
data = Iterators.repeated((), 10)
opt = ADAM(0.025)
cb = function () #callback function to observe training
    sample = predict_n_sde()
    # loss against current data
    display(sum(abs2,sde_data .- sample))
    # plot current prediction against data
    cur_pred = Flux.data(sample)
    pl = scatter(t,sde_data[1,:],label="data")
    scatter!(pl,t,cur_pred[1,:],label="prediction")
    display(plot(pl))
end
 
# Display the SDE with the initial parameter values.
cb()
 
Flux.train!(loss_n_sde1 , ps, Iterators.repeated((), 100), opt, cb = cb)

Notice how it’s almost exactly the same as the SDE code but with the definition of the jumps. You still get the same high order adaptive GPU-accelerated (choice of implicit, etc.) SDE solvers, but now to this more generalized class of problems. Using the GPU gives a good speedup in the neural network case, but slows it down quite a bit when generating the training data since it’s not very parallel. Finding out new ways to use GPUs is one thing I am interested in perusing here. Additionally, using a lower tolerance StackOverflows Tracker.jl, which is something we have fixed with Zygote.jl and will be coming to releases once Zygote.jl on the differential equation solvers is more robust. Lastly, the plotting with GPU-based arrays is wonky right now, we’ll need to make the interface a little bit nicer. However, this is a proof of concept that this stuff does indeed work, though it takes awhile to train it to a “decent” loss (way more than the number of repetitions showcased in here).

[Note: you need to add using CuArrays to enable the GPU support. I turned it off by default because I was training this on my dinky laptop :)]

Neural Partial Differential Equations

Now let’s do a neural partial differential equation (PDE). We can start by pulling code from this older blog post on solving systems of stochastic partial differential equations with GPUs. Here I’m going to strip the stochastic part off, simply because I want to train this on my laptop before the flight ends, so again I’ll leave it as an exercise to do the same jump diffusion treatment to this PDE. Let’s start by defining the method of lines discretization for our PDE. If you don’t know what that is, please go read that blog post on defining SPDEs. What happens is the discretization gives you a set of ODEs to solve, which looks like:

using OrdinaryDiffEq, RecursiveArrayTools, LinearAlgebra,
      DiffEqOperators, Flux, CuArrays
 
# Define the constants for the PDE
const α₂ = 1.0f0
const α₃ = 1.0f0
const β₁ = 1.0f0
const β₂ = 1.0f0
const β₃ = 1.0f0
const r₁ = 1.0f0
const r₂ = 1.0f0
const D = 100.0f0
const γ₁ = 0.1f0
const γ₂ = 0.1f0
const γ₃ = 0.1f0
const N = 100
const X = reshape([i for i in 1:N for j in 1:N],N,N) |> gpu
const Y = reshape([j for i in 1:N for j in 1:N],N,N) |> gpu
const α₁ = 1.0f0.*(X.&gt;=80)
 
const Mx = Array(Tridiagonal([1.0f0 for i in 1:N-1],[-2.0f0 for i in 1:N],[1.0f0 for i in 1:N-1])) |> gpu
const My = copy(Mx)
Mx[2,1] = 2.0
Mx[end-1,end] = 2.0
My[1,2] = 2.0
My[end,end-1] = 2.0
 
# Define the initial condition as normal arrays
u0 = rand(Float32,N,N,3) |> gpu
const MyA = zeros(Float32,N,N) |> gpu
const AMx = zeros(Float32,N,N) |> gpu
const DA = zeros(Float32,N,N) |> gpu
 
# Define the discretized PDE as an ODE function
function f(_du,_u,p,t)
  u = reshape(_u,N,N,3)
  du= reshape(_du,N,N,3)
  A = @view u[:,:,1]
  B = @view u[:,:,2]
  C = @view u[:,:,3]
  dA = @view du[:,:,1]
  dB = @view du[:,:,2]
  dC = @view du[:,:,3]
  mul!(MyA,My,A)
  mul!(AMx,A,Mx)
  @. DA = D*(MyA + AMx)
  @. dA = DA + α₁ - β₁*A - r₁*A*B + r₂*C
  @. dB = α₂ - β₂*B - r₁*A*B + r₂*C
  @. dC = α₃ - β₃*C + r₁*A*B - r₂*C
end
 
# Solve the ODE
prob = ODEProblem(f,vec(u0),(0.0f0,100.0f0))
@time sol = solve(prob,BS3(),  progress=true,saveat = 5.0)
@time sol = solve(prob,ROCK2(),progress=true,saveat = 5.0)
 
 
using Plots; pyplot()
p1 = surface(X,Y,reshape(sol[end],N,N,3)[:,:,1],title = "[A]")
p2 = surface(X,Y,reshape(sol[end],N,N,3)[:,:,2],title = "[B]")
p3 = surface(X,Y,reshape(sol[end],N,N,3)[:,:,3],title = "[C]")
plot(p1,p2,p3,layout=grid(3,1))
savefig("neural_pde_training_data.png")
 
using DiffEqFlux, Flux
 
u0 = param(u0)
tspan = (0.0f0,100.0f0)
 
ann = Chain(Dense(3,50,tanh), Dense(50,3)) |> gpu
p1 = DiffEqFlux.destructure(ann)
ps = Flux.params(ann)
 
_ann = (u,p) -&gt; reshape(p[3*50+51 : 2*3*50+50],3,50)*
                    tanh.(reshape(p[1:3*50],50,3)*u + p[3*50+1:3*50+50]) + p[2*3*50+51:end]
 
function dudt_(_u,p,t)
  u = reshape(_u,N,N,3)
  A = u[:,:,1]
  DA = D .* (A*Mx + My*A)
  _du = mapslices(x -&gt; _ann(x,p),u,dims=3) |> gpu
  du = reshape(_du,N,N,3)
  x = vec(cat(du[:,:,1]+DA,du[:,:,2],du[:,:,3],dims=3))
end
 
prob = ODEProblem(dudt_,vec(Flux.data(u0)),tspan,Flux.data(p1))
@time diffeq_fd(p1,Array,length(u0)*length(0.0f0:5.0f0:100.0f0),prob,ROCK2(),progress=true,
                saveat=0.0f0:5.0f0:100.0f0)
 
function predict_fd()
  diffeq_fd(p1,Array,length(u0)*length(0.0f0:5.0f0:100.0f0),prob,ROCK2(),progress=true,
                  saveat=0.0f0:5.0f0:100.0f0)
end
 
function loss_fd()
  _sol = predict_fd()
  loss = sum(abs2,Array(sol) .- _sol)
  @show loss
  loss
end
loss_fd()
 
data = Iterators.repeated((), 10)
opt = ADAM(0.025)
 
Flux.train!(loss_fd, ps, data, opt)

The interesting part of this neural differential equation is the local/global aspect of parts. The mapslices call makes it so that way there’s a local nonlinear function of 3 variables applied at each point in space. While it keeps the neural network small, this currently does not do well with reverse-mode automatic differentiation or GPUs. That isn’t a major problem here because, since the neural network is kept small in this architecture, the number of parameters is also quite small. That said, reverse-mode AD will be required for fast adjoint passes, so this is still a work in progress / proof of concept, with a very specific point made (all that’s necessary here is overloads to make mapslices work well).

One point that really came out of this was the ODE solver methods. The ROCK2 method is much faster when generating the training data and when running diffeq_fd. It was a difference of 3 minutes with ROCK2 vs 40 minutes with BS3 (on the CPU), showing how specialized methods really are the difference between the problem being solvable or not. The standard implicit methods like Rodas5 aren’t performing well here either since the 30,000×30,000 dense matrix, and I didn’t take the time to specify sparsity patterns or whatnot to actually make them viable competitors. So for the lazy neural ODE use with sparsity, ROCK2 seems like a very interesting option. This is a testament to our newest GSoC crew’s results since it’s one of the newer methods implemented by our student Deepesh Thakur. There are still a few improvements that need to be made to make the eigenvalue estimates more GPU-friendly as well, making this performance result soon carry over to GPUs as well (currently, the indexing in this part of the code gives it trouble, so a PR is coming probably in a week or so). Lastly, I’m not sure what’s a good picture for these kinds of things, so I’m going to have to think about how to represent a global neural PDE fit.

Conclusion

Have fun with this. There are still some rough edges, for example plotting is still a little wonky because all of the automatic DiffEq solution plotting seems to index, so the GPU-based arrays don’t like that (I’ll update that soon now that it’s becoming a standard part of the workflow). Use it as starter code and find some cool stuff. Note that the examples shown here are not the only ones that are possible. This all just uses Julia’s generic programming and differentiable programming infrastructure in order to automatically generate code that is compatible with GPUs and automatic differentiation, so it’s impossible for me to enumerate all of the possible combinations. That means there’s plenty of things to explore. These are very early preliminary results, but shows that these equations are all possible. These examples show some places where we want to continue accelerating by both improving the methods and their implementation details. I look forward to doing an update with Zygote soon.

CITATION:

 Christopher Rackauckas, Neural Jump SDEs (Jump Diffusions) and Neural PDEs, The Winnower6:e155975.53637 (2019). DOI:10.15200/winn.155975.53637

The post Neural Jump SDEs (Jump Diffusions) and Neural PDEs appeared first on Stochastic Lifestyle.

Deep Learning on the New Ubuntu-Based Data Science Virtual Machine for Linux

Authored by Paul Shealy, Senior Software Engineer, and Gopi Kumar, Principal Program Manager, at Microsoft.

Deep learning has received significant attention recently for its ability to create machine learning models with very high accuracy. It’s especially popular in image and speech recognition tasks, where the availability of massive datasets with rich information make it feasible to train ever-larger neural networks on powerful GPUs and achieve groundbreaking results. Although there are a variety of deep learning frameworks available, getting started with one means taking time to download and install the framework, libraries, and other tools before writing your first line of code.

Microsoft’s Data Science Virtual Machine (DSVM) is a family of popular VM images published on the Azure marketplace with a broad choice of machine learning and data science tools. Microsoft is extending it with the introduction of a brand-new offering in this family – the Data Science Virtual Machine for Linux, based on Ubuntu 16.04LTS – that also includes a comprehensive set of popular deep learning frameworks.

Deep learning frameworks in the new VM include:

  • Microsoft Cognitive Toolkit
  • Caffe and Caffe2
  • TensorFlow
  • H2O
  • MXNet
  • NVIDIA DIGITS
  • Theano
  • Torch, including PyTorch
  • Keras

The image can be deployed on VMs with GPUs or CPU-only VMs. It also includes OpenCV, matplotlib and many other libraries that you will find useful.

Run dsvm-more-info at a command prompt or visit the documentation for more information about these frameworks and how to get started.

Sample Jupyter notebooks are included for most frameworks. Start Jupyter or log in to JupyterHub to browse the samples for an easy way to explore the frameworks and get started with deep learning.

GPU Support

Training a deep neural network requires considerable computational resources, so things can be made significantly faster by running on one or more GPUs. Azure now offers NC-class VM sizes with 1-4 NVIDIA K80 GPUs for computational workloads. All deep learning frameworks on the VM are compiled with GPU support, and the NVIDIA driver, CUDA and cuDNN are included. You may also choose to run the VM on a CPU if you prefer, and that is supported without code changes. And because this is running on Azure, you can choose a smaller VM size for setup and exploration, then scale up to one or more GPUs for training.

The VM comes with nvidia-smi to monitor GPU usage during training and help optimize parameters to make full use of the GPU. It also includes NVIDIA Docker if you want to run Docker containers with GPU access.

Data Science Virtual Machine

The Data Science Virtual Machine family of VM images on Azure includes the DSVM for Windows, a CentOS-based DSVM for Linux, and an Ubuntu-based DSVM for Linux. These images come with popular data science and machine learning tools, including Microsoft R Server Developer Edition, Microsoft R Open, Anaconda Python, Julia, Jupyter notebooks, Visual Studio Code, RStudio, xgboost, and many more. A full list of tools for all editions of the DSVM is available here. The DSVM has proven popular with data scientists as it helps them focus on their tasks and skip mundane steps around tool installation and configuration.


To try deep learning on Windows with GPUs, the Deep Learning Toolkit for DSVM contains all tools from the Windows DSVM plus GPU drivers, CUDA, cuDNN, and GPU versions of CNTK, MXNet, and TensorFlow.

Get Started Today

We invite you to use the new image to explore deep learning frameworks or for your machine learning and data science projects – DSVM for Linux (Ubuntu) is available today through the Marketplace. Free Azure credits are available to help get you started.

Paul & Gopi