The latest version of GPUArrays.jl involved a port of all vendor-neutral kernels to KernelAbstractions.jl. This should make it easier to add new functionality and improve the performance of existing kernels.
Vendor-neutral kernel DSL
Back in the day, we created GPUArrays.jl to avoid having to write separate kernels for each GPU back-end, by relying on a very simple vendor-neutral domain-specific language (DSL) that could be translated very easily to the back-end's native kernel language. As a simple example, the following kernel was used to compute the adjoint of a vector:
function LinearAlgebra.adjoint!(B::AbstractGPUMatrix, A::AbstractGPUVector)
gpu_call(B, A) do ctx, B, A
idx = @linearidx A
@inbounds B[1, idx] = adjoint(A[idx])
return
end
return B
end
This DSL was designed almost a decade ago, by Simon Danisch, and has served us well! Since then, KernelAbstractions.jl has been developed by Valentin Churavy, providing a more principled and powerful DSL. With many application developers switching to KernelAbstractions.jl, it was time to port GPUArrays.jl to this new DSL as well.
Thanks to the tireless work by James Schloss, GPUArrays.jl v11 now uses KernelAbstractions.jl for all vendor-neutral kernels. The aforementioned adjoint! kernel now looks like this:
function LinearAlgebra.adjoint!(B::AbstractGPUMatrix, A::AbstractGPUVector)
@kernel function adjoint_kernel!(B, A)
idx = @index(Global, Linear)
@inbounds B[1, idx] = adjoint(A[idx])
end
adjoint_kernel!(get_backend(A))(B, A; ndrange=size(A))
return B
end
As shown above, the KernelAbstractions.jl DSL is very similar to the old DSL, but it provides more flexibility and power (e.g., support for atomics through Atomix.jl). In addition, many more users are familiar with KernelAbstractions.jl, making it easier for them to contribute to GPUArrays.jl. A good first step here would be to port some of the vendor-specific kernels from CUDA.jl to GPUArrays.jl, making them available to all GPU back-ends. If you are interested in contributing, please reach out!
That said, the change is not without its challenges. The added flexibility offered by KernelAbstractions.jl with respect to indexing currently results in certain kernels being slower than before, specifically when there is not much computational complexity to amortise the cost of indexing (e.g., when doing very simple broadcasts). We are working on improving this, but it will take some time. Not to hold back the rest of the JuliaGPU ecosystem, we are releasing despite these performance issues. It's recommended to carefully benchmark your application after upgrading to v11, and to report any performance regressions
Back-end package versions
As GPUArrays.jl is not a direct dependency of most applications, the update will be pulled in by the following back-end package versions (some of which may not be released yet):
Here you will learn about different techniques to infer parameters of a
(differentiable) process-based model against data. This is useful to in
the context of mechanistic inference, where we want to explain patterns
in a system by understanding the processes that generate them, in
contrast to purely statistical or empirical inference, which might
identify patterns or correlations in data without necessarily
understanding the causes. We’ll mostly focus on differential equation
models. Make sure that you stick to the end, where we’ll see how we can
not only infer parameter values but also functional forms, by
parametrizing models’ components with neural networks.
Preliminaries
Wait, what is a differentiable model?
One can usually write a model as a map ℳ mapping some parameters p, an
initial state u0 and a time t to a future state ut
ut = ℳ(u0, t, p).
We call differentiable a model ℳ for which we can calculate its
derivative with respect to p or u0. The derivative
$\frac{\partial \mathcal{M}}{\partial \theta}$ expresses how much the
model output changes with respect to a small change in θ.
A ComponentArray is a convenient Array type that allows to access
array elements with symbols, similarly to a NamedTuple, while
behaving like a standard array. For instance, you could do something
like
cv=ComponentVector(;a=1,b=2)cv.=[3,4]
ComponentVector{Int64}(a = 3, b = 4)
This is useful, because you can only calculate a gradient w.r.t a Vector!
Now let’s try to calculate the gradient of this model. While you could
in this case derive the gradient analytically, an analytic derivation is
generally tricky with complex models. And what about models that can
only be simulated numerically, with no analytic expressions? We need to
find a more automatized way to calculate gradients.
Implement the function ∂mymodel_∂K(h, u0, p, t) which returns the
model’s derivative with respect to K, calculated with a small h to
be provided by the user.
The gradient of the model is useful to understand how a parameter
influences the output of the model. Let’s calculate the importance of
the carrying capacity K on the model output:
As you can observe, the carrying capacity has no effect at small t
where population is small, and its influence on the dynamics grows as
the population grows. We expect the reverse effect for r.
On the importance of gradients for inference
The ability to calculate the derivative of a model is crucial when it
comes to inference. Both within a full Bayesian inference context, where
one wants to sample the posterior distribution of parameters θ given
data u, p(θ|u), or when one wants to obtain a point estimate
$\theta^\star = \text{argmax}_\theta (p(\theta | u))$ (frequentist or machine
learning context), the model gradient proves very useful. In a full
Bayesian inference context, they are used e.g. with Hamiltonian Markov
Chains methods, such as the NUTS sampler, and in a machine learning
context, they are used with gradient-based optimizer.
Gradient descent
The best way to grasp the importance of gradients in inference is to
understand the gradient descent
algorithm.
The following picture illustrates the algorithm in the special case
where p is one-dimensional.
Given an initial estimate of the parameter value p0,
$\frac{d \mathcal{M}}{dp}$ is used to suggest a new, better estimate,
following
Gradient-based methods are usually very efficient in high-dimensional
spaces.
Automatic differentiation
Let’s go back to our method ∂mymodel_∂p. What is the optimal value of h to calculate the derivative? This is a tricky question, because a
too small h can lead to round off errors (see more explanations
here)
while h too large also leads to a bad approximation of the asymptotic
definition.
Fortunately, a bunch of techniques referred to as automatic
differentiation
(AD) allows to exactly differentiate any piece of numerical
functions. In practice, your code must be exclusively written within an
AD-backend, such as Torch, JAX or Tensorflow. Those libraries do not
know how to differentiate code not written in their own language, such
as normal Python code.
Fortunately, Julia is an AD-pervasive language! This means that any
piece of Julia code is theoretically differentiable with AD.
This is what makes Julia great for model calibration and inference!
Write your model in Julia, and any inference method using AD will be
able to work with your model!
We’ll use a simple dynamical community model, the Lotka
Volterra model,
to generate data. We’ll then contaminate this data with noise, and try
to recover the parameters that have generated the data. The goal of the
session will be to estimate those parameters from the data, using a
bunch of different techniques.
This is the true state of the system. Now let’s contaminate it with
observational noise.
Exercise: contaminate data with noise
Create a data_mat array, which consists of the ODE solution
contaminated with a lognormally-distributed noise with standard
deviation 0.3.
Note
Note that we add lognormally-distributed noise instead of
normally-distributed because we are observing population abundance,
which can only be positive.
Solution
data_mat=Array(sol_true).*exp.(0.3*randn(size(sol_true)))# Plot simulation and noisy observations.plot(sol_true;alpha=0.3)scatter!(sol_true.t,data_mat';color=[12],label="")
Now that we have our data, let’s do some inference!
Mechanistic inference as a supervised learning task
We’ll get started with a very crude approach to inference, where we’ll
treat the calibration of our LV model similarly to a supervised machine
learning task. To do so, we’ll write a loss function, defining a
distance between our model and the data, and we’ll try to minimize this
loss. The parameter minimizing this loss will be our best model
parameter estimate.
Current loss after 1 iterations: 251.10349846646116
false
Our initial predictions are bad, but you’ll likely get even worse
predictions in a real-case scenario!
We’ll use the library Optimization, which is a wrapper library around
many optimization libraries in Julia. Optimization therefore provides
us with many different types of optimizers to find parameters minimizing loss. We’ll specifically use the infamous Adam
optimizer (187k citations!!!), widely
used in ML.
In supervised learning, it is common practice to regularize the model to
prevent overfitting. Regularization can also help the model to converge.
Regularization is done by adding a penalty term to the loss function:
Loss(θ) = Lossdata(θ) + λ Reg(θ)
Exercise: regularization
Add a regularization term to the loss, which penalizes the loss when
the inferred initial conditions are less than 0.
Multiple shooting
Another trick that can greatly improve the convergence of the
optimization is to break down the prediction task into simpler tasks.
Namely, instead of trying to predict in one shot the whole time series,
the idea of multiple shooting is to predict for shorter time horizon.
Exercise: multiple shooting
Can you modify your loss function to implement this idea?
Solution
functionmultiple_shooting_idx(N,length_interval=10)K=N÷length_interval@assertN%K==1"`N - 1` is not a multiple of `length_interval`"interval_idxs=[k*length_interval+1:(k+1)*length_interval+1forkin0:(K-1)]returninterval_idxsendfunctionloss_multiple_shooting(p)interval_idxs=multiple_shooting_idx(length(tsteps))l=0.foridxininterval_idxssaveat=tsteps[idx]# u0_i = sol_true.u[idx[1]] # here we are cheating, using true states for initial conditions!u0_i=data_mat[:,idx[1]]# this is not cheating, but it does not work very wellpredicted=solve(prob,alg;u0=u0_i,p,saveat,tspan=(saveat[1],saveat[end]),abstol=1e-6,reltol=1e-6)foriin1:length(predicted)ifall(predicted[i].>0)l+=sum(abs2,log.(data_mat[:,idx[i]])-log.(predicted[i]))endendendpredicted=solve(prob,alg;p,saveat=tsteps,abstol=1e-6,reltol=1e-6)returnl,predictedendlosses=[]pinit=ComponentArray(;α=1.,β=1.5,γ=1.0,δ=0.5)adtype=Optimization.AutoZygote()optf=Optimization.OptimizationFunction((x,p)->loss_multiple_shooting(x),adtype)optprob=Optimization.OptimizationProblem(optf,pinit)@timeres_ada=Optimization.solve(optprob,Adam(0.1);callback,maxiters=500)res_ada.minimizer
Current loss after 1 iterations: 57.64884717929634
Current loss after 101 iterations: 15.985881478253205
Current loss after 201 iterations: 15.984751300361513
Current loss after 301 iterations: 15.984751280519914
Current loss after 401 iterations: 15.98475128052433
Current loss after 501 iterations: 15.984751280410928
Did you wonder why do we need to load SciMLSensitivity? and why did we
specify adtype = Optimization.AutoZygote()?
AD comes in different flavours, with broadly two types of AD methods – forward methods and backward methods -, and a bunch of different
implementations.
You can specify which ones Optimization.jl will use to differentiate loss with adtype, see available options here.
But when it comes to differentiating the solve function from OrdinaryDiffEq, you want to use AutoZygote(), because when trying to
differentiate solve, a specific adjoint rule provided by the SciMLSensitivity package will be used.
What are adjoint rules?
These are algoirithmic rules that specify to the AD backend how to
calculate the derivative of a specific function.
These adjoint rules can be specificed by the keyword sensealg when
calling solve and have been designed for best performance when
differentiating solutions of an ODEProblem. There exists a lot of them
(see a review here), and if sensealg is not provided, a smart polyalgorithm is going to pick up
one for you.
You can have a look in the documentation here
for hints on how to choose an algorithm.
Exercise: benchmarking sensitivity methods
Can you evaluate the performance of ForwardDiffSensitivity() and ReverseDiffAdjoint()?
For a small number of parameters, forward methods tend to perform best,
but with higher number of parameters, the other way around is true.
Well done! Now, let’s jump into the Bayesian world…
Bayesian inference
Julia has a very strong library for Bayesian inference: Turing.jl.
Let’s declare our first Turing model!
This is done with the @model macro, which allows the library to
automatically construct the posterior distribution based on the
definition of your model’s random variables.
Frequentist (supervised learning) vs. Bayesian approach
The main difference between a frequentist approach and a Bayesian
approach is that the latter considers that parameters are random
variables. Hence instead of trying to estimate a single value for the
parameters, the Bayesian will try to estimate the posterior (joint)
distribution of those parameters.
How many threads do you have running? Threads.nthreads() will tell
you!
Let’s see if our chains have converged.
usingStatsPlotsplot(chain)
Data retrodiction
Let’s now generate simulated data using samples from the posterior
distribution, and compare to the original data.
functionplot_predictions(chain,sol,data_mat)myplot=plot(;legend=false)posterior_samples=sample(chain[[:α,:β,:γ,:δ]],300;replace=false)forparrineachrow(Array(posterior_samples))p=NamedTuple([:α,:β,:γ,:δ].=>parr)sol_p=solve(prob,Tsit5();p,saveat)plot!(sol_p;alpha=0.1,color="#BBBBBB")end# Plot simulation and noisy observations.plot!(sol;color=[12],linewidth=1)scatter!(sol.t,data_mat';color=[12])returnmyplotendplot_predictions(chain,sol_true,data_mat)
Exercise: Hey, this is cheating!
Notice that we use the true u0, as if we were to know exactly the
initial state. In a real situation, we need also to infer the true
state!
Can you modify the model to infer the true state?
Solution
@modelfunctionfitlv2(data,prob)# Prior distributions.σ~InverseGamma(2,3)α~truncated(Normal(1.5,0.5);lower=0.5,upper=2.5)β~truncated(Normal(1.2,0.5);lower=0,upper=2)γ~truncated(Normal(3.0,0.5);lower=1,upper=4)δ~truncated(Normal(1.0,0.5);lower=0,upper=2)u0~MvLogNormal(data[:,1],σ^2*I)# Simulate Lotka-Volterra model but save only the second state of the system (predators).p=(;α,β,γ,δ)predicted=solve(prob,alg;p,u0,saveat)# Observations.foriin2:length(predicted)ifall(predicted[i].>0)data[:,i]~MvLogNormal(log.(predicted[i]),σ^2*I)endendreturnnothingendmodel2=fitlv2(data_mat,prob)# Sample 3 independent chains.chain2=sample(model2,NUTS(),MCMCThreads(),3000,3;progress=true)plot(chain2)
Here is a small utility function to visualize your results.
`plot_predictions2`
functionplot_predictions2(chain,sol,data_mat)myplot=plot(;legend=false)posterior_samples=sample(chain,300;replace=false)foriin1:length(posterior_samples)ps=posterior_samples[i]p=get(ps,[:α,:β,:γ,:δ],flatten=true)u0=get(ps,:u0,flatten=true)u0=[u0[1][1],u0[2][1]]sol_p=solve(prob,Tsit5();u0,p,saveat)plot!(sol_p;alpha=0.1,color="#BBBBBB")end# Plot simulation and noisy observations.plot!(sol;color=[12],linewidth=1)scatter!(sol.t,data_mat';color=[12])returnmyplotendplot_predictions2(chain2,sol_true,data_mat)
Mode estimation
Turing allows you to find the maximum likelihood estimate (MLE) or
maximum a posteriori estimate (MAP).
Although Bayesian inference seems very different from the supervised
learning approach we developed in the first part, estimating the MAP,
which can be still considered as Bayesian inference, transforms in an
optimization problem that can be seen as a supervised task.
To see that, we can log-transform the posterior:
log P(θ|?) = log P(?|θ) + log P(θ) − log P(?)
Since the evidence P(?) is independent of θ, it can be ignored
when maximizing with respect to θ. Therefore, the MAP estimate
simplifies to:
Here, log P(?|θ) can be seen as our previous non-regularized loss and log P(θ) acts as a regularization term, penalizing
unlikely parameter values based on our prior beliefs. Priors on
parameters can be seen as regularization term.
MLE and MAP can be obtained by maximum_likelihood and maximum_a_posteriori.
What’s very nice is that Turing.jl provides you with utility functions
to analyse your mode estimation results.
usingStatsBasecoeftable(map_res)
Coef.
Std. Error
z
Pr(>
z
)
σ
0.354554
0.0250558
14.1506
1.85249e-45
0.305445
0.403662
α
1.47077
0.157711
9.32571
1.10241e-20
1.16166
1.77988
β
0.917194
0.125103
7.3315
2.27594e-13
0.671996
1.16239
γ
3.26146
0.335101
9.73279
2.18526e-22
2.60468
3.91825
δ
1.02352
0.124744
8.20497
2.30644e-16
0.779026
1.26801
u0[1]
2.15065
0.18462
11.6491
2.31953e-31
1.7888
2.51249
u0[2]
2.47891
0.247352
10.0218
1.22272e-23
1.99411
2.96371
Exercise: Partially observed state
Let’s assume the following situation: for some reason, you only have
observation data for the predator. Could you still infer all
parameters of your model, including those of the prey?
Could be! Because the signal of the variation in abundance of the
predator contains information on the dynamics of the whole
predator-prey system.
Do it!
You’ll need to assume so prior state for the prey. Just assume that it
is the same as that of the predator.
Solution
@modelfunctionfitlv3(data::AbstractVector,prob)# Prior distributions.σ~InverseGamma(2,3)α~truncated(Normal(1.5,0.5);lower=0.5,upper=2.5)β~truncated(Normal(1.2,0.5);lower=0,upper=2)γ~truncated(Normal(3.0,0.5);lower=1,upper=4)δ~truncated(Normal(1.0,0.5);lower=0,upper=2)u0~MvLogNormal([data[1],data[1]],σ^2*I)# Simulate Lotka-Volterra model but save only the second state of the system (predators).p=(;α,β,γ,δ)predicted=solve(prob,Tsit5();p,u0,saveat,save_idxs=2)# Observations of the predators.foriin2:length(predicted)ifpredicted[i]>0data[i]~LogNormal(log.(predicted[i]),σ^2)endendreturnnothingendmodel3=fitlv3(data_mat[2,:],prob)# Sample 3 independent chains.chain3=sample(model3,NUTS(),MCMCThreads(),3000,3;progress=true)plot(chain3)p=plot_predictions2(chain3,sol_true,data_mat)plot!(p,yaxis=:log10)
Now you need to realise that up to now, we had a relatively simple model. How would this model scale, should we have a much larger model? Let’s cook-up some idealised LV model. –>
AD backends and sensealg
The NUTS sampler uses automatic differentiation under the hood.
By default, Turing.jl uses ForwardDiff.jl as an AD backend, meaning
that the SciML sensitivity methods are not used when the solve
function is called. However, you could change the AD backend to Zygote
with adtype=AutoZygote().
Can you evaluate the performance of ForwardDiffSensitivity() and ReverseDiffAdjoint()?
Variational Inference
Variational inference (VI) consists in approximating the true posterior
distribution P(θ|?) by an approximate distribution Q(θ; ϕ),
where ϕ is a parameter vector defining the shape, location, and other
characteristics of the approximate distribution Q, to be optimzed so
that Q is as close as possible to P. This is achieved by minimizing
the Kullback-Leibler (KL) divergence between the true posterior P(θ|?) and the approximate distribution :
The advantage of VI over traditional MCMC sampling methods is that VI is
generally faster and more scalable to large datasets, as it transforms
the inference problem into an optimization problem.
Let’s do VI in Turing!
importFluxusingTuring:Variationalmodel=fitlv2(data_mat,prob)q0=Variational.meanfield(model)advi=ADVI(10,10_000)# first arg is the q=vi(model,advi,q0;optimizer=Flux.ADAM(1e-2))functionplot_predictions_vi(q,sol,data_mat)myplot=plot(;legend=false)z=rand(q,300)forparrineachcol(z)p=NamedTuple([:α,:β,:γ,:δ].=>parr[2:5])u0=parr[6:7]sol_p=solve(prob,Tsit5();u0,p,saveat)plot!(sol_p;alpha=0.1,color="#BBBBBB")end# Plot simulation and noisy observations.plot!(sol;color=[12],linewidth=1)scatter!(sol.t,data_mat';color=[12])returnmyplotendplot_predictions_vi(q,sol_true,data_mat)
The cool thing with VI that we can sample from the resulting q with
ease.
Up to now, we have been infering the value of the model’s parameters,
assuming that the structure of our model is correct. But this is very
idealistic, specifically in ecology. As a general trend, we have little
idea of how does e.g. the functional response of a
species look like.
What if instead of inferring parameter values, we could infer functional
forms, or components within our model for which we have little idea on
how to express it mathematically?
In Julia, we can do that.
To illustrate this, we’ll assume that we do not know the functional
response of both prey and predator, i.e. the terms β * y and δ * x.
Instead, we will parametrize this component in our DE model by a neural
network, which can be seen as a simple non-linear regressor dependent on
some extra parameters p_nn.
We then simply have to optimize those parameters, along with the other
model’s parameters!
Let’s get started. To make the neural network, we’ll use the deep
learning library Lux.jl, which is similar to Flux.jl but where
models are explicitly parametrized. This explicit parametrization makes
it simpler to integrate with an ODE model.
To make things simpler, we will define a single layer neural network
StatefulLuxLayer{true}(
Dense(2 => 2, relu), # 6 parameters
) # Total: 6 parameters,
# plus 0 states.
We use a StatefulLuxLayer to not having to carry around st_nn, a
struct containing states of a Lux model, which is essentially useless
for a multi-layer perceptron.
st_nn
NamedTuple()
We can now evaluate our neural network model as follows:
nn(u0,p_nn_init)
2-element Vector{Float64}:
0.0
0.0
instead of
nn_init(u0,p_nn_init,st_nn)
([0.0, 0.0], NamedTuple())
Let’s define a new parameter vectors, which will consist of the ODE
model parameters as well as the neural net parameters
Now we can define our Turing Model. We’ll need to use a utility function vector_to_parameters that reconstructs the neural network parameter
type based on a sampled parameter vector (taken from this Turing
tutorial). You
do not need to worry about this. Note that we could have used a
component vector, but for some reason this did not work at the time of
the writing of this tutorial…
`vector_to_parameters`
usingFunctors# for the `fmap`functionvector_to_parameters(ps_new::AbstractVector,ps::NamedTuple)@assertlength(ps_new)==Lux.parameterlength(ps)i=1functionget_ps(x)z=reshape(view(ps_new,i:(i+length(x)-1)),size(x))i+=length(x)returnzendreturnfmap(get_ps,ps)end
vector_to_parameters (generic function with 1 method)
# Create a regularization term and a Gaussian prior variance term.sigma=0.2@modelfunctionfitlv_nn(data,prob)# Prior distributions.σ~InverseGamma(3,0.5)α~truncated(Normal(1.5,0.5);lower=0.5,upper=2.5)γ~truncated(Normal(3.0,0.5);lower=1,upper=4)nparameters=Lux.parameterlength(nn)p_nn_vec~MvNormal(zeros(nparameters),sigma^2*I)p_nn=vector_to_parameters(p_nn_vec,p_nn_init)# Simulate Lotka-Volterra model. p=(;α,γ,p_nn)predicted=solve(prob,alg;p,saveat)# Observations.foriin1:length(predicted)ifall(predicted[i].>0)data[:,i]~MvLogNormal(log.(predicted[i]),σ^2*I)endendreturnnothingendmodel=fitlv_nn(data_mat,prob_nn)