Category Archives: Julia

On combining machine learning-based and theoretical ecosystem models

By: julia | Victor Boussange

Re-posted from: https://vboussange.github.io/post/hybridmodelling/

In this post, I explore the benefits and drawbacks of using empirical (ML)-based models versus mechanistic models for predicting ecosystem responses to perturbations. To evaluate these different modelling approaches, I use a mechanistic ecosystem model to generate a synthetic time series dataset.

By applying both modelling approaches to this dataset, I can evaluate their performance. While the ML-based approach yields accurate forecasts under unperturbed dynamics, it inevitably fails when it comes to predicting ecosystem response to perturbations. On the other hand, the mechanistic model, which is simplified version of the ground truth model to reflect a realistic scenario, is inaccurate and cannot forecast, but provides a more adequate approach to predict ecosystem response to unobserved scenarios.

To improve the accuracy of mechanistic models, I introduce inverse modelling, and in particular an approach that I have developed called piecewise inference. This approach allows to accurately calibrate complex mechanistic models, and doing so open doors to improve our understanding ecosystems by performing model selection.

Finally, I discuss how hybrid models, which incorporate both ML-based and mechanistic components, offer the potential to benefit from the strengths of both modelling approaches. By examining the strengths and limitations of these different modelling approaches, I hope to provide insights into how best to use them to advance our knowledge of ecological and evolutionary dynamics.

Notes

  • This post is under construction, and contain typos! If you find some, please contact me so that I can correct
  • For the sake of clarity, some pieces of code have voluntarily been hidden in external Julia files, which are loaded throughout the post. If you want to inspect them, check out those files in the corresponding GitHub repository

Generating a synthetic dataset

To generate the synthetic dataset, I consider a 3 species ecosystem model, composed of a resource, consumer and prey species. The resource growth rate depends on water availability. Here is a simplified version of the dynamics

$$\begin{aligned}
\text{basal growth of } \text{?} &= f(\text{?})\\
\text{per capita growth rate }\text{?} &= \text{basal growth} – \text{competition} – \text{grazing} – \text{death}\\
\text{per capita growth rate }\text{?} &= \text{grazing} – \text{predation} – \text{death}\\
\text{per capita growth rate }\text{?} &= \text{predation} – \text{death}
\end{aligned}$$

Let’s implement that in Julia!

cd(@__DIR__)
import Pkg; Pkg.activate(".")
using PythonCall
nx = pyimport("networkx")
np = pyimport("numpy")
include("model.jl")
include("utils.jl");

To implement the model, I use the library EcoEvoModelZoo, which provides to the user ready-to-use mechanistic eco-evolutionary models. I use the model type SimpleEcosystemModel, cf. documentation of EcoEvoModelZoo. Let’s first construct the trophic network, and plot it

using EcoEvoModelZoo

N = 3 # number of species

pos = Dict(1 => [0, 0], 2 => [0.2, 1], 3 => [0, 2])
labs = Dict(1 => "Resource", 2 => "Consumer", 3 => "Prey")

foodweb = DiGraph(N)
add_edge!(foodweb, 2 => 1) # Consumer (node 2) feeds on Resource (node 1)
add_edge!(foodweb, 3 => 2) # Predator (node 3) fonds on consumer (node 2)
println(foodweb)
Graphs.SimpleGraphs.SimpleDiGraph{Int64}(2, [Int64[], [1], [2]], [[2], [3],
 Int64[]])
plot_foodweb(foodweb, pos, labs)
(<py Figure size 640x480 with 1 Axes>, <py Axes: >)

Then, I implement the processes that drive the dynamics of the ecoystem. Those include resource limitation for the resource species (e.g. limitation in nutrients), intraspecific competition for the resource species, reproduction, and feeding interactions (grazing and predation). To better understand this piece of code, you may want to refer to one of my previous blog post, “Inverse ecosystem modelling made easy with PiecewiseInference.jl”.

function carrying_capacity(p, t)
    @unpack K₁₁ = p
    K = vcat(K₁₁, ones(Float32, N - 1))
    return K
end
carrying_capacity (generic function with 1 method)
function competition(u, p, t)
    @unpack A₁₁ = p
    A = spdiagm(vcat(A₁₁, zeros(Float32, 2)))
    return A * u
end
competition (generic function with 1 method)
resource_conversion_efficiency(p, t) = ones(Float32, N)
resource_conversion_efficiency (generic function with 1 method)

Functional responses

The feeding processes implemented are based on a functional response of type II. The attack rates q define the slope of the functional response, while the handling times H define the saturation of this response.

W = adjacency_matrix(foodweb)
I, J, _ = findnz(W)

function feeding(u, p, t)
    @unpack H₂₁, H₃₂, q₂₁, q₃₂ = p

    # handling time
    H = sparse(I, J, vcat(H₂₁, H₃₂), N, N)

    # attack rates
    q = sparse(I, J, vcat(q₂₁, q₃₂), N, N)

    return q .* W ./ (one(eltype(u)) .+ q .* H .* (W * u))
end
feeding (generic function with 1 method)

Dependence of resource growth rate on water availability

We model a time-varying water availability, and a growth rate of the resource which depends on the water availability.

water_availability(t) = sin.(2 * pi / 600 * 5 * t)

growth_rate_resource(r, water) = r * exp(-0.5*(water)^2)

intinsic_growth_rate(p, t) = [growth_rate_resource(p.r[1], water_availability(t)); p.r[2:end]]
intinsic_growth_rate (generic function with 1 method)

Let’s plot what does the water availability looks like through time

ts = tspan[1]:1:tspan[end]
fig = figure()
plot(ts, water_availability.(ts))
xlabel("Time (days)")
ylabel("Water availability (normalized)")
fig.set_facecolor("None")
Python None

Now we define the numerical values for the parameter of the ecosystem model.

p_true = ComponentArray(H₂₁=Float32[1.24], # handling times
                        H₃₂=Float32[2.5],
                        q₂₁=Float32[4.98], # attack rates
                        q₃₂=Float32[0.8],
                        r=Float32[1.0, -0.4, -0.08], # growth rates
                        K₁₁=Float32[1.0], # carrying capacity for the resource
                        A₁₁=Float32[1.0]) # competition for the resource
ComponentVector{Float32}(H₂₁ = Float32[1.24], H₃₂ = Float32[2.5], q₂₁ = Flo
at32[4.98], q₃₂ = Float32[0.8], r = Float32[1.0, -0.4, -0.08], K₁₁ = Float3
2[1.0], A₁₁ = Float32[1.0])

And with that, we can plot how we implemented the dependence between the resource growth rate and the water availability. It is a gaussian dependence, where the resource growth rate is maximum at a certain value of water availability. The intuition is that too much or too little water is detrimental to the resource.

water_avail_range = reshape(sort!(water_availability.(tsteps)),length(tsteps))

fig = figure()
plot(water_avail_range, growth_rate_resource.(p_true.r[1], water_avail_range))
xlabel("Water availability (normalized)")
ylabel("Resource basal growth rate")
fig.set_facecolor("none")
Python None

Now let’s simulate the dynamics

u0_true = Float32[0.77, 0.060, 0.945]

mp = ModelParams(;p=p_true,
                tspan,
                u0=u0_true,
                solve_params...)

model = SimpleEcosystemModel(; 
                            mp, 
                            intinsic_growth_rate,
                            carrying_capacity,
                            competition,
                            resource_conversion_efficiency,
                            feeding)
`Model` SimpleEcosystemModel
fig, ax = plot_time_series(simulate(model));

And let’s generate a dataset by contaminating the model output with lognormally distributed noise.

data = simulate(model) |> Array

# contaminating raw data with noise
data = data .* exp.(1f-1*randn(Float32, size(data)))

# plotting
fig, ax = subplots(figsize=(7,4))

for i in 1:size(data,1)
    ax.scatter(tsteps, data[i, :], label=labels_sp[i], color = species_colors[i], s=10.)
end
# ax.set_yscale("log")
ax.set_ylabel("Species abundance")
ax.set_xlabel("Time (days)")
fig.set_facecolor("None")
ax.set_facecolor("None")
fig.legend()
display(fig)

Empirical modelling

Let’s build a ML-model, and train the model on the time series.

We’ll use a recurrent neural network model

More specifically, we use a Long Short Term Memory cell, connected to two dense layers with relu and a radial basis (rbf) activation functions.

using Flux # Julia deep learning library

include("rnn.jl") # Load some utility functions
args = ArgsEco() # Set up hyperparameters

rbf(x) = exp.(-(x.^2)) # custom activation function

hls = 64 # hidden layer size

# Definition of the RNN
# our model takes in the ecosystem state variables, together with current water availability
rnn_model = Flux.Chain(LSTM(N + 1, hls),
            Flux.Dense(hls, hls, relu),
                Flux.Dense(hls, N, rbf))

@time train_model!(rnn_model, data, args)   # Train and output model
75.555727 seconds (137.67 M allocations: 50.030 GiB, 6.62% gc time, 37.42%
 compilation time)

We can now simulate our trained model in an autoregressive manner, which allows us to forecast over time steps beyond the training dataset.

tsteps_forecast = tspan[end]+dt:dt:tspan[end]+100*dt

water_availability_range = water_availability.(vcat(tsteps, tsteps_forecast))
init_states = data[:,1:2]

pred_rnn = simulate(rnn_model, init_states, water_availability_range)
3×200 Matrix{Float32}:
 0.762119  0.808972   0.814921   0.789933   …  0.871319  0.816685  0.742632
 0.118674  0.0965346  0.0672258  0.0434324     0.119743  0.14609   0.185878
 0.530488  0.531594   0.518911   0.476789      0.301607  0.29676   0.31418

And let’s plot it. Plain lines are the model output for the training time span, and dashed lines are the model forecast.



fig, ax = subplots(1, figsize=(7,4))
for i in 1:N
    ax.scatter(tsteps[3:end], 
                data[i, 3:end], 
                label = labels_sp[i], 
                color = species_colors[i],
                s = 6.)
    ax.plot(tsteps[2:end], 
            pred_rnn[i,1:length(tsteps)-1], 
            c = species_colors[i])
    ax.plot(tsteps_forecast .+ dt, 
            pred_rnn[i,length(tsteps):end], 
            linestyle="--", 
            c = species_colors[i])
end
ax.set_ylabel("Species abundance")
ax.set_xlabel("Time (days)")
fig.set_facecolor("None")
ax.set_facecolor("None")
fig.legend()
display(fig)

Looks pretty good!

Now let’s see what happens if the resource species go extinct. Because this species is at the bottom of the trophic chain, we expect a collapse of the ecosystem.

water_availability_range = water_availability.(tsteps)
init_states = data[:,1:2]
init_states[1,:] .= 0.
pred_rnn = simulate(rnn_model, init_states, water_availability_range)
3×100 Matrix{Float32}:
 0.904891   0.767216   0.741785   0.774109  …  0.183886  0.0288432  0.04645
4
 0.0890969  0.0426226  0.0287545  0.019797     0.596416  0.293802   0.08636
74
 0.59085    0.542144   0.487364   0.441334     0.178284  0.282761   0.29444
7
fig, ax = subplots(1, figsize=(7,4))
for i in 1:N
    ax.plot( 
            pred_rnn[i,:], 
            c = species_colors[i])
end
ax.set_ylabel("Species abundance")
ax.set_xlabel("Time (days)")
fig.set_facecolor("None")
ax.set_facecolor("None")
display(fig)

There is clearly a problem! The RNN model outputs almost unchanged dynamics, predicting that magically, the resource would revive. This were to be expected: the ML model did not see this type of collapse dynamics, so it cannot invent it.

In summary, ML-based model

  • ? Very good for interpolation
  • ? Does not require any knowledge from the system, only data!
  • ? Cannot extrapolate to unobserved scenarios

Mechanistic modelling

We now define a new mechanistic model, deliberatively falsifying the numerical value of the parameters compared to the baseline ecosystem model, and simplifying the resource dependence on water availbility by assuming that its growth rate is constant. The resulting model_mechanistic is as such an inaccurate representation of the baseline ecosystem model.

p_mech = ComponentArray(H₂₁=Float32[1.1],
                        H₃₂=Float32[2.3],
                        q₂₁=Float32[4.98],
                        q₃₂=Float32[0.8],
                        r = Float32[1.0, -0.4, -0.08],
                        K₁₁=Float32[0.8],
                        A₁₁=Float32[1.2])

u0_mech = Float32[0.77, 0.060, 0.945]

mp = ModelParams(; p=p_mech,
                u0=u0_mech,
                solve_params...)

growth_rate_mech(p, t) = p.r

model_mechanistic = SimpleEcosystemModel(; mp, intinsic_growth_rate = growth_rate_mech ,
                                        carrying_capacity,
                                        competition,
                                        resource_conversion_efficiency,
                                        feeding)

simul_data = simulate(model_mechanistic)

Let’s now plot its dynamics

plot_time_series(simul_data);

The dynamics looks quite different from the time series, and cannot provide any realistic forecast of the ecosystem state in the future. Yet, it does capture some of the dynamics, since it reproduces an oscillatory behavior. As such, we can assume that this ecosystem model captures some of the processes driving the baseline ecosystem dynamics.

We can now use this mechanistic model to again try to understand what happens if the resource species go extinct.

simul_data = simulate(model_mechanistic, 
                u0 = Float32[0., 0.060, 0.945], 
                saveat=0:dt:100.)

plot_time_series(simul_data);

That makes much more sense!!! The model does predict a collapse, and we can even estimate how long it will take for the system to fully collapse.

In summary, mechanistic models

  • ? Very good for extrapolating to novel scenarios
  • ? Hard to parametrise
  • ? Inacurate

Now let’s see how we can improve this ecosystem model, by making better use of the dataset that we have at hand.

Inverse modelling

Use the observed data to infer the parameters of a model

There are two broad classes of methods to perform inverse modelling:

  • Bayesian inference

    • provide uncertainties estimations
    • does not scale well with the number of parameters to explore
  • Variational optimization

    • Does not suffer from the curse of dimensionality
    • Convergence to local minima
    • Need for parameter sensitivity

To better understand the caveats of the variational optimization approach, let’s further explain it. This method consists in defining a certain loss function, which allows to measure a distance between the model output and the observations:

$$
\text{L} = \sum_i [\text{observations} – \text{simulations}]^2
$$

Variational optimization methods seek to minimize $L$ by using its gradient (sensitivity) with respect to the model parameters. This gradient indicates how to update the parameters to reduce the distance between the observations and the simulation outputs.

This can be done iteratively until finding the parameters that minimize the loss, as illustrated below.

This works very well! In theory.

In practice, the landscape is rugged, as in the picture below

By following the gradient in such a landscape, variational optimization methods tend to get stuck in wrong regions of the parameter space, and provide false estimate.

PiecewiseInference.jl

PiecewiseInference.jl is a julia package that I have authored, which implements a method to smoothen the loss landscape, and that permits to automatically obtain the model parameter sensitivity. It is detailed in the following preprint

Boussange, V., Vilimelis-Aceituno, P., Pellissier, L., Mini-batching ecological data to improve ecosystem models with machine learning. bioRxiv (2022), 46 pages.

⚠️ We will shortly rename this preprint in a revised version, as the term “mini batching” is confusing. We now prefer the term “partitioning”

The method works by training the model on small chunks of data

Let’s use it to tune the paramters of our mechanitic model. We’ll group data points in groups of 11, indicated by the argument group_size = 11

using PiecewiseInference
include("utils.jl"); # some utility functions defining `inference_problem_args` and `piecewise_MLE_args`

loss_likelihood(data, pred, rg) = sum((log.(data) .- log.(pred)).^2)

infprob = InferenceProblem(model_mechanistic, 
                            p_mech; 
                            loss_likelihood, 
                            inference_problem_args...);

@time res_mech = piecewise_MLE(infprob;
                        group_size = 11,
                        data = data,
                        tsteps = tsteps,
                        optimizers = [Adam(1e-2)],
                        epochs = [500],
                        piecewise_MLE_args...)
piecewise_MLE with 101 points and 10 groups.
Current loss after 50 iterations: 14.972400665283203
Current loss after 100 iterations: 12.230918884277344
Current loss after 150 iterations: 11.497014045715332
Current loss after 200 iterations: 11.264657020568848
Current loss after 250 iterations: 11.19422721862793
Current loss after 300 iterations: 11.16870403289795
Current loss after 350 iterations: 11.156006813049316
Current loss after 400 iterations: 11.14731216430664
Current loss after 450 iterations: 11.141851425170898
Current loss after 500 iterations: 11.138410568237305
166.285066 seconds (1.17 G allocations: 177.010 GiB, 10.44% gc time, 27.02%
 compilation time: 0% of which was recompilation)
`InferenceResult` with model SimpleEcosystemModel

Now let’s plot what does this trained model predicts

simul_mech_forecast = simulate(model_mechanistic, 
                                p = res_mech.p_trained, 
                                u0 = data[:,1],
                                saveat = vcat(tsteps, tsteps_forecast), 
                                tspan = (0, tsteps_forecast[end]));

fig, ax = subplots(1, figsize=(7,4))
for i in 1:N
    ax.scatter(tsteps, 
                data[i,:], 
                label = labels_sp[i], 
                color = species_colors[i],
                s = 6.)
    ax.plot(tsteps[1:end], 
            simul_mech_forecast[i,1:length(tsteps)], 
            c = species_colors[i])
    ax.plot(tsteps_forecast, 
            simul_mech_forecast[i,length(tsteps)+1:end], 
            linestyle="--", 
            c = species_colors[i])
end
ax.set_ylabel("Species abundance")
ax.set_xlabel("Time (days)")
fig.set_facecolor("None")
ax.set_facecolor("None")
fig.legend()
display(fig)

Looks much better! The model now captures doubling period oscillations.

In summary, mechanistic models

  • ? Very good for extrapolating to novel scenarios
  • ? Hard to parametrise
  • ? Inacurate

Model selection

To improve the accuracy, one can try to formulate different models corresponding to alternative hypotheses about the processes driving the ecosystem dynamics. For instance, we could compare the performance of this model to an alternative model, which would capture some sort of dependence between the water availability and the resource growth rate.

If we find that the refined model performs better than the model with constant growth rate, we have learnt that water availability is an important driver that affects the ecosystem dynamics.

Hybrid models

Assume that we have no idea on what the dependence of the resource growth rate on water availability may look like.

To proceed, we can define a very generic parametric function that can capture any sort of dependence, and then try to learn the parameters of this function from the data.

Neural networks are parametric functions that are highly suited for this sort of task. So we’ll build a hyrbid model, which contains the mechanistic components of the previous model, but where the resource growth rate is parametrized by a neural network.

? + ?= ?

Below, we define the neural network, and the resource growth rate based on this neural net.

using Lux
rng = Random.default_rng()
# Multilayer FeedForward
hlsize = 5
neural_net = Lux.Chain(Lux.Dense(1,hlsize,rbf), 
                        Lux.Dense(hlsize,hlsize, rbf), 
                        Lux.Dense(hlsize,hlsize, rbf), 
                        Lux.Dense(hlsize, 1))
# Get the initial parameters and state variables of the model
p_nn, st = Lux.setup(rng, neural_net)
p_nn = p_nn |> ComponentArray

growth_rate_resource_nn(p_nn, water) = neural_net([water], p_nn, st)[1]

function hybrid_growth_rate(p, t)
    return [growth_rate_resource_nn(p.p_nn, water_availability(t)); p.r]
end
hybrid_growth_rate (generic function with 1 method)

Now we define our new hybrid model that implements this hybrid_growth_rate function

p_hybr = ComponentArray(H₂₁=p_mech.H₂₁,
                        H₃₂=p_mech.H₃₂,
                        q₂₁=p_mech.q₂₁,
                        q₃₂=p_mech.q₃₂,
                        r = p_mech.r[2:3],
                        K₁₁=p_mech.K₁₁,
                        A₁₁=p_mech.A₁₁,
                        p_nn = p_nn)

mp = ModelParams(;p=p_hybr,
                u0=u0_mech,
                solve_params...)

model_hybrid = SimpleEcosystemModel(; mp, 
                                    intinsic_growth_rate = hybrid_growth_rate,
                                    carrying_capacity,
                                    competition,
                                    resource_conversion_efficiency,
                                    feeding)
`Model` SimpleEcosystemModel

And we train it

infprob = InferenceProblem(model_hybrid, 
                            p_hybr; 
                            loss_likelihood, 
                            inference_problem_args...);

res_hybr = piecewise_MLE(infprob;
                    group_size = 11,
                    data = data,
                    tsteps = tsteps,
                    optimizers = [Adam(2e-2)],
                    epochs = [500],
                    piecewise_MLE_args...)
piecewise_MLE with 101 points and 10 groups.
Current loss after 50 iterations: 5.201906204223633
Current loss after 100 iterations: 3.6290767192840576
Current loss after 150 iterations: 3.519521713256836
Current loss after 200 iterations: 3.4955437183380127
Current loss after 250 iterations: 3.4787349700927734
Current loss after 300 iterations: 3.5343759059906006
Current loss after 350 iterations: 3.4698166847229004
Current loss after 400 iterations: 3.4563584327697754
Current loss after 450 iterations: 3.5980613231658936
Current loss after 500 iterations: 3.4506189823150635
`InferenceResult` with model SimpleEcosystemModel

The loss has signficiantly reduced. This means that this model better explains the dynamics, and as such, we have discovered that water avilability is indeed an important driver of ecosystem dynamics! Let’s plot the simulation output of this hyrbid model

simul_hybr_forecast = simulate(model_hybrid, 
                                p = res_hybr.p_trained, 
#                                 u0 = res_hybr.u0s_trained[1],
                                u0 = data[:,1],
                                saveat = vcat(tsteps, tsteps_forecast), 
                                tspan = (0, tsteps_forecast[end]));
fig, ax = subplots(1, figsize=(7,4))
for i in 1:N
    ax.scatter(tsteps, 
                data[i,:], 
                label = labels_sp[i], 
                color = species_colors[i],
                s = 6.)
    ax.plot(tsteps[1:end], 
        simul_hybr_forecast[i,1:length(tsteps)], 
            c = species_colors[i])
    ax.plot(tsteps_forecast, 
            simul_hybr_forecast[i,length(tsteps)+1:end], 
            linestyle="--", 
            c = species_colors[i])
end
ax.set_ylabel("Species abundance")
ax.set_xlabel("Time (days)")
fig.set_facecolor("None")
ax.set_facecolor("None")
fig.legend()
display(fig)

The cool thing is that although it contains a fully parametric component (the neural network), this hybrid can still extrapolate, because it is constrained by mechanistic processes.

plot_time_series(simulate(model_hybrid, 
                p = res_hybr.p_trained, 
                u0 = Float32[0., 0.060, 0.945], 
                saveat=0:dt:100.));

And what’s even cooler is that by interpreting the neural network, we can actually learn a new process: the shape of the dependence between the resource growth rate and the water availability!

water_avail = reshape(sort!(water_availability.(tsteps)),1,:)

p_nn_trained = res_hybr.p_trained.p_nn
gr = neural_net(water_avail, p_nn_trained, st)[1]


fig, ax = subplots(1)
ax.plot(water_avail[:], 
        gr[:], 
        label="Neural network",
        linestyle="--")

gr_true = model.mp.p[1] .* exp.(-0.5 .* water_avail.^2)
ax.plot(water_avail[:], gr_true[:], label="Ground truth")
ax.set_ylim(0,2)
ax.set_xlim(-2,2)
ax.legend()
xlabel("Water availability (normalized)")
ylabel("Resource basal growth rate")
display(fig)

Wrap-up

Hybrid approaches are the future! But they

  • Require
  • Requires differentiable programming and interoperability between scientific libraries
    • What
      is best at

    • But JAX is also very cool ?

Inverse ecosystem modeling made easy with PiecewiseInference.jl

By: julia | Victor Boussange

Re-posted from: https://vboussange.github.io/post/piecewiseinference/

Mechanistic ecosystem models permit to quantiatively describe how population, species or communities grow, interact and evolve. Yet calibrating them to fit real-world data is a daunting task. That’s why I’m excited to introduce PiecewiseInference.jl, a new Julia package that provides a user-friendly and efficient framework for inverse ecosystem modeling. In this blog post, I will guide you through the main features of PiecewiseInference.jl and provide a step-by-step tutorial on how to use it with a three-compartment ecosystem model. Whether you’re a quantitative ecologist or a curious data scientist, I hope this post will encourage you to join the effort and use and develop inverse ecosystem modelling methods to improve our understanding and predictions of ecosystems.

Preliminary steps

This tutorial relies on three packages that I have authored but are (yet) not registered on the official Julia registry. Those are

  • PiecewiseInference,
  • EcoEvoModelZoo: a package which provides access to a collection of ecosystem models,
  • ParametricModels: a wrapper package to manipulate dynamical models. Specifically, ParametricModels avoids the hassle of specifying, at each time you want to simulate an ODE model, boring details such as the algorithm to solve it, the time span, etc…

To easily install them on your machine, you’ll have to add my personal registry by doing the following:

using Pkg; Pkg.Registry.add(RegistrySpec(url = "https://github.com/vboussange/VBoussangeRegistry.git"))

Once this is done, let’s import those together with other necessary Julia packages for this tutorial.

using Graphs
using EcoEvoModelZoo
using ParametricModels
using LinearAlgebra
using UnPack
using OrdinaryDiffEq
using Statistics
using SparseArrays
using ComponentArrays
using PythonPlot

We use Graphs to create a directed graph to represent the food web to be considered The OrdinaryDiffEq package provides tools for solving ordinary differential equations, while the LinearAlgebra package is used for linear algebraic computations. The UnPack package provides a convenient way to extract fields from structures, and the ComponentArrays package is used to store and manipulate the model parameters conveniently. Finally, the PythonCall package is used to interface with Python’s Matplotlib library for visualization.

Definition of the forward model

Defining hyperparameters for the forward simulation of the model.

Next, we define the algorithm used for solving the ODE model. We also define the
absolute tolerance (abstol) and relative tolerance (reltol) for the solver.
tspan is a tuple representing the time range we will simulate the system for,
and tsteps is a vector representing the times we want to output the simulated
data.

alg = BS3()
abstol = 1e-6
reltol = 1e-6
tspan = (0.0, 600)
tsteps = range(300, tspan[end], length=100)
300.0:3.0303030303030303:600.0

Defining the foodweb structure

We’ll define a 3-compartment ecosystem as presented in McCann et al. (1994). We will use SimpleEcosystemModel from EcoEvoModeZoo.jl, which requires as input a foodweb structure. Let’s use a DiGraph to represent it.

N = 3 # number of compartment

foodweb = DiGraph(N)
add_edge!(foodweb, 2 => 1) # C to R
add_edge!(foodweb, 3 => 2) # P to C
true

The N variable specifies the number of
compartments in the model. The add_edge! function is used to add edges to the
graph, specifying the flow of resources between compartments.

For fun, let’s just plot the foodweb. Here we use the PythonCall and PythonPlot
packages to visualize the food web as a directed graph using networkx and numpy.
We create a color list for the different species, and then create a directed
graph g_nx with networkx using the adjacency matrix of the food web. We also
specify the position of each node in the graph, and use nx.draw to draw the
graph with

using PythonCall
nx = pyimport("networkx")
np = pyimport("numpy")
species_colors = ["tab:red", "tab:green", "tab:blue"]

g_nx = nx.DiGraph(np.array(adjacency_matrix(foodweb)))
pos = Dict(0 => [0, 0], 1 => [0.2, 1], 2 => [0, 2])
labs = Dict(0 => "Resource", 1 => "Consumer", 2 => "Prey")

fig, ax = subplots(1)
nx.draw(g_nx, pos, ax=ax, node_color=species_colors, node_size=1000, labels=labs)
display(fig)

Defining the ecosystem model

Now that we have defined the foodweb structure, we can build the ecosystem
model, which will be a SimpleEcosystemModel from EcoEvoModelZoo.

The next several functions are required by SimpleEcosystemModel and define the
specific dynamics of the model. The intinsic_growth_rate function specifies
the intrinsic growth rate of each compartment, while the carrying_capacity
function specifies the carrying capacity of each compartment. The competition
function specifies the competition between and within compartments, while the
resource_conversion_efficiency function specifies the efficiency with which
resources are converted into consumer biomass. The feeding function specifies
the feeding interactions between compartments.

intinsic_growth_rate(p, t) = p.r

function carrying_capacity(p, t)
    @unpack K₁₁ = p
    K = vcat(K₁₁, ones(N - 1))
    return K
end

function competition(u, p, t)
    @unpack A₁₁ = p
    A = spdiagm(vcat(A₁₁, 0, 0))
    return A * u
end

resource_conversion_efficiency(p, t) = ones(N)
resource_conversion_efficiency (generic function with 1 method)

To define the feeding processes, we use adjacency_matrix to get the adjacency matrix of the food web. We then use findnz from SparseArrays to get the row and column indices of the non-zero entries in the adjacency matrix, which we store in I and J. Those are then used to generate sparse matrices required for defining the functional responses of each species considered. The sparse matrices’ non-zero coefficients are the model parameters to be fitted.

using SparseArrays
W = adjacency_matrix(foodweb)
I, J, _ = findnz(W)
([2, 3], [1, 2], [1, 1])
function feeding(u, p, t)
    @unpack H₂₁, H₃₂, q₂₁, q₃₂ = p

    # handling time
    H = sparse(I, J, vcat(H₂₁, H₃₂), N, N)

    # attack rates
    q = sparse(I, J, vcat(q₂₁, q₃₂), N, N)

    return q .* W ./ (one(eltype(u)) .+ q .* H .* (W * u))
end
feeding (generic function with 1 method)

We are done defining the ecological processes.

Defining the ecosystem model parameters for generating a dataset

The parameters for the ecosystem model are defined using a ComponentArray. The
u0_true variable specifies the initial conditions for the simulation. The
ModelParams type from the ParametricModels package is used to specify the
model parameters and simulation settings. Finally, the SimpleEcosystemModel
type from the EcoEvoModelZoo package is used to define the ecosystem model.

p_true = ComponentArray(H₂₁=[1.24],
                        H₃₂=[2.5],
                        q₂₁=[4.98],
                        q₃₂=[0.8],
                        r=[1.0, -0.4, -0.08],
                        K₁₁=[1.0],
                        A₁₁=[1.0])

u0_true = [0.77, 0.060, 0.945]

mp = ModelParams(; p=p_true,
    tspan,
    u0=u0_true,
    alg,
    reltol,
    abstol,
    saveat=tsteps,
    verbose=false, # suppresses warnings for maxiters
    maxiters=50_000
)
model = SimpleEcosystemModel(; mp, intinsic_growth_rate,
    carrying_capacity,
    competition,
    resource_conversion_efficiency,
    feeding)
`Model` SimpleEcosystemModel

Let’s run the model to generate a dataset! There is nothing more simple than that. Let’s also plot it,
to get a sense of what it looks like.

data = simulate(model, u0=u0_true) |> Array

# plotting
using PythonPlot;
function plot_time_series(data)
    fig, ax = subplots()
    for i in 1:N
        ax.plot(data[i, :], label="Species $i", color = species_colors[i])
    end
    # ax.set_yscale("log")
    ax.set_ylabel("Species abundance")
    ax.set_xlabel("Time (days)")
    fig.set_facecolor("None")
    ax.set_facecolor("None")
    fig.legend()
    return fig
end

display(plot_time_series(data))

Let’s add a bit of noise to the data to simulate experimental errors. We proceed by adding
log normally distributed noise, so that abundance are always positive (negative abundance would not make sense, but could happen when adding normally distributed noise!).

data = data .* exp.(0.1 * randn(size(data)))

display(plot_time_series(data))

Inversion with PiecewiseInference.jl

Now that we have set up our model and generated some data, we can proceed with the inverse modelling using PiecewiseInference.jl.

PiecewiseInference.jl allows to perform inversion based on a segmentation method that partitions the data into short time series (segments), each treated independently and matched against simulations of the model considered. The segmentation approach helps to avoid the ill-behaved loss functions that arise from the strong nonlinearities of ecosystem models, when formulation the inference problem. Note that during the inversion, not only the parameters are inferred, but also the initial conditions, which are necessary to simulate the ODE model.

Definition of the InferenceProblem

We first import the packages required for the inversion. PiecewiseInference is the
main package used, but we also need OptimizationFlux for the Adam optimizer,
and SciMLSensitivity to define the sensitivity method used to differentiate
the forward model.

using PiecewiseInference
using OptimizationFlux
using SciMLSensitivity

To initialize the inversion, we set the initial values for the parameters in p_init to those of p_true but modify the H₂₁ parameter.

p_init = p_true
p_init.H₂₁ .= 2.0 #
1-element view(::Vector{Float64}, 1:1) with eltype Float64:
 2.0

Next, we define a loss function loss_likelihood that compares the observed data
with the predicted data. Here, we use a simple mean-squared error loss function while log transforming the abundance, since the noise is log-normally distributed.

loss_likelihood(data, pred, rg) = sum((log.(data) .- log.(pred)) .^ 2)# loss_fn_lognormal_distrib(data, pred, noise_distrib)
loss_likelihood (generic function with 1 method)

We then define the InferenceProblem, which contains the forward
model, the initial parameter values, and the loss function.

infprob = InferenceProblem(model, p_init; loss_likelihood);

It is also handy to use a callback function, that will be called after each iteration of the optimization routine, for visualizing the progress of the inference. Here, we use it to track the loss value and plot the data against the model predictions.

info_per_its = 50
include("cb.jl") # defines the `plotting_fit` function
function callback(p_trained, losses, pred, ranges)
    if length(losses) % info_per_its == 0
        plotting_fit(losses, pred, ranges, data, tsteps)
    end
end
callback (generic function with 1 method)

piecewise_MLE hyperparameters

To use piecewise_MLE, the main function of PiecewiseInference to estimate the parameters that fit the observed data, we need to decide on two critical hyperparameters

  • group_size: the number of data points that define an interval, or segment. This number is usually small, but should be decided upon the dynamics of the model: to more nonlinear is the model, the lower group_size should be. We set it here to 11
  • batch_size: the number of intervals, or segments, to consider on a single epoch. The higher the batch_size, the more computationally expensive a single iteration of piecewise_MLE, but the faster the convergence. Here, we set it to 5, but could increase it to 10, which is the total number of segments that we have.

Another critical parameter to be decided upon is the automatic differentiation backend used to differentiate the ODE model. Two are supported, Optimization.AutoForwardDiff() and Optimization.Autozygote(). Simply put, Optimization.AutoForwardDiff() is used for forward mode sensitivity analysis, while Optimization.Autozygote() is used for backward mode sensitivity analysis. For more information on those, please refer to the documentation of Optimization.jl.

Other parameters required by piecewise_MLE are

  • optimizers specifies the optimization algorithm to be used for each batch. We use the Adam optimizer, which is the go-to optimizer to train deep learning models. It has a learning rate parameter that controls the step size at each iteration. We have chosen a value of 1e-2 because it provides good convergence without causing numerical instability,

  • epochs specifies the number of epochs to be used for each batch. We chose a value of 500 because it is sufficient to achieve good convergence,

  • info_per_its specifies after how many iterations the callback function should be called

  • verbose_loss prints the value of the loss function during training,

@time res = piecewise_MLE(infprob;
                        adtype = Optimization.AutoZygote(),
                        group_size = 11,
                        batchsizes = [5],
                        data = data,
                        tsteps = tsteps,
                        optimizers = [Adam(1e-2)],
                        epochs = [500],
                        verbose_loss = true,
                        info_per_its = info_per_its,
                        multi_threading = false,
                        cb = callback)
piecewise_MLE with 100 points and 10 groups.
Current loss after 50 iterations: 36.745952018526495
Current loss after 100 iterations: 15.064711239454626
Current loss after 150 iterations: 9.993029255013324
Current loss after 200 iterations: 7.994491307947515
Current loss after 250 iterations: 6.500818892986831
Current loss after 300 iterations: 5.3892647156988565
Current loss after 350 iterations: 3.0351181646280514
Current loss after 400 iterations: 2.674445730720996
Current loss after 450 iterations: 3.1591980829795676
Current loss after 500 iterations: 2.4343376293865995
157.765049 seconds (1.70 G allocations: 154.827 GiB, 10.16% gc time, 31.31%
 compilation time: 1% of which was recompilation)
`InferenceResult` with model SimpleEcosystemModel

Finally, we can examine the results of the inversion. We can look at the final parameters, and the initial conditions inferred for each segement:

# Some more code
p_trained = res.p_trained
u0s_trained = res.u0s_trained

function print_param_values(p_trained, p_true)
    for k in keys(p_trained)
        println(string(k))
        println("trained value = "); display(p_trained[k])
        println("true value ="); display(p_true[k])
    end
end

print_param_values(p_trained, p_true)
H₂₁
trained value = 
1-element Vector{Float64}:
 1.4786844814887716
true value =
1-element Vector{Float64}:
 2.0
H₃₂
trained value = 
1-element Vector{Float64}:
 1.891238277791975
true value =
1-element Vector{Float64}:
 2.5
q₂₁
trained value = 
1-element Vector{Float64}:
 4.550896291686214
true value =
1-element Vector{Float64}:
 4.98
q₃₂
trained value = 
1-element Vector{Float64}:
 0.7250599871665505
true value =
1-element Vector{Float64}:
 0.8
r
trained value = 
3-element Vector{Float64}:
  0.8705446490535288
 -0.30124597815843823
 -0.08241879418666838
true value =
3-element Vector{Float64}:
  1.0
 -0.4
 -0.08
K₁₁
trained value = 
1-element Vector{Float64}:
 1.0397189294700315
true value =
1-element Vector{Float64}:
 1.0
A₁₁
trained value = 
1-element Vector{Float64}:
 0.972947353012839
true value =
1-element Vector{Float64}:
 1.0

Your turn to play!

You can try to change e.g. the batch_sizes and the group_size. How do those parameters influence the quality of the inversion?

Conclusion

PiecewiseInference.jl provides an efficient and flexible way to perform inference on complex ecological models, making use of automatic differentiation and optimizers traditionally used in Machine Learning. The segmentation method implemented in PiecewiseInference.jl regularizes the inference problem and enables inverse modelling of complex dynamical systems, for which standard methods would otherwise fail.

Furthermore, PiecewiseInference.jl together with EcoEvoModelZoo.jl offer a powerful toolkit for ecologists and evolutionary biologists to benchmark and validate models against data. The combination of theoretical modelling and data can provide new insights into complex ecological systems, helping us to better understand and predict the dynamics of biodiversity.

We invite users to explore these packages and contribute to their development, by adding new models to the EcoEvoModelZoo.jl and improve the features of PiecewiseInference.jl. With these tools, we can continue to push the boundaries of ecological modelling and make important strides towards a more sustainable future.

Appendix

You can find the corresponding tutorial as a .jmd file at https://github.com/vboussange/MyTutorials.
Please contact me, if you have found a mistake, or if you have any comment or suggestion on how to improve this tutorial.

What is new in DataFrames.jl 1.5

By: Blog by Bogumił Kamiński

Re-posted from: https://bkamins.github.io/julialang/2023/03/24/df15.html

Introduction

The objective of this post is simple: I want to discuss
the most important new functionalities that DataFrames.jl 1.5 offers.

The changes I cover today are:

  • more powerful Cols selector;
  • group order setting in groupby;
  • setting row order in joins;
  • new options of handling duplicate rows in unique;
  • making flatten function scalar-aware.

The post was tested under Julia 1.9.0-rc1 and DataFrames.jl 1.5.0.

More powerful Cols selector

The Cols selector has gotten two new features:

  • ability to mix condition function selector with other selectors;
  • new operator keyword argument.

Let me explain both by example. Start with using condition function selector.

This is the old behavior that was supported:

julia> using DataFrames

julia> df = DataFrame(x1=1, x2=2, y1=3, y2=4)
1×4 DataFrame
 Row │ x1     x2     y1     y2
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1      2      3      4

julia> select(df, Cols(startswith("x")))
1×2 DataFrame
 Row │ x1     x2
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

The startswith("x") condition function was required to be the only argument to Cols.
Now you can flexibly mix it with other selectors:

julia> select(df, Cols(startswith("x"), r"2"))
1×3 DataFrame
 Row │ x1     x2     y2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      4

julia> select(df, Cols(startswith("x"), endswith("2")))
1×3 DataFrame
 Row │ x1     x2     y2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      4

julia> select(df, Cols(startswith("x"), :y2))
1×3 DataFrame
 Row │ x1     x2     y2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      4

The second change is operator keyword argument. By default Cols, when passed multiple arguments
takes their union:

julia> select(df, Cols(startswith("x"), endswith("2")))
1×3 DataFrame
 Row │ x1     x2     y2
     │ Int64  Int64  Int64
─────┼─────────────────────
   1 │     1      2      4

However, you can pass other operators that specify the way how columns selected by individual
arguments should be combined together. For example you can take their intersection:

julia> select(df, Cols(startswith("x"), endswith("2"), operator=intersect))
1×1 DataFrame
 Row │ x2
     │ Int64
─────┼───────
   1 │     2

or set difference:

julia> select(df, Cols(startswith("x"), endswith("2"), operator=setdiff))
1×1 DataFrame
 Row │ x1
     │ Int64
─────┼───────
   1 │     1

Group order setting in groupby

Previously grouby when you passed sort=true sorted groups in ascending order.
Here is an example:

julia> df = DataFrame(id=["a", "c", "b"], row=1:3)
3×2 DataFrame
 Row │ id      row
     │ String  Int64
─────┼───────────────
   1 │ a           1
   2 │ c           2
   3 │ b           3

julia> show(groupby(df, :id, sort=true), allgroups=true)
GroupedDataFrame with 3 groups based on key: id
Group 1 (1 row): id = "a"
 Row │ id      row
     │ String  Int64
─────┼───────────────
   1 │ a           1
Group 2 (1 row): id = "b"
 Row │ id      row
     │ String  Int64
─────┼───────────────
   1 │ b           3
Group 3 (1 row): id = "c"
 Row │ id      row
     │ String  Int64
─────┼───────────────
   1 │ c           2

Now you can use pass in sort any set of keyword arguments that sort accepts as a named tuple.
For example if you wanted groups to be sorted in reverse order do:

julia> show(groupby(df, :id, sort=(rev=true,)), allgroups=true)
GroupedDataFrame with 3 groups based on key: id
Group 1 (1 row): id = "c"
 Row │ id      row
     │ String  Int64
─────┼───────────────
   1 │ c           2
Group 2 (1 row): id = "b"
 Row │ id      row
     │ String  Int64
─────┼───────────────
   1 │ b           3
Group 3 (1 row): id = "a"
 Row │ id      row
     │ String  Int64
─────┼───────────────
   1 │ a           1

Setting row order in joins

By default join operations do not guarantee the order of rows in the output
(just like databases):

julia> df_left = DataFrame(id=[1, 2, 4, 5], left=1:4)
4×2 DataFrame
 Row │ id     left
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     4      3
   4 │     5      4

julia> df_right = DataFrame(id=[2, 1, 3, 6, 7], right=1:5)
5×2 DataFrame
 Row │ id     right
     │ Int64  Int64
─────┼──────────────
   1 │     2      1
   2 │     1      2
   3 │     3      3
   4 │     6      4
   5 │     7      5

julia> outerjoin(df_left, df_right, on=:id)
7×3 DataFrame
 Row │ id     left     right
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     2        2        1
   2 │     1        1        2
   3 │     4        3  missing
   4 │     5        4  missing
   5 │     3  missing        3
   6 │     6  missing        4
   7 │     7  missing        5

However, often users want the result to follow the order of rows of one of the source tables.
This can be achieved now using the order keyword argument.

If you want the result to have rows in the order of left table (and then added
non-matching rows from the right table at the end, if needed) do:

julia> outerjoin(df_left, df_right, on=:id, order=:left)
7×3 DataFrame
 Row │ id     left     right
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     1        1        2
   2 │     2        2        1
   3 │     4        3  missing
   4 │     5        4  missing
   5 │     3  missing        3
   6 │     6  missing        4
   7 │     7  missing        5

Similar option is available if you want to keep the row order of the right table:

julia> outerjoin(df_left, df_right, on=:id, order=:right)
7×3 DataFrame
 Row │ id     left     right
     │ Int64  Int64?   Int64?
─────┼─────────────────────────
   1 │     2        2        1
   2 │     1        1        2
   3 │     3  missing        3
   4 │     6  missing        4
   5 │     7  missing        5
   6 │     4        3  missing
   7 │     5        4  missing

New options of handling duplicate rows in unique

By default unique (and related functions unique! and nonunique) kept
the first duplicate row in case of duplicates. Here is an example:

julia> df = DataFrame(a=[1, 2, 3, 1, 2, 4], id=1:6)
6×2 DataFrame
 Row │ a      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
   4 │     1      4
   5 │     2      5
   6 │     4      6

julia> unique(df, :a)
4×2 DataFrame
 Row │ a      id
     │ Int64  Int64
─────┼──────────────
   1 │     1      1
   2 │     2      2
   3 │     3      3
   4 │     4      6

However, users sometimes wanted a different behavior. Two more are now supported.
If instead you want to keep the last of duplicate rows pass keep=:last:

julia> unique(df, :a, keep=:last)
4×2 DataFrame
 Row │ a      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     1      4
   3 │     2      5
   4 │     4      6

While, if you do not want to keep any duplicate rows pass keep=:noduplicates:

julia> unique(df, :a, keep=:noduplicates)
2×2 DataFrame
 Row │ a      id
     │ Int64  Int64
─────┼──────────────
   1 │     3      3
   2 │     4      6

Making flatten function scalar-aware

The flatten function is often useful when one wants to unnest a column that
holds collections (e.g. vectors). Here is an example:

julia> df = DataFrame(id=1:3, col=[["a", "b"], ["c", "d"], ["e", "f"]])
3×2 DataFrame
 Row │ id     col
     │ Int64  Array…
─────┼───────────────────
   1 │     1  ["a", "b"]
   2 │     2  ["c", "d"]
   3 │     3  ["e", "f"]

julia> flatten(df, :col)
6×2 DataFrame
 Row │ id     col
     │ Int64  String
─────┼───────────────
   1 │     1  a
   2 │     1  b
   3 │     2  c
   4 │     2  d
   5 │     3  e
   6 │     3  f

However, sometimes such a column might hold values that are not collections and we do not want to try expanding them.
The most common case is missing:

julia> df = DataFrame(id=1:3, col=[["a", "b"], missing, ["e", "f"]])
3×2 DataFrame
 Row │ id     col
     │ Int64  Array…?
─────┼───────────────────
   1 │     1  ["a", "b"]
   2 │     2  missing
   3 │     3  ["e", "f"]

julia> flatten(df, :col)
ERROR: MethodError: no method matching length(::Missing)

As you can see the operation failed as missing is not a collection that has length (thus it cannot be expanded).

Now you can specify that missing is a scalar that, when encountered, should not be expanded.
You do it by passing scalar=Missing keyword argument:

julia> flatten(df, :col, scalar=Missing)
5×2 DataFrame
 Row │ id     col
     │ Int64  String?
─────┼────────────────
   1 │     1  a
   2 │     1  b
   3 │     2  missing
   4 │     3  e
   5 │     3  f

Conclusions

I hope you will find the new additions that were introduced in DataFrames.jl useful for your data wrangling tasks!