Estimating SIR model parameters with ABC

By: Tim Poisot

Re-posted from:

I am currently working with Approximate Bayesian Computation for a project. As
always when learning a new method, I like to revisit examples i already
know about, to see how it compares. So I decided to look at classical data
about a flu epidemics in an English Boarding School. It’s a good dataset to
estimate SIR parameters, because (1) Influenza transmits in a flu-like way,
(2) boarding schools are a closed system from a demographic point of view, and
(3) there was no mortality event during the epidemics. Also, it is likely that
there was a single individual responsible for the epidemics at the beginning
(but this is something I’ll look into).

distribution of parameter values when there is no analytical expression of
the likelihood function. Just to give a very short overview: knowing data
(the number of cases every day), I want to estimate the parameters of the
SIR model that represents the epidemics. The SIR model is a compartmental
model, in which individuals start as susceptible, become infectious, then
finally recover. Formally, it is expressed as a set of ordinary differential
equations, $\dot S = -\beta I S$, $\dot I = \beta I S – \nu I$, and $\dot R =
\nu I$. There are two parameters of the model itself (the transmission rate
$\beta$ and recovery rate $\nu$), to which one should add the initial state
$S, I, R = {S_0, I_0, R_0}$.

There are a few noteworthy things about this model. First, at any time $t$,
$S_t+I_t+R_t=N$, where $N$ is the total population size – this assumes no
demography, but this is a very reasonable assumption for diseases with a
short course and low/no mortality. The other noteworthy things are not really
relevant to what I’m about to do, so I’ll skip them. The important information
is the value of $R_0$, which gives the expected number of infections from a
single infected case in an otherwise entirely susceptible population. In the
SIR model, $R_0 = N\beta/\nu$. The $R_0$ of influenza viruses is usually in
the $[1.3-1.9]$ range. Values of $R_0 > 1$ imply that epidemic progression
will likely happen.

What I want to do is use ABC to estimate the distribution of $\beta$ and
$\nu$, as well as $I_0$, and thus estimate the $R_0$ of this particular
epidemics. Here is the basic approach. I will first establish the prior
distribution of my three parameters. I will assume that there are no way to
guesstimate values of $\beta$ and $\nu$, so I will sample a uniform $[0;1]$
distribution for $\nu$, and 10 to the power of random numbers in $[-4;
0]$ for $\beta$ (these are reasonable estimates in my experience). The
only thing I know about $I_0$ is that it is most likely 1, so I will use a
Poisson distribution with $\lambda = 1$ and add 1, which will make $I_0 =
1$ and $I_0 = 2$ equally likely.

With these random parameters, I will simulate a SIR epidemics, and measure
$\rho(I_t,\hat I_t)$, which is to say the distance between the empirical time
series, and the simulated one. If this distance is lower than an arbitrary
treshold $\epsilon$, I will keep the random set of parameters as likely
– by replicating this process enough times, I should hopefully end up with
a posterior distribution of all three parameters.

Oh, and also I will do this in julia. Let’s go!

Writing the code

First, let’s load a bunch of packages needed to do the processing and output:

using Distances
using Distributions
using DataFrames
using Gadfly

Once this is done, we need a function sir to run the simulations. I will use
an extremely simple one with Euler integration, because using a sophisticated
numerical integration scheme is not the point.

function sir(N, I0, beta, nu, t)
   # SIR model
   # N     - total population size
   # I0    - number of infected cases at t = 0
   # beta  - transmission
   # nu    - recovery
   # t     - number of timesteps
   s = zeros(t)
   i = zeros(t)
   r = zeros(t)
   s[1] = N-I0
   i[1] = I0
   for step in [1:(t-1)]
      ds = -beta*s[step]*i[step]
      di =  beta*s[step]*i[step] - nu*i[step]
      dr =  nu*i[step]
      s[step+1] = s[step] + ds
      i[step+1] = i[step] + di
      r[step+1] = r[step] + dr
   return (s, i, r)

This function will return a tuple of time series, for the S, I, and R
values. The next step is to define a bunch of arrays, parameters, etc etc.

number_samples = 10000

beta_prior = 10.^rand(Uniform(-4, -2), number_samples)
nu_prior   =     rand(Uniform(0, 1), number_samples)
i0_prior   =     rand(Poisson(1), number_samples).+1

param_N = 763   # We know the total number of students
param_t = 15    # The data cover 15 days
param_d = 200   # This is my arbitraty treshold, feel free  to experiment

beta_posterior = Float64[]
nu_posterior = Float64[]
i0_posterior = Int64[]
fit_posterior = Float64[]

Then, we can load the data which are stored in a file called data.csv,
available as a gist here:

observed_cases = convert(Array{Float64}, readtable("data.csv",separator=';', header=true)[:I])

And now, we need to loop through all sets of parameters, and measure the
distance. Each iteration of this loop will run a simulation of the SIR model,
then measure the Euclidean distance between the observed and simulated time
series, and finally if this distance is lower than a treshold, consider that
the parameter set can be kept:

for run in [1:number_samples]
   simulated_timeseries = sir(
   distance = evaluate(Euclidean(), observed_cases, simulated_timeseries[2])
   if distance <= param_d
      push!(beta_posterior, beta_prior[run])
      push!(nu_posterior, nu_prior[run])
      push!(i0_posterior, i0_prior[run])
      push!(fit_posterior, distance)

To make plotting faster, we will wrap up the results in a DataFrame:

posteriors = DataFrame(beta = beta_posterior, nu = nu_posterior, i0 = i0_posterior)

By the way, I ran the following figures with 10^6^ prior samples – in under 2 seconds.

The results

Because this dataset is well-known, there are estimates of the parameters
using $I_0 = 1$. They are, respectively, $\beta \approx 2.18\times 10^{-3}$,
and $\nu \approx 4.41\times 10^{-1}$. So, what does the fitting says?

julia> mean(posteriors[:beta])

julia> mean(posteriors[:nu])

julia> mean(posteriors[:i0])

First, although $\beta$ estimated this way is quite close to the proposed
estimate, the value of $\nu$ differs. Specifically, I predict an easier
remission than the original estimation (the recovery time is $1/\nu$ days,
i.e. 2.27 in the original model, and 1.7 in my estimate). Why? Simply because
I decided not to trust the assumption that $I_0 = 1$. The original $I_0$
is based on the fact that there was a single student admitted for flu-like
symptoms on day 0, so it can be that by day 0, there were already infected
(but asymptomatic) individuals. My estimation tends to predict that there
were between two and three infected students.

With this information, we can easily plot the predicted timecourse of the

pred = sir(763, 3, 0.00238, 0.5571, 15)
      PNG("infected.png", 12cm, 8cm),
         layer(x=[0:14], y=pred[2], Geom.point),
         layer(dat, x="day", y="I", Geom.line)

This yields the following figure:

The data are the continuous line, and the predictions are the dots. It’s
not a perfect fit, but it reproduces the main features of the epidemics
quite well. When compared to the original parameters, notably, it predicts
the correct date at which no more cases are observed, but it slightly
over-estimates the number of cases during the peak.

We can also have a look at the distribution of $\beta$

and $\nu$

In both figures, the vertical lines indicate the previous estimations of the
parameters. While they are well within the posterior distribution, it seems
that (i) using a different method for fitting and (ii) estimating the value
of $I_0$ gives different estimates.

Finally, what about the distribution of $R_0$? The original $R0$ was estimated
to 3.76. Using the result of the ABC, I found

julia> mean(763.*(posteriors[:beta]./posteriors[:nu]))

Which is not all that different from the original prediction. Note that
a consistent result is that the epidemic parameters estimated through ABC
make the epidemic less aggresive than the original one. The recovery rate
is higher, the $R_0$ is lower, but the transmission rate is higher.

As a brief wrap-up…

I love julia. I haven’t even thought about possible optimizations and it’s
fast already. Also, it’s simple to write. Clearly sliced bread was the best
thing until julia.

ABC – at least as long as a simple rejection scheme is used – is not that
hard. This gives me a bunch of ideas.

I love epidemiology. For some reason, I just find it fascinating. This is the
reason why I switched from immunology to parasitology to ecology. I love it.