By: Tim Poisot

Re-posted from: http://timotheepoisot.fr/2014/07/25/sir-abc/

I am currently working with Approximate Bayesian Computation for a project. As

always when learning a new method, I like to revisit examples i already

know about, to see how it compares. So I decided to look at classical data

about a flu epidemics in an English Boarding School. It’s a good dataset to

estimate SIR parameters, because (1) Influenza transmits in a flu-like way,

(2) boarding schools are a closed system from a demographic point of view, and

(3) there was no mortality event during the epidemics. Also, it is likely that

there was a single individual responsible for the epidemics at the beginning

(but this is something I’ll look into).

😡

distribution of parameter values when there is no analytical expression of

the likelihood function. Just to give a very short overview: knowing data

(the number of cases every day), I want to estimate the parameters of the

SIR model that represents the epidemics. The SIR model is a compartmental

model, in which individuals start as susceptible, become infectious, then

finally recover. Formally, it is expressed as a set of ordinary differential

equations, $\dot S = -\beta I S$, $\dot I = \beta I S – \nu I$, and $\dot R =

\nu I$. There are two parameters of the model itself (the transmission rate

$\beta$ and recovery rate $\nu$), to which one should add the initial state

$S, I, R = {S_0, I_0, R_0}$.

There are a few noteworthy things about this model. First, at any time $t$,

$S_t+I_t+R_t=N$, where $N$ is the total population size – this assumes no

demography, but this is a very reasonable assumption for diseases with a

short course and low/no mortality. The other noteworthy things are not really

relevant to what I’m about to do, so I’ll skip them. The important information

is the value of $R_0$, which gives the expected number of infections from a

single infected case in an otherwise entirely susceptible population. In the

SIR model, $R_0 = N\beta/\nu$. The $R_0$ of influenza viruses is usually in

the $[1.3-1.9]$ range. Values of $R_0 > 1$ imply that epidemic progression

will likely happen.

What I want to do is use ABC to estimate the distribution of $\beta$ and

$\nu$, as well as $I_0$, and thus estimate the $R_0$ of this particular

epidemics. Here is the basic approach. I will first establish the prior

distribution of my three parameters. I will assume that there are no way to

guesstimate values of $\beta$ and $\nu$, so I will sample a uniform $[0;1]$

distribution for $\nu$, and 10 to the power of random numbers in $[-4;

0]$ for $\beta$ (these are reasonable estimates in my experience). The

only thing I know about $I_0$ is that it is most likely 1, so I will use a

Poisson distribution with $\lambda = 1$ and add 1, which will make $I_0 =

1$ and $I_0 = 2$ equally likely.

With these random parameters, I will simulate a SIR epidemics, and measure

$\rho(I_t,\hat I_t)$, which is to say the distance between the empirical time

series, and the simulated one. If this distance is lower than an arbitrary

treshold $\epsilon$, I will keep the random set of parameters as *likely*

– by replicating this process enough times, I should hopefully end up with

a posterior distribution of all three parameters.

Oh, and also I will do this in `julia`

. Let’s go!

# Writing the code

First, let’s load a bunch of packages needed to do the processing and output:

```
using Distances
using Distributions
using DataFrames
using Gadfly
```

Once this is done, we need a function `sir`

to run the simulations. I will use

an extremely simple one with Euler integration, because using a sophisticated

numerical integration scheme is *not* the point.

```
function sir(N, I0, beta, nu, t)
# SIR model
# N - total population size
# I0 - number of infected cases at t = 0
# beta - transmission
# nu - recovery
# t - number of timesteps
s = zeros(t)
i = zeros(t)
r = zeros(t)
s[1] = N-I0
i[1] = I0
for step in [1:(t-1)]
ds = -beta*s[step]*i[step]
di = beta*s[step]*i[step] - nu*i[step]
dr = nu*i[step]
s[step+1] = s[step] + ds
i[step+1] = i[step] + di
r[step+1] = r[step] + dr
end
return (s, i, r)
end
```

This function will return a tuple of time series, for the S, I, and R

values. The next step is to define a bunch of arrays, parameters, etc etc.

```
number_samples = 10000
beta_prior = 10.^rand(Uniform(-4, -2), number_samples)
nu_prior = rand(Uniform(0, 1), number_samples)
i0_prior = rand(Poisson(1), number_samples).+1
param_N = 763 # We know the total number of students
param_t = 15 # The data cover 15 days
param_d = 200 # This is my arbitraty treshold, feel free to experiment
beta_posterior = Float64[]
nu_posterior = Float64[]
i0_posterior = Int64[]
fit_posterior = Float64[]
```

Then, we can load the data which are stored in a file called `data.csv`

,

available as a gist here:

```
observed_cases = convert(Array{Float64}, readtable("data.csv",separator=';', header=true)[:I])
```

And now, we need to loop through all sets of parameters, and measure the

distance. Each iteration of this loop will run a simulation of the SIR model,

then measure the Euclidean distance between the observed and simulated time

series, and finally if this distance is lower than a treshold, consider that

the parameter set can be kept:

```
for run in [1:number_samples]
simulated_timeseries = sir(
param_N,
i0_prior[run],
beta_prior[run],
nu_prior[run],
param_t)
distance = evaluate(Euclidean(), observed_cases, simulated_timeseries[2])
if distance <= param_d
push!(beta_posterior, beta_prior[run])
push!(nu_posterior, nu_prior[run])
push!(i0_posterior, i0_prior[run])
push!(fit_posterior, distance)
end
end
```

To make plotting faster, we will wrap up the results in a `DataFrame`

:

```
posteriors = DataFrame(beta = beta_posterior, nu = nu_posterior, i0 = i0_posterior)
```

By the way, I ran the following figures with 10^6^ prior samples – in under 2 seconds.

# The results

Because this dataset is well-known, there are estimates of the parameters

using $I_0 = 1$. They are, respectively, $\beta \approx 2.18\times 10^{-3}$,

and $\nu \approx 4.41\times 10^{-1}$. **So**, what does the fitting says?

```
julia> mean(posteriors[:beta])
0.0023823647134406517
julia> mean(posteriors[:nu])
0.5571488709952898
julia> mean(posteriors[:i0])
2.8266301035953685
```

First, although $\beta$ estimated this way is quite close to the proposed

estimate, the value of $\nu$ differs. Specifically, I predict an easier

remission than the original estimation (the recovery time is $1/\nu$ days,

*i.e.* 2.27 in the original model, and 1.7 in my estimate). Why? Simply because

I decided not to trust the assumption that $I_0 = 1$. The original $I_0$

is based on the fact that there was a single student admitted for flu-like

symptoms on day 0, so it can be that by day 0, there were already infected

(but asymptomatic) individuals. My estimation tends to predict that there

were between two and three infected students.

With this information, we can easily plot the predicted timecourse of the

epidemics:

```
pred = sir(763, 3, 0.00238, 0.5571, 15)
draw(
PNG("infected.png", 12cm, 8cm),
plot(
layer(x=[0:14], y=pred[2], Geom.point),
layer(dat, x="day", y="I", Geom.line)
)
)
```

This yields the following figure:

The data are the continuous line, and the predictions are the dots. It’s

not a perfect fit, but it reproduces the main features of the epidemics

quite well. When compared to the original parameters, notably, it predicts

the correct date at which no more cases are observed, but it slightly

over-estimates the number of cases during the peak.

We can also have a look at the distribution of $\beta$

and $\nu$

In both figures, the vertical lines indicate the previous estimations of the

parameters. While they are well within the posterior distribution, it seems

that (i) using a different method for fitting and (ii) estimating the value

of $I_0$ gives different estimates.

Finally, what about the distribution of $R_0$? The original $R0$ was estimated

to 3.76. Using the result of the ABC, I found

```
julia> mean(763.*(posteriors[:beta]./posteriors[:nu]))
3.292957996735042
```

Which is not *all* that different from the original prediction. Note that

a consistent result is that the epidemic parameters estimated through ABC

make the epidemic *less* aggresive than the original one. The recovery rate

is higher, the $R_0$ is lower, but the transmission rate is higher.

# As a brief wrap-up…

I love `julia`

. I haven’t even thought about possible optimizations and it’s

fast already. Also, it’s simple to write. Clearly sliced bread was the best

thing until `julia`

.

ABC – at least as long as a simple rejection scheme is used – is not that

hard. This gives me a bunch of ideas.

I love epidemiology. For some reason, I just find it fascinating. This is the

reason why I switched from immunology to parasitology to ecology. I love it.