Author Archives: Tim Poisot

Don’t enforce R as a standard

Yesterday, I received the reviews for a paper of mine. It was rejected with an
invitation to resubmit, so far, so good. In this paper, we present a lot of new
measures to work on probabilistic networks, and it’s all in the
preprint if you really want to read more about that. To do the paper,
as in, to produce the figures and do the analysis, I wrote a package in
Julia. I’m proud of this package. It’s fast, defensively programmed, well
tested, already parallelized if you use several CPUs, and all that. It’s also
released under the MIT “Expat” license, so you are free to do with it as you
please.

It’s important to specify that this paper is not a software paper. It’s the
description of methods, and it so happens that we decided to release the package
alongside it. This is where things got complicated.

Reviewer 1 wants more practical details about the software (despite it not being
the purpose of the paper), and that we have to justify that it is indeed usable.
Some of this point is fair (I will write a short introduction, and the code used
for the figures will be published alongside the paper). Reviewer 3 states that
since Julia is not frequently used by ecologists, and therefore it’s dubious
that the methods are usable, or even useful, and a R package would be better.

The three reviews were helpful and constructive, but these two comments
infuriated me. Let me start with a bit of background. I’m not anti-R. I gave
department-wide R training for graduate students and faculty. I wrote R
packages to go with papers, and released them on GitHub, and I keep on fixing
bugs and maintaining them. I use R on a daily basis. I contributed code to
other people’s R packages. One of my lab machines runs RStudio server for
colleagues that lack computational power. But it’s a tool. It’s neither a
religion, nor a standard, nor a pre-requisite for getting your paper accepted in
an ecological journal.

Let me put things in perspective.

First, a package written in anything is better than no package at all. And as
far as I’m concerned, this should be the end of the argument. And especially
if this is not a software paper (but even then), the language a software is
written on has nothing to do with the scientific merit of the paper. Unless the
choice of language means bad performance or a significant risk of errors, or the
impossibility to run the code, it can be written in lisp or cobold for all I
care. I cannot emphasize this point enough: any code organized in a software
library and released under a FOSS License is better than no code at all.

Second, what to write a piece of software in, as the guy in charge of actually
writing the code, is a conscious decision. And knowing the type of data, and use
cases (because I’ve since used this package in a few projects), and the
algorithm used, and all that, I decided that R was not the best choice I can
make. Could I write a R version of it? Yes. But I won’t, because I don’t have
that much time, it won’t be as efficient, and I’m not interested in doing it.

Third, there is this thing in the open-source crowd, where you don’t get to be
entitled
about a piece of software that is available for free. This is
extremely important. That something is released does not imply that the
maintainers will do free work for you. Especially when the message is “I don’t
like this because this is not the way I do it, so can you start again?”. The
whole point of FOSS is that if you don’t like it, you can fork it. But you don’t
get to complain about something that is given away for free. This is just rude
(see also, first point – it’s better than nothing). I don’t expect praise for
writing code (except by my continuous integration engine, that sends me
comforting Build passed emails). But if I go the extra mile of writing and
organizing and optimizing code, it’s grating that people complain because it’s
not matching their personal preferences.

Fourth, this points back to a larger issue – R is not the only computational
tool we have. Every time these is a pushback about something on the basis that
this is not R
, we are losing opportunities. I’ve had discussion with people
saying that they won’t even try to publish software, because it’s not in R and
they don’t feel like fighting an uphill battle where most of their energy will
go into their language choice. And because of this “R or bust” mentality that
some people are so vocal about, cool things are not properly released. There
is a cost.

Every time you complain about the language code is written in, Ethan White has
to buy a new computer:

I know that accessibility is important – this is what picking FOSS platforms
and languages is about, and this is why FOSS licenses for scientific code
should be enforced. But beyond that, what should matter is, does it works as
advertised?
. Software is not something that happens magically out of thin air.
People write it, people like me, and our choice of how to write takes precedence
over your ideal language in which to run it. Replying to reviewers is an
exercise in picking your battles – this one I’ll fight.

Estimating SIR model parameters with ABC

I am currently working with Approximate Bayesian Computation for a project. As
always when learning a new method, I like to revisit examples i already
know about, to see how it compares. So I decided to look at classical data
about a flu epidemics in an English Boarding School. It’s a good dataset to
estimate SIR parameters, because (1) Influenza transmits in a flu-like way,
(2) boarding schools are a closed system from a demographic point of view, and
(3) there was no mortality event during the epidemics. Also, it is likely that
there was a single individual responsible for the epidemics at the beginning
(but this is something I’ll look into).

😡
distribution of parameter values when there is no analytical expression of
the likelihood function. Just to give a very short overview: knowing data
(the number of cases every day), I want to estimate the parameters of the
SIR model that represents the epidemics. The SIR model is a compartmental
model, in which individuals start as susceptible, become infectious, then
finally recover. Formally, it is expressed as a set of ordinary differential
equations, $\dot S = -\beta I S$, $\dot I = \beta I S – \nu I$, and $\dot R =
\nu I$. There are two parameters of the model itself (the transmission rate
$\beta$ and recovery rate $\nu$), to which one should add the initial state
$S, I, R = {S_0, I_0, R_0}$.

There are a few noteworthy things about this model. First, at any time $t$,
$S_t+I_t+R_t=N$, where $N$ is the total population size – this assumes no
demography, but this is a very reasonable assumption for diseases with a
short course and low/no mortality. The other noteworthy things are not really
relevant to what I’m about to do, so I’ll skip them. The important information
is the value of $R_0$, which gives the expected number of infections from a
single infected case in an otherwise entirely susceptible population. In the
SIR model, $R_0 = N\beta/\nu$. The $R_0$ of influenza viruses is usually in
the $[1.3-1.9]$ range. Values of $R_0 > 1$ imply that epidemic progression
will likely happen.

What I want to do is use ABC to estimate the distribution of $\beta$ and
$\nu$, as well as $I_0$, and thus estimate the $R_0$ of this particular
epidemics. Here is the basic approach. I will first establish the prior
distribution of my three parameters. I will assume that there are no way to
guesstimate values of $\beta$ and $\nu$, so I will sample a uniform $[0;1]$
distribution for $\nu$, and 10 to the power of random numbers in $[-4;
0]$ for $\beta$ (these are reasonable estimates in my experience). The
only thing I know about $I_0$ is that it is most likely 1, so I will use a
Poisson distribution with $\lambda = 1$ and add 1, which will make $I_0 =
1$ and $I_0 = 2$ equally likely.

With these random parameters, I will simulate a SIR epidemics, and measure
$\rho(I_t,\hat I_t)$, which is to say the distance between the empirical time
series, and the simulated one. If this distance is lower than an arbitrary
treshold $\epsilon$, I will keep the random set of parameters as likely
– by replicating this process enough times, I should hopefully end up with
a posterior distribution of all three parameters.

Oh, and also I will do this in julia. Let’s go!

Writing the code

First, let’s load a bunch of packages needed to do the processing and output:

using Distances
using Distributions
using DataFrames
using Gadfly

Once this is done, we need a function sir to run the simulations. I will use
an extremely simple one with Euler integration, because using a sophisticated
numerical integration scheme is not the point.

function sir(N, I0, beta, nu, t)
   # SIR model
   # N     - total population size
   # I0    - number of infected cases at t = 0
   # beta  - transmission
   # nu    - recovery
   # t     - number of timesteps
   s = zeros(t)
   i = zeros(t)
   r = zeros(t)
   s[1] = N-I0
   i[1] = I0
   for step in [1:(t-1)]
      ds = -beta*s[step]*i[step]
      di =  beta*s[step]*i[step] - nu*i[step]
      dr =  nu*i[step]
      s[step+1] = s[step] + ds
      i[step+1] = i[step] + di
      r[step+1] = r[step] + dr
   end
   return (s, i, r)
end

This function will return a tuple of time series, for the S, I, and R
values. The next step is to define a bunch of arrays, parameters, etc etc.

number_samples = 10000

beta_prior = 10.^rand(Uniform(-4, -2), number_samples)
nu_prior   =     rand(Uniform(0, 1), number_samples)
i0_prior   =     rand(Poisson(1), number_samples).+1

param_N = 763   # We know the total number of students
param_t = 15    # The data cover 15 days
param_d = 200   # This is my arbitraty treshold, feel free  to experiment

beta_posterior = Float64[]
nu_posterior = Float64[]
i0_posterior = Int64[]
fit_posterior = Float64[]

Then, we can load the data which are stored in a file called data.csv,
available as a gist here:

observed_cases = convert(Array{Float64}, readtable("data.csv",separator=';', header=true)[:I])

And now, we need to loop through all sets of parameters, and measure the
distance. Each iteration of this loop will run a simulation of the SIR model,
then measure the Euclidean distance between the observed and simulated time
series, and finally if this distance is lower than a treshold, consider that
the parameter set can be kept:

for run in [1:number_samples]
   simulated_timeseries = sir(
         param_N, 
         i0_prior[run],
         beta_prior[run],
         nu_prior[run],
         param_t)
   distance = evaluate(Euclidean(), observed_cases, simulated_timeseries[2])
   if distance <= param_d
      push!(beta_posterior, beta_prior[run])
      push!(nu_posterior, nu_prior[run])
      push!(i0_posterior, i0_prior[run])
      push!(fit_posterior, distance)
   end
end

To make plotting faster, we will wrap up the results in a DataFrame:

posteriors = DataFrame(beta = beta_posterior, nu = nu_posterior, i0 = i0_posterior)

By the way, I ran the following figures with 10^6^ prior samples – in under 2 seconds.

The results

Because this dataset is well-known, there are estimates of the parameters
using $I_0 = 1$. They are, respectively, $\beta \approx 2.18\times 10^{-3}$,
and $\nu \approx 4.41\times 10^{-1}$. So, what does the fitting says?

julia> mean(posteriors[:beta])
   0.0023823647134406517

julia> mean(posteriors[:nu])
   0.5571488709952898

julia> mean(posteriors[:i0])
   2.8266301035953685

First, although $\beta$ estimated this way is quite close to the proposed
estimate, the value of $\nu$ differs. Specifically, I predict an easier
remission than the original estimation (the recovery time is $1/\nu$ days,
i.e. 2.27 in the original model, and 1.7 in my estimate). Why? Simply because
I decided not to trust the assumption that $I_0 = 1$. The original $I_0$
is based on the fact that there was a single student admitted for flu-like
symptoms on day 0, so it can be that by day 0, there were already infected
(but asymptomatic) individuals. My estimation tends to predict that there
were between two and three infected students.

With this information, we can easily plot the predicted timecourse of the
epidemics:

pred = sir(763, 3, 0.00238, 0.5571, 15)
draw(
      PNG("infected.png", 12cm, 8cm),
      plot(
         layer(x=[0:14], y=pred[2], Geom.point),
         layer(dat, x="day", y="I", Geom.line)
         )
    )

This yields the following figure:

The data are the continuous line, and the predictions are the dots. It’s
not a perfect fit, but it reproduces the main features of the epidemics
quite well. When compared to the original parameters, notably, it predicts
the correct date at which no more cases are observed, but it slightly
over-estimates the number of cases during the peak.

We can also have a look at the distribution of $\beta$

and $\nu$

In both figures, the vertical lines indicate the previous estimations of the
parameters. While they are well within the posterior distribution, it seems
that (i) using a different method for fitting and (ii) estimating the value
of $I_0$ gives different estimates.

Finally, what about the distribution of $R_0$? The original $R0$ was estimated
to 3.76. Using the result of the ABC, I found

julia> mean(763.*(posteriors[:beta]./posteriors[:nu]))
   3.292957996735042

Which is not all that different from the original prediction. Note that
a consistent result is that the epidemic parameters estimated through ABC
make the epidemic less aggresive than the original one. The recovery rate
is higher, the $R_0$ is lower, but the transmission rate is higher.

As a brief wrap-up…

I love julia. I haven’t even thought about possible optimizations and it’s
fast already. Also, it’s simple to write. Clearly sliced bread was the best
thing until julia.

ABC – at least as long as a simple rejection scheme is used – is not that
hard. This gives me a bunch of ideas.

I love epidemiology. For some reason, I just find it fascinating. This is the
reason why I switched from immunology to parasitology to ecology. I love it.