In 2009, the new influenza virus A/H1N1 rapidly spread worldwide [1]. In the World Health Organization guidance document [2] detailing the epidemiological parameters to quickly determine after identification of the disease were: the incubation period, i.e. time between infection and symptoms; the serial interval, i.e. time between symptoms onset in primary and secondary cases; and the initial reproduction ratio, i.e. the average number of secondary cases per primary case. In a systematic review of all articles presenting such estimates for the 2009 H1N1 influenza pandemic [3], we found high variability in the methods used to estimate the same parameters.
Numerical differences in the reported estimates were therefore due in part to the chosen method. Applying all methods on the same dataset would help to understand what variation is due to the method, and this will be encouraged if the required code is widely distributed. It is also worth noting that subtile variation arises from the actual implementation of the same methods. For example, the initial “exponential growth rate” of the epidemic curve, used in the method described by Wallinga & Lipsitch [4] has been estimated using linear regression on logged incidence [5], Poisson regression on incidence data [6] or renewal equations [7].
Authors may have provided code implementing their methods, but no effort has yet been made to provide end users with a unique framework, with standardized approach allowing easy comparisons. To allow comparisons and provide more standardized approaches, we developed an R package implementing five methods that were the most commonly used during the 2009 H1N1 influenza pandemic. These methods are “plug-in” methods, requiring only data that are commonly recorded during an outbreak (epidemic curve, serial interval), and have been applied in a variety of situations.
After briefly recalling the principle of these methods, we illustrate their use, propose some tools to critically examine results and finally discuss applicability and limitations.
Implementation
We recall and describe the implementation of methods to estimate the serial interval distribution and reproduction numbers in epidemics. We also propose tools to explore the sensitivity of estimates to required assumptions.
Defining a generation time distribution
The generation time is the time lag between infection in a primary case and a secondary case. The generation time distribution should be obtained from the time lag between all infectee/infector pairs [8]. As it cannot be observed directly, it is often substituted with the serial interval distribution that measures time between symptoms onset. In our software package, the ‘generation.time’ function is used to represent a discretized generation time distribution. Discretization is carried out on the grid [0,0.5), [0.5, 1.5), [1.5, 2.5), etc.… where the unit is a user chosen time interval (hour, day, week…). Several descriptions are supported: “empirical” requiring the full specification of the distribution, or parametric distributions taken among “gamma”, “lognormal” or “weibull”. In the latter case, the mean and standard deviation must be provided in the desired time units.
A function (‘est.GT’) is also provided to estimate the serial interval distribution from a sample of observed time intervals between symptom onsets in primary cases and secondary cases by maximum likelihood.
Estimation of initial reproduction numbers
Reproduction numbers may be estimated at different times during an epidemic. In the following, we recall methods for estimating the “initial” reproduction number, i.e. at the beginning of an outbreak, and for estimating the “time-dependent” reproduction number at any time during an outbreak, as well as the required hypotheses for the methods. Proposed extensions and options implemented in the software are also presented.
Attack rate (AR)
In the classical SIR model of disease transmission, the attack rate (AR : the percentage of the population eventually infected) is linked to the basic reproduction number [9], by where S
0 is the initial percentage of susceptible population. The required assumptions are homogeneous mixing, closed population, and no intervention during the outbreak.
Exponential growth (EG)
As summarized by Wallinga & Lipsitch [4], the exponential growth rate during the early phase of an outbreak can be linked to the initial reproduction ratio. The exponential growth rate, denoted by r, is defined by the per capita change in number of new cases per unit of time. As incidence data are integer valued, Poisson regression is indicated to estimate this parameter [6, 10], rather than linear regression of the logged incidence. The reproduction number is computed as where M is the moment generating function of the (discretized) generation time distribution. It is necessary to choose a period in the epidemic curve over which growth is exponential. We propose to use the deviance based R-squared statistic to guide this choice. No assumption is made on mixing in the population.
Maximum likelihood estimation (ML)
This model, proposed by White & Pagano [11], relies on the assumption that the number of secondary cases caused by an index case is Poisson distributed with expected value R. Given observation of (N
0,
N
1, …, N
T
) incident cases over consecutive time units, and a generation time distribution w, R is estimated by maximizing the log-likelihood where μ
t
= R ∑
i = 1
t
N
t − i
w
i
. Here again, the likelihood must be calculated on a period of exponential growth, and the deviance R-squared measure may be used to select the best period. No assumption is made on mixing in the population.
The approach assumes that the epidemic curve is analysed from the first case on. If this is not the case, the initial reproduction number will be overestimated, as secondary cases will be assigned to too few index cases: we implemented a correction as described in Additional file 1: Supplementary material S1. It is also possible to account for importation of cases during the course of the epidemic.
Sequential bayesian method (SB)
This method, although introduced as “real-time bayesian” by its authors, more exactly allows sequential estimation of the initial reproduction number. It relies on an approximation to the SIR model, whereby incidence at time t + 1, N(t + 1) is approximately Poisson distributed with mean N(t)e
(γ(R − 1))[12], where the average duration of the infectious period. The proposed algorithm, described in a Bayesian framework, starts with a non-informative prior on the distribution of the reproduction number R. The distribution is updated as new data is observed, using . In other words, the prior distribution for R used on each new day is the posterior distribution from the previous day. At each time, the mode of the posterior may be computed along with the highest probability density interval. As before, the method requires that the epidemic is in a period of exponential growth, i.e. it does not account for susceptible depletion; it implicitly uses an exponential distribution for the generation time; and assumes random mixing in the population.
Estimation of time dependent reproduction numbers (TD)
The time-dependant method, proposed by Wallinga & Teunis [13], computes reproduction numbers by averaging over all transmission networks compatible with observations. The probability p
ij
that case i with onset at time ti was infected by case j with onset at time tj is calculated as . The effective reproduction number for case j is therefore R
j
= ∑
i
p
ij
, and is averaged as over all cases with the same date of onset. The confidence interval for Rt can be obtained by simulation. Correction for real time estimation, where not yet observed secondary cases are taken into account is possible [14]. It is possible to account for importation cases during the course of the epidemic.