Outbreak definition by change point analysis: a tool for public health decision?

Background Most studies of epidemic detection focus on their start and rarely on the whole signal or the end of the epidemic. In some cases, it may be necessary to retrospectively identify outbreak signals from surveillance data. Our study aims at evaluating the ability of change point analysis (CPA) methods to locate the whole disease outbreak signal. We will compare our approach with the results coming from experts’ signal inspections, considered as the gold standard method. Methods We simulated 840 time series, each of which includes an epidemic-free baseline (7 options) and a type of epidemic (4 options). We tested the ability of 4 CPA methods (Max-likelihood, Kruskall-Wallis, Kernel, Bayesian) methods and expert inspection to identify the simulated outbreaks. We evaluated the performances using metrics including delay, accuracy, bias, sensitivity, specificity and Bayesian probability of correct classification (PCC). Results A minimum of 15 h was required for experts for analyzing the 840 curves and a maximum of 25 min for a CPA algorithm. The Kernel algorithm was the most effective overall in terms of accuracy, bias and global decision (PCC = 0.904), compared to PCC of 0.848 for human expert review. Conclusions For the aim of retrospectively identifying the start and end of a disease outbreak, in the absence of human resources available to do this work, we recommend using the Kernel change point model. And in case of experts’ availability, we also suggest to supplement the Human expertise with a CPA, especially when the signal noise difference is below 0. Electronic supplementary material The online version of this article (doi:10.1186/s12911-016-0271-x) contains supplementary material, which is available to authorized users.


Introduction
The US Centers for Disease Control and Prevention (CDC) define an epidemic as "the occurrence of more cases of disease than expected in a given area or among a specific group of people over a particular period of time" [1]. Because the course of many epidemics can be altered through early public health action, considerable research has been directed towards early detection of epidemics using public health surveillance data [2][3][4]. Nearly all of this work has focused on detecting the start of the epidemic, in order to initiate a timely response. However, identifying the full time course of an epidemic is useful for several reasons. First, identifying the end of the epidemic helps public health officials know the when response activities can cease. Second, defining the end of an epidemic in disease surveillance helps determine whether new cases are part of a known or a new outbreak. The notion of end of epidemic underlies also a large literature on disease elimination or eradication [5][6][7][8] and (re-) emerging pathogens. From an economical point of view, declaring the end of an epidemic impacts trade and tourism. INSEE (France's National Institute of Statistics and Economic Studies) evaluated the consequences on tourism of the Chikungunya outbreak that occurred in La Reunion Island in 2006. The result was a decrease of 130 000 visitors compared with 2005, a general increase of 8 % of unemployment in tourism activities, growing to 25 % for the only hostel trade. The total cost of the public aid for 2006 was evaluated to 20 millions euros for this activity [9]. Finally, detecting the full extent of past outbreaks in surveillance data can improve future outbreak identification. When undetected prior outbreaks are included in surveillance baseline data, the surveillance baselines are biased upward, and future outbreaks will be more difficult to detect [10,11].
Our study aims at evaluating the ability of change point analysis (CPA) methods to identify the beginning and ending dates of a disease outbreak from weekly counts. We will compare our approach with the results coming from experts' signal inspections, considered by many as the gold standard method.

Human signal inspection
Historically, identifying a whole outbreak signal in surveillance data has relied more on human judgment, for example through review by a committee of experts [12], than on signal processing as for the prospective detection of outbreak starts. This visual inspection of the time series is still considered by many authors as the gold standard approach. But the difficulty to find available experts for this task, the evidence of variability in opinion among the experts, and the poor knowledge about the factors influencing this variability (Watkins 2006 [13]) lead us to say with Buckeridge [14] that this method can be considered as a "time-consuming procedure whose reliability is unknown". All these reasons raise the interest for a statistical approach.

Change point analysis Epidemics as state changes in surveillance series
Following several authors [15][16][17][18][19], we consider that the weekly count of an infectious disease is a time series resulting from a model combining two endemic and epidemic components. If no outbreak occurs, the endemic component is alone. A complete time series can be then seen as a succession of model regimes corresponding to the alternation of no outbreak (endemic) and outbreak (epidemic) state. The identification of a regime change associated with its time in the series is known as the "change point problem". Indeed, considering a time series {x 1 , x 2 ,…, x n } measured with an index of time τ ∈ {1, 2,…,n}, a change point is a time index where a structure change occurs in data.
As all previous authors working on use of CPA in the context of surveillance data of infectious disease counts [15][16][17][18][19], we will assume that the endemic or epidemic component of the process is driven by parameter (as mean and/or variance incidence) which were considered as piecewise constant. The epidemic change point problem is a special multiple change point problem developed in Chen 2011 [20]. Identifying an epidemic regime requires to identify the endemic states that come before and after this epidemic segment. In this context, we need to identify pre-epidemic period (endemic state), epidemic period (epidemic state) and post-epidemic period (endemic state) building a free state model.

Outbreak change point problem formulation
A change point model is a model assuming the existence of at least one change point (at the time index τ when the signal has changed) and partitioning the data into disjoint segments (with parameters similar within each segment and different without). This analysis is recursively repeated for each segment, allowing the detection of multiple changes.
Usually the state change occurs at an unknown time index τ. It is the reason why the problem is formulated using a change point detection (detection formulation: "How many changes during the time series?") and a change point estimation (locations formulation: "When do they occur within the time series?") Let {x 1 , x 2 ,…, x n } a time series of independent random variables and θ i, i = 1,…, n the corresponding structure parameters. The change-point analysis of the time series consists in the following two steps:

1) Decide between
where 1 < k < τ < n, and α, β, γ are unknown, k and τ representing the start and end dates of the outbreak. A change point is detected when H 0 is rejected. The next step of change point estimation is carried out only when the null hypothesis of no change point is rejected. The number of change points and their respective unknown positions has to be estimated.
2) Estimate k and τ from the sample {x 1 , x 2 ,…, x n }, if H 1 is true.
Depending on the underlying model used to solve the change point problem, we may distinguish parametric (based on Maximum Likelihood) non-parametric (Kernel, Kruskall-Wallis) and Bayesian change point model.

Methods
All the experiments were performed using R 3.1.1 [21].
To evaluate precisely the ability of change point analysis methods to identify a whole outbreak time curve, we need to build a gold standard by controlling perfectly when the outbreak starts in time series and when it ends. In the absence of a consensual gold standard defined in real world, only simulated data allow this evaluation.

Outbreak generation
For the realism of the generated outbreak curve, as proposed by Jackson [22], we used an Inverse Transform Sampling Method algorithm (ITSM) to simulate the signal from a real outbreak of Norovirus already published [23]. According to these outbreaks, we generated 100 curves for each outbreak magnitude (10, 30, 50, 100 cases) while conserving the original number of day (n = 12). Due to the use of a probabilistic process to generate this curve, the resulting duration of the simulated outbreak can be shorter than the original one, in particular for outbreak with small number of cases (n = 10).

Evaluation data sets
With an objective of result reproducibility, we choose to build, for each combination of outbreak magnitude and baseline level of daily incidence, a data set of 30 times series. Each time series was created by adding a randomly selected outbreak among the 100 to a randomly selected baseline among the 200. For fully controlling the beginning and ending dates of the outbreak within the time series, we systematically added the outbreak after the first 30 days of baseline and kept 30 days after the outbreak end. Finally, 840 time series (4 level of curves x 7 level of baseline x 30 replicates) in 28 data sets were produced for evaluating the algorithms.
The building process was slightly different concerning the human experts. For avoiding a process of learning, we have chosen to randomly place the epidemic period in each series, with the constraint of keeping at least 10 days of baseline before the outbreak beginning and 10 days of baseline after the outbreak ending. Taking in account the workload, we built 28 data sets of 2 time series only. However, we have randomly reordered the 56 resulting time series presented to each expert for controlling a possible ordering effect.

Expert evaluation
We enlisted 15 experts who have at least 1 year's experience in daily disease surveillance. To allow comparison with the algorithms, we gave the experts the information that one and only one outbreak was present in each time series. Each expert independently evaluated 56 time series. The time series were presented in both graphical and numerical format (see Figure S1 in Additional file 1). Experts were asked to visually identify the beginning and ending dates of the simulated outbreaks without the aid of calculators, spreadsheets, or other tools. Results from all the experts were pooled to treat the expert review as a single unique "algorithm" analyzing a total of 840 times series.

CPA models Maximum likelihood CPA model
As proposed by Chen 2011 [20], an efficient strategy to identify change-points is to select the partition of the time series sample {x 1 , x 2 ,…, x n } that yields a maximum of heterogeneity between segments. This method tries to cut the time series {x 1 , x 2 ,…, x n } by maximizing the likelihood in 3 continuous states {x 1 +2 ,…, x τ2 } and {x τ2+1 , x τ2+2 ,…, x n } described previously. If this 3 states are confirmed, and if we consider that the time series is organized into 2 endemic periods encircling a epidemic period, τ 1 and τ 2 correspond to interesting change points: the start and end dates of the outbreak. Because x τ are counting data (number of cases), we can hypothesize that x τ follows a Poisson distribution. Even if infectious disease data are often overdispersed, this distribution hypothesis may be kept, as it usually done in disease surveillance [19], and to ensure evaluation fairness between algorithms. The detailed method is presented in Additional file 1.

Kernel CPA model
This CPA is detailed in Harchaoui et al. [24].
Let {x 1 , x 2 ,…, x n } a time series of independent random variables. We can define a Kernel function K as: In a second time we can define a Kernel Fisher discriminant ratio (KFDR), which measures the heterogeneity between the successive segments S 1 , S 2 , S 3 S 1 = {x 1 , x 2 ,…, x i } with i observations,: pre-epidemic S 2 = {x τ1+1 , x τ1+2 ,…, x j } with (j-i) observations,: epidemic S 3 = {x τ2+1 , x τ2+2 ,…, x n } with (n-j) observations.: postepidemic A kernel function is required to calculate this KFDR. In literature, many functions exist (Gaussian, Laplace…) for k, but we choose for this work a simple linear kernel function: k(x,y) = xy that amounts calculating a Fisher test statistic to compare the heterogeneity between S1, S2, S3. The detailed method is presented in Additional file 1.

Kruskal-Wallis model
Tests against the epidemic alternative H 1 : "a change occurs in the baseline" with a non-parametric technique are proposed in Yao (1993) [25] and Emad-Eldin (1996) [16]. Adapted to our hypothesis, let L 1 the law of the random variable X on the first segment (pre-epidemic), L2 the law on the second segment (epidemic) and L 3 the law on the last segment (post-epidemic). The Kruskal-Wallis test is used to check the following hypotheses: H 0 : L 1 = L 2 = L 3 . H 1 : Laws L 1 , L 2 , and L 3 are not identical.
The detailed method is presented in Additional file 1.

Bayesian model
The Bayesian model used here is the one introduced in 1998 by Siddhartha Chib [26]. Chib proposes to introduce a latent variable (S t ) that takes discrete values from 1 to the total number of hidden regimes (m) in the series. Each discrete value indicates the kind of datagenerating regime at each time unit (τ). This approach allows reproducing the epidemic and non epidemic latent regimes that generate the disease surveillance time series X n = {x 1 ,x 2 ,…,x n }, with n observations. This Bayesian approach ensures some flexibility by using a limited number of dependent variables, while keeping the capacity of managing multiple change points [27].
Let x t the number of events at time t and m a hidden state (or regime) at t. The x t distribution according X t-1 = {x 1 ,x 2 ,…,x t-1 } depends on the transition parameters (transition kernel) ξ t , which values {θ 1 ,θ 2 ,…,θ m } change at unknown dates {τ 1 , τ 2 ,…,τ m-1 }. In Chib's model, the transition of hidden states is constrained to move forward by a non-ergodic Markov chain that makes the regime changes irreversible. Discrete variable S t is modeled by a Markov chain process with probability matrix P, without possibility to return back to a previous state. The detailed method is presented in Additional file 1.
Data analysis was done with the MCMCpoissonChange function provided by MCMCpack [28].

Evaluation metrics and signal noise difference
We evaluated the algorithms using several metrics.
The beginning detection delay (d 1 ), measured in days, is equal to the absolute value of the difference between the start date according the algorithm and the real start date.
The ending detection delay (d 2 ), measured in days, is equal to the absolute value of the difference between the end date according the algorithm and the real end date.
As a general measure of accuracy we used an overall detection error, measured in days, defined as the sum of all the delays considering d 1 and d 2 with the same importance: The specificity (Sp) expresses the capability to consider a day as non-epidemic while it is really not epidemic.
The sensitivity (Se) expresses the capability to consider a day as an outbreak day while it is really epidemic.
For the last metric we selected the Bayesian probability of correct classification. Our detection problem consists in deciding which is the true binary state of a day in the time series (baseline or outbreak), given the binary result of the algorithm for this day. If the 2 possible realizations for a day are noted H 0 (non epidemic day) and H 1 (epidemic day), their prior probabilities are P 0 and P 1 . The Bayesian risk associated with this binary detection problem is then: where C .. are the costs associated to each possibility and P(H . /H . ) are the conditional probabilities for each realization.
If C 00 = C 11 = 1 and C 01 = C 10 = 0, the Bayesian risk is the binary Bayesian probability of correct classification (PCC), or the probability of exact decision.
To account for the influence of the outbreak and baseline sizes in the evaluation results we have used the difference between the signal (outbreak) and the noise (baseline), and not the signal to noise ratio as usual. This was required because of the existence of null baselines in the datasets. This signal to noise difference (SND) is defined as: sum of baseline cases during the outbreak period noise ð Þ A positive SND corresponds to a higher number of cases in the outbreak than in the baseline during the outbreak period, and a negative one to the opposite.

Accuracy Evaluation Overall evaluations
Across all algorithms, the baseline and outbreak sizes affect accuracy and dispersion (Table 1). Larger outbreaks are associated with lower d 1 and d 2 , while the inverse is true for larger mean baseline counts. Delay d 1 (d 2 ) goes from 13.4 to 2.4 days (11.7 to 2.8 days) when the outbreak size grows from 10 to 100 cases, and from 0.5 to 20.9 days (0.5 to 21.1 days) when the baseline level grows from 0 to 30. The precision of the dates increases with the outbreak size (16.5 to 1.1) and decreases when the baseline level grows (16.3 to 0.4).
The Kernel algorithm provides the more precise and unbiased global results according to d 1 (σ =7.0, mean =4.9) and d 2 (σ =6.9, mean = 5.4). Figure 1 illustrates the influence of the baseline level and outbreak size on the combined d 1 and d 2 errors across the 840 time series. No point is below the first diagonal with a 4-days offset (the minimal outbreak duration) because this area corresponds to the impossible case of an outbreak beginning after its ending. The maximum likelihood and the kernel algorithms show their results mainly along this diagonal. This reveals that these algorithms try to find the shortest outbreak possible while exploring the time series. In contrast, the experts' results are mainly grouped around the true values and do not show a specific alignment along the first diagonal.
The top left quadrant in each figure corresponds to the situations where all the real outbreak is included by excess within the detected outbreak, meaning that algorithm make an error not only for d 1 but also for d 2 . It is most frequent for the Kruskal Wallis (26 % of the results) and Bayes (18 %) algorithms, and less frequent for the kernel (12,5 %) and the Max-Likelihood (16 %) algorithms, and least frequent for the experts (10 %).
The Kernel algorithm has the lower overall detection error (considering with the same importance d 1 and d 2 errors), whatever the outbreak size and the baseline level (Fig. 2).

Influence of baseline fluctuations
We focused on the CPA results with an error equal or greater than 2 days and then 4 days. Suspecting an effect of the random fluctuations of the baseline, we applied the algorithms on the only baselines and compared the results with the corresponding complete time series. The dates were the same in 34.8 % of the series with an error equal or greater than 2 days and 58.5 % of the series with an error equal or greater than 4 days.

Signal-noise influence
All algorithms show an important decrease in their capacity to identify d 1 , d 2 when the SND goes under 0 (when the outbreak size is lower than the cumulated baseline during the outbreak period) (Fig. 3). The loss of accuracy starts before this threshold for the Bayes and Kruskall-Wallis algorithms and for the experts. There is a general tendency to delay the detection of the outbreak beginning and to anticipate the detection of its ending. Additional file 1: Table S1 supplements these results, showing that kernel and max-likelihood algorithms are the less biased and dispersed.
The cumulative standard deviation (used as a proxy of the cumulative dispersion of the date detection) according to SND shows that the accuracy decreases with SND (Fig. 4), the slope being the accuracy loss rate. The curves clearly show 2 regimes with a slow increase of the dispersion when the number of cases in the outbreak is equal or higher than in the baseline during the outbreak period. It is possible to rank the algorithms and to define 3 groups: Kernel and Max-likehood, Human and     Kruskall-Wallis, and Bayes. This decrease of accuracy is less important for the Kernel algorithm than for the other ones.

Global detection performance Sensitivity and specificity
The algorithm sensitivity and specificity are highly influenced by the outbreak size and the baseline level (Additional file 1: Figure S2), but the Kernel and Maximum likelihood algorithms have the best results in term of specificity (resp. 0.957 and 0.949) and the Kruskall-Wallis algorithm in term of sensitivity (0.675).
Only the Bayes algorithm and some experts have sometime considered all the time series as the signal to detect (null specificity). This situation has been encountered with small outbreak sizes and high baseline levels. However, the CPA algorithms show global results characterized by high specificity associated with low sensitivity. In this sense these algorithms can be seen as conservative.

Probability of correct classification
The probability of correct classification (PCC) can be considered as a probability for an algorithm to provide the good decision (according the real days status: endemic or epidemic) for all days in a time series. A PCC equal to 1 correspond to perfect status identification for

Time
On average, an expert could process 56 curves per hour (range, 36 to 105). An average of 15 h is then required to manually process all the 840 curves. In contrast, the slowest algorithm handles the whole set in 24 mn 37 s (Kruskall-Wallis), with a minimum of 3 mn 18 s for the Max-Likelihood algorithm.

Discussion
Buckeridge [14] wrote that Experts can be considered as a "time-consuming procedure whose reliability is unknown" to identify epidemics. And if we agree with the fact that expert should be considered as a weakness for this purpose, the issue is to find an alternative and to evaluate its practicability.
Concerning the time-consuming aspect, a recent study [29] by Debin et al. assessed the capacity of experts to identify the starts and ends of influenza outbreaks with a web-based Delphi survey. The time required to mobilize the experts (n = 69 recruited among 288 eligible), to assess 34 times series, and to obtain a consensus was approximately 3 months (without including the time needed for recruiting the experts).
Although data-processing time is hardware configuration dependent we notice, only for comparison purpose, that expert need at least 15 h to assess 840 curves, in contrast with the maximum of 25 min and minimum of 3 mm 18 s of processing time for the CPA models. Taking in account the workload, the small number of skilled experts, and the probable volunteer fatigue as a result of solicitation increase, it is clear that considering expert review can be only an occasional solution for research purpose and is unusable in a routine context (surveillance and public health context).
Considering the reliability of the different methods (human and statistics), we tried to identify their limits. Our approach was to control the characteristics of the time series submitted to the different methods of outbreak identification by injecting controlled signals coming from real outbreaks within well-known baselines. Among the factors most influencing the evaluation results, we showed that the excess of case in the time series linked to the epidemics, represented by the SND, is the most pertinent factor and emphasizes the interaction between the outbreak size and the baseline level. Concerning accuracy, a drop out point, which is associated with an important accuracy decrease and dispersion increase and which is easy to materialize by the cumulative dispersion, can characterize each algorithm. Experts outperform the algorithms in term of PCC for low level of baseline but this ability quickly declines when SND  decreases. Kernel and Maximum-Likelihood are the most accurate and less biased algorithms when the SND decreases under 0. All other algorithms loss their capacities as soon as SND reach 10. Kernel and Maximum-Likelihood can be considered, whatever the baseline level and outbreak size, as the best algorithms for providing a correct classification of the days in the time series according to their real status. For all algorithms, we observed a disposition to delay the detection of the first outbreak day and to anticipate the detection of the last outbreak day, probably in association with the slope of the outbreak curve.
Among the differences observed between Experts and algorithms, we noticed that when an Expert doesn't find the outbreak signal he has a propensity to include the whole time series as the outbreak signal (increasing his sensitivity). Another individual variability observed in the Expert population, and increasing the result dispersion, is the way the instructions are understood and assimilated. In example, considering the instruction "End date for the outbreak is the last day always included in the epidemic signal but before a return to a normal situation" and the 0-level baseline, some Experts included in the outbreak signal the first baseline days (0 case) after the ending day, while in the same time considering outbreak cases occurring ahead of time as baseline noise and excluding them from the outbreak. As suggested by Wagner [4], this phenomenon may be explained by a possible mixing of detection and intervention objectives in the Experts' cognitive processes. The need to have the same experimental condition for Experts and algorithms drove us to give the Experts the only time series without contextual information (as the agent or the target population) and to force them to give a result, even if they did not precisely identify the outbreaks. In reality, we believe the Experts are able to increase their performance in using information about the context and simple tools during a situational diagnosis (i.e. using the nature of the agent for inferring a plausible outbreak duration or size).
We also observed some limitations of both expert and CPA when the outbreak starts during a low level of the baseline fluctuation, making impossible the identification of the true beginning date.
Our study shows that nonparametric and parametric methods are more accurate and less biased than Bayesian CPA. Kass-Hout et al found the same result [30] and presumed it was because Bayesian CPA makes the assumption of a normal distribution of the time-series data, adopting non-informative priors on the model parameters. But in reason of the nature of disease surveillance data, the normality assumption may interfere with the estimation of the posterior distribution and the results obtained by the Bayesian CPA. Kernel algorithm has the most effective overall features in term of accuracy, bias and global decision concerning the outbreak day identification.
All this work has been done on detrended signals, without seasonality or autocorrelation. However, using a 3 states model, we integrate already the possibility of trend existence, as this allows the baseline to have different levels before and after the outbreak. Taylor [31] considers that only high levels of autocorrelation can influence the CPA results. In a same way, Zou wrote that if CPA remains useful in disease surveillance it is probably because most outbreaks are not strongly correlated in a limited time period. See Kass-Hout et al [30] for a larger discussion on influence of autocorrelation and trend on CPA model results in disease surveillance.
Despite the numerous advantages of the CPA algorithms, three main problems can be raised. Firstly, they are unable to detect unique aberrations in time series, as a single peak outbreak (not evaluated here), which are still easy to identify by an expert and usually best detected by usual time aberration detection methods. It's a reason why we recommend that CPA models should be compared to threshold-based methods in future studies as for example the Moving Epidemic Method [32]. Secondly, they cannot be considered as monitoring tools (i.e a prospective detection method) but only as retrospective technics allowing the analysis of historical time series. Thirdly, because CPA algorithms are able to identify subtle changes [31], they may be confused by some particular sequences within the baseline, considering them as change points (see 4.1.2 above). If the first important change they detect in the time series is a decrease they may even consider erroneously this low level of the baseline as the signal to detect. This blind change detection is an inaccuracy that cannot be observed when the series are processed by Experts, because they know they have to identify only changes corresponding to increases of the number of cases. This erroneous CPA behavior could be avoided if it would be possible to add a knowledge defining the state to detect (i.e. number of cases during the epidemic state greater than the baseline state). This work explores the results of each algorithm but it should be interesting, first to confirm our results by evaluating algorithms with real data (probably more complex to analyse), and second, with the goal of improving the result accuracy, to combine the results of two or several algorithms, taking in account their specific skills.

Conclusions
In conclusion, for the aim of retrospectively identifying the start and end of a disease outbreak, in the absence of human resources available to do this work, we recommend using the Kernel change point model. And in case of experts' availability, we also suggest to supplement the Human expertise with this kind of technics, especially when the SND is below 0 (in practice when the signal is considered as difficult to identify by the experts).

Availability of supporting data
Data are simulated from a real outbreaks of Norovirus already published [23].