Our syndromic surveillance system receives data from 18 of 19 emergency departments (EDs) in King County. Each morning, EDs send data on all visits that occurred the previous day, including the date and time of the visit; the patient's age, sex, and home zip code; a free-text chief complaint; diagnosis, if available; and disposition. Chief complaints are classified into syndromes based on the presence or absence of key words using a modified version of the chief complaint coder developed by the New York City Department of Health and Mental Hygiene[15]. For each syndrome, daily case counts are determined both by age group and aggregated across age groups. In this study, we used the aggregated counts. Because historical data were not available from all EDs when we began this study, we restricted our analysis to the nine EDs with complete reporting from 2001–2004.
Because we could not use all of the syndromes we monitor in this analysis, we chose a representative set by grouping syndromes into four categories based on their mean daily counts, and then selecting one syndrome from each group. The final syndromes we used as baselines had mean (standard deviation) daily counts of 60 (16.0), 35 (9.9), 10 (4.0), and 2 (1.6) visits per day. These counts corresponded to ED visits for respiratory illness, influenza-like illness, and asthma syndromes, and pneumonia hospitalizations, respectively. From these four baselines we used data from 2001 through 2003 to provide background counts for the algorithms, and used data from 2004 for testing the algorithms. For each of these syndromes, we calculated the variability in daily counts by weekday and month using Poisson regression [see Additional file 1]. Notably, all four baselines had significant month effects, indicating the presence of seasonal trends. The four baselines also showed significant day-of-week effects.
We compared the performance of six aberration detection methods. We evaluated the three control-chart-based algorithms commonly referred to as C1, C2, and C3[13]. For C1 and C2, the test statistic on day t was calculated as
St = max(0, (Xt - (μt + k*σt))/σt)
where Xt is the count on day t, k is the shift from the mean to be detected, and μt and σt are the mean and standard deviation of the counts during the baseline period. For C1, the baseline period is (t-7,...,t-1); for C2 the baseline is (t-9,...,t-3). The test statistic for C3 is the sum of St + St-1 + St-2 from the C2 algorithm.
We evaluated a generalized linear model (GLM), using a three-year baseline and Poisson errors, with terms for day of the week, month, linear time trend, and holidays. The full model for the expected count on day t was
E(Xt) = β0 + β1(Sunday) + ... + β6(Friday) + β7(January) + ... + β17(November) + β18(Holiday) + β19(time trend)
and the test statistic was the probability from a Poisson distribution of observing at least Xt cases given E(Xt).
We also included two Exponential Weighted Moving Average (EWMA) models[16], using a 28-day baseline and smoothing constants of 0.4 (EWMA4) and 0.9 (EWMA9). The smoothed daily counts were calculated as
Y1 = X1; Yt = ω*Xt + (1 - ω)*Yt-1
and the test statistic was calculated as
Tt = (Yt - μt)/[σt*(ω/(2 - ω))1/2]
where ω is the smoothing constant and other notation as defined above. The baseline period for μt and σt was set to (t-30,...,t-3).
Outbreak simulation
Rather than try to model the full outbreak process from infection to ED visit, as has been done for some limited outbreak types[17], we tried to model a wide range of outbreak signals, representing various ways that outbreaks in the community could alter ED syndrome counts. We based our simulations on the approach described by Mandl et al[18]. First, we chose five temporal distributions, using the epidemic curves of historical outbreaks, to represent several ways in which a pathogen could spread through a community (Figure 1). These were (a) the airborne release of a bioweapon[19]; (b) point-source exposure to an infectious agent[20]; (c) transmission of a pathogen spread by close contact;[21] (d) transmission of an airborne pathogen[22]; and (e) transmission resulting in a multi-modal distribution[23]. Next, we simulated a range of outbreak signal durations (lasting 1, 2, 4, 6, 8, 16, and 32 days) and a range of sizes using the forecast errors of the six algorithms[18]. We determined the forecast errors for each algorithm for each day of 2004, and calculated the standard deviations of these errors. We set the size of the largest outbreak signal to be roughly four times the largest standard deviation, rounded to a convenient number. The largest standard deviation was 13.2 counts, and our outbreak signals ranged from 5 to 50 cases in increments of 5.
Next, we simulated daily outbreak signal counts by creating every unique combination of the temporal distributions, durations, and sizes. Since all outbreaks of one day duration have the same temporal distribution, this gave us 310 different outbreak signals. For each signal, we converted the epidemic curve of the appropriate historical outbreak into an empirical probability distribution. We divided this distribution into the appropriate number of days for the simulated outbreak. We then calculated the cumulative probability of a case occurring on or before day d for each day (1,...,d) of the outbreak signal. Finally, we assigned each case a random number between 0 and 1, chosen from a uniform random distribution. The total number of cases on day d was the sum of all cases whose random number was greater than the cumulative probability for day d-1, and less than or equal to the cumulative probability for day d. This random assignment was repeated for each of the 310 outbreak signals.
Algorithm testing
We created evaluation datasets by adding the simulated outbreak signals to each of the four baselines, with the outbreak counts starting on January 2nd, 2004. To avoid bias due to day-of-the-week effects and seasonality, we repeated this process starting the outbreak on every other day of 2004 for each of the four baselines. This gave us 183 datasets per outbreak per baseline, for a total of 226,920 datasets for analysis.
We applied the six algorithms to the evaluation datasets, and calculated two outcome measures for each algorithm on each dataset. The first was whether the algorithm ever detected the outbreak signal. The second was the earliest day of an alert, among signals that were detected. The earliest day of alert was counted from the day on which the first simulated case was added. Due to the stochastic nature of these simulations, this was not always the first day of the epidemic. For example, consider a simulated outbreak signal of five cases 32 days duration started on January 2nd, 2004. The cases might appear on days 3, 5, 8, 17, and 30 of the signal. In this situation, we would begin counting the days until detection starting from day 3 (January 4th).
We calculated these outcomes while running the algorithms at an alert rate of 0.01 (an average of one alert every hundred days). We set this alert rate empirically by applying each algorithm to each baseline without any added outbreak signals, and determining the threshold that would yield an average of one alert per 100 days. Note that in this study we defined an alert rate, rather than a false positive rate (which is 1 – specificity). Calculating a false positive rate (i.e. a specificity) requires assuming that the baseline syndromes did not contain any signals from true outbreaks in the community. This is an unreasonable assumption, given the known yearly influenza epidemics and the probable presence of other unknown outbreaks. By using an alert rate instead of a false positive rate, we allowed for the existence of these signals in our baseline data.
Comparing algorithm performance
For each of the 310 outbreak signals, in each of the four baseline syndromes, we computed the sensitivity (that is, the probability of detecting the signal given the presence of an outbreak) by averaging the detection outcome across all 183 analysis runs. We also computed the median of the earliest day of detection. We used ANOVA to test for significant differences in the algorithms' sensitivities. Separate tests were conducted by baseline and by outbreak distribution, duration, and size. In each stratum of these four grouping variables we tested for significant differences between algorithms in the probability of detection. We also tested for differences within each algorithm across the strata. For all ANOVA comparisons that were significant at the 0.05 level, we compared the performance of all pairs of algorithms by t-test.
We also present a figure showing the performance of one algorithm on each of the 310 simulated outbreak signals, at each of the four baselines. Because of the multiple comparisons problem, we did not test for significant differences between the cells of the figure. However, we present this figure for qualitative comparisons, so that the reader can visualize the effects of baseline and outbreak temporal distribution, size, and duration on relative algorithm performance, over the range of outbreak signals we tested (Figure 2). All simulations and analyses were performed using SAS version 8.2 (SAS Institute Inc., Cary North Carolina).