This article has Open Peer Review reports available.
Syndromic surveillance: STL for modeling, visualizing, and monitoring disease counts
© Hafen et al; licensee BioMed Central Ltd. 2009
Received: 21 January 2009
Accepted: 21 April 2009
Published: 21 April 2009
Public health surveillance is the monitoring of data to detect and quantify unusual health events. Monitoring pre-diagnostic data, such as emergency department (ED) patient chief complaints, enables rapid detection of disease outbreaks. There are many sources of variation in such data; statistical methods need to accurately model them as a basis for timely and accurate disease outbreak methods.
Our new methods for modeling daily chief complaint counts are based on a seasonal-trend decomposition procedure based on loess (STL) and were developed using data from the 76 EDs of the Indiana surveillance program from 2004 to 2008. Square root counts are decomposed into inter-annual, yearly-seasonal, day-of-the-week, and random-error components. Using this decomposition method, we develop a new synoptic-scale (days to weeks) outbreak detection method and carry out a simulation study to compare detection performance to four well-known methods for nine outbreak scenarios.
The components of the STL decomposition reveal insights into the variability of the Indiana ED data. Day-of-the-week components tend to peak Sunday or Monday, fall steadily to a minimum Thursday or Friday, and then rise to the peak. Yearly-seasonal components show seasonal influenza, some with bimodal peaks.
Some inter-annual components increase slightly due to increasing patient populations. A new outbreak detection method based on the decomposition modeling performs well with 90 days or more of data. Control limits were set empirically so that all methods had a specificity of 97%. STL had the largest sensitivity in all nine outbreak scenarios. The STL method also exhibited a well-behaved false positive rate when run on the data with no outbreaks injected.
The STL decomposition method for chief complaint counts leads to a rapid and accurate detection method for disease outbreaks, and requires only 90 days of historical data to be put into operation. The visualization tools that accompany the decomposition and outbreak methods provide much insight into patterns in the data, which is useful for surveillance operations.
The development of statistical methods for accurately modeling disease count time series for the early detection of bioterrorism and pandemic events is a very active area of research in syndromic surveillance [1–15]. Early detection requires accurate modeling of the data.
A challenge to modeling is systematic components of time variation whose patterns have a level of predictability – inter-annual (long-term trend), day-of-the-week, and yearly-seasonal components [2, 10]. The variation can be ascribed to causal factors: for the inter-annual component the factor can be an increase or decrease in the population of people from which patients come; for the yearly-seasonal component it is changing weather and human activities over the course of a year; and for the day-of-the-week component it is a changing propensity of patients to seek medical attention according to the day of the week. In addition to these components is a noise component: random errors not assignable to causal factors that often can be modeled as independent random variables.
The most effective approach to early outbreak detection is to account for the systematic components of variation and the noise component through a model, and then base a detection method on values of the systematic components and the statistical properties of the random errors. Methods that do not accurately model the components can suffer in performance. The model predicts the systematic recurring behavior of the systematic components so it can be discounted by the detection method, and specifies the statistical properties of the counts through the stochastic modeling of the noise component. The two dangers are lack of fit – patterns in the systematic components are missed – and overfitting – the fitted values are unnecessarily noisy. Both are harmful. Lack of fit results in increased prediction squared bias, and overfitting results in increased prediction variance. The prediction mean-squared error is the sum of squared bias and variance.
Many modeling methods require large amounts of historical data. Examples are the extended baseline methods [11–15] of the Early Aberration Reporting System (EARS) , which require at least 3 years of data. Another popular example is a seasonal autoregressive integrated moving average (ARIMA) model in  that is fitted to 8 years of data. Such methods are not useful for many surveillance systems that have recently come online, so methods have been proposed that require little historical data . They typically employ moving averages of recent data, such as C1-MILD (C1), C2-MEDIUM (C2), and C3-ULTRA (C3), which are used in EARS. But such methods do not provide optimal performance because they do not exploit the full information in the data; they smooth out some component effects and operate locally to avoid others, rather than accounting for the components.
Our research goal is accurate modeling of components of variation, but without requiring a large amount of historical data. The research was carried as part of our analysis of the daily counts of chief complaints from emergency departments (EDs) of the Indiana Public Health Emergency Surveillance System (PHESS) . The complaints are divided into eight classifications by CoCo . Data for the first EDs go back to November 2004, and new EDs have come online continually since then. There are now 76 EDs in the system.
There are many ways to proceed in modeling chief-complaint time series. One is to develop parametric models. However, our new modeling methods are based on a nonparametric method, seasonal-trend decomposition using loess (STL) , which is very flexible and can account for a much wider range of component patterns than any single parametric model. Just as importantly, it can be used with as little as 90 days of data. We have also developed a new synoptic-scale (days to weeks) outbreak detection method based on the STL modeling.
The STL decomposition was run on all Indiana EDs for respiratory and gastro-intestinal counts. Both are fundamental markers for a number of naturally occurring diseases, and research has shown that diseases from bioweapons have early characterization of influenza-like illness , which typically results in respiratory complaints.
We present results for the respiratory time series for the 30 EDs that came online the earliest; these series end in April 2008, and start at times ranging from November 2004 to September 2005. All analyses were performed using the R statistical software environment , and an R package is available as a supplemental download [see additional file 1].
where t is time in units of years and increments daily, Y t is the respiratory count on day t, T t is an inter-annual component that models long-term trend, S t is a yearly-seasonal component, D t is a day-of-the-week component, and N t is a random-error noise component.
STL components of variation arise from smoothing the data using moving weighted-least-squares polynomial fitting with a moving window bandwidth in days. The degree of the polynomial is 0 (locally constant), 1 (locally linear), or 2 (locally quadratic).
The inter-annual component, T t , arises from locally linear fitting with a bandwidth of 1000 days, a very low-frequency component. (Window bandwidths can be formally larger than the number of available data points.)
For the yearly-seasonal component, S t , a critical idea of our method is not pooling values across years with the same time of year – values of January 1, values of January 2, etc. – as is often done. STL methodology can provide pooling, but because our methods are designed to work with limited data, S t at a time t uses data in a neighborhood of t, and not across years. S t is a low-frequency component with a bandwidth of 90 days and with a blending of locally quadratic and locally constant fitting. It is possible that this local method would be better in many surveillance applications even with substantially more data if the phase and shape are sufficiently variable from one year to the next. For example, for some Indiana EDs, seasonal influenza peaks are unimodal in some years and bimodal in others. Our modeling tracks this accurately, but averaging across years could easily distort the patterns, for example, making the bimodal peaks merge into unimodal ones. Other research has also found that the "one season fits all" assumption, leading to pooling across years, is not suitable for disease surveillance data .
The blending for the S t results in nearly stationary noise, N t . Local smoothing methods tend to produce systematic components that have higher standard deviation as the time approaches the ends of the data, making the noise component have a lower standard deviation at the ends. We countered this with a new smoothing method, blending. S t results from blending locally quadratic smoothing and locally constant smoothing, a weighted average of the two for the 50 days at each end of the data. The weight for the quadratic smoother is 0.7 at the first or last observation, and increases linearly moving away from each end, becoming 1 at the 50th observation from each end. This means the weight for the constant smoother goes from 0.3 to 0.
The fitting algorithm begins with computation of a day-of-the-week component, D t , directly employing the method described in . STL can allow D t to slowly evolve though time but we found that D t is a strictly periodic component. D t results from an iterative process. First, a low-middle-frequency component is fitted using locally linear fitting with a bandwidth of 39 days. Then D t is the result of means for each day-of-the-week of the minus the low-middle-frequency component. Then the current D t is subtracted from the and the low-middle-frequency component is re-computed. This iterative process is continued to convergence. We subtract the final day-of-the-week component from the , and then use loess smoothing on the result to obtain the inter-annual component T t as described above. Finally, we subtract the day-of-the-week and inter-annual components from the and smooth the result to obtain S t as described above.
We selected the stable D t assumption and the bandwidths of T t and S t through extensive use of visualization and numerical methods of model checking in order to prevent overfitting or a lack of fit of the components of variation, and to keep two components from competing for the same variation in the data.
STL modeling provides public health officials with a clear view of yearly-seasonal variation for visual monitoring of the onset and magnitude of seasonal peaks. This is facilitated by the smoothness of S t as opposed to the raw data.
Another critical task is the detection of outbreaks that rise up on the order of several days to several weeks and are not part of the systematic patterns in disease counts; the causes are bioterrorism and uncommon highly pernicious diseases. Following the terminology of meteorology, we will refer to this as a "synoptic time-scale outbreak".
With Y t as a Poisson random variable with with mean λ t , our method declares an outbreak alarm when Y t results in a value y t for which P (Y t ≥ y t ; λ t ) is less than a threshold ρ. We evaluate this synoptic-scale outbreak method using the historical data for baselines and adding artificial outbreaks, as has been done in other research [24, 25]. We use three baselines: respiratory daily counts from three EDs with low, medium, and high daily means – 9.25, 17.33, and 23.86.
Detectability at a point in time depends on the the number of outbreak cases per day compared with the variability of the baseline counts over the outbreak period. As we have emphasized earlier, the variability of an ED count time series is larger when the level of the series is larger than when the level is smaller. Because O is fixed, detectability changes through time.
We used three values of O for each baseline, selected based on a method in . There are many ways to make the selection, but we chose this one to remain consistent with past work. The maximum of our lognormal epicurve is 0.087 and occurs at = 8.9 days. If we suppose the density is this value for 8.9 ± 0.5 days, then the expected number of cases in this peak interval is 0.087O. Detectability can be controlled by making the expected peak-interval cases equal to a magnitude factor f times the sample standard deviation of the residual counts Y t - (T t + S t + D t )2. The parameter f controls the detectability.
The three values of O in our evaluation for each baseline were chosen by this method with f = 1, 1.5, and 2. For example, for the baseline with the smallest mean daily counts, the standard deviation of the residual counts is 3.324. For f = 2, we have 2 × 3.324/0.087 = 76.414, and rounding to the nearest integer, O = 76. This case is shown in Figure 2. The curve is the lognormal density times 76, and the histogram shows 76 draws from the lognormal.
where μ t and σ t are a 7-day baseline mean and standard deviation. For the C1 method, the baseline window for μ t and σ t is days t - 7,..., t - 1 and for the C2 method, the baseline window is days t - 9,..., t - 3. The test statistic for the C3 method is S t + S t-1+ S t-2, where S t-1and S t-2are added if they do not exceed the threshold.
Since some EDs have more than 3 years of data, we also considered some of the EARS longer historical baseline methods. However, we chose not to implement any of them since it has been shown that these do not offer much of an advantage over the limited baseline methods . Instead, we compared with a Poisson generalized linear model (GLM) method , which has terms for day-of-the-week, month, linear trend, and holidays. We did not include holidays since we were not able to find the details of its implementation. We also compared with a seasonal ARIMA model  but found its results unsatisfactory, so we do not include it here. This could be due to insufficient data for pooling across years, or that such pooling through the seasonal terms of the model cannot accommodate the substantial change from one year to the next in the seasonal patterns.
Outbreak detection was tested using 1004 days of each baseline. The first 365 days acted as historical data for the GLM method to have sufficient data to estimate coefficients. For each of the 9 combinations of outbreak detectability and baseline, we simulated 625 different outbreaks; the first started on day 366, the second on day 367, and so forth to the final on day 990.
Each outbreak was a different random sample from the lognormal epicurve. One sample is shown in Figure 2. The counts of each outbreak were added to the baseline counts to form Y t . We ran each method through each day of an outbreak as would be done in real practice, and tracked whether the outbreak was detected and, if so, how long it took to detect. Time until detection is measured by the number of days after first exposure until the day of detection. Any evaluation that did not result in detection on or before day 14 of the outbreak was classified as not-detected.
We also investigated the minimum amount of data needed by the STL method to achieve good performance. Because T t is very stable, and the window for S t is 90 days, we would expect STL to do well for 90 days or more. To test this, we ran our methods at each time point from 366 to 990 using just the most recent 90 days of baseline data for each outbreak scenario.
We ran all methods at a false positive rate of 0.03, achieved empirically by choosing a cutoff, ρ, for each method from the historical data with no outbreaks injected. For the C1, C2, and C3 methods, this means choosing ρ to be the 97th percentile of the daily test statistic values. Since the STL and GLM methods update past fitted values as they progress, choosing ρ for these methods cannot be done by simply fitting the entire time series and retrospectively choosing ρ. Instead, we use the fitted values obtained at the last day of each fit over time to chose ρ.
In practice, limits for the STL and GLM methods would be set by using the theoretical false positive rate. If a model fits the data, the observed rates should be consistent with the theoretical rate. We compared the GLM and STL methods by studying their observed rates for the 30 EDs using ρ = 0.03.
The day-of-the-week components, D t , in Figure 3 tend to peak on Sunday or Monday, fall to a minimum Thursday or Friday, and then rise. For some EDs, D t is very high over the weekend but for others is considerably lower.
Figure 4 shows the yearly-seasonal component, S t . The flu season peaks for 2004–2005 and 2007–2008 tend to be larger than those for 2005–2006 and 2006–2007. Many EDs show a double peak for 2006–2007.
Some T t in Figure 5 show an increase and others are nearly constant. We attribute the increase to a growing population using an ED, and no growth to a stable population. Our conclusion is in part based on observing a similar pattern for gastro-intestinal counts.
Diagnostic plots validated the STL modeling. One example, Figure 6, shows normal quantile plots of the N t . The lines on the plots are drawn through the lower and upper quartile points. Because the data of the panels are close to the lines, the normal is a good approximation of the distribution. Plots of the autocorrelation functions of the N t showed the noise component can be modeled as independent random variables. Loess smoothing of the N t against t verified that no variation that should have been in T t or S t leaked into the N t . Separate loess plots of the N t for each day-of-the-week verified that all significant day-of-the-week effects were captured by D t .
Modeling on the Counts Scale
The three systematic components – T t , S t , and D t – are each weighted averages of a large number of observations, so the values are are comparatively stable. The statistical variability in is chiefly the result of N t , modeled as independent, identically distributed normal random variables with variance . Thus the sample mean of Y t is λ t from Equation 2.
The square root of a Poisson variable whose mean is not too small is approximately normal with a standard deviation of 0.5 . The sample standard deviations of the 30 noise components are all greater than 0.5, indicating a systematic departure, but are not far from 0.5. The upper quartile of the estimates is 0.58 and the median is 0.54; two estimates reach as high as 0.63 but this is due to a few outliers. The consistency of the values can be seen in Figure 6; the slopes of the lines on the display are nearly the same.
The near normality of N t and the closeness of the to 0.5 are consistent with the Y t having a distribution that is well approximated by the Poisson with mean λ t . However, in carrying out the detection method, we use in place of the theoretical value 0.5 to provide a measure of robustness to the Poisson model.
Outbreak detection simulation results.
At the lowest outbreak magnitude, the STL method outperforms all other methods by a difference of 10% sensitivity. At higher magnitudes, the gap closes, but this is not surprising because all reasonable methods should eventually converge to 100% detection as magnitude increases. Comparisons of sensitivity across baselines within each method and magnitude show quite a bit of variability. This may be due to unique characteristics of each ED, the random outbreak generation, or the fact that the cutoff values were set empirically within each baseline.
The results of running the STL method on the same date range, but using only 90 days of baseline data for each scenario, are shown in final column of Table 1. These numbers are very comparable to the results using all available data. This indicates that the STL detection method is effective when 90 days of data are available.
To better understand the difference in performance across outbreak methods, we investigated the model fitted values (predicted systematic values) and residuals (data minus the fits) from each method. As we have emphasized, the methods of statistical modeling upon which each detection method is based have a large impact on performance. For the EARS methods, the fitted value for day t is the seven-day baseline raw data mean μ t from Equation 4. For the GLM and the STL methods, the fitted value for day t is the evaluation of the systematic components at time t for a fit through time t, just as it would be used in practice for outbreak detection. GLM bases this on data for a number of years; STL bases it on 90 days to a good approximation unless there is substantial long-term trend, which there is not in our Indiana surveillance data.
For our data, the same smoothing parameters were successfully used across all EDs. We advise those interested in applying these methods to their syndromic data to perform their own parameter selection and model validation. For outbreak detection, our data behaved like Poisson random variables, but this may not be the case in other applications. We advise checking this assumption and if it does not hold, investigating other possibilities such as the over-dispersed Poisson. Another alternative is validating distributional properties of the remainder term and using this term for monitoring.
Although our detection performance results favor the STL method, this does not necessarily mean that it is the best method available. There are certainly other methods with which we could have compared. We chose the EARS methods since they are widely used and can be applied to very short disease time series. Our detection study was limited to one type of outbreak. Further use and study of this method will determine its merit. The visualization capabilities of STL modeling make it a useful tool for visual analytics.
The STL method should not be used for small counts close to and including zero. While they work well with the large counts of the respiratory and gastro-intestinal categories, many other categories such as botulinic have counts that are too small for the square roots to be approximately normally distributed. Future work can investigate employing STL ideas for small counts by replacing the square root normal distribution and local least-squares fitting with the Poisson distribution and local Poisson maximum likelihood fitting.
The STL decomposition methods presented here effectively model chief complaint counts for syndromic surveillance without significant lack of fit or undue noise, and lead to a synoptic time-scale disease outbreak detection method that in our testing scenarios performs better than other methods to which is it compared – EARS C1, C2, and C3 methods , as well as a Poisson GLM modeling method . The methods can be used for disease time series as short as 90 days, which is important because many surveillance systems have started only recently and have a limited number of observations. Visualization of the components of variation – inter-annual, yearly-seasonal, day-of-the-week, and random-error – provides much insight into the properties of disease time series. We recommend the methods be considered by those who manage public health surveillance systems.
This work has been funded by the Regional Visualization and Analytics Center (RVAC) program of the U.S. Department of Homeland Security.
- Burkom H: Development, adaptation, and assessment of alerting algorithms for biosurveillance. Johns Hopkins APL Technical Digest. 2003, 24 (4): 335-342.Google Scholar
- Burkom H, Murphy S, Shmueli G: Automated time series forecasting for biosurveillance. Stat in Med. 2007, 26 (22): 4202-4218. 10.1002/sim.2835.View ArticleGoogle Scholar
- Reis B, Mandl K: Time series modeling for syndromic surveillance. BMC Med Inform Decis Mak. 2003, 3: 2-10.1186/1472-6947-3-2.View ArticlePubMedPubMed CentralGoogle Scholar
- Hutwagner L, Thompson W, Seeman G, Treadwell T: The bioterrorism preparedness and response Early Aberration Reporting System (EARS). J Urban Health. 2003, 80: i89-i96.PubMedPubMed CentralGoogle Scholar
- Wallenstein S, Naus J: Scan statistics for temporal surveillance for biologic terrorism. MMWR Morb Mortal Wkly Rep. 2004, 53 Suppl: 74-78.Google Scholar
- Brillman J, Burr T, Forslund D, Joyce E, Picard R, Umland E: Modeling emergency department visit patterns for infectious disease complaints: results and application to disease surveillance. BMC Med Inform Decis Mak. 2005, 5 (4): 1-14.Google Scholar
- Dafni U, Tsiodras S, Panagiotakos D, Gkolfinopoulou K, Kouvatseas G, Tsourti Z, Saroglou G: Algorithm for statistical detection of peaks – syndromic surveillance system for the Athens 2004 olympic games. MMWR Morb Mortal Wkly Rep. 2004, 53: 86-94.Google Scholar
- Kleinman K, Abrams A, Kulldorff M, Platt R: A model-adjusted space-time scan statistic with an application to syndromic surveillance. Epidemiology and Infection. 2005, 133 (03): 409-419. 10.1017/S0950268804003528.View ArticlePubMedPubMed CentralGoogle Scholar
- Wong W, Moore A, Cooper G, Wagner M: Rule-based anomaly pattern detection for detecting disease outbreaks. Proc. 18th Nat. Conf. on Artificial Intelligence. 1999, Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press, 2002: 217-223.Google Scholar
- Burr T, Graves T, Klamann R, Michalak S, Picard R, Hengartner N: Accounting for seasonal patterns in syndromic surveillance data for outbreak detection. BMC Med Inform Decis Mak. 2006, 6: 40-10.1186/1472-6947-6-40.View ArticlePubMedPubMed CentralGoogle Scholar
- Stroup D, Wharton M, Kafadar K, Dean A: Evaluation of a method for detecting aberrations in public health surveillance data. Am J Epidemiol. 1993, 137 (3): 373-380.PubMedGoogle Scholar
- Hutwagner L, Maloney E, Bean N, Slutsker L, Martin S: Using laboratory-based surveillance data for prevention: an algorithm for detecting salmonella outbreaks. Emerg Infect Dis. 1997, 3: 395-400.View ArticlePubMedPubMed CentralGoogle Scholar
- Farrington C, Andrews N, Beale A, Catchpole M: A statistical algorithm for the early detection of outbreaks of infectious disease. J R Slat Soc Ser A Stat Soc. 1996, 159 (3): 547-563. 10.2307/2983331.View ArticleGoogle Scholar
- Stern L, Lightfoot D: Automated outbreak detection: a quantitative retrospective analysis. Epidemiol Infect. 1999, 122 (01): 103-110. 10.1017/S0950268898001939.View ArticlePubMedPubMed CentralGoogle Scholar
- Simonsen L: The impact of influenza epidemics on mortality: introducing a severity index. American Journal of Public Health. 1997, 87 (12): 1944-1950. 10.2105/AJPH.87.12.1944.View ArticlePubMedPubMed CentralGoogle Scholar
- Hutwagner L, Browne T, Seeman G, Fleischauer A: Comparing aberration detection methods with simulated data. Emerg Infect Dis. 2005, 11 (2): 314-316.View ArticlePubMedPubMed CentralGoogle Scholar
- Grannis S, Wade M, Gibson J, Overhage J: The Indiana public health emergency surveillance system: Ongoing progress, early findings, and future directions. AMIA Annu Symp Proc. 2006, 304-308.Google Scholar
- Olszewski R: Bayesian classification of triage diagnoses for the early detection of epidemics. Proceedings of the Sixteenth International Florida Artificial Intelligence Research Society Conference. 2003, Menlo Park, CA: AAAI, 412-7.Google Scholar
- Cleveland R, Cleveland W, McRae J, Terpenning I: STL: A seasonal-trend decomposition procedure based on loess (with discussion). J Offic Stat. 1990, 6: 3-73.Google Scholar
- Franz D, Jahrling P, Friedlander A, McClain D, Hoover D, Bryne W, Pavlin J, Christopher G, Eitzen E: Clinical recognition and management of patients exposed to biological warfare agents. JAMA. 1997, 278 (5): 399-411. 10.1001/jama.278.5.399.View ArticlePubMedGoogle Scholar
- R Development Core Team: R: A language and environment for statistical computing. 2005, R Foundation for Statistical Computing, Vienna, Austria, [http://www.R-project.org]Google Scholar
- Cleveland W, Devlin S: Locally weighted regression: an approach to regression analysis by local fitting. JASA. 1988, 83 (403): 596-610.View ArticleGoogle Scholar
- Johnson NL, Kotz S, Kemp A: Univariate Discrete Distributions. 1992, New York: WileyGoogle Scholar
- Jackson M, Baer A, Painter I, Duchin J: A simulation study comparing aberration detection algorithms for syndromic surveillance. BMC Med Inform Decis Mak. 2007, 7: 6-6. 10.1186/1472-6947-7-6.View ArticlePubMedPubMed CentralGoogle Scholar
- Mandl K, Reis B, Cassa C: Measuring outbreak-detection performance by using controlled feature set simulations. MMWR Morb Mortal Wkly Rep. 2004, 53: 130-6.Google Scholar
- Sartwell P: The distribution of incubation periods of infectious disease. Am J Hyg. 1950, 51 (3): 310-318.PubMedGoogle Scholar
- Meselson M, Guillemin J, Hugh-Jones M, Langmuir A, Popova I, Shelokov A, Yampolskaya O: The Sverdlovsk anthrax outbreak of 1979. Science. 1994, 266 (5188): 1202-1208. 10.1126/science.7973702.View ArticlePubMedGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1472-6947/9/21/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.