An adaptive prediction and detection algorithm for multistream syndromic surveillance
 AmirHomayoon Najmi^{1}Email author and
 Steve F Magruder^{1}
DOI: 10.1186/14726947533
© Najmi and Magruder; licensee BioMed Central Ltd. 2005
Received: 25 February 2005
Accepted: 12 October 2005
Published: 12 October 2005
Abstract
Background
Surveillance of OvertheCounter pharmaceutical (OTC) sales as a potential early indicator of developing public health conditions, in particular in cases of interest to biosurvellance, has been suggested in the literature. This paper is a continuation of a previous study in which we formulated the problem of estimating clinical data from OTC sales in terms of optimal LMS linear and Finite Impulse Response (FIR) filters. In this paper we extend our results to predict clinical data multiple steps ahead using OTC sales as well as the clinical data itself.
Methods
The OTC data are grouped into a few categories and we predict the clinical data using a multichannel filter that encompasses all the past OTC categories as well as the past clinical data itself. The prediction is performed using FIR (Finite Impulse Response) filters and the recursive least squares method in order to adapt rapidly to nonstationary behaviour. In addition, we inject simulated events in both clinical and OTC data streams to evaluate the predictions by computing the Receiver Operating Characteristic curves of a threshold detector based on predicted outputs.
Results
We present all prediction results showing the effectiveness of the combined filtering operation. In addition, we compute and present the performance of a detector using the prediction output.
Conclusion
Multichannel adaptive FIR least squares filtering provides a viable method of predicting public health conditions, as represented by clinical data, from OTC sales, and/or the clinical data. The potential value to a biosurveillance system cannot, however, be determined without studying this approach in the presence of transient events (nonstationary events of relatively short duration and fast rise times). Our simulated events superimposed on actual OTC and clinical data allow us to provide an upper bound on that potential value under some restricted conditions. Based on our ROC curves we argue that a biosurveillance system can provide early warning of an impending clinical event using ancillary data streams (such as OTC) with established correlations with the clinical data, and a prediction method that can react to nonstationary events sufficiently fast. Whether OTC (or other data streams yet to be identified) provide the best source of predicting clinical data is still an open question. We present a framework and an example to show how to measure the effectiveness of predictions, and compute an upper bound on this performance for the Recursive Least Squares method when the following two conditions are met: (1) an event of sufficient strength exists in both data streams, without distortion, and (2) it occurs in the OTC (or other ancillary streams) earlier than in the clinical data.
Background
Surveillance of OvertheCounter pharmaceutical (OTC) sales as a potential early indicator of developing public health conditions, in particular in cases of interest to biosurvellance, has been suggested in the literature [1]. Sales of overthecounter pharmaceuticals (OTCs) offer several advantages as possible early indicators of public health. They are very widely used [2], and reliable and detailed electronic records of their sales exist.
Another possible advantage is the timeliness of OTC sales relative to other observable events that might occur when the public health is threatened. This is a particularly difficult aspect since it requires the identification of specific events in all the data streams before a judgment can be reached as to the correlations and the timeliness of those events.
We have, in a previous article [3], provided evidence that when judiciously grouped, the OTC data show timedependent correlations with clinical data, and that the present days values of the latter can be estimated well from the present and past values of the former using a set of linear filters h_{ j }[m], where the subscript j refers to the particular OTC product group (multiple groups are used) and the index m refers to the time step. If we denote the clinical data time series on day number n by y[n], and the OTC time series on the same day number n by x_{ j }[n], (the index j denotes the OTC product group), then the estimation problem discussed in our previous paper refers to using today's and past days' OTC data to estimate today's clinical data, in the sense that the estimated quantity is . The linear filters h_{ j }[m] are assumed to have a span of M points (days).
We present a prediction method based on an adaptive recursive least squares filter. In addition, we compare these predictions, which we term auto predictions, with similar predictions that use the same method applied to the clinical data alone without referencing any OTC channels. It is our contention that when the auto prediction results (i.e. when using the clinical data in the past to predict its future values) are equally (in the sense of minimum squared error) effective as or better than those predictions based solely on OTC streams, in all time intervals, then it is highly probable that no event of interest to biosurveillance actually exists in the clinical data. This is based on the fundamental premises of linear optimal predictors that a nonstationary and relatively short duration event superimposed on an otherwise stationary and predictable background cannot be predicted from the stationary background data alone. We argue that the best performance comparison, in the context of a biosurveillance system whose objective is to detect an outbreak early, among all method/data stream combinations is tied closely to the existence of such events. Lacking any real specific events of sufficient signal strength, we perform a study based on simulated events in order to compute an upper bound on the indicated performances. We emphasize that the system whose performance we are investigating here is a predict and detect system, in the sense that it uses historical clinical and other ancillary data streams in order to predict clinical data many days into the future. The detection performance is then based on a study of probabilities of true detections versus the probabilities of false alarms.
The meaning of an upper bound on the detection performance in this context is in the following sense. Given a data stream y_{ t }that includes an event of short duration then the detection performance of a specific prediction method, is related to the quantity where is the predicted value of the data stream when an event exists and y_{ t }is the value of the data stream in the absence of the event. This predicted value could be based on the data stream itself, or it could be based on a combination of the data stream and several other correlated data sets. In a realtime situation one might perform detections based on the quantity , where is a prediction of the data stream in the absence of the signal, because the actual quantity y_{ t }is not available when the predictions are made at t  Δt. We contend that an upper bound on detection performance is obtained when we use the "actual background" y_{ t }instead of the "predicted background" .
Methods
Data grouping and recursive least squares prediction
JHU/APL is currently collecting large quantities of daily OTC sales data. We receive sales records of 622 different products under the general category of cold remedies from a single vendor, with similar numbers from other vendors. Many of these products are used to treat very similar conditions. Product sales from some of these product groups are known to be good indicators of the corresponding clinical data. For instance, chest rub sales are highly correlated with the count of physician diagnosis of acute bronchitis or acute bronchiolitis [4].
OTC Adult Medication Product Groups
Product Group Name 

ALLERGY 
BRONCHIAL 
COLDALLERGY 
COUGH 
FLU 
POWDER 
SINUS 
THROAT 
Here, we consider the clinical data, the dependent variable, as the primary data channel (in the parlance of adaptive filter theory) whose values are to be predicted. The OTC product groups (the independent variables) are then used to predict the daily clinical data in the following manner. Today's and several past days' OTC data are combined to make a future clinical data prediction, which is then compared to the actual value of that day's clinical data when it becomes available, and the error is used to update the filter coefficients in such a way as to minimize the square of the error. For simplicity and to illustrate the method we consider the estimation problem in which there is only one reference channel whose value at each time n is denoted by x[n] (note that the subscript j denoting the particular product group is now missing since we are using an example with only 1 product group). The latter is used to estimate the present value y[n] of the primary channel (the dependent variable – office visit data). The estimation equations, once put into the recursive form are then easily generalized to the prediction problem. A linear estimate of the primary channel in terms of a single reference channel is given by , where we have assumed a filter of length M. The last equation can be written as a vector dot product , where h= [h[0], h[1],..., h[M 1]]^{ T }, x[n] = [x[n], x[n 1],..., x[n  (M  1)]]^{ T }, and the superscript T denotes the transposition operation (the transpose of a row vector is a column vector). A linear predictor of the primary channel (the dependent variable) at k steps ahead is then given by .
Clearly we could perform a similar prediction process when we use the clinical data by itself instead of the OTC data streams, as well as simply including the clinical data as an "additional" reference data stream. Let P denote the number of days ahead to predict, M denote the number of linear filter coefficients, and N denote the number of OTC data channels. Then the filter vector h will have M × N elements when we predict the clinical data using the OTC channels, and it will have M elements when we use the clinical data to predict itself, and (M + 1) × N elements when we combine the OTC channels and the clinical data to predict the clinical data. The covariance/correlation matrix of the reference channels will then be a square matrix of the appropriate dimension in each case. The filter application and updates are recursive. Denoting the clinical data (the dependent variable) on day n by y[n] and the reference data (the independent variables) by x_{ j }, the Recursive Least Squares P – days ahead prediction equation is [7] Where L denotes the number of reference data streams and it is equal to N when only OTC data are used, or 1 if the clinical data is used to predict itself, or N + 1 if both clinical and OTC channels are used to predict the clinical data. This step is repeated as many times as required in order to obtain the predicted values . The recursion equations and the new method of minimum multiple look error feedback is described in Additional file 1. We should point out that when we use the clinical data alone (i.e. for self prediction) then the prediction equation is of the form , where the subscript j could be left out since only one filter is used.
Simulation parameters are as follows. The signal maximum strength takes on values 10%, 100% and 200% (percentage increases refer to the background counts). We have chosen 2 signal lag times of 5 and 10 days (lag times refer to the lag between the application of the maximum signal strength to the OTC channels and the office visit count channel). The predictor uses a filter length of 5 days and we try 2 sets of predictions: 5 days ahead and 10 days ahead.
Results and discussion
Our interpretation is that this particular office visit data has sufficiently strong autocorrelations at long lags that allows for a better prediction when compared to the predictions made using the cross correlations. We emphasize that one cannot draw any conclusions as to the best combination of method/data for prediction from these results when in fact no identifiable and significant events of interest exist in the present data set. For instance, one cannot state that OTC data can be safely ignored in the prediction problem in favour of using the clinical data itself. In order to illustrate this point and to place an upper bound on detection performance of a biosurveillance system that relies on predicting the clinical data from OTC and/or clinical data, we have performed an analysis based on simulated events superimposed on the present data sets.
Our choice of the synthetic signal requires further discussion. Since we are interested in placing upper bounds on the performance of a multistream syndromic surveillance system that uses a prediction method to detect an outbreak we decided to concentrate on a type of epidemic curve that has a reasonably fast rise time and a slower fall off. We chose the 1week rise time and 2weeks fall off because they are reasonable numbers in the context of early detection of most biological attack scenarios. It so happens that these numbers appear to fit the observations by Sartwell and so we used a lognormal shape. It turns out the results are quite insensitive to the analytic form of the signal, for instance, we could have used a "triangle" signal with the same risetime characteristics and reached similar conclusions.
Finally we should discuss our results in view of the fact that we applied the same multiplicative signal in all data streams without distortion. A complete simulation study of the performance of a multistream syndromic surveillance system that uses a prediction method to detect an outbreak would include all possible signal distortions (including amplitude reduction, but also changes in the shape of the signal), and all reasonable time delays; this is a huge task and well beyond the scope of this publication. What we have attempted here is obtaining an upper bound on this performance by varying the time delay and the maximum amplitude of the signal but keeping the signal undistorted. Any distortion of the signal would clearly degrade the ROC curves. In the absence of a general theory of infectious disease evolution and the uncertainties associated with the impact of an infectious disease outbreak upon all the data streams in a multistream syndromic surveillance system we have found performance upper bounds on the limited number of cases we have studied, in conjunction with the prediction algorithm presented here.
Conclusion
Based on our simulation results we can state the following broad conclusions regarding a multistream syndromic surveillance system that operates by predicting the clinical data several days in advance and issuing early warnings if the predicted values exceed a given thershold. This predictanddetect system must include ancillary data streams (such as OTC) with established correlations with the clinical data, and a prediction method that can react to nonstationary events sufficiently fast. Any predictions of the clinical data using only the clinical data, i.e. relying on selfcorrelations of the clinical data rather than crosscorrelations with other data streams such as OTC data, can be an effective estimate of the background conditions. Whether OTC (or other data streams yet to be identified) can provide the best source of predicting clinical data is still an open question. The system must also include a prediction algorithm that can react sufficiently fast to nonstationary changes. The Recursive Least Squares Minimum Distance Error algorithm presented here seems to satisfy this condition. Finally, we have no way of knowing the likelihood that events of interest will always be present in both the clinical data and the ancillary streams, without significant distortion, and with reasonable time lags. But if any event satisfies these conditions, we have provided the framework for a system that has an excellent chance of detecting it in advance.
Abbreviations
 OTC:

over the counter (medications).
Declarations
Acknowledgements
This research is sponsored by the Defense Advanced Research Projects Agency and managed under Naval Sea Systems Command (NAVSEA) contract N0002498D8124. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied of the Defense Advanced Research Projects Agency, NAVSEA, or the United States Government.
Authors’ Affiliations
References
 Anna Goldenberg : PNAS. Early statistical detection of Anthrax outbreaks by tracking over the counter medication sales. 2002, 99 (8): 52375240.
 Selfcare in the New Millennium, report by Roper Starch Worldwide, Inc., prepared for the Consumer Healthcare Products Association, RoperStarch. 2001
 Najmi AH, Magruder SF:BMC Medical Informatics and Decision Making. Estimation of Hospital Emergency Room Data Using OTC Pharmaceutical Sales and Least Mean Square Filters. 2004, 4: 5
 Magruder SF: Johns Hopkins University Applied Physics Laboratory Technical Digest. Evaluation of OTC pharmaceutical sales as a possible early warning indicator of public health. 2003, 24:
 Magruder SF, HappelLewis S, Najmi AH, Florio AE: Special Issue on the Proceedings of the National Syndromic Surveillance Conference. Progress in Understanding and Using OvertheCounter Pharmaceuticals for Syndromic Surveillance of Public Health, Morbidity and Mortality Weekly Report. 2004, 53 (Suppl): S117122.
 See for example Diagnostic Coding Essentials. 2001, Ingenix Publishing Group, Salt Lake City, Utah
 Mathematical methods and Algorithms for Signal Processing. 2000, Moon and Stirling, Prentice Hall
 The Incubation Period of Poliomyelitis. 1952, Philip Sartwell, American Journal of Public Health, 42: 14031408.
 Detection of Signals in Noise. 1995, McDonough and Whalen, Academic Press
 The prepublication history for this paper can be accessed here:http://www.biomedcentral.com/14726947/5/33/prepub
Prepublication history
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.