### Models

Long-term trends can occur in surveillance data for multiple reasons. Changes may occur in: local resources (more or specialty EDs), access and reimbursement practices (facilities change which insurance plans with which they are associated, major shifts in insurers drives patients to other facilities), changes in the underlying population (shifts in population size or age), and changes in the local economy. Moving averages were not used because although they generate visually pleasing curves they smooth over day-of-week and seasonal effects that are important for developing baselines.

Concerning model quality, one useful test is whether the forecast error variance in the testing data is approximately the same as that in the training data. Upon dividing the forecast errors in the test data by their standard deviations in the training data, the scaled forecast error variances range from 0.87 to 1.07 for the seven CC categories (ideally, these ratios should be near 1). Further, the fraction of scaled forecast errors that exceed 1.96 ranged from 0.0 to 0.033 (when the model holds and residuals are Gaussian, the portion of one-sided residuals exceeding 1.96σ is 2.5%). Thus, departures from stationarity in the time series are mild enough that the forecast errors show that future complaints can be reasonably well predicted using a single baseline model for each category.

When monitoring complaint levels over multi-year time frames, it is necessary to periodically update baseline model coefficients in order to minimize the extrapolation in forecasting. One approach to choosing an update frequency is to do a planned update every year, but also monitor residuals for patterns, including shifting variance, that have not been observed previously to check whether additional updates are needed.

### Hierarchical modeling to capture season-to-season differences

The first-order model is useful for routine monitoring. It has the obvious shortcoming, however, of describing each season in a one-size-fits-all fashion. As noted above in evaluation of the model's goodness-of-fit, forecast errors reflect modelling imperfections as well as random variability, limiting somewhat the sensitivity of surveillance to detect smaller anomalies. Improving on this situation requires more refined baselining.

Hierarchical methods [23] can overcome the one-size-fits-all shortcoming, or, at a minimum, provide information that is valuable in assessing the quality of one-size-fits-all modelling assumptions. In the hierarchical approach, each season is allowed to have its own time of peak activity, its own seasonal duration, and its own peak magnitude. For practical purposes, the hierarchical model shares the global characteristics of the first order cyclical regression model. The seasonal component is modelled with a scalable Gaussian function, in contrast with the fixed-width sine and cosine harmonics previously. And the underlying baseline changes linearly within a season, as opposed to behaving linearly over a longer time period.

Applying the hierarchical model to respiratory CC data illustrates the season-specific nature of chief complaints. On the average, our respiratory complaints peak on January 22, with a season-to-season standard deviation in the day of the peak of 12 days. The durations of individual seasons, defined in terms of the standard deviations for the Gaussian-shaped peaks, vary by factor of two over the monitoring period. And there is no apparent relation between the time that the peak occurs and the magnitude of the flu season.

Use of hierarchical models for real time syndromic monitoring could be considered, but at a significant computational cost. In order to capture the peak time and magnitude of an ongoing season, the model must be updated on a frequent (e.g., weekly) basis, involving lengthy runs of Markov chain Monte Carlo software. Because the first order cyclical regression model fits the data sufficiently well to detect anomalies of interest, we have used the first order model for routine monitoring. A similar first order cyclical regression model is used by the Centers for Disease Control to monitor pneumonia and influenza related mortality data [24], also with success.

### Related efforts

Influenza surveillance basing alerts on comparison to historical data were described by Irvine [25]. Daily counts were compared to historical averages and standard deviations. Their data demonstrated a peak in CCs during influenza season.

Lazarus et. al. [26] use a generalized linear mixed model based on four years of data from ambulatory health encounters. They find that indicators for day-of-week, month, holiday effects as well as a secular trend term contribute significantly to their model fit. There may be ED data from other hospitals where month-to-month effects exist but are not part of a longer seasonal trend, but we don't see them in our data. Logistic regression [26] is useful for scaling over census tracts of different population sizes and, when complaint counts behave proportional to underlying census populations, is also useful in modeling overall complaint levels.

Reis and Mandl [27] used CCs for their time series models (autoregressive integrated moving average, ARIMA, models) for total and respiratory visits. After fitting a day-of-week effect and a seasonal effect, there remained positive autocorrelation in the forecast errors, which they modelled using a particular time series model. In our CC data, there is negligible autocorrelation in the errors after fitting our model, except that due to the variation in when the seasonal peak occurs. For example, if a peak occurs early, then we observe a sequence of positive errors, which leads to positive autocorrelation of the type reported. For our CC data, the best-fitting ARIMA-type model applied to the residuals after fitting the trend, seasonality, and day-of-week effect was the EWMA (equivalent to a moving average fit to the first differences, also denoted ARIMA(0,1,1) for the particular autoregressive integrated moving average model that it corresponds to). Reis and Mandl [27] note that the ARIMA model adjusts to multi-day outbreaks and so it reduces the error on days 2, 3, ... of a multi-day outbreak. Therefore they suggest using both the original errors (containing serial correlation) and the ARIMA-model-adjusted errors in two monitoring schemes.

The EWMA also adjusts to multi-day outbreaks and therefore suffers from signal loss if the outbreak persists for multiple days (the results in Table 2 illustrate this effect). Therefore, the Reis and Mandl [27] suggestion to monitor two residual series is relevant if we use EWMA or any approach (including the hierarchical model) that uses the very recent past (in addition to the trend, seasonality, and day-of-week effects) to modify the current forecast (leading to two or more forecast methods). Also, sequential tests were not used in [27] for their 7-day simulated outbreaks. Each simulated outbreak added simulated additional counts to the daily CC data. Forecasts were made (on the basis of a model that used the overall mean, the day-of-week means, and the trimmed day-of-year means) and were modified using the ARIMA-modeling of the residuals. If any single-day forecast error exceeded a threshold, then the simulated outbreak was said to be detected. Sequential tests are ideal for multi-day outbreaks so the performance (the false negative rate for a fixed false positive rate) of Page's statistic or a moving window such as in [28] or the scan statistic such as in [29] would be better than the performance of one-at-a-time tests in the case where all simulated outbreaks lasted 7 days. Reis et al. [28] applied several sliding detection windows, each of at most 7 days to ED visit daily counts in which simulated outbreaks (in the form of additional ED visits) lasting 3, 7, and 14 days were added to the real data in a simulation study. On the other hand, if each outbreak lasted only one day, then monitoring single-day errors would be optimal. In summary, we concur with [30] regarding the robustness and simplicity of Page's test. Alternatively, there are occasions when using a modest number of specific tests is effective as was done in [29].

### Use of free text chief complaints

Many surveillance systems report the use of CCs or discharge diagnoses based on ICD-9 codes. Most of these use discharge diagnosis ICD-9 codes in specialized settings such as the Military [13] or in HMOs [23]. ED ICD-9 codes were also used when processing data retrospectively [4]. In most EDs, however, ICD-9 discharge diagnosis coding is not performed on a "near-real" time basis and would not be available for "near-real time" surveillance.

By contrast, free text chief complaints are obtained at the time of patient entry into all Emergency Departments and free text discharge diagnoses are determined at or close to the time of ED discharge. Therefore, use of our grouping scheme is relevant to the majority of EDs in which chief complaints and discharge diagnoses are recorded as free text and are available for "near-real time" surveillance.

Because of their timeliness, CCs are used in our "near-real time" surveillance system B-SAFER [31, 32]. We considered using CoCo, a naive Bayesian free-text classifier developed by the University of Pittsburgh [33], but this was not made available to us. Another automated classification based on weighted key words system is used by ESSENCE [34]. The New York City Department of Health uses a key word and key phrase SAS-based coding system [35]. A comparison of the performance of expert based classification systems such as ours, and automated classification systems has not been done.

There are other potential limitations in using ICD-9 codes for surveillance. It is to be expected that early cases of unusual diseases will be misdiagnosed. Assigned ICD-9 diagnostic codes may be more reflective of the diagnostic bias or practice patterns of the provider, than of the true incidence of disease. Furthermore, ICD-9 diagnosis code assignment is potentially subject to billing bias: codes which garner the highest reimbursement may be used, rather then those that most accurately represent the disease process. Use of ICD-9 codes for chief complaints is also problematic. Because ICD-9 codes were developed for classification of diagnoses, the dictionary for chief complaints is not robust. Therefore, the use of free text chief complaints may result in increased sensitivity, although the B-SAFER team believes that some types of coding standards would be beneficial [36].

### Results and applications

Day-of-week patterns in EDs have been previously reported in the literature [4, 26, 27, 37, 38]. Although the magnitudes of the day-of-week effects vary depending on the setting, the first day of the work week typically exhibits the greatest number of events. And, as we have shown, there is day-of-week variability within infectious disease syndromes which can be obscured by the failure to consider each body system's pattern individually. It is also important to understand utilization patterns on weekends, which ED data provides, as the effects of bio-terrorist or natural outbreaks are unlikely to be limited to weekdays.

Seasonal effects in infectious diseases are best known for respiratory infections (e.g. influenza and respiratory syncytial virus). This seasonality is in large part due to the yearly winter influenza epidemics. Our data is quite consistent with these findings. However, we provide an analysis of the pattern of respiratory related chief complaints based on the longest (8.6 years) historical data base. Furthermore, the seasonality of infectious disease complaints for other body systems, or for non-ID related complaints has not previously been reported. Seasonality in specific GI infections has been noted in other settings. Some gastrointestinal infections are more common in the winter (e.g. rotavirus) while others (e.g. *Campylobacter*, *Cryptosporidium*, enterovirus) are more common in the warm months from outdoor cooking and recreational water exposures.

Seasonality is also important because the signal-to-noise ratio in complaint counts maychange depending on time of year. As illustrated in Figure 1, the error bars are larger during the height of flu season than at other times of the year, leading to reduced sensitivity for detecting an increase in respiratory complaints at that time.

It is unclear why, in our data, the number of respiratory complaints fell with time while the number of fever complaints rose. This may reflect the relative mildness of recent influenza seasons. Alternatively, rather than actual differences in patient presentations, this may represent changes in the practices of choosing or recording chief complaints. This would require further investigation. Implementation of a standard drop-down menu for chief complaints might prevent some bias over time in the selection of chief complaints. Use of a new system, however, would likely change the distribution of chief complaints and thus not allow for the creation of baselines based on historic data.

### Surveillance

Using the models described above we successfully identified a respiratory outbreak in advance of the traditional flu-reporting data streams described in the Methods section. Incoming B-SAFER reports were monitored at least once daily, seven days a week, by the project epidemiologist. This allowed for prompt handling of events indicating a condition reportable by statute to the NM Department of Health. Because there is approximately a 2-week delay for traditional flu-related data sources, provided our respiratory CC captures some of the NM flu cases, we expected to, and did, identify a flu-related respiratory peak in advance of these other sources.

### Limitations

This analysis is based on data from one ED and patterns identified may be somewhat specific to metropolitan Albuuerque. Indigent or Hispanic populations may be over-represented in the ED studied as compared to other EDs. Visit patterns may differ by the local health care infrastructure, population insurance status, access to care, or local climate. We wewere fortunate that electronic ED data was available for the previous eight years. Other institutions may lack the source data for a similar analysis.

CCs are determined by a nurse and recorded in free text by a clerk. This process may conceivably distort patients' literal CCs. Free text CCs are quite variable and require extensive processing. These CCs may also have varied had they been recorded by a physician. Although the rationale for using CCs rather than discharge diagnoses was provided above, there is a tradeoff between the better timeliness of CC data and the better sensitivity of discharge diagnoses [39]. Note that our modelling is as easily applied to diagnoses codes as to chief complaints.

Any approach to disease surveillance using either CCs or discharge diagnoses requires large numbers of symptomatic patients. Analyses based on such large-scale counts are unlikely to discover a small and geographically dispersed event such as the anthrax Anthrax outbreak of October 2001.

As we work more with our data, we will understand it better. Opportunities exist for performing sensitivity analyses, comparing the baseline patterns for CCs to those for discharge diagnoses, and more thoroughly evaluating the performance of our signals as compared to existing standards.