Through the use of decision analytic modeling, we were able to translate the number of people affected by a hypothetical bioterrorist attack to relevant outcomes and incorporate these outcomes into the evaluation metric. The outcomes were lives lost, QALYs lost, and costs incurred, costs being the most comprehensive of the three. In the base case, using the trapezoidal method of AUC calculation, the relative order of performance remained fairly consistent across the three outcome weights. However, we found the relative order of performance was sensitive to the model inputs as well as the method of AUC calculation.

Our model predicts that, if undetected until the ninth day, a bioterrorist attack with *bacillus anthracis* would have a significant detrimental effect on the health of the population. If a surveillance system was successful in detecting the attack before the ninth day, and measures were immediately taken to deliver treatment to the population, the lives, QALYs and dollars that would be lost could be reduced considerably. Earlier detection results in better outcomes: our model estimates an absolute cost savings of several billion dollars for a detection on the eighth day rather than the ninth. Conversely, a false positive alarm has negative consequences associated with the unnecessary use of prophylactic antibiotics, namely the cost incurred and the adverse effects of the medication. Consequently, a high-performing surveillance system should not only be capable of detecting an attack before the ninth day, but should also detect the attack in as timely a manner as possible and with a low rate of false positives.

Public health authorities must consider both the positive and negative aspects of the programs they choose to implement. In the case of surveillance for bioterrorism attacks, the benefits of early detection must be balanced by the adverse effects of false positive alarms, an aspect of surveillance systems that supports the use of weighted ROC curve analysis in their evaluation. Results comparing alternative surveillance algorithms could be used to select an optimal algorithm depending on the outcome public decision makers choose to optimize. This kind of information, and the cost information assembled in Table 1, can help inform discussions about the value and appropriate role for syndromic surveillance.

When timeliness was incorporated into the evaluation metric by Kleinman et al [4], the Time-series method was the best performer, consistent with our results. The order of relative performance of the other systems however was different in the present analysis. Using a different weighting scheme, Kleinman et al also performed an evaluation of the systems that incorporated the number of people affected by the attacks [5], and found an order of relative performance that differs from both their previous analysis as well the present analysis. As noted in the results, the impact of a delay in detection varies with day of detection. We have shown that this variation also affects the apparent performance of the surveillance systems, and thus the incorporation of outcomes into the evaluation metric has an important effect on their ranking.

It is reasonable to conclude that the shape of the outbreak also plays a part in the relative order of performance. If the increase in the number affected is greater during the first days following the attack, it follows that a detection system that performs best in those first days would result in fewer losses. However, if the increase in the number of affected people is highest several days after the attack, a detection system with a higher cumulative sensitivity during the preceding days would have the best performance, even if detection is delayed by several days. Regardless of the shape of the outbreak, in all but a linear relationship between number affected and time, the incorporation of outcomes has a significant impact on relative performance. The modified ROC curves described in this paper allow for several dependent variables to be taken into account in one evaluation metric.

The results remained fairly constant in the 'best' and 'worst' case scenario analyses, indicating that our model is robust to variation in model inputs. However, the relative order of performance is heavily dependent on the choice of AUC calculation method. The rectangular and truncated methods produced results quite different from the classic trapezoidal method. The trapezoidal method assumes that the surveillance system threshold can be adjusted in a continuous manner, such that the false positive rate can be set anywhere between zero and one. This may not in fact be a reasonable assumption given the complexity underlying the statistical algorithms used by the surveillance systems. Moreover, due to the need for extrapolation and the resulting shape of the curves (Figure 2), this method gives more weight to the latter portion of the curve where there are fewer data points. The rectangular method assumes that the thresholds are discrete and that the sensitivity of each detection algorithm has a preset maximum. The area is therefore limited by the preset maximum and the relative performance as measured by the AUC is thus affected. The truncated method goes a step further and assumes that there is a false positive rate beyond which the negative consequences are too great to consider using the system, allowing the remainder of the graph to be disregarded. In this case, we arbitrarily chose 0.1 false positives per day as the cutoff. As is demonstrated in the results, the relative ordering changed significantly from that determined in the base case.

Further research into the differences between the methods of area calculation is needed. If these evaluation methods were adopted and used by public health authorities, consideration should be given to the assumptions underlying the method of ROC curve construction, including the shape of the outbreak, the flexibility of the detection algorithms and the threshold for an acceptable false positive rate. For example, if an acceptable false positive rate were defined, this would restrict the portion of the curve to be studied and potentially minimize the variation when alternative calculation methods are used. The relative performances of the surveillance systems within these bounds would be more accurately applicable to a real-world setting.

Furthermore, the interpretation of these weighted ROC curves is limited due to the nature of the data used to construct the axes, a constraint shared by earlier analyses of this type [4]. The false-positive rate on the x-axis is based on only one year of historical data while the sensitivity is calculated from a simulated data set with multiple events that occur an arbitrary number of times. Rather than using the analysis to draw conclusions about the absolute performance of each system, the intention is to compare the area under the weighted ROC curves from the seven statistical algorithms in order to assess their performance relative to each other.

Although the probability and utility estimates were the best estimates available from the literature, we had limited data on some model inputs due to the limited number of anthrax cases. For example, the probability of recovery from anthrax disease was based on the only data available, the reported case series of seven individuals treated for prodromal anthrax in the 2001 outbreak [8]. Furthermore, the analysis assumed that all individuals with symptomatic anthrax illness would be treated on the day of detection and that antibiotic prophylaxis would be provided within one day to all persons at risk, an idealized scenario that may not be met in practice.