Below, we describe and investigate an algorithm for the detection of multiple rectangular spatial clusters. The algorithm is applied to the simple outbreak model described next. While we restrict the analysis to this particular model, the algorithm is general and can be applied to any underlying model that can provide a likelihood function such as the function lik described in Equation (8) below. We then describe the scanning algorithm itself.
Outbreak model
We define the disease model as a Bayesian Network (BN) model, which is a type of graphical model for representing joint probability distributions; it is particularly suited for modeling conditional independence between variables. In the graphical representation each node represents a random variable and the distribution of each variable is defined as a conditional distribution that is conditioned on the variables from which it receives incoming edges, or as a prior probability distribution if the node has no incoming edges [22]. An extension to the BN representation employs plates when subgraphs of the network are replicated many times. In the graphical plate representation, plates are shown as boxes, the nodes and edges that are on each plate are replicated the number of times shown on the plate, replicated nodes are indexed, and edges that cross plate boundaries are also replicated [23]. In particular, in Figure 1 we have m copies of nodes L
i
, D
i
, and I
i
. We further extend the notation by allowing the number of replications to depend on random variables, as represented by the dotted arrow from SUB to the plate indexed by j. Here, the values that SUB takes are ordered sets and the number of replications n is defined to equal the size of the set |SUB|. The model is explained in further detail below, where we first describe the model and the meanings of the variables in general terms, and then provide the particular parameterization that we used to instantiate the variables in our tests.
Model definition and likelihood function
We employ an agent-based model where each cell of the R × C grid contains a number of individuals. The entire population, made up of m individuals, is modeled.
Figure 1 shows a BN plate representation of the model that governs the state of each individual: SUB = (rect
1, ..., rect
n
) represents the (possibly disconnected) hypothesized location of an ongoing outbreak as an ordered collection of n non-intersecting rectangles, OB
j
are binary variables each of which determines whether an outbreak is present in a given rectangle indexed by j, and F
j
∈ [0,1] represent frequencies of outbreak disease cases per day, each corresponding to an outbreak rectangle indexed by j. We will use the vector notation shorthand F = (F
1, ..., F
n
) where convenient. K ∈ [0,1] represents the proportion of the population that visits the ED per day in the absence of an outbreak, D
i
represents the disease state of a given individual, I
i
represents the observable evidence about the individual, L
i
∈ 1, ..., R} × {1, ..., C} represents the grid cell in which the individual is located, and Φ
i
is the manifested outbreak frequency, which is the frequency of the outbreak disease per day in the individual's grid cell. The dotted arrow pointing from SUB to the plate containing OB
j
and F
j
is meant to make explicit that the number of indexes j, and hence copies of these nodes, is dependent on the size of SUB.
The random variable Φ
i
, which represents the outbreak frequency manifested in individual i's grid cell, is not used to capture uncertainty as random variables normally do, but rather plays the role of a logic switch. It is defined as follows:
(3)
(4)
Note that this is a well-defined probability distribution only under the constraint that the members of SUB do not intersect. We could alternatively merge this logic into the definition of D
i
, however, this decomposition of the model yields a clearer presentation.
The presence of OB
j
as separate from F
j
serves a similar logical purpose. The distribution of OB
j
will depend on the hypothesis space and is left for the model instantiation section. The distribution of F
j
is defined below in terms of OB
j
and a "template" frequency distribution F that is also a detail of model instantiation.
(5)
(6)
Where (F
j
|OB
j
= true) ~ F indicates that F
j
is i.i.d. according to the distribution of F when OB
j
= true.
The primary purpose of the model is to enable us to calculate the likelihood of the presence of an outbreak in a given subregion, where a subregion is taken to be any set of cells. Throughout this paper, we will often also refer to rectangular subregions, or simply rectangles, to make explicit instances where we require the sets of cells to form multi-cell rectangles on the grid. We will also introduce below the concept of tilings and tiles. In the context of this paper, tiles are always rectangular subregions.
The data likelihood of the presence or absence of an outbreak in a subregion is given by the likelihood of the data about individuals located within the subregion's limits supporting a uniform outbreak of some frequency f from the frequency distribution F or the absence of an outbreak, respectively. In order to arrive at the likelihood of an outbreak, consider the case of calculating the likelihood of a particular observation I
i
given a manifested outbreak frequency Φ
i
and a particular K:
(7)
Note that the likelihood for the individual given the absence of an outbreak can be obtained from the expression above by letting Φ
i
= 0. To obtain the likelihood of an outbreak state (not just a particular outbreak frequency) in a subregion of the grid S, we obtain a likelihood for each frequency by taking the product over all per-person likelihoods and take an expectation over those likelihoods to get the expected likelihood of the desired outbreak state. Formally, we will use the variable x to represent the outbreak state. Let x = 0 indicate that a subregion S contains no outbreak, x = 1 indicate that S is an outbreak subregion, and let the notation E
K,F
represent the expectation over the random variables K and F. Then the likelihood for any set of grid cells S under an outbreak state x is given by:
(8)
Here we are simplifying some of the calculation by capturing the logic behind determining the manifested frequency Φ
i
. Particularly, we are using the fact that the manifested outbreak frequency in a non-outbreak cell is 0, which is the end result of multiplying x by F when x = 0 to indicate the absence of an outbreak.
Model instantiation
Three disease states are modeled: noED, flu, and other, which respectively represent the events that an individual did not come in to the ED, came in to the ED due to influenza, or came to the ED for some other reason. The probability that an individual i comes in to the ED due to influenza is equal to the manifested outbreak frequency Φ
i
. Those individuals that do not come in to the ED due to influenza have a probability K of coming in to the ED for some other reason (e.g. due to acute appendicitis). Under this model the distribution of D
i
is defined by:
(9)
In the initial stages of the investigation K was treated as a constant value, i.e., a distribution with probability mass 1 at K = 3.904 × 10-4 which is an estimate that comes from data obtained for ED visits in Allegheny County, Pennsylvania in May of the years 2003-2005. We also consider a model where K is a discrete distribution with positive mass at multiple values estimated from that data. However, due to practical limitations mentioned in our discussion of additional tests below, most tests were performed with the model that uses a constant K.
To estimate the distribution F that ultimately governs the distributions of outbreak frequencies in the model, we base it on a distribution previously used for a similar, more complex disease model in [17]. We adjusted the distribution to reflect a uniform density over the interval (0, 6.50 × 10-4].
Four evidence states are modeled for the evidence variable I
i
, in the form of chief complaints, taking the values cough, fever, other, and missing. I
i
is governed by the following distribution:
(10)
(11)
(12)
(13)
(14)
(15)
(16)
(17)
A value of "missing" indicates that individual i did not come in to the ED. The values that I
i
takes for individuals that did come in to the ED (either due to influenza or other reasons) are governed by a probability distribution that is based on expert assessments that are informed by the medical literature.
The proper choice of prior distributions is closely tied to the proper choice of the prior probability of an influenza outbreak being present anywhere in the region. For that purpose, we take the value P( outbreak) = 0.04 from previous work [17, 24]. This value will appear at multiple points throughout this paper.
The distributions of the subregion, outbreak, and frequency variables
The distributions of SUB, OB
j
, and F
j
are inherently tied together, as made explicit in the BN structure, and since the space of SUB is the space of considered outbreak subregions, it is different for different algorithms that consider different hypothesis spaces. This section describes the distributions of SUB and F
j
for the two baseline methods we use in our evaluation and for the dynamic programming algorithm.
The single-rectangle case
The simplest case is the case where only single-rectangle hypotheses are considered. Under that model, SUB can represent any of the R(R + 1)C(C + 1)/4 rectangles that can be placed on the R × C grid, each representing a hypothesis of an outbreak within the rectangle and no outbreak outside the rectangle. SUB can take an additional value to represent a non-outbreak hypothesis. Since we defined SUB to be an ordered set of rectangles, formally, each single-rectangle hypothesis is represented as a single-element set, and the non-outbreak hypothesis is represented by the empty set ∅. The associated prior distribution of SUB is:
(18)
(19)
Note that when SUB is the empty set, OB
j
and F
j
do not need to appear in the BN as they play no role in the distribution of Φ
i
and the rest of the BN. In the case when SUB represents a rectangle, it is a single-element set, and there is only one j, namely j = 1. OB
1 is then taken to be true and consequently F
1 ~ F.
The multi-rectangle case
The case that corresponds to the greedy algorithm to which we will be comparing our algorithm is the case where each element sub of the hypothesis space is an ordered set of non-overlapping rectangles. Call this hypothesis space , that is, is a set of values that the random variable SUB can take, where each value sub is an ordered set of rectangles. Here the prior distribution of SUB is governed by a structure prior parameter q as follows:
(20)
(21)
The parameter α can be used to adjust the prior probability of the absence of an outbreak, while the parameter q controls the relative prior probability of having a more complex outbreak hypothesis (that is, a hypothesis with more hypothesized rectangles) vs. a less complex outbreak hypothesis. Z
M
is a normalization constant described by:
(22)
Where \ is the set difference operator and |sub| denotes the size of the ordered set sub (a particular value that SUB can take) in terms of the number of rectangles in the set.
Since in our experiments this model is used only in the context of selecting the most likely outbreak hypothesis, there is no need to specify α or calculate Z
M
. The value q = P(outbreak) × 4/(R(R + 1)C(C + 1)) was used for the structure prior parameter.
Since in this model every rectangle that appears in SUB is an outbreak rectangle, we again have OB
j
= true for all j and consequently F
j
~ F are i.i.d., one per rectangle.
The tiling case
The hypothesis space used by the dynamic programming algorithm, the centerpiece of this work, is a space of tilings (discussed in more detail in the next section.) To express this hypothesis space, each value of SUB is a tiling: a collection of non-overlapping rectangles such that every cell of the grid is covered by exactly one of these rectangles. In order to represent an outbreak hypothesis as a tiling we use the variable OB
j
to indicate whether a tile is hypothesized to represent an outbreak or an outbreak-free region. In the tiling context the distribution over the variables SUB and OB
j
is defined as follows:
(23)
(24)
(25)
Here Z
T
is a normalization constant and p is a structure prior representing the conditional probability that the rectangle rect
j
∈ sub (where sub is a value in the range of the r.v. SUB) is an outbreak rectangle given that it is indeed a rectangle of the hypothesis. The prior distribution of SUB is taken to be uniform since we assume that we have no information that leads us to favor one tiling over another. The calculation of Z
T
is left for a later section that also discusses the relationship between p and the prior probability of an outbreak, as well as the process which was used to select the value of p that corresponds to the prior probability of a flu outbreak equaling P( outbreak) = 0.04.
In the text that follows, we will refer to the choice of a set of tiles sub and the associated assignment of the variables OB
1, ..., OB
n
as a colored tiling, or when it is clear from the context we may refer to colored tilings simply as tilings. We also discuss the particular space of colored tilings that the dynamic programming algorithm considers in more depth, since this space is a subset of the space of all possible tilings. The possibility of using a structure prior P(OB
j
= true|SUB) that is not identical for every rectangle in every tiling SUB is also discussed.
Tilings
The algorithm presented in this paper searches a space of hypotheses where each hypothesis is a possible tiling of the surveillance grid by non-outbreak and outbreak rectangles. For example, Figure 2 shows the six possible tilings of a 1 × 2 grid.
Note that we make a distinction between two adjacent outbreak tiles (e.g., Hypothesis 5) and one large outbreak tile (e.g., Hypothesis 6). The rationale for this is that each tile represents a region over which the outbreak frequency is uniform. In the 1 × 2 example, we would expect Hypothesis 5 to be more likely than Hypothesis 6 in a case where, for example, 5% of the population in the left cell has the outbreak disease and 10% of the population in the right cell has the outbreak disease. Conversely, if the outbreak disease cases are distributed uniformly among the two cells we would expect Hypothesis 6 to be more likely.
Computing tiling scores
In computing tiling scores, we assume conditional independence among separate tiles given a particular tiling, even if the tiles are adjacent. As an alternative, modeling dependencies could help in the cluster detection task to adjust the prior probability of the presence of a disease in one cluster when it is near another cluster. A model that takes such effects into account could achieve improved outbreak detection, especially when modeling infectious diseases like influenza where being near an infected individual increases the probability of transmission of an infection. Modeling such dependencies incorrectly, however, could also hinder detection. In this sense, our choice not to model spatial dependencies can be seen as a cautious approach to avoid making informative spatial dependence assumptions that may be incorrect and therefore deleterious to outbreak detection performance. Even so, our basic choice of priors does favor finding a smaller number of tiles for a region, which often leads to spatially grouping neighboring disease cases.
The independence assumptions we make enable the dramatic computational efficiency that we gain by using dynamic programming, as explained in detail below. Assuming conditional independence does not constrain the outbreak hypotheses that can be identified in principle from the data, if given enough data. Although valid assumptions of conditional dependence may yield a more accurate performance in light of available data, representing and reasoning with conditional dependence carries an enormous computational burden that we avoid. It may be possible to extend our method to take spatial dependencies into account and still maintain computational tractability. We leave this issue as an open problem for future research.
Conditional independence allows us to define the score of a given tiling to be the product of the scores of the individual tiles it is composed of. The score of each individual tile is given by the data likelihood of that tile, as defined by Equation (8), multiplied by the prior probability of the tile's outbreak state. That is, the score of the hypothesis that tile T contains an outbreak is p ⋅ lik(1, T) and the score of the hypothesis that it does not contain an outbreak is (1 - p) ⋅ lik(0, T).
To illustrate the effects of multiplying the likelihood by a prior, suppose that in the 1 × 2 example in Figure 2 the likelihoods of the tiles in Hypothesis 5 and Hypothesis 6 were the same, both equal to some value l. In that case the score of Hypothesis 6, which contains only one tile, would be lp while the score of Hypothesis 5, which contains two tiles, would have the lower value of lp
2. Hence, all other things being equal, our prior favors tilings composed of fewer tiles.
A multiplicative prior for each tile is used to allow the decomposition of a tiling score into a product of individual tile scores. While we use the same prior for all tiles, it is possible to assign a different prior probability for each tile based on its location and size while maintaining the multiplicative property that the score of a tiling is a product of tile scored. Also note that while the priors for each tile add up to 1, the priors for tilings, which are a product of tile priors, are not normalized. We discuss the details of normalization when we present Bayesian model averaging, since it is especially relevant in that context. The process of selecting the value of the structure prior p and its relationship to the prior probability that an outbreak is present anywhere in the region (the "global prior") is also discussed alongside normalization. Below, let us denote the score of an outbreak state x for a tile T that spans rows R
L
through R
H
and columns C
L
through C
H
by
(26)
Dynamic programming algorithm
It is clear that the space of possible outbreak hypotheses is exponential in the size of the grid, since there are 2R×Cpossible outbreak subregions and an even larger number of possible tilings. In order to efficiently search the space of hypotheses, a dynamic programming algorithm is presented for finding the most likely tiling that exploits the fact that the tiling of a large grid can be decomposed into multiple tilings of smaller grids.
To present the operation of the algorithm, we will first describe the algorithm for finding the highest-scoring tiling of a 1 × C horizontal strip in general terms, then walk through an example of tiling the top row of a 5 × 5 grid illustrated in Figure 3. Next we describe how the algorithm is extended to two dimensions with the aid of the example on a 5 × 5 grid illustrated in Figure 4, and finally we provide a formal definition of the algorithm as pseudocode.
In order to find the best (highest scoring) tiling of a 1 × C strip, we first number the cells consecutively from 1 to C. Let denote the best tiling of cells numbered C
L
- 1 (inclusive) and lower. When C
L
= 1, (or equivalently ) is a tiling of an empty set of cells. There is only one such possible tiling, the empty tiling, and we assign it a score of 1 because it is the multiplicative identity. Having laid out this grounding we can describe the operation of the algorithm iteratively: Given that we have found the best tiling for each C
L
such that 1 ≤ C
L
≤ C
H
, we can determine the best tiling for cells 1, ..., C
H
as follows: For each C
L
≤ C
H
, determine whether the tile that spans the cells C
L
, ..., C
H
(inclusive) should be an outbreak tile or a non-outbreak tile to maximize its score, then multiply the score by the score of the best tiling . This product is the score of the tiling obtained by appending tile to tiling , and this resulting tiling is then a candidate for tiling . Then out of all the candidates obtained for the different values of C
L
in 1, ..., C
H
, the highest scoring one is guaranteed to be the highest scoring tiling of the range of cells 1, ..., C
H
. Thus we obtain the best tiling for the entire strip by iterating over values of C
H
from 1 to C.
Figure 3 illustrates a single iteration of this algorithm when looking for the best scoring tiling of the first (top) row of a 5 × 5 grid. The cells are numbered left to right. Particularly, the figure shows us the iteration that obtains the best tiling of the first (leftmost) four cells, thus, throughout this iteration, C
H
= 4. At this point, we have already obtained the best tilings of the lower ranges of cells in previous iterations. These are shown in Figure 3(a). Figure 3(b) shows that for each value of C
L
from 1 through C
H
= 4, we consider whether an outbreak or a non-outbreak tile scores higher. The best scoring tile from each pair is indicated. In Figure 3(c), we show each of those highest scoring tiles combined with the corresponding best tiling from Figure 3(a). In our example suppose that of those resulting tilings the bottom one (obtained with C
L
= 4) resulted in the highest score. This tiling is then the best tiling of cells 1...4, and it is cached for use in the next iteration.
This method is extended to the two-dimensional case by performing the same iteration over rows, where each row takes the role analogous to a cell in the one-dimensional version discussed above. For example, Figure 4 is analogous to Figure 3(c) in that we know the best tiling for the rectangular region that spans rows 1 through R
L
- 1, and we are adding the column-wise tiling of the rectangular region spanning rows R
L
through R
H
to get a candidate tiling of the entire area above the line labeled R
H
. The main difference is that, where in Figure 3(b) we were considering whether to add an outbreak tile or a non-outbreak tile, in Figure 4 we are simply adding the best column-wise tiling of the range of rows between R
L
and R
H
, as is illustrated by the fact that the best tiling of row 1 comes from an extension of the example in Figure 3.
For readability purposes, a version of the algorithm that only computes the score of the best tiling is presented in Figure 5. A version of the algorithm that keeps track of the actual tiling is detailed in Additional file 1: Appendix A.
The algorithm takes Θ(R
2 ⋅ C
2) iterations. In the general case, the computational cost of each iteration depends on the likelihood model used. In our particular implementation we compute a table of likelihoods for all possible grid rectangles prior to running the algorithm. Since our model is agent-based, the computational cost of populating the table is linear in the population of the surveillance region. The computational advantage that dynamic programming gives us is the ability to scan an exponential number of possible outbreak subregions. The exact number of possible tilings scanned is (2 × 3C-1+ 1)R-1× 2 × 3C-1= Θ(2R(Clg 3+1-lg 3)); of these tilings, (2C-1+ 1)R -1× 2C-1are non-outbreak hypotheses, and the rest are outbreak hypotheses, that is, hypotheses where at least one tile is colored. The calculation of the number of tilings is presented in Additional file 2: Appendix B.
We can express the value computed by the algorithm formally as follows: Let a row-wise partition of the grid be denoted by where is the space of all row-wise partitions of the grid. Each partition is a set of pairs of row indexes, where each pair (R
L
, R
H
) bounds the rectangular region R
LH
that spans the width of the grid (columns 1 through C) and rows R
L
through R
H
(inclusive). Let a column-wise partition of the rectangular region R
LH
be denoted by , where is the space of all column-wise partitions of R
LH
. We use the notation to mean that there is a tile spanning columns C
L
through C
H
(and rows R
L
through R
H
) in the partition. Using this notation, the algorithm finds:
(27)
The space of tilings considered by the algorithm can be described as the space of row-wise tilings of column-wise sub-tilings. Intuitively, this space has the desired property of being able to capture every cell-wise coloring of the grid, however, this is not an exhaustive search over all general colored tilings. Figure 6 illustrates this on a 10 × 10 grid with a pair of examples: Figures 6(a) and Figure 6(c) show two configuration of outbreak rectangles representing multiple outbreaks. It is desirable for a tiling found by the algorithm to be able to (1) cover all outbreak regions with outbreak tiles and all non-outbreak regions with non outbreak rectangles (that is, get the coloring right), (2) cover each outbreak rectangle with a single tile, and (3) minimize any unnecessary fragmentation of the non-outbreak region. For the outbreak configuration in 6(a), there exists a tiling in the search space of the algorithm that satisfy all three conditions, namely, the tiling in Figure 6(b). However, such a tiling does not always exist: The outbreak configuration in Figure 6(c) shows this limitation. A tiling in the space of all colored tilings that satisfies all three conditions is shown in Figure 6(e). Note that this tiling is not a row-wise tiling of column-wise sub-tilings and is hence not in the search space of the algorithm. In fact, a row-wise tiling that minimizes the fragmentation of the outbreak rectangles (to the extent possible for a row-wise tiling) is shown in Figure 6(d). Note that in Figure 6(d) the non-outbreak region has to be broken up into eleven tiles (as opposed to just eight in Figure 6(e)) and the three outbreak rectangles need to be broken up into five tiles. Thus, the most parsimonious row-wise tiling shown in Figure 6(d) is still not as parsimonious as the tiling in Figure 6(e).
Model averaging
In the above algorithm it was assumed that the task at hand is model selection, that is, finding the most likely tiling given the data. Below we present an adaptation of the dynamic programming algorithm to perform Bayesian model averaging as well, in order to derive a posterior probability that an outbreak is occurring given the data. In simple terms, this is done by replacing maximization with summation to obtain two sums: S
0(Data), the sum of the scores of all tilings that do not include outbreaks (blank tiles only); and S
01(Data), the sum of the scores of all tilings considered. More specifically, replacing maximization by summation leads to the dynamic programming algorithm for summation over tilings in Figure 7.
The summation algorithm calculates the sum
(28)
Here, the function h(R
L
, R
H
, C
L
, C
H
) defines the tile-wise values we would like to sum over. By substituting each of the following alternative functions for h we obtain the desired sums S
01(Data) and S
0(Data), respectively:
(29)
(30)
Score normalization
Under the above setup, the tiling scores need to be normalized in order to have the prior probabilities over all hypotheses sum to 1. The normalization constant Z
T
can be simply calculated as a sum over the priors associated with all tilings:
(31)
Hence the normalization constant Z
T
is simply the sum over tilings in (28) with h ≡ 1. Additional file 3: Appendix C shows that for an R × C grid, this value can be calculated in closed form as Z
T
= f(R, C, y) using Equation (32) with y = 1:
(32)
Similarly, the prior probability, as governed by the structure prior parameter p, of the absence of an outbreak anywhere in the grid can be calculated as a sum over the priors of all non-outbreak tilings:
(33)
This relationship was used to pick the value of p that matches the particular value of P( outbreak) = 0.04 for the prior probability of an influenza outbreak based on expert assessment. Since we have not found a closed-form inverse to this relationship, the appropriate the value of p was found numerically using a binary search over numbers in [0,1].
In order to obtain the posterior probability of an outbreak, we normalize the sums of scores to obtain: S
0,1(Data)/Z
T
= P(Data) and S
0(Data)/Z
T
= P(Data, no outbreak). This allows us to obtain the posterior probability of an outbreak
(34)
In a typical biosurveillance application this is the posterior that we would use to detect whether an outbreak is occurring anywhere in the surveillance region by testing whether it rises above a pre-defined alert threshold.
Evaluation methods
This section describes an evaluation of the dynamic programming (DP) algorithm, both in terms of model selection and in terms of model averaging. The evaluation was preformed using real background data consisting of information about chief complaints and home ZIP codes of ED patients who are presumed not to have an outbreak disease. Synthetic data were generated to simulate the presence of influenza outbreak cases. The evaluation consists of a comparison to a baseline that scans over single-rectangle hypotheses (SR), as well as a comparison for model selection against a greedy algorithm (GR) that hypothesizes one or more non-overlapping outbreak rectangles. The next section describes these algorithms and the section that follows it describes the data in further detail.
Baseline methods
The simpler of the baseline methods that the DP algorithm is compared to is a method of scanning over single-rectangle hypotheses (SR). For model selection, it consists of iterating over all possible placements of a single rectangle on the grid and finding the placement that maximizes the posterior probability of an outbreak, calculated as:
(35)
where S
i
represents some rectangular region on the grid and is its complement. In the implementation used in this evaluation, the prior probability P(H
1(S
i
)) of having an outbreak in rectangle S
i
is set to be equal for all rectangles, hence maximization of the posterior probability here is equivalent to maximization of the likelihood . The prior probabilities P(H
1(S
i
)) are chosen so that the prior probability P( Outbreak) of an outbreak anywhere in the region matches the one for the dynamic programming algorithm.
Let the notation S(R
L
, R
H
, C
L
, C
H
) denote the rectangle S
i
defined by those row and column boundaries, and let . Then model averaging using the SR algorithm consists of simply calculating the posterior:
(36)
The other baseline method that is used in the evaluation is the greedy algorithm (GR) without recursion by Jiang and Cooper [18]. We use this algorithm for model selection only, and it operates iteratively: First it selects the single rectangle that maximizes the score , where q is a structure prior. Next the non-overlapping rectangle that maximizes the score is selected. This process of adding one rectangle at each step and multiplying by the structure prior q with each new added rectangle is repeated until the score can no longer be increased. The result is a collection of non-overlapping rectangles with the associated score
(37)
Outbreak rectangles are considered to be independent conditioned on their positions. The value of q is chosen to be equal to the value used in the SR algorithm for P(H
1(S
i
)).
Data generation
Multiple outbreak scenarios were generated on a 10 × 10 grid with a population distribution based on ZIP Code Tabulation Area populations in Allegheny County, Pennsylvania as reported in the 2000 Census. Real ED data with home ZIP code information was used as a non-outbreak background scenario. Data from the months of June-August of the years 2003-2005, a total of 181 days, were used to evaluate the background scenario. The data for each day consisted of a de-identified list of records for patients who visited emergency departments in Allegheny county on that day, where each record contained the patient's home zip code and chief complaints. Ethics approval for use of the data was provided by the University of Pittsburgh IRB.
Outbreaks of varying shapes, sizes, and frequencies were then injected into the data to obtain outbreak scenarios. In simulating outbreaks, the outbreak frequency F was sampled as a continuous uniform random variable from one of the following four frequency ranges: low (0 to 5.91 × 10-5), mid (5.91 × 10-5 to 3.55 × 10-4), high (3.55 × 10-4 to 6.50 × 10-4), and very high (6.50 × 10-4 to 0.5).
For each frequency range, four sets of outbreak specifications were generated where the rectangle sizes were limited to a maximum of 10, 40, 60, and 100 cells. Each of these sets was in turn composed of five sets of specifications with n = 1 to 5 non-overlapping rectangles, 10 scenarios for each value of n, giving a total of 800 outbreak specifications. The positioning of each rectangle on the grid was uniformly random. Figure 8 shows a sample of the randomly generated outbreak shapes that were used. Each rectangle j was assigned an outbreak frequency F
j
sampled from the previously determined frequency range.
For each of the 181 days of background data, each of the 800 outbreak specifications was used to create an outbreak scenario by injecting disease cases into that day by selecting an individual in a an outbreak rectangle indexed by j with probability F
j
and setting the corresponding symptom I
i
by sampling the distribution of P(I
i
|D
i
= flu) defined in our disease model.
For each outbreak scenario, the outbreak's coverage (the proportion of the grid that is covered by outbreak rectangles) was also recorded. This value is later used to stratify the test results.