- Research article
- Open Access
- Open Peer Review

# A data-driven epidemiological prediction method for dengue outbreaks using local and remote sensing data

- Anna L Buczak
^{1}Email author, - Phillip T Koshute
^{1}, - Steven M Babin
^{1}, - Brian H Feighner
^{1}and - Sheryl H Lewis
^{1}

**12**:124

https://doi.org/10.1186/1472-6947-12-124

© Buczak et al.; licensee BioMed Central Ltd. 2012

**Received:**1 May 2012**Accepted:**25 October 2012**Published:**5 November 2012

## Abstract

### Background

Dengue is the most common arboviral disease of humans, with more than one third of the world’s population at risk. Accurate prediction of dengue outbreaks may lead to public health interventions that mitigate the effect of the disease. Predicting infectious disease outbreaks is a challenging task; truly predictive methods are still in their infancy.

### Methods

We describe a novel prediction method utilizing Fuzzy Association Rule Mining to extract relationships between clinical, meteorological, climatic, and socio-political data from Peru. These relationships are in the form of rules. The best set of rules is automatically chosen and forms a classifier. That classifier is then used to predict future dengue incidence as either HIGH (outbreak) or LOW (no outbreak), where these values are defined as being above and below the mean previous dengue incidence plus two standard deviations, respectively.

### Results

Our automated method built three different fuzzy association rule models. Using the first two weekly models, we predicted dengue incidence three and four weeks in advance, respectively. The third prediction encompassed a four-week period, specifically four to seven weeks from time of prediction. Using previously unused test data for the period 4–7 weeks from time of prediction yielded a positive predictive value of 0.686, a negative predictive value of 0.976, a sensitivity of 0.615, and a specificity of 0.982.

### Conclusions

We have developed a novel approach for dengue outbreak prediction. The method is general, could be extended for use in any geographical region, and has the potential to be extended to other environmentally influenced infections. The variables used in our method are widely available for most, if not all countries, enhancing the generalizability of our method.

## Keywords

- Dengue fever
- Prediction
- Association rule mining
- Fuzzy logic
- Predictor variables

## Background

Dengue is an acute febrile disease of humans caused by a single-stranded RNA flavivirus transmitted by *Aedes* mosquitoes, primarily *Aedes aegypti*. These mosquitoes thrive in tropical urban areas by breeding in uncovered containers capable of holding rain water, such as tires, buckets, flower pots, etc.[1]. Dengue is now the most common arboviral disease of humans in the world[2, 3], recognized in over 100 countries, with an estimated 50 – 100 million cases annually[4, 5]. More than one third of the world’s population lives in the areas where there is a risk of dengue virus transmission. Recent dengue outbreaks have occurred in the Philippines, Singapore, Thailand, Cambodia, Peru, Ecuador, and Brazil[6]. Dengue is endemic in Puerto Rico and recently re-emerged in the Florida Keys in the United States (US)[7].

Dengue presents with a wide range of symptoms[2]. Minimally symptomatic or mild flu-like presentations may be seen in young children. The classic presentation (called dengue fever or DF), seen most commonly in older children and adults, is an abrupt onset of a high fever, severe muscle and joint pain, and headache that may occur with nausea and vomiting. Recovery is prolonged and marked by fatigue and depression[8]. A hemorrhagic form of the disease may develop, especially in patients who have been exposed to more than one of the four known strains of the virus[2, 6]. This presentation, called dengue hemorrhagic fever (DHF), includes increased capillary permeability with potentially significant vascular leakage that compromises organ function and may lead to shock[2–4]. Mortality in DHF with excellent medical care is generally less than 10%, but has been reported to be as high as 40% in austere settings[2].

Efforts to develop a dengue vaccine have been hampered by lack of appropriate animal models. Additionally, the empirical observation of increased incidence of DHF with prior immunologic response to dengue virus infection raises the theoretical possibility that immunization may result in an increased incidence of DHF[9]. Several dengue vaccines are currently undergoing clinical trials; however, no dengue vaccine is licensed for use in the US. Therefore, it is important to find ways to accurately predict dengue outbreaks in order that preventive public health interventions may be used to mitigate the effect of these outbreaks, particularly in areas where resources for such efforts are limited and where medical treatment facilities may become overwhelmed by an outbreak.

## Methods

### Predictor variables

**Sources of data**

Data type | Source |
---|---|

Rainfall | NASA Tropical Rainfall Measuring Mission http://mirador.gsfc.nasa.gov/ |

Temperature | USGS Land Processes Distributed Active Archive Center https://lpdaac.usgs.gov/get_data |

Altitude | NOAA National Geophysical Data Center http://www.ngdc.noaa.gov/cgi-bin/mgg/ff/nph-newform.pl/mgg/topo/. |

Demographics | Peru National Institute of Statistics and Informationhttp://www.inei.gob.pe/ |

NDVI | USGS Land Processes Distributed Active Archive Centerhttps://lpdaac.usgs.gov/get_data |

EVI | USGS Land Processes Distributed Active Archive Centerhttps://lpdaac.usgs.gov/get_data |

Political Stability | Worldwide Governance Indicators Projecthttp://info.worldbank.org/governance/wgi/index.asp |

Southern Oscillation Index | US National Center for Atmospheric Researchhttp://mirador.gsfc.nasa.gov/ |

Sea Surf. Temp. Anomaly | NASA Global Change Mastery Directoryhttps://lpdaac.usgs.gov/get_data |

In order to perform spatiotemporal predictions, all the variables need to fit the same spatiotemporal scale. The spatiotemporal scale used in this work was selected based on the distribution of the dengue data: the chosen temporal scale was one week and the chosen spatial distribution was one district. In the following sections, when describing the different variables used, we also describe the extensive preprocessing done for each variable to fit the selected spatiotemporal scale.

#### Dengue case data

In the calculations, we assumed that the population of a district was constant throughout a given year. To derive these population values, we obtained district population data from Peru National Institute of Statistics and Information from the 1993 and 2007 censuses (there was no census taken in between these years). For each district, we used a linear interpolation to obtain the population for each of the years between 1993 and 2007, and we used a linear extrapolation to obtain the population for 2008 and 2009. When portions of Iquitos were reassigned to Belen and San Juan Bautista in 2000, we assumed that the three districts’ total population increased linearly and that the ratio between them remained constant.

*i*to week

*i*+ 3 can be obtained with:

Because dengue cases were provided in weekly intervals, we converted all other input variables to weekly intervals. Following US Centers for Disease Control and Prevention (CDC) conventions[28], all weekly intervals begin on a Sunday.

#### Rainfall

*subcells*and counted (using MATLAB[30]) the number of centroids of these subcells contained within that district. We then divided these counts by the total number of subcell centroids encompassed by each district. An example is shown in Figure 6: 11 subcell centroids fell within district A (shown in blue), 47 within district B (red), and 42 within district C (orange). Table 2 gives counts of subcell centroids in each district in other grid cells (not shown). For some of the cells, the total of subcell centroids is less than 100 because other districts (not listed) encompass some of the subcell centroids in those grid cells. The following calculations can be used to determine the weights

*W*

_{ i,d }for each grid cell for district A:

**Numbers of subcell centroids from each grid cell in districts A, B, and C. GC stands for Grid Cell**

District | GC1 | GC2 | GC3 | GC4 | GC5 | GC6 | GC7 | GC8 | GC9 |
---|---|---|---|---|---|---|---|---|---|

A | 7 | 22 | 0 | 12 | 11 | 0 | 0 | 0 | 0 |

B | 0 | 2 | 0 | 67 | 47 | 0 | 18 | 13 | 0 |

C | 0 | 5 | 15 | 0 | 42 | 34 | 0 | 65 | 25 |

Similar calculations were performed to determine the grid cell weights for other districts and the same technique was used to convert other variables’ gridded data to single value for each district.

*i*to week

*i*+ 3 can be obtained with:

#### Temperature

where *d*
_{
i
} is the number of days in the week overlapping the *i* th 8-day interval and *T*
_{
8,i
} is the temperature in that 8-day interval.

In cases where one of two 8-day intervals had missing data from a given grid cell, we used the mean from the other 8-day interval exclusively. In cells where all values from corresponding 8-day intervals were “missing” data, we excluded these temperature values from the subsequent spatial aggregation whenever possible. (In a small number of cases, all temperature data from all grid cells comprising a given district were missing for an entire week and we set that district’s temperature to “missing”). To determine grid cell weights for each district, we applied the subcell centroid counting method (described earlier in the section entitled Rainfall) that we used to determine grid cell weights for rainfall.

*i*to week

*i*+ 3 can be obtained with:

#### Vegetation indices: NDVI and EVI

These data consist of satellite measurements of leaf area indices that provide a surrogate assessment of green leaf biomass, photosynthetic activity, and the effects of seasonal rainfall, which may then be related to vector habitat characteristics and disease outbreaks[32]. NDVI is closely related to photosynthesis, while EVI is closely related to leaf display[33]. Values of both NDVI and EVI were obtained from the Moderate Resolution Imaging Spectrometer (MODIS).

Negative values (approaching −1) of NDVI correspond to water. Values close to zero (−0.1 to 0.1) correspond to barren areas of rock, sand, or snow. Low positive values (approximately 0.2 to 0.4) represent shrub and grassland. High values (approaching 1) indicate temperate and tropical rainforests. NDVI seasonal variations closely follow human-induced patterns, resulting in a significant correlation between NDVI and landscape disturbance[33].

EVI is an optimized index designed to enhance the vegetation signal with improved sensitivity in high biomass regions and improved vegetation monitoring through a decoupling of the canopy background signal and a reduction in atmosphere influences. EVI is calculated similarly to NDVI, but corrects for some distortions in the reflected light. EVI is considered to be more responsive than NDVI to canopy structural variations. Xiao et al.[34] note that the fact that EVI includes the blue band for atmospheric correction is particularly important for the Amazon basin where seasonal burning of pasture and forest takes place throughout the dry season. They note that, unlike EVI, NDVI could be substantially impacted by the smoke and aerosols from biomass burning, regardless of the vegetation changes.

where *d*
_{
i
} is the number of days in the week overlapping the *i* th 16-day interval and *N*
_{
16,i
} is the NDVI or EVI value for that 16-day interval.

In cases where one of two 16-day intervals was missing data from a given grid cell, we used the value from the other 16-day interval exclusively. In cells where all values from corresponding 16-day intervals were missing data, we excluded these NDVI and EVI values from the subsequent spatial aggregation. To assign single NDVI/EVI values for each district, we applied the subcell centroid counting method described earlier in the section entitled Rainfall.

*i*to week

*i*+ 3 can be obtained with:

where Index stands for NDVI or EVI depending for which one the calculation is being performed.

#### Southern Oscillation Index

We obtained monthly Southern Oscillation Index (SOI) values from the US National Center for Atmospheric Research Climate Analysis Section website[35]. SOI is based on the pressure difference between Darwin (Australia) and Tahiti (French Polynesia), which influences the strength of the prevailing easterly winds. These data provide a measure of the El Nino Southern Oscillation (ENSO) climate effect. A single monthly SOI value is available and therefore is not location-specific.

where *d*
_{
i
} is the number of days in the week overlapping the *i* th month and *S*
_{
m,i
} is the SOI for that month.

#### Sea Surface Temperature Anomaly

where *d*
_{
i
} is the number of days in the week overlapping the *i* th week and *A*
_{
w,i
} is the SSTA for that week.

#### Socio-economic and demographic data

We considered several socio-economic variables that reflected potentially relevant information. We obtained political stability data from the Worldwide Governance Indicators Project[37]. These data consisted of a single value for Peru from most years between 1996 and 2009. To obtain values for the missing years, we performed a linear interpolation. From the Peru National Institute of Statistics and Information 2007 census[38], we obtained population density and proportions with electric lighting, running water, and hygienic services. These data also included numbers of *vivendas particulares* (private dwellings), *vivendas con abstecimiento de agua* (private dwellings with running water), *vivendas con servicio higienico* (private dwellings with toilets), and *vivendas con alumbrado electric* (private dwellings with electricity) for each district. We then calculated percentages of private dwellings with running water, toilets, and electricity. Because these values were only available from the 2007 census, we used a single value for each district for all weeks.

#### Elevation

We obtained elevation data from the NOAA National Geophysical Data Center website[39]. We assigned missing data (typically for ocean locations) an elevation of zero. By averaging the elevation in 30-by-30 grids, we changed the scale from 1/120 degree to 0.25 degree resolution, consistent with the scale of rainfall data. Subsequently, we used the subcell centroid counting method (described earlier in the section entitled Rainfall) to determine a single average elevation value for each district.

### Prediction methodology

#### Overview

- 1
Definition of spatiotemporal resolution and data preprocessing to fit that resolution.

- 2
Division of the data set into disjoint training, validation and test subsets.

- 3
Rule extraction from training data using Fuzzy Association Rule Mining (FARM).

- 4
Automatic building of classifiers from the rules extracted in step 3.

- 5
Choice of the best classifier based on its performance on the validation data set.

- 6
Computation of predictions on the test data using the classifier from step 5. Computation of performance metrics.

Because different model input data come in disparate spatiotemporal scales, they were converted to one spatiotemporal scale to be used in the prediction method. The chosen temporal scale was one week and the chosen spatial distribution was one district. In step 1 all the predictor and epidemiological data were converted into this spatiotemporal scale. Details of this conversion were described earlier in the section entitled **Predictor variables**.

The second step was to divide the data set into disjoint training, validation and test subsets. FARM[40] was used on the training subset in step 3 to extract rules predicting future dengue incidence (details described later in the section entitled **Association rule mining and fuzzy association rule mining**). Step 4 involved the automatic building of classifiers from rules extracted in step 3. A separate, validation subset was used to choose the best performing classifier in step 5. Finally, in step 6, a third data subset was used to predict the dengue incidence and determine the accuracy of the method.

Rule extraction from the training data (step 3) is the most important and novel step of the whole methodology. It is performed using FARM, a set of data mining methods that automatically extract from data so-called fuzzy association rules[41]. Fuzzy association rules are of the form:

IF (X is A) → (Y is B).

where X and Y are variables, and A and B are fuzzy sets that characterize X and Y respectively. The following is a simple example of a fuzzy association rule (not actually used in the method):

IF (Temperature is HOT) AND (Humidity is HIGH) → (Energy usage is HIGH).

Fuzzy association rules are easily understood by humans because of the linguistic terms that they employ (e.g., HOT, HIGH). Fuzzy set theory[42] assigns a degree of membership between 0 and 1 (e.g., 0.4) to each element of a set, allowing for a smooth transition between full membership (degree=1) and non-membership (degree=0). The degree of membership in a set is generally considered to be the extent to which a corresponding fuzzy set applies. For example, if the variable is temperature and the linguistic term (fuzzy set) is HOT then you might consider a temperature of 70F to have a degree of membership of 0.1 in the fuzzy set HOT, while a temperature of 80F might have a membership degree of 0.8 and a temperature of 100F might have membership degree equal to 1.

FARM extracts a large number of rules (possibly hundreds or even thousands) from a training data set. When the classifier is automatically built from those rules, the rules have only one consequent, which is the variable to be predicted (i.e., future dengue incidence). When building a classifier, a subset of rules must be chosen; the subset chosen is the one that results in a smallest misclassification error for the validation set. For building the classifier, we have extended the method of Liu et al.[43] as will be described in the section entitled Building the classifier. There are certain class weights that need to be assigned and the final classifier is the one that has the lowest misclassification error on the validation set.

The final step was the computation of predictions by the classifier. The outcome variable (predicted dengue incidence) was converted to a binary variable, either HIGH or LOW dengue incidence (where the threshold between high and low values is quantitatively defined in FARM-based methods results section as the mean dengue incidence + 2 standard deviations). Testing was performed on the test data and the following performance metrics were used to assess the accuracy of this prediction: Positive Predictive Value (PPV), Negative Predictive Value (NPV), sensitivity, and specificity. PPV is the proportion of dengue outbreaks that are correctly identified, while NPV is the proportion of periods without outbreaks that are correctly identified.

#### Details of the prediction methodology

##### Association rule mining and fuzzy association rule mining

The goal of data mining is to discover inherent and previously unknown information from data. When the knowledge discovered is in the form of association rules, the methodology is called association rule mining (ARM). An association rule describes a relationship among different attributes. Association rule mining was introduced by Agrawal et al.[44] as a way to discover interesting co-occurrences in supermarket data (the market basket analysis problem). It finds frequent sets of items (i.e., combinations of items that are purchased together in at least *N* transactions in the database), and from the frequent items sets such as {X, Y}, generates association rules of the form: X → Y and/or Y → X. A simple example of an association rule pertaining to the items that people buy together is:

IF (Bread AND Butter) → Milk

The above rule states that if a person buys bread and butter, then they also buy milk. Such rules are very useful for store managers to help decide how to group items on the shelves. Many extracted rules are obvious, as the one mentioned above. However ARM methods extract not only well known rules but, more importantly, novel rules unknown to Subject Matter Experts (SMEs). Those rules are often surprising to SMEs as the now famous rule:

IF Diapers → Beer.

The store managers did not want to believe that there was a relationship between buying diapers and buying beer and thought that the ARM methodology that extracted that rule from data was flawed. However after carefully checking the store transactions, they noticed that in the evenings this rule was very prominent: when somebody was buying diapers they were also buying beer. After further investigation, they concluded that in the afternoons/evenings, moms often ask dads to buy some diapers; dads do that and reward themselves by buying beer.

A limitation of traditional association rule mining is that it only works on binary data (i.e., an item was either purchased in a transaction (1) or not (0)). In many real-world applications, data is either categorical (e.g., district name, type of public health intervention) or quantitative (e.g., rainfall, temperature, age). For numerical and categorical attributes, Boolean rules are unsatisfactory. Extensions have been proposed to operate on these data, such as quantitative association rule mining[45] and fuzzy association rule mining[41].

Fuzzy association rules are of the form:

IF (X is A) → (Y is B).

where X and Y are variables, and A and B are fuzzy sets that characterize X and Y respectively. A simple example of fuzzy association rule for a medical application is the following:

IF (Temperature is **S** trong Fever) AND (Skin is Yellowish) AND (Loss of appetite is Profound) → (Hepatitis is Acute).

More formally, let D = {t_{1}, t_{2}, …, t_{n}} be the transaction database and let t_{i} represent the i^{th} transaction in D. Let I={i_{1}, i_{2},… i_{m}} be the universe of items. A set *X* ⊆ *I* of items is called an itemset. When X has k elements, it is called a k-itemset. An association rule is an implication of the form X → Y, where *X* ⊂ *I*, *Y* ⊂ *I* and *X* ∩ *Y* = *ϕ*.

*Support*,

*Confidence*and

*Lift*and we will be using these metrics in the method’s development. The

*support*of an itemset X is defined as:

_{x}is the number of records with X, and p

_{x}is the associated probability. The support of a rule (X → Y) is defined as:

where n_{xy} is the number of records with X and Y, and p_{xy} is the associated probability.

*confidence*of a rule (X → Y) is defined as:

Confidence can be treated as the conditional probability (P(Y|X)) of a transaction containing X of also containing Y. A high confidence value suggests a strong association rule. However, this can be deceptive. For example, if the antecedent (X) or consequent (Y) have a high support, they could have a high confidence even if they were independent. This is why the measure of lift was suggested as a useful metric.

*lift*of a rule (X → Y) measures the deviation from independence of X and Y:

A lift greater than 1.0 indicates that transactions containing the antecedent (X) tend to contain the consequent (Y) more often than transactions that do not contain the antecedent (X). The higher the lift, the more likely that the existence of X and Y together is not just a random occurrence, but rather due to the relationship between them.

#### Building the classifier

FARM extracts a large set of rules from the training data. For the disease prediction application, the rules of interest are called class association rules (CARs), meaning that they have only one consequent - the class. An example of a CAR extracted by FARM is:

IF (Past_Incidence_Rate_T-1 is HIGH) AND (Past_Incidence_Rate_T-5 is HIGH) AND (Rainfall_T-3 is LARGE) → (Predicted_Incidence_Rate_T+4 is HIGH *)*, confidence = 0.95, support = 0.01, lift = 5.3.

The question is which rules from the hundreds extracted by FARM to use in the final classifier and in which sequence to use them. When building the classifier, we first employed the method of Liu et al.[43]. Let *R* be the set of generated rules and *D* be the training data. The basic idea of the algorithm is to choose a subset of rules from *R* to cover all the training examples (*D*). The classifier will have the following format: <r_{1}, r_{2}, …, r_{m}, default class>. Default class is the one into which a case will be classified, if none of the rules satisfies it. The order of the rules in the classifier is important and in classifying a case, the first rule that satisfies it will classify it.

The algorithm has the following steps:

*R*by:

- 1
Confidence (from highest to lowest);

- 2
Support (from highest to lowest);

- 3
Number of antecedents (from lowest to highest).

Step 2: Select the rules for the classifier from *R*. For a given rule *r*, find cases in *D* that are covered by *r* (i.e. they satisfy the conditions of *r*). Remove from *D* the cases covered. Compute the number of errors that the rule makes and add the rule to the classifier (*C*). A default class is also selected – this is the majority class in the remaining data in *D*. When there is no rule or no training case left, then the rule selection process is completed.

Step 3: Discard the rules from *C* that do not improve the accuracy of the classifier.

- 1
The rules are being ordered first by confidence, then by lift, and finally by the number of antecedents.

- 2
The misclassification error is weighted. The user has the opportunity to give a much higher weight for misclassifying the cases that should be HIGH than those that should be LOW. We used 10 and 1, respectively.

## Results

### FARM-based method results

An example rule extracted by FARM from the data is:

IF (Past_Incidence_Rate_T-1 is HIGH) AND (Past_Incidence_Rate_T-5 is HIGH) AND (Rainfall_T-3 is LARGE) → (Predicted_Incidence_Rate_T+4 is HIGH *)*, confidence = 0.95

The above rule states that if the dengue incidence rate a week ago (T-1) was HIGH, the dengue incidence rate five weeks ago (T-5) was HIGH, and the rainfall three weeks ago was LARGE, then the predicted dengue incidence rate in four weeks (T+4) will be HIGH. Each extracted rule has an associated confidence that measures the conditional probability that if the left hand side of the rule was true, the right hand side is also true.

Three different prediction models (classifiers) were automatically built using the methodology developed. The first two are weekly, i.e. we predicted either HIGH or LOW dengue incidence for a given future week (T+3 and T+4). The third prediction encompassed a four-week period, specifically four to seven weeks from time of prediction (T+4 to T+7). This is a single prediction for whether dengue incidence rate will be LOW or HIGH over the entire four-week period.

In order for a dengue incidence prediction to fall exclusively into one class (LOW or HIGH), we needed to set the threshold between LOW and HIGH. For weekly data, this was achieved by computing the mean (0.103) and standard deviation (0.175) of past weekly incidences. The threshold between LOW and HIGH was set at mean + 2 standard deviations (rounded to 0.45). For 4–week data, the mean was 0.343 and standard deviation was 0.583. The threshold between LOW and HIGH was set at mean + 2 standard deviations (rounded to 1.5).

**Variables used**

Weekly prediction: Prediction 3 weeks ahead | Weekly prediction 4 weeks ahead | 4 Week prediction 4–7 weeks ahead |
---|---|---|

Past Incidence Rate (T-12, T-11, …, T-1, T) | Past Incidence Rate (T-12, T-11, …, T-1) | Past Incidence Rate (T-12_T-9, T-8_T-5, T-4_T-1) |

Rainfall (T-12, T-11, …, T-1, T) | Rainfall (T-12, T-11, …, T-1) | Rainfall (T-12_T-9, T-8_T-5, T-4_T-1) |

NDVI (T-12, T-11, …, T-1, T) | NDVI (T-12, T-11, …, T-1) | NDVI (T-12_T-9, T-8_T-5, T-7_T-4) |

SOI (T) | SSTA (T-4, T-3, T-2, T-1) | EVI (T-12_T-9, T-8_T-5, T-7_T-4) |

Week number | Week number | SSTA (T-12, T-11, …, T-1) |

SOI (T-12_T-9, T-9_T-6) | ||

Temperature Day (T-12_T-9, T-8_T-5, T-5_T-2) | ||

Temperature Night (T-12_T-9, T-8_T-5, T-5_T-2) | ||

Week number | ||

Running water | ||

Sanitation | ||

Electric lighting |

The weekly, T+3 (i.e., three-weeks in advance) incidence prediction achieved a PPV of 0.667, a NPV of 0.983, a sensitivity of 0.593, and a specificity of 0.987 on the test data set. This means that if the method predicted there would be a HIGH dengue incidence three weeks in the future, a HIGH dengue incidence occurred 66.7% of the time and 59.3% of the total HIGH dengue incidence rates three weeks in the future were captured (sensitivity). Similarly, if the method predicted a LOW dengue incidence three weeks in the future, a LOW dengue incidence occurred 98.3% of the time and 98.7% of the total LOW dengue incidence rates three weeks in the future were captured. The weekly, T+4 predictions were slightly less accurate with a PPV of 0.556, a NPV of 0.973, a sensitivity of 0.469, and a specificity of 0.981.

**Examples of the 166 chosen rules from the classifier for predicting 4–7 weeks ahead**

Rule # | Antecedent 1 | Antecedent 2 | Antecedent 3 | Consequent | Confidence | Support | Lift |
---|---|---|---|---|---|---|---|

1 | Week_26-29 | PastIncidenceRateT-4_T-1_Low | PastIncidenceRateT-8_T-5_Low | PredictedIncidenceRateT+4_T+7_Low | 1.0 | 0.0454 | 1.29 |

2 | Week_26-29 | PastIncidenceRateT-8_T-5_Low | ElectricLighting_High | PredictedIncidenceRateT+4_T+7_Low | 1.0 | 0.0402 | 1.29 |

… | … | … | … | … | … | … | … |

46 | Week_42-45 | PastIncidenceRateT-4_T-1_High | SSTAT-10_High | PredictedIncidenceRateT+4_T+7_High | 1.0 | 0.0047 | 18.83 |

… | … | … | … | … | … | … | … |

58 | NDVIT-8_T-5_Med | Sanitation_High | SSTAT-2_High | PredictedIncidenceRateT+4_T+7_Low | 0.998 | 0.0122 | 1.29 |

… | … | … | … | … | … | … | … |

166 | Week_50-53 | RainfallT-12_T-9_Med | SOIT-12_T-9_High | PredictedIncidenceRateT+4_T+7_High | 0.553 | 0.0038 | 10.41 |

IF (NDVI_T-8_T-5 is MED) AND (Sanitation is HIGH) AND (SSTA_T-2 is HIGH) → (Predicted_Incidence_Rate_T+4_T+7 is LOW *)*, confidence = 0.998, support = 0.0038, lift = 1.29

### Logistic regression results

In order to compare the results of the novel technique proposed with those of an established method, we compared FARM results with those of a method often used by epidemiologists: logistic regression (LR). We used the same input data for the LR models as for the FARM methods (see Table 3), for both weekly and four-week interval predictions.

_{1}through X

_{k}. The form of the logistic model formula is:

where p is the probability that Y is 1, B_{0} is a constant (called the intercept), and B_{1} though B_{k} are the coefficients for the predictor variables X_{1} through X_{k}.

In our application, the LR result gives the probability that a HIGH incidence rate will occur. Specifically, if the estimate exceeds a predefined threshold (0.5 in our work), the model predicts a HIGH incidence rate; otherwise, a HIGH is not predicted.

**LR coefficients obtained for weekly predictions 4 weeks ahead (T+4)**

Variable | LR Coefficient |
---|---|

Intercept | 17.37 |

Week | −0.01 |

PastIncidenceRate T-1 | −2.30 |

PastIncidenceRate T-2 | −3.33 |

PastIncidenceRate T-3 | −1.51 |

PastIncidenceRate T-4 | −0.22 |

PastIncidenceRate T-5 | 0.30 |

PastIncidenceRate T-6 | 0.46 |

PastIncidenceRate T-7 | 2.49 |

PastIncidenceRate T-8 | −0.09 |

PastIncidenceRate T-9 | 1.19 |

PastIncidenceRate T-10 | −5.89 |

PastIncidenceRate T-11 | 1.19 |

PastIncidenceRate T-12 | 0.64 |

Rainfall T-1 | 0.01 |

Rainfall T-2 | 0.01 |

Rainfall T-3 | 0.00 |

Rainfall T-4 | 0.00 |

Rainfall T-5 | 0.00 |

Rainfall T-6 | −0.01 |

Rainfall T-7 | −0.01 |

Rainfall T-8 | 0.00 |

Rainfall T-9 | 0.00 |

Rainfall T-10 | 0.00 |

Rainfall T-11 | 0.00 |

Rainfall T-12 | −0.01 |

NDVI T-4 | 0.20 |

NDVI T-5 | 3.83 |

NDVI T-6 | −5.59 |

NDVI T-7 | 1.86 |

NDVI T-8 | −3.45 |

NDVI T-9 | 1.84 |

NDVI T-10 | −5.75 |

NDVI T-11 | −1.99 |

NDVI T-12 | −5.21 |

SSTA T-1 | −2.11 |

SSTA T-2 | 1.21 |

SSTA T-3 | −0.79 |

SSTA T-4 | 1.16 |

## Discussion

A truly rigorous predictive method should have two characteristics that cannot be violated. The first one is that the method cannot be both developed and tested on exactly the same data. Rigorous validation requires that the data used for testing not be the same as the data used in its development. If the prediction method was developed and tested on the same data, then a high value of a performance metric, such as R^{2}, does not reveal anything about the accuracy that would occur on previously unseen data. Even obtaining R^{2}=1 when using the same data for both development and validation does not guarantee a good prediction performance on data not used for the model development.

The second characteristic for rigorous prediction is that all predictor variables need to be collected for the previous time period (e.g. week) and be used for prediction of outbreaks during a later time period. This ensures a realistic prediction because the values of all the predictor variables can be obtained prior to performing prediction for the next time period. Methods that use some variables at time T to predict another variable at time T (i.e., zero time lag) are not performing a useful prediction because prediction means using past or currently available data to describe a future event.

When designing a prediction method that learns from data, machine learning scientists very carefully divide the data set they are using. Simply dividing the data set into training (to develop the model) and testing (to test the model and report performance) is considered insufficient. These data should be divided into three subsets: training, validation and testing[46]. The training subset is used to develop the model. The models are usually not parameter-free, but have certain parameters that can be adjusted by the model developers: the best model is the one that has the lowest error on the validation data subset. Once the best model is chosen, then it becomes the final version and it can be tested to assess its performance on a previously unseen data set called the testing subset. For example, in a feed-forward neural network, the number of hidden layers and the number of neurons in each hidden layer are parameters to be chosen; when several of those networks are compared, the best network has the smallest error on the validation data subset. Once the network with the smallest validation error is chosen, the error is computed for the test subset in order to assess its performance on previously unseen data.

It is important to note that our method only uses input data that is actually available prior to running the model at a given time. For example, for the Temperature variable, we ignored data within two weeks of the current week in order to avoid using values that actually would not be available at the point in time at which the prediction is made. Temperature is provided in 8-day intervals so, if such an interval ends on a Sunday, none of the Temperature values for the preceding week would be available because the interval from which some of their data were coming would not be available until the following Monday.

Disease outbreak detection differs from prediction in that the evidence of the incipient outbreak is already present though not yet obvious when it is first detected, and a response should begin immediately. Although it may complement disease detection, the work presented here differs in that it can be used when no outbreak is currently present, and it predicts whether or not an outbreak may occur at some specific time in the future. Response to such a prediction may include planning as well as mitigation activities. Our method is designed to produce a dengue outbreak prediction four to seven weeks in advance, thereby providing public health officials with more time to intervene and perhaps mitigate the impacts of an outbreak. Discussions with our Peruvian collaborators revealed that this response timeline is reasonable for their public health departments. They did caution however, that it is important to have a method with few false alarms because of limited funding for public health interventions. The parameter of greatest importance to these public health practitioners is therefore PPV, with specificity being second in priority.

Our method has several weaknesses. First, it has no input variables that directly measure vector behavior, e.g., mosquito biting behavior or prevalence of dengue virus in the vector. This information is quite important, yet is expensive and labor intensive to obtain, and possible only for small areas over short time periods. The reason we are not using this variable is that we do not have access to such data for Peru. Also, because the socio-political and sanitation data are only available as annual updates, our method cannot predict the effects of concentrated sanitation programs such as house-to-house efforts to remove vector breeding containers. Another variable that could be useful for prediction are people’s travel patterns. If people were traveling from a district with an ongoing outbreak, these data have the potential to be important predictors. Additionally, we do not have historical data describing the serotypes present and existing dogma supports a role for pre-existing immunity to a serotype with which a person was infected before. Given the fact that the methodology developed herein automatically extracts rules from existing data, the weaknesses described above can be overcome should the data become available.

## Conclusions

The method described above was developed to use local and remote sensing data to predict dengue outbreaks, with an outbreak defined as being above the long-term mean previous dengue incidence plus two standard deviations, with high values of PPV and NPV. Effective methodologies to predict disease outbreaks may allow preventive interventions to avert large epidemics. For best results, the researchers must have access to data streams with timely, detailed, and accurate values of predictor variables. Model validation is of paramount importance as health officials would be unlikely to spend resources on mitigation efforts based on model predictions without evidence of accuracy on past outbreaks. The input variables used in our model are widely available for most, if not all, countries. Although additional local data, such as mosquito biting activity or percentage of mosquitoes with dengue virus, might improve the accuracy of our method, such data are generally difficult and expensive to obtain. The use of widely available data enhances the generalizability of our method.

## Disclaimer

The views expressed here are the opinions of the authors and are not to be construed as official or as representing the views of the US Department of the Navy or the US Department of Defense.

## Declarations

### Acknowledgements

We gratefully acknowledge the support of the Ministerio de Salud del Peru, especially Dr. Omar Napanga, for the dengue case surveillance data and many helpful discussions. We also are in debt to Dr. Carlos Sanchez and Dr. Joel Montgomery of the US Naval Medical Research Center, Lima, Peru, and CAPT Clair Witt of the US Armed Forces Health Surveillance Center. Funding for this study was provided by the US Department of Defense Global Emerging Infections Surveillance and Response System, a division of the US Armed Forces Health Surveillance Center.

## Authors’ Affiliations

## References

- Focks DA, Daniels E, Haile DG, Keesling JE: A simulation model of the epidemiology of urban dengue fever: literature analysis, model development, preliminary validation, and samples of simulation results. Am J Trop Med Hyg. 1995, 53 (5): 489-506.PubMedGoogle Scholar
- Gibbons RV, Vaughn DW: Dengue: an escalating problem. BMJ. 2002, 324 (7353): 1563-1566. 10.1136/bmj.324.7353.1563.View ArticlePubMedPubMed CentralGoogle Scholar
- TDR/WHO: Dengue Guidelines for Diagnosis, Treatment, Prevention and Control. 2009, Geneva: World Health Organization, Available:http://whqlibdoc.who.int/publications/2009/9789241547871_eng.pdf. Accessed 26 October 2012,Google Scholar
- Ranjit S, Kissoon N: Dengue hemorrhagic fever and shock syndromes. Pediatr Crit Care Med. 2010, 12 (1): 90-100.View ArticleGoogle Scholar
- Guzman MG, Kouri G: Dengue: an update. Lancet Infect Dis. 2002, 2: 33-42. 10.1016/S1473-3099(01)00171-2.View ArticlePubMedGoogle Scholar
- Halsted SB: Dengue. Lancet Infect Dis. 2007, 370: 1644-1652.Google Scholar
- US Centers for Disease Control and Prevention: Locally Acquired Dengue – Key West, Florida 2009–2010. MMWR. 2010, 59 (19): 577-581.Google Scholar
- Control of Communicable Diseases Manual. Edited by: Heymann DL. 2008, Washington DC: American Public Health Association, 19Google Scholar
- Tan GK, Alonso S: Pathogenesis and prevention of dengue virus infection: state of the art. Curr Opin Infect Dis. 2009, 22 (3): 302-308. 10.1097/QCO.0b013e328329ae32.View ArticlePubMedGoogle Scholar
- Fuller DO, Troyo A, Beier JC: El Nino Southern Oscillation and vegetation dynamics as predictors of dengue fever cases in Costa Rica. Environ Res Lett. 2009, 4: 014011-10.1088/1748-9326/4/1/014011.View ArticleGoogle Scholar
- Choudhury MAHZ, Banu S, Islam MA: Forecasting dengue incidence in Dhaka, Bangladesh: a time series analysis. DF Bulletin. 2008, 32: 29-37.Google Scholar
- Wongkoon S, Jaroensutasinee M, Jaroensutasinee K: Predicting DHF incidence in northern Thailand using time series analysis technique. World Academy of Science, Engineering and Technology. 2007, 32: 216-220.Google Scholar
- Cummings DAT, Irizarry RA, Huang NE, Endy TP, Nisalak A, Ungchusak K, Burke DS: Travelling waves in the occurrence of dengue haemorrhagic fever in Thailand. Nature. 2004, 427: 344-347. 10.1038/nature02225.View ArticlePubMedGoogle Scholar
- Bhandari KP, Raju PLN, Sokhi BS: Application of GIS modeling for dengue fever prone area based on socio-cultural and environmental factors – a case study of Delhi city zone. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol. XXXVII, Part B8. 2008, 165-170. Beijing, ChinaGoogle Scholar
- Husin NA, Salim N, Ahmad AR: Modeling of dengue outbreak prediction in Malaysia: a comparison of neural network and nonlinear regression model. IEEE International Symposium on Information Technology (ITSim 2008). 2008, 3: 1-4. Kuala Lumpur, MalaysiaView ArticleGoogle Scholar
- Cetiner BG, Sari M, Aburas HM: Recognition of dengue disease patterns using artificial neural networks. 5th International Advanced Technologies Symposium (IATS’09). 2009, Karabuk, 359-362.Google Scholar
- Rachata N, Charoenkwan P, Yooyativong T, Camnongthai K, Lursinsap C, Higachi K: Automatic prediction system of dengue haemorrhagic-fever outbreak risk by using entropy and artificial neural network. IEEE International Symposium on Communications and Information Technologies (ISCIT 2008). 2008, 210-214. 21–23 October 2008, LaoGoogle Scholar
- Fu X, Liew C, Soh H, Lee G, Hung T, Ng L-C: Time-series infectious disease data analysis using SVM and genetic algorithm. 2007 IEEE Congress on Evolutionary Computation (CEC 2007). 2007, 1276-1280. Singapore, 25–28 September 2007Google Scholar
- Halide H, Ridd P: A predictive model for dengue hemorrhagic fever epidemics. International J Environ Health Research. 2008, 4: 253-265.View ArticleGoogle Scholar
- Honório NA, Nogueira RMR, Codeco CT, Carvalho MS, Cruz OG, Magalhaes MAFM, de Araujo JMG, de Araujo ESM, Gomes MQ, Pinheiro LS, Pinel CS, de Oliveira L: Spatial evaluation and modeling of dengue seroprevalence and vector density in Rio de Janeiro, Brazil. PLoS Negl Trop Dis. 2009, 3: e545-View ArticlePubMedPubMed CentralGoogle Scholar
- Barbazan P, Guiserix M, Boonyuan W, Tuntaprasart W, Pontier D, Gonzalez J-P: Modelling the effect of temperature on transmission of dengue. Medical and Veterinary Entomology. 2010, 24: 66-73. 10.1111/j.1365-2915.2009.00848.x.View ArticlePubMedGoogle Scholar
- Kummerow C, Barnes W, Kozu T, Shiue J, Simpson J: The Tropical Rainfall Measuring Mission (TRMM) sensor package. J. Atmos. Ocean. Technol. 1998, 15: 809-817. 10.1175/1520-0426(1998)015<0809:TTRMMT>2.0.CO;2.View ArticleGoogle Scholar
- Soebiyanto RP, Adimi F, Kiang RK: Modeling and predicting seasonal influenza transmission in warm regions using climatological parameters. PLoS One. 2010, 5: e9450-10.1371/journal.pone.0009450.View ArticlePubMedPubMed CentralGoogle Scholar
- Anyamba A, Linthicum KJ, Mahoney R, Tucker CJ, Kelley PW: Mapping potential risk of Rift Valley Fever outbreaks in African savannas using vegetation index time series data. Photogramm. Engr. Remote Sens. 2002, 68: 137-145.Google Scholar
- Hales S, Weinstein P, Souares Y, Woodward A: El Nino and the dynamics of vectorborne disease transmission. Environ Health Perspect. 1999, 107: 99-102.PubMedPubMed CentralGoogle Scholar
- Johansson MA, Cummings DAT, Glass GE: Multiyear climate variability and dengue – El Nino Southern Oscillation, weather, and dengue incidence in Puerto Rico, Mexico, and Thailand: a longitudinal data analysis. PLoS Med. 2009, 6: e1000168-10.1371/journal.pmed.1000168.View ArticlePubMedPubMed CentralGoogle Scholar
- Hu W, Clements A, Williams G, Tong S: Dengue fever and El Nino/Southern Oscillation in Queensland, Australia: a time series predictive model. Occup Environ Med. 2010, 67: 307-311. 10.1136/oem.2008.044966.View ArticlePubMedGoogle Scholar
- US Centers for Disease Control and Prevention: MMWR Weeks. 2012, Available:http://www.cdc.gov/nndss/document/MMWR_Week_overview.pdf Accessed 26 October 2012,Google Scholar
- US National Aeronautics and Space Administration Goddard Earth Sciences Data and Information Services Center: Mirador Earth Science Data Search Tool. 2012, Available:http://mirador.gsfc.nasa.gov/ Accessed 26 October 2012,Google Scholar
- Mathworks: MATLAB. Available:http://www.mathworks.com/products/matlab/index.html Accessed 26 October 2012,
- US Geological Survey: Land Processes Distributed Active Archive Center. Available:https://lpdaac.usgs.gov/get_data Accessed 26 October 2012,
- Kalluri S, Gilruth P, Rogers D, Szczur M: Surveillance of arthropod vector-borne infectious diseases using remote sensing techniques: a review. PLoS Pathog. 2007, 3 (10): e116-10.1371/journal.ppat.0030116.View ArticlePubMed CentralGoogle Scholar
- Ferreira NC, Ferreira LG, Huete AR: Assessing the response of the MODIS vegetation indices to landscape disturbance in the forested areas of the legal Brazilian Amazon. International J Remote Sensing. 31 (3): 745-759.Google Scholar
- Xiao X, Zhang Q, Saleska S, Hutyra L, De Camargo P, Wofsy S, Frolking S, Boles S, Keller M, Moore B: Satellite-based modeling of gross primary production in a seasonally moist tropical evergreen forest. Remote Sensing Environ. 2005, 94: 105-122. 10.1016/j.rse.2004.08.015.View ArticleGoogle Scholar
- Climate and Global Dynamics Section, US National Center for Atmospheric Research, University Corporation for Atmospheric Research: Southern Oscillation Index Data. Available:http://www.cgd.ucar.edu/cas/catalog/climind/SOI.signal.ascii Accessed 26 October 2012,
- Global Change Master Directory, US National Aeronautics and Space Administration Goddard Space Flight Center: Monthly and Weekly Nino 3.4 Region SST Index: East Central Tropical Pacific. 2012, Available:http://gcmd.nasa.gov/records/GCMD_NOAA_NWS_CPC_NINO34.html Accessed 26 October 2012,Google Scholar
- The World Bank Group: Worldwide Governance Indicators. 2012, Available:http://info.worldbank.org/governance/wgi/index.asp Accessed 26 October 2012,Google Scholar
- Peru National Institute of Statistics and Information. 2007, Peru: Census, Available:http://www.inei.gob.pe/ Accessed 26 October 2012,
- US National Oceanic and Atmospheric Administration, National Geophysical Data Center: Topographic and Digital Terrain Data. Availablehttp://www.ngdc.noaa.gov/cgi-bin/mgg/ff/nph-newform.pl/mgg/topo/ Accessed 26 October 2012,
- Buczak AL, Gifford CM: Fuzzy association rule mining for community crime pattern discovery. Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining: Workshop on Intelligence and Security Informatics. 2010, Washington, D.C: Association for Computing Machinery, July 2010Google Scholar
- Kuok CM, Fu A, Wong MH: Mining fuzzy association rules in databases. ACM SIGMOD Record. 1998, 27 (1): 41-46. 10.1145/273244.273257. New York, NYView ArticleGoogle Scholar
- Zadeh LA: Fuzzy Sets. Information and Control. 1965, 8 (3): 338-353. 10.1016/S0019-9958(65)90241-X.View ArticleGoogle Scholar
- Liu B, Hsu W, Ma Y: Integrating classification and association rule mining. Proceedings of 4th International Conference on Knowledge Discovery Data Mining (KDD). 1998, New York: AAAI Press, 80-86.Google Scholar
- Agrawal R, Imielinski T, Swami A: Mining association rules between sets of items in large databases. Proceedings of the International Conference on Management of Data. 1993, Washington, DC: Association for Computing Machinery, 207-216.Google Scholar
- Srikant R, Agrawal R: Mining quantitative association rules in large relational tables. Proceedings of the International Conference on Management of Data. 1996, Montreal: Association for Computing Machinery, 1-12.Google Scholar
- Cawley GC, Talbot NLC: On over-fitting in model selection and subsequent selection bias in performance evaluation. J Machine Learning Res. 2010, 11: 2079-2107.Google Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1472-6947/12/124/prepub

### Pre-publication history

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.