Using record linkage to validate notification and laboratory data for a more accurate assessment of notifiable infectious diseases

Background Infectious disease burden is commonly assessed using notification data. Using retrospective record linkage in Western Australia, we described how well notification data captures laboratory detections of influenza, pertussis and invasive pneumococcal disease (IPD). Methods We linked data from the Western Australian Notifiable Infectious Diseases Database (WANIDD) and the PathWest Laboratory Database (PathWest) pertaining to the Triple I birth cohort, born in Western Australia in 1996–2012. These were combined to calculate the number of unique cases captured in each dataset alone or in both datasets. To assess the impact of under-ascertainment, we compared incidence rates calculated using WANIDD data alone and using combined data. Results Overall, there were 5550 influenza, 513 IPD (2001–2012) and 4434 pertussis cases (2000–2012). Approximately 2% of pertussis and IPD cases and 7% of influenza cases were solely recorded in PathWest. Notification of influenza and pertussis cases to WANIDD improved over time. Overall incidence rates of influenza in children aged <5 years using both datasets was 10% higher than using WANIDD data alone (IRR = 1.1, 95% CI = 1.1–1.2). Conclusions This is the first time WANIDD data have been validated against routinely collected laboratory data. We anticipated all cases would be captured in WANIDD but found additional laboratory-confirmed cases that were not notified. Studies investigating pathogen-specific infectious disease would benefit from using multiple data sources. Electronic supplementary material The online version of this article (doi:10.1186/s12911-017-0484-7) contains supplementary material, which is available to authorized users.


Background
Population estimates of the burden of some infectious diseases may be calculated using surveillance data based on statutory notifications. In Australia, a disease is listed on the National Notifiable Diseases list if it is a public health priority and data collection is feasible [1]. Data on these diseases are collected by surveillance or other methods at individual state and territory Departments of Health and subsequently sent to the national Department of Health for national surveillance [2].
While automated notification-based surveillance systems are cost-effective, timely and ease the reporting burden on healthcare providers, if these systems are not audited and updated regularly, incomplete data capture may occur with changes over times (e.g. changes in case definitions or testing methods). These may lead to inaccurate estimates of disease burden. In Western Australia (WA), we have the opportunity to use record linkage (also known as data linkage) to assess the completeness of data capture using these systems. Record linkage is the process of combining multiple, usually administrative, datasets that relate to the same person, and has been used successfully in epidemiological studies of infectious diseases in Australia and the United Kingdom [3][4][5].
Infectious diseases, particularly acute respiratory infections, are the leading cause of hospitalisation in young children in WA [6]. Influenza, pertussis (whooping cough) and invasive pneumococcal disease (IPD) are notifiable infectious diseases in WA, which are usually associated with pathogens causing respiratory infections. In Australia, pertussis became notifiable in 1991 while influenza and IPD became notifiable in 2001 [7,8].
In WA, the responsibility for notification lies with the attending healthcare provider [9]. In addition, diagnostic laboratories responsible for notifiable disease testing are also required to report detections of such pathogens [9]. Both groups are encouraged to complete a disease notification regardless of whether the case has been reported previously by another party, with duplicates managed by the WA Department of Health [9]. These data are stored on the Western Australian Notifiable Infectious Diseases Database (WANIDD) [10].
As part of a whole population-based birth cohort study investigating the pathogen-specific burden of respiratory infections in children (referred to as the Triple I cohort), we assembled a linked dataset of WANIDD detections and routine laboratory data from the Path-West Laboratory Database using the WA Data Linkage System. PathWest Laboratory Medicine is a state-wide public laboratory that provides pathology testing services to all public hospitals in WA, some general practitioners from private clinics, and referred samples from private pathology providers [4,11]. Approximately half of the specimens tested at PathWest Laboratory Medicine are from children presenting to general practices or outpatient clinics, with the remainder from children admitted to hospital (personal communication, A Levy). The PathWest Laboratory Database stores all data relating to pathology testing conducted at PathWest Laboratory Medicine. When pathogens associated with notifiable diseases are reported in the PathWest Laboratory Database (PathWest), a report of these cases is sent automatically to WANIDD.
We assess the completeness of data capture in WANIDD by comparing recorded detections in PathWest. We wanted to describe the differences, if any, in the number of cases with notifiable diseases recorded in WANIDD compared with PathWest, as well as to identify the effects of these differences on estimates of laboratory-confirmed infections. To do this, we focused on three key infections: laboratory-proven influenza, Bordetella pertussis, and invasive Streptococcus pneumoniae infection (associated with IPD). Given the statutory requirements for notification of these infections and the automated reporting system in PathWest, we anticipated that all PathWest detections of these pathogens would also be captured in WANIDD. As a proportion of diagnostic samples are tested only by private laboratories, we expected detections in these laboratories to be captured solely in WANIDD.

Methods
The Triple I cohort included all children born in WA between 1996 and 2012. The Triple I cohort was identified using data extracted from the Midwives Notification System and the Birth and Death Registry via the WA Data Linkage System. An estimated 95% of children in this study were successfully linked across datasets through this system (personal communication, WA Data Linkage Branch). The Triple I cohort comprised of 469,589 children, of whom 51.2% were male and 6.7% were of Aboriginal or Torres Strait Islander descent (hereafter referred to as Aboriginal). The majority of these children (74.7%) were born in metropolitan Perth.
PathWest and WANIDD records of children in the Triple I cohort that reported laboratory-proven influenza virus infection, B. pertussis or invasive S. pneumoniae infections were extracted for this study. Comparisons were restricted to data with a date of specimen collection between January 2000 and December 2012. For WANIDD records where date of specimen collection was missing, optimal date of onset was used instead (further details on this variable below). Records with a specimen collection date on or after the date of death were considered post-mortem specimens and were excluded from these analyses. As the cohort consisted of all births from 1996 to 2012, this study includes notification and laboratory data for children up to 16 years of age.

Definitions and linkage rules
Aboriginal children were identified using the 'Get Our Story Right' variable, which was provided through the WA Data Linkage System [12]. Geographical region was assigned using the residential postcode at the time of birth. The same child in PathWest and WANIDD datasets was identified using a study-specific personal identifier assigned through the WA Data Linkage System [3].
Date of specimen collection was used to identify WANIDD records with a corresponding PathWest record. Approximately 19.4% of all WANIDD records of any notifiable disease pertaining to the Triple I cohort had missing date of specimen collection. For these cases, optimal date of onset, which is a derived date variable calculated by WANIDD, was used instead. Optimal date of onset was calculated in hierarchical order based on date of onset, date of specimen collection, date of clinical notification and date of receipt of notification (personal communication, C Giele). If a date of onset was listed for a particular notification, this was used as the optimal date of onset. Date of receipt of notification was only used as the optimal date of onset if all other date variables were missing. Age of the child was calculated using the date of specimen collection and date of birth.

Influenza
Notification to WANIDD is mandatory for any laboratoryproven case of influenza infection detected in a respiratory specimen or serum by culture, polymerase chain reaction (PCR) and antigen testing or by serology [13]. Equivocal results were coded as not detected. Respiratory specimens were defined as nasal, sputum, throat, tracheal, lung or bronchial samples. Influenza became notifiable in 2001 [8]; hence cases were restricted to those detected between 2001 and 2012. Consistent with the WANIDD definition of duplicates, notifications or laboratory detections for the same child up to 8 weeks (56 days) from the date of initial specimen collection were considered duplicates unless a different subtype of influenza was detected. All duplicates were excluded from the following analyses.

Invasive pneumococcal disease (IPD)
An IPD case was defined as detection of S. pneumoniae by culture or PCR from a normally sterile site [13] including blood, cerebrospinal fluid or pleural fluid specimens. Detections from other sterile sites (e.g. joint fluid) were excluded for this study. Equivocal results were coded as not detected. As with influenza, IPD became notifiable in WA from 2001 onwards [8], hence, cases were restricted to those in 2001-2012. As per current rules applied to the WANIDD database, unless a different invasive serotype was identified, any subsequent records of IPD for the same person recorded on either WANIDD or PathWest were considered duplicates and excluded from these analyses.

Pertussis
Unlike influenza and IPD, notification is required for both laboratory-confirmed and probable cases of pertussis. Laboratory confirmation of pertussis was defined as the detection of B. pertussis by culture, PCR or serology [13]. PCR testing was in use throughout the study period and equivocal results were coded as not detected. Probable cases were defined as the presence of clinical (e.g. coughing illness for 2 or more weeks) together with epidemiological evidence of infection (e.g. epidemiological link to a laboratory-confirmed case) [13]. Clinicians are encouraged to submit a notification if they suspect an individual to have pertussis based on these criteria. WANIDD data for both laboratory-confirmed and probable cases were included in the following analyses. All PathWest cases that met the definition for laboratory confirmation of pertussis from 2000 onwards were included in the analyses. WANIDD or PathWest records for the same person up to a year (365 days) from the date of initial specimen collection were considered duplicates and excluded from the analyses.

Statistical analyses
After extracting data from PathWest and WANIDD that met the criteria for notification, we compared the demographic factors of cases from the PathWest dataset to those from the WANIDD dataset. We then combined both datasets to calculate the total number of unique influenza, IPD and pertussis cases as well as the number of cases that were recorded in the WANIDD dataset only, the PathWest dataset only, or in both datasets. We then estimated and compared the incidence/notification rates (hereafter referred to as incidence rates) using person-time-at-risk as the denominator. Person-time was calculated using date of birth, death and the end of the study period. As the highest burden of respiratory infections are in children aged less than 5 years [14], only incidence rates for these children are presented by year of specimen collection using data from WANIDD alone and data from both datasets. Data cleaning and analyses were performed in IBM SPSS version 22 and 23. Exact 95% confidence intervals (CI) and incidence rate ratios (IRR) were calculated using EpiBasic [15].

Results
Overall, between 2001 and 2012, there were 4885 influenza cases and 342 IPD cases recorded in PathWest from children in the Triple I cohort. WANIDD recorded 5159 influenza cases and 502 IPD cases over the same period. Between 2000 and 2012, there were a total of 2850 pertussis cases from children in the Triple I cohort recorded in PathWest while WANIDD recorded 4361 pertussis cases.
Demographics of children with influenza or IPD recorded in PathWest and WANIDD were similar (Table 1). Among children with pertussis, although the age distribution in both WANIDD and PathWest were similar, older children accounted for a larger proportion of cases reported in WANIDD compared to PathWest ( Table 1). The majority of pertussis cases documented on WANIDD (n = 4247, 97.4%) were reported as laboratory-confirmed pertussis cases.
Notification of influenza cases to WANIDD improved over time with at least 95% of all cases being captured by WANIDD from 2007 onwards (Fig. 2). Similarly, the proportion of pertussis cases captured by WANIDD improved from 95.0% of all cases in 2001 to almost all cases (99.1%) in 2012 (Additional file 1: Figure S1). However, there was an increase in the proportion of pertussis cases captured by PathWest alone in 2006 and 2007, although the number of cases were small (<50 in each year). The WANIDD dataset captured nearly all IPD cases during the study period, with the exception of 2008, where 7.1% of cases were only recorded on Path-West. However, this represented only a small number of IPD cases (n < 5; Additional file 2: Figure S2).
In children aged less than 5 years, the overall incidence rate of influenza was 168 per 100,000 child-years in 2001-2012 when using data from WANIDD alone.
Using data from both datasets yielded a 10% increase in the overall incidence rate to 186 per 100,000 child-years (IRR = 1.1, 95% CI = 1.1-1.2), with the most marked difference in 2003 (Fig. 3). Overall incidence rates of pertussis was similar (IRR = 1.0, 95% CI = 1.0-1.1) when using data from both datasets (110 per 100,000 childyears) compared to data from WANIDD alone (107 per 100,000 child-years) with minimal difference over the study period (Fig. 4). Likewise, incidence rates of IPD were similar across all years when using data from either WANIDD only or both datasets (Fig. 5).

Discussion
Respiratory infections are a major cause of hospitalisation in young children. Accurate, complete and reliable   measures of disease burden, particularly with laboratory confirmation, are essential to guide the control of infectious diseases and assess the impact of prevention programs such as vaccination. Using a birth cohort and linked data, we described the discrepancies in the number of influenza, IPD and pertussis cases recorded in the WANIDD and PathWest datasets as well as the effect of these discrepancies on estimates of incidence rates. We anticipated that all PathWest cases would be captured in WANIDD but found additional cases of influenza, IPD and pertussis that were only documented in the Path-West dataset. While over 98% of IPD and pertussis cases were captured by WANIDD, it failed to capture 7% of all influenza cases. Incidence of influenza in children aged less than 5 years increased by 10% as a result of using data from both datasets compared to using only data from WANIDD. Despite having an automated reporting system at Path-West to report notifiable pathogens to WANIDD, the WANIDD dataset failed to capture between 2 and 7% of notifiable disease cases. Our investigation uncovered two reasons why this occurred. Firstly, the automated notification system did fail on occasion, particularly in the early years. Secondly, influenza cases that were only detected by antigen detection were not electronically notified by the laboratory. Reports on the test results of these cases requested that the heath care provider notify the case, which did not reliably happen. This emphasises the importance of direct notification by laboratories in achieving high notification rates.
Discrepancies between the two datasets were greatest for influenza but this decreased over time. Both influenza and IPD became notifiable in 2001, with the laboratories developing automatic notification systems shortly thereafter. A decrease in automatic notification failures over time is consistent with development, implementation and audit of automated notifications. However, we observed an increase in the proportion of pertussis cases that were only recorded in the PathWest dataset in 2007. As new tests are introduced or refined over time, changes in the way these tests are coded could lead to errors in the automated reporting system. As PathWest Laboratories underwent significant changes to its database in 2006, both of these factors may have contributed to the sudden changes in notification patterns for influenza and pertussis in 2006-2007. In addition, a manufacturing error in the cut-off titres for serological testing for pertussis that was reported in 2006 [16] could have also contributed to the discrepancies between the two datasets during this period.
As notification data help to shape national responses to seasonal diseases like influenza [17] and guide immunisation policy [18], using notification data alone will likely underestimate the incidence of laboratory-confirmed infections as we have observed in this study. Similar issues with other surveillance models have been reported internationally [19]; we can only speculate if the issues presented here are applicable to other pathology providers and jurisdictions around Australia. We would welcome validation of these findings elsewhere using similar methodologies, although it may be more complicated if multiple pathology In the meantime, we would suggest using multiple data sources alongside statistical models or other methods, as appropriate, to generate future estimates of laboratoryconfirmed infections wherever possible. These results have been generated using linked data on the Triple I cohort. As such, it represents only a subset of the total population with a number of tests and notifications occurring in children outside the cohort. Furthermore, children exhibiting milder disease symptoms, particularly for influenza [20], may not be tested for the associated pathogen. It is estimated that Path-West contributes 64% of influenza, 44% of IPD and 34% of pertussis laboratory-confirmed notifications among WA residents (personal communication, C Giele). However, using record linkage provides a unique opportunity to externally validate notification data.
A further limitation of this study was missing data, particularly for key variables such as date of specimen collection. This may have flow on effects whereby relevant records were not extracted, which has occurred in previous studies [21], and associated PathWest records may not be identified. Due to the administrative nature of these datasets, investigators have little control over the quality and completeness of individual variables in each dataset. As the data are de-identified, we are unable to look up individual records to determine why a particular infection episode in one dataset was not found in the other. While this study is not an exhaustive investigation into the reasons for discrepancies between the two datasets, nevertheless, it provides a descriptive overview of areas for further investigation.

Conclusions
To our knowledge, this is the first study that directly validates infectious disease notification data with routinely collected laboratory data for notifiable respiratory infections extracted through population-based record linkage. We identified discrepancies in the number of notifiable infectious diseases reported in WANIDD compared to PathWest, most notably for influenza. While we conducted investigations into the reasons for these discrepancies, the scope of investigations were limited by privacy restrictions with using linked data.
Periodic validation of passive surveillance systems like WANIDD, help identify reporting gaps and instil confidence in public health policy recommendations made from these data. A strength of this study is consultation with staff at PathWest and WANIDD to help verify our findings and ensure both laboratory and notification data were interpreted appropriately. Future studies describing the burden of laboratoryconfirmed infections would benefit from the use of multiple data sources where feasible.

Additional files
Additional file 1: Figure S1