Assessing the suitability of general practice electronic health records for clinical prediction model development: a data quality assessment

Background The use of general practice electronic health records (EHRs) for research purposes is in its infancy in Australia. Given these data were collected for clinical purposes, questions remain around data quality and whether these data are suitable for use in prediction model development. In this study we assess the quality of data recorded in 201,462 patient EHRs from 483 Australian general practices to determine its usefulness in the development of a clinical prediction model for total knee replacement (TKR) surgery in patients with osteoarthritis (OA). Methods Variables to be used in model development were assessed for completeness and plausibility. Accuracy for the outcome and competing risk were assessed through record level linkage with two gold standard national registries, Australian Orthopaedic Association National Joint Replacement Registry (AOANJRR) and National Death Index (NDI). The validity of the EHR data was tested using participant characteristics from the 2014–15 Australian National Health Survey (NHS). Results There were substantial missing data for body mass index and weight gain between early adulthood and middle age. TKR and death were recorded with good accuracy, however, year of TKR, year of death and side of TKR were poorly recorded. Patient characteristics recorded in the EHR were comparable to participant characteristics from the NHS, except for OA medication and metastatic solid tumour. Conclusions In this study, data relating to the outcome, competing risk and two predictors were unfit for prediction model development. This study highlights the need for more accurate and complete recording of patient data within EHRs if these data are to be used to develop clinical prediction models. Data linkage with other gold standard data sets/registries may in the meantime help overcome some of the current data quality challenges in general practice EHRs when developing prediction models. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-021-01669-6.

significant developments in general practice EHR software systems over time, resulting in increases in the volume, detail and quality of patient data that can be stored within these records [1]. As a result of this, and technological advancements in computer processing power, general practice EHRs have the potential to answer various research questions [2].
Despite the potential to answer a variety of research questions, the large volumes of data within EHRs does not necessarily mean these data are fit to answer the research question [2]. EHRs are a secondary data source collected for clinical purposes and not research. Therefore, the quality of the data in EHRs may be influenced by the methods and practices used to record, extract, collate and disseminate the data [3][4][5]. For example, chronic conditions that are incentivised or that are national health priorities, such as asthma may be more completely recorded in the EHR [2]. Therefore, the prevalence of these conditions may be more accurately recorded in EHRs than other unincentivised or lower priority conditions.
In addition, there is no standardisation of general practice EHR software in Australia and no national standards for EHR software [6]. Each EHR software system differs in the clinical terminologies and classifications used and many still utilise text heavy fields for data capture [6,7]. Hence the quality of data in EHRs may be influenced by the design and layout of the software system used to collect patient information. Kahn et al. (2016) devised a framework for assessing the quality of data within EHRs to assist researchers in determining whether their data are fit to answer the research question. In this study we utilised this framework to assess whether the quality of data in a sample of Australian general practice EHRs were suitable for use in the development of a clinical prediction tool for total knee replacement (TKR) in patients with osteoarthritis (OA) for use in primary care. Thirty-two predictors were identified through a literature review and by consultation with experts in the field of OA [8]. Nine of these predictors were available in general practice EHRs and therefore selected for use in model development. The planned outcome of the model was time to primary TKR, with death treated as a competing risk due to the age of the study cohort. The specific methods used to develop the prediction tool are detailed in Thuraisingam et al. [8] and are outside the scope of this study. Here we assessed the quality of the EHR data prior to model development. Data were verified through linkage with gold standard registries, and validity through comparison with national data. The data linkage process was assessed to identify potential bias that may have been introduced from data linkage.

Study design
NPS MedicineWise manage the MedicineInsight data set consisting of deidentified EHRs from 2.9 million patients from 671 general practices across Australia [9,10]. These data are provided by consenting practices and are extracted from two different EHR software systems using two third-party data extraction tools [10]. Data from the two EHR software systems are amalgamated within the data warehouse into a single consistent structure [10]. The coding used to merge data fields is proprietary of NPS MedicineWise and has been developed with input from general practitioners, pharmacists, business analysts and data warehouse architects. For this study, NPS MedicineWise provided EHR data extracted from 475,870 patients with a recorded diagnosis of OA (see Additional file 1: Table 1 for coding of OA). These records included patient clinical data recorded in the EHR at the 31 st of December 2017. Patient encounter data were provided for the years 2013 to 2017.
These EHR data were linked with Australian Orthopaedic Association National Joint Replacement Registry (AOANJRR) [11] and the National Death Index (NDI) [12]. The AOANJRR contains data on TKRs performed in Australia since 1st of September 1999 with near complete capture of all TKRs in Australia from 2002 onwards. The NDI includes all deaths that have occurred in Australia since 1999. The data linkage process is outlined in Additional files 2-4.
Study baseline was the 1st of January 2014 and the study end date the 31st of December 2017 (inclusive). The inclusion criteria for the study were patients: (i) with at least two visits to the clinic in the year prior to baseline (i.e. in 2013); (ii) aged 45 years and over at baseline; (iii) alive at study baseline (i.e. no record of death in the NDI) and (iv) no recorded evidence of bilateral TKR prior to study baseline. We were unable to determine active patients of a clinic according to the RACGP definition (at least three clinic visits in a two-year period) given the encounter data provided did not include the two years prior to study baseline [13]. Hence criterion (i) was chosen as a proxy.

Coding of variables
The nine candidate predictors were age, body mass index (BMI), weight gain between early adulthood and middle age, prescribing of OA medication in the year prior to baseline, multimorbidity count, diagnosis of a mental health condition, previous contralateral TKR, other knee surgery (excluding TKR) and geographical residence of the patient. Each of the predictors were coded from the EHR at study baseline except for BMI and the prescribing of OA medications. The last BMI measurement recorded in the EHR in the year prior to the start of the study was included. The strength, dosage and frequency fields for medications data were used to determine whether patients were likely to be taking medications for OA at study baseline using prescriptions issued in the 12 months prior to baseline. Death was coded from the patient status variable in the EHR. The patient's geographical residence was based on the Australian Bureau of Statistics (ABS) Australian Statistical Geography Standard (ASGS) remoteness areas [14]. Multimorbidity count was used as a proxy measure for overall health. Three different ways of counting multimorbidity were considered: (i) count of chronic conditions listed in the Charlson Comorbidity Index (CCI) which predicts ten year survival in patients with multiple comorbidities [15], (ii) count of 17 frequently managed chronic conditions in primary care as identified in the Bettering the Evaluation and Care of Health (BEACH) study [16], and (iii) a combination of (i) and (ii). The conditions included in the CCI and from the BEACH study are listed in Additional file 1 Table 2, and coding in Table 3. Coding for mental health conditions, past knee surgeries and medications for osteoarthritis are detailed in Additional file 1 Tables 4-6 respectively.
Patients were coded as having missing data for the prescribing of OA medications if they had a prescription issued 12 months prior to baseline with missing dosage, strength or frequency and no other OA medication prescription with a clear end date, as it was not possible to determine whether the patient was likely to be taking medications for OA at study baseline. Similarly, patients were coded as having missing data for chronic conditions and past surgeries if they had missing diagnoses and surgery dates since it was not possible to determine whether the condition was present, or surgery had occurred by study baseline. Patients that did not have a text entry in the diagnosis field relating to any of the chronic conditions in Additional file 1 Table 2, were coded as negative for that condition and those without a prescription entry for the medications listed in Additional file 1 Table 6 were coded as negative for OA medications. Patients without a BMI measurement recorded in the year prior to baseline were coded as having missing values for BMI. The same approach was used for weight. Those with a recorded patient status of "deceased" with no year of death recorded were coded as having missing year of death.

Data quality assessment
The data quality assessment of the MedicineInsight EHR data included the following steps: (i) Identification of missing and implausible data (ii) Assessment of accuracy of recording of TKR (outcome) and death (competing risk) in the EHR (iii) Assessment of external validity of EHR data The methods used in steps 2(i), 2(ii) and 2(iii) are detailed below.
(i) Identification of missing and implausible data The completeness and plausibility of the predictors, outcome and competing risk were assessed. Counts and percentages were used to summarise the amounts of missing data and implausible values. Definitions for implausible data entries are listed in Additional file 5. Examples include year of birth documented as a date after the data extraction date and year of death being documented before the year of birth. (ii) Assessment of accuracy of recording of TKR and death in the EHR The accuracy, sensitivity and specificity of recording of TKR and death in the MedicineInsight EHR data set were assessed through data linkage with the AOANJRR and NDI respectively. Sensitivity, specificity and accuracy were calculated using the definitions in Altman and Bland (1994). In assessing the accuracy of recording of TKR side and year of surgery, the denominator was the total number of true TKRs regardless of whether a side or year was recorded. For model building purposes we are interested in the proportion of true TKRs that had a side and year of surgery correctly recorded, not the proportion of recorded TKR sides and years that were correctly recorded. The same approach was used to assess the accuracy of recording of year of death. The data linkage process used to link the NPS MedicineWise EHR data set to the AOANJRR and NDI was assessed using the checklist developed by Pratt et al. [17].

(iii) Assessment of external validity of EHR data
The external validity of the EHR data set was assessed by comparing socio-demographics and clinical characteristics of our cohort with that of OA patients aged 45 years and over from the 2014-2015 National Health Survey (NHS) [18] carried out by the ABS. The ABS NHS condition level codes [19] used to code the various chronic conditions are listed in Additional file 6. Not all chronic conditions included in our multimorbidity measures were available in the NHS, hence we compared the most commonly occurring chronic conditions for patients with OA as determined by the Australian Institute of Health and Welfare (AIHW) [20]. Proportions from the NHS data set were adjusted to account for the survey sampling strategy. The par-ticipant household record identifier was defined as the primary sampling unit in the NHS data and standard errors estimated using replicate weights and the jackknife variance estimator [21]. Due to the NHS sampling method, counts have not been provided for these data. Instead, estimated population proportions and standard errors for these proportions have been calculated using the methods outlined in Donath (2005). For the EHR data, the general practice clinic was used as the primary sampling unit.

Statistical analyses
Categorical variables were summarised using frequency and percentage. Continuous variables were summarised using mean and standard deviation (SD) or median and inter-quartile range (IQR) as appropriate. All analyses were conducted using STATA MP version 16.1 (Stata-Corp, College Station Texas) [22].

Selection of study cohort
Of the 475,870 patient EHRs, 236,412 patients with a recorded diagnosis of OA prior to study baseline who attended their general practice clinic in the year prior to baseline were identified (Fig. 1). A total of 34,950 (14.8%) patients were excluded from the study. Approximately 28,069 (11.9%) were excluded due to (i) less than two visits to the clinic in the year prior to baseline (n = 9776), (ii) less than 45 years of age (n = 16,362) or (iii) both (i) and (ii) (n = 1931). After linkage with the NDI, a further 0.9% (n = 2117) of the 236,412 patients were excluded because they were either not alive at study baseline (n = 491, 0.2%) or they could not be confirmed as being alive (n = 1626, 0.7%) due to uncertain dates of death recorded in the NDI. Uncertain dates of death were due to patients with common names and dates of birth having links to multiple records in the NDI and hence multiple possible dates of death, deaths being discovered some time after the event or missing date of death data.  (Table 3). Socio-demographic and clinical characteristics were similar between the two cohorts except for proportions relating to OA medication (EHR 34% vs NHS 55%) and metastatic solid tumour (EHR 17% vs NHS 26%).

Discussion
Are these data fit for use?
In this data quality assessment, we considered the completeness, plausibility, accuracy and validity of data contained in general practice EHRs, specifically for the purpose of developing a prediction model from these data. We found data fields relating to the outcome and competing risk (TKR side, TKR year and year of death) to be incomplete and inaccurate and therefore unfit for use in model development. The predictors BMI and weight gain between early adulthood and middle age were also unfit for use due to high proportions of missing data. The remaining predictors had less than 35% missing data or implausible values, which would allow us to perform multiple imputation to impute missing predictor values prior to model development as outlined in our published statistical analysis plan [8], provided we include variables to explain the missing data. We were unable to assess the accuracy and external validity of the candidate predictors due to restricted access to other data sets containing this information. We therefore cannot be certain that these predictors were accurately recorded in the EHRs. However, the socio-demographic and clinical characteristics of our cohort and the NHS cohort were similar, except for the prescribing of OA medication and recording of metastatic solid tumour. The NHS data had a higher proportion of OA medication prescribed compared to the EHR and a likely explanation is that the NHS data included over the counter medications as well as medications prescribed by specialists which might not be communicated to the general practitioner. The lower proportion of metastatic solid tumours in the EHR data compared to the NHS data may be due to diagnoses by specialists not being communicated to the general practitioner and the inconsistent manner in which metastatic solid tumours are recorded between general practitioners.
Whilst it seems important to assess the accuracy and validity of all predictors prior to model development, if we consider the context in which the model will be used, Counts and percentages presented unless otherwise indicated. Percentages may not add to 100% due to rounding BMI body mass index, OA osteoarthritis, CCI Charlson Comorbidity Index, IQR Inter-Quartile Range, BEACH Bettering the Evaluation and Care of Health, TKR total knee replacement, N/A Not applicablê excluding mental health conditions * Based on the Australian Bureau of Statistics (ABS) Australian Statistical Geography Standard (ASGS) remoteness areas [14] Notes: BMI includes measurements recorded within one year of study baseline; Early adulthood = 18-21 years; Middle age = 45-65 years; Patient considered to be on OA medication if estimated to be on medication at study baseline using prescription date and medication strength, dosage and frequency; Patient considered to have chronic condition or undergone past knee surgery if record of this exists prior to study baseline  it may not be necessary to validate the predictors outside of the EHR setting. Our intention is to embed the prediction model in a clinical support decision tool within the EHR such that the predictors are drawn directly from the record. The main aim of the model is to produce accurate predictions, hence provided these data are complete/near complete, the predictors may only need to be representative of data within the EHRs. This viewpoint suggests that the predictors (excluding BMI and weight gain) in our study may be fit for our purpose. Should these predictors be used for model development (with outcome data obtained through data linkage) and recording practices in EHRs change over time, the model may not perform well and will need to be updated periodically with new data from the EHRs.

Strengths and limitations
Our study adds to the limited literature on the quality of data within Australian general practice EHRs. It is the first Australian study to provide insight into the use of these data specifically for prediction models through the development of a real-world clinical prediction tool Table 3 Summary statistics of patient characteristics from EHR data and ABS NHS data CI confidence interval * most commonly occurring chronic conditions as identified by AIHW [20] Note: Estimates from the NHS have been calculated using replicate weights and the jacknife variance estimator [21] EHR data (N = 201,462 for use in practice. Our study highlights the importance of assessing the suitability of EHR data prior to model development through data quality assessment and demonstrates how to conduct such a study. It provides insight into which data fields in Australian EHRs are prone to missing and inaccurate data and the value of data linkage for data validation. In this study we followed established guidelines for assessing data quality and the data linkage process [17,23]. Our assessment was based on a large sample of EHRs which provides a true representation of how data are recorded in general practice EHRs. Our coding of diagnoses were consistent with NPS MedicineWise MedicineInsight Data Book [9]. Lastly, the data quality assessment included input from general practitioners, epidemiologists and biostatisticians.
Whilst the NDI data are validated annually against the Australian mortality data [24], the results are not publicly published and it is possible that there may be some uncertainty in these data. Although the AOAN-JRR has near complete capture of every joint replacement performed in Australia and good external validity of these data has been demonstrated [11], it is possible that some patients within our data set underwent TKR in another country.
The EHRs provided by NPS MedicineWise to our research team consisted of patients identified as having a recorded diagnosis of OA in their EHR. The selection of this cohort was performed by NPS MedicineWise and we were provided the free-text terms used to identify the cohort. Whilst it is possible that data cleaning errors may have occurred during data pre-processing, the proportion of patient EHRs with a recorded diagnosis of OA provided to us out of the total number of patients was approximately 10.2% (304,725/2,974,031). This is comparable to the estimate provided by AIHW of 9.3% of Australians living with OA in 2017-18 [20]. Similarly, the rate of TKR in our cohort (229 per 100,000 per year) obtained from linkage with AOAN-JRR was similar to that published by AIHW (218 per 100,000 per year) [20]. We were unable to verify the rate of death obtained through linkage with the NDI as we were unable to find another data source containing this information. Therefore, it is possible that the observed death rate is different to the expected death rate and our assessment of accuracy of recording of death is inaccurate. However, this seems unlikely as only a small proportion of patients were excluded from our study for uncertain dates of death in the NDI (0.7%) and a small proportion of records from the NDI (0.05%) were excluded as hashes for linkage could not be generated. Further, data linkage errors such as errors generating patient hashes or matching hashes is expected to be uncommon and have little impact on our assessment of accuracy.
This study identified 201,462 patients with OA from 483 general practices across Australia. It is possible that duplicate patients exist within our cohort given patients are not registered to one general practice clinic in Australia and are able to attend multiple clinics. The MedicineInsight General Practice Insights Report from 2017-2018 estimates that approximately 3% of patients in the 2017-2018 MedicineInsight cohort are duplicate patients [25]. It is therefore plausible to assume that the proportion of duplicated patients in our cohort is small.
Further, given there is no global measure for overall health we have considered using a count of conditions from the CCI [15] and a count of chronic conditions that were identified from the BEACH study as being frequently managed in general practice [16]. The latter is not a validated measure and may not accurately represent a patient's overall health status. All data fields used in this study were provided as raw text fields except for age, gender, patient status (active, inactive or deceased), geographical location of the patient and clinic state. Data extracted for these variables from the two EHR software systems were merged into a common variable within the data warehouse. There were minimal missing data for these merged data fields. Patient status is recorded similarly in the two EHR software systems and therefore inaccuracies in the recording of deceased patients were likely due to clinics not being informed of the death of a patient or data entry errors, rather than the merging of these data. Given the majority of data fields were text, extensive data cleaning was carried out by our research team on prescriptions and diagnoses data fields to extract the required information, and data sets were extensively reshaped and merged to prepare the data into the format required for assessment and later, modelling. Whilst we were able to externally validate some of the predictors, we cannot be certain bias was not introduced from data pre-processing errors.
Lastly, we were not provided data from the "progress notes" text field within the EHR to ensure patient privacy was maintained. It is possible that important patient information such as BMI were recorded in this field as opposed to the field allocated specifically for clinical observations. This highlights the need for regulated EHR recording practices and enforced EHR software standards across Australia. Without this, Australia will continue to fall behind other developed nations in its use of health information [6,26].

Conclusions
The use of general practice EHRs for clinical prediction model development is in its infancy in Australia. In this study, data relating to the outcome, competing risk and two predictors were unfit for prediction model development. This study highlights the importance of conducting thorough data quality assessments prior to model development to assess the suitability of the EHR data. These assessments need to extend beyond missing and implausible data and include assessments of accuracy. External validity can provide useful insights about the study cohort and is important for models being applied outside the EHR setting. There is a need for more accurate and complete recording of patient data within EHRs if these data are to be used to develop clinical prediction models. Data linkage with other gold standard data sets/registries may in the meantime help overcome some of the current data quality challenges in general practice EHRs when developing prediction models.