To overcome the challenges characterised in the previous section for harmonising EHRs from the SAIL Databank for Wales and the NHS Digital TRE for England, we adopted a four-layer process for the CVD-COVID-UK projects within SAIL, aiming to optimise reusability and reproducibility. We used multiple demographic and EHR data sources, including primary and secondary care-related data sources, prescribing and dispensing records, COVID-19 testing and vaccination data, and mortality records. The data sources contributed varied follow-up time, which covered the years 1990–2022 (Additional file 1).
To address transparency and communication challenges, we have used best practices and rules established within the SAIL Databank for Wales and the NHS Digital TRE for England around naming conventions of files and folders and any database assets created and maintained. This ensures the effective organisation and understanding for users who may be actively working on a proposal or wish to learn or reuse existing components of the resources. Data visualisation has also been established to show the flow and layers of data preparation employed in delivering required data assets and research, which align with the underlying file names and locations. Figure 1 shows a simplified example of the four-layer process applied for some of these projects, and Fig. 2 illustrates the detailed version of the data harmonisation process used in [43].
Layer 1 (raw data sources)
Layer 1 consists of raw data sources in SAIL which are available to all approved users conducting CVD-COVID-UK projects in a “read-only” database schema. Additional file 1 shows key details of all data sources currently available for CVD-COVID-UK projects within SAIL Databank and the NHS Digital TRE for England. All these raw data sources (apart from the data for ONS 2001 Census, and Congenital Anomalies Register and Information Services for Wales) are updated regularly within these TREs daily, weekly, fortnightly, or quarterly, depending on the data source. More information about these data sources and their meta-data can be found in the Health Data Research Innovation Gateway [44].
Layer 2 (curated data)
There are two types of data tables in Layer 2 (derived from raw data sources in Layer 1). The first type are general purpose, pre-prepared, cleaned tables with derived columns, known as Research Ready Data Assets (RRDAs). These are generated from two or more raw data sources by applying quality checks, linkage, and pre-processing procedures [45]. The RRDAs are maintained by the Population Data Science group at Swansea University [46] and made available for several projects including CVD-COVID-UK.
An example of an RRDA is the COVID-19 “C20 electronic cohort” [47] which provides a population spine of 3.2 million Welsh residents alive and registered within the NHS in Wales from the 1st January 2020, including those who have moved into Wales or were born after 1st January 2020. Multiple demographic and healthcare data sources have been used to create the cohort (see [47] for more details), which is updated monthly. Columns cover information regarding demographics (e.g. age, sex, week of birth, date of death), residence (e.g. date moved in and out of Wales, residential anonymised linkage field, Lower-layer Super Output Area (LSOA, a geographic hierarchy in England and Wales used to estimate the characteristics of the people who live in a particular area) version 2011), and registration with primary care general practices. An equivalent entity in the NHS Digital TRE for England is the “key patient characteristics” table, which includes > 56 million individuals alive on 1st January 2020 and registered with an NHS general practice in England [32]. Primary care and hospital episode records (covering inpatient, outpatient, and emergency department episodes) have been combined prior to the index date of 1st January 2020 to define key characteristics, including sex and age. We have used these population denominators to derive harmonised variables for age, sex, date of death, and deprivation.
Another key RRDA in Layer 2 is the derived ethnicity data table in the SAIL Databank based on 26 data sources and harmonised into a national ethnicity spine for the population of Wales [48]. While ethnicity is usually considered a single variable in studies, each patient might have their ethnicity recorded once, many times, or never [9]. These codes might differ and even conflict for various reasons, including different categories being used across data sources and TREs [49]. Harmonised ethnic groups corresponding to the ONS categories (i.e. White, Mixed, Asian and Asian British, Black and Black British, and Other ethnic groups) are available in the ethnicity spine RRDA in SAIL and the key patient characteristics table in the NHS Digital TRE.
In addition to the RRDAs in Layer 2, we have generated a curated version of other raw data sources by applying initial data cleaning, which is common across projects. This cleaning process includes linking the data sources to two population spines for Wales, the C20 cohort and C16 cohort (a counterfactual and contextual comparative population spine consisting of the whole ~ 3 million population of Wales from the 1st January 2016 to 31st December 2019 [47], see Additional file 1 for more details), removing records whose unique anonymised identifier is missing, and applying some pre-processing procedures (e.g. removing records whose date is out of data coverage). Examples includes the curated version of the Welsh Longitudinal General Practice (WLGP), Patient Episode Dataset for Wales (PEDW), Outpatient Dataset for Wales (OPDW), and COVID-19 Test Results (PATD). All RRDAs and generated data tables in Layer 2 are updated monthly.
Layer 3 (phenotyped data)
Layer 3 consists of phenotype-related data tables, called “PHEN DataSourceName PhenotypeName/Category”. These include all records related to a phenotype or group of phenotypes within a data table in Layer 2. Examples of Layer 3 data tables are “PHEN PEDW COVID19” (which includes all confirmed or suspected cases of COVID-19 in the curated version of PEDW data in Layer 2), and “PHEN PEDW CVD” (which contain all records related to cardiovascular diseases in the curated version of PEDW).
Data for secondary care systems such as hospital admissions, outpatient episodes, and mortality registers in Wales and England use the same clinical coding systems for diagnoses and causes of death, the International Classification of Diseases, 10th revision (ICD-10), and the Office of Population Censuses and Surveys codes version 4 (OPCS-4) for classification of hospital interventions and procedures clinical coding [44]. Therefore, phenotype code-lists developed using ICD-10 or OPCS-4 in either TRE could be used to generate the harmonised data tables related to these phenotypes.
Medication dispensed through community pharmacies are available in both SAIL Databank (for COVID-19 purposes) and the NHS Digital TRE for England. In Wales, dispensing data is available within the Welsh Dispensing DataSet (WDDS), which includes all NHS prescription items dispensed from all community pharmacies remunerated by NHS Wales, and is coded in the Dictionary of Medicines and Devices (DM+D). Work has been done in SAIL to also include British National Formulary (BNF) coding to this data through creating an RRDA version of WDDS [45]. This RRDA is linked to the C20 cohort and part of Layer 2 in our four-layer process. In England, the NHS Business Service Authority (NHSBSA) dispensing data includes prescriptions for all medicines dispensed in the community in England and is coded in BNF and DM+D. Therefore, any phenotype developed using DM+D or BNF in either TRE can be used.
However, this is not the case for other data sources, such as primary care general practice event data. In Wales, WLGP data is recorded in Read V2 codes, whilst in England, the General Practice Extraction Service (GPES) Data for Pandemic Planning and Research (GDPPR) is recorded in Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT) [44]. For primary care phenotypes, previously validated phenotypes in Read V2 code format from reputable sources such as CALIBER [37, 50] are used directly in SAIL, while conversion to Read V2 code format of phenotypes developed for use in the NHS Digital TRE for England in SNOMED-CT has been completed in collaboration with healthcare professionals and domain experts. Novel primary care phenotypes, such as COVID-19 diagnosis, have been developed in both SNOMED-CT and Read V2 in parallel (see Additional file 2). Furthermore, in the NHS Digital TRE for England, an assessment of comorbidity burden uses the number of disorders for each individual based on a SNOMED code-list obtained via an algorithm. The same approach could not be implemented with Read V2 codes (due to differences in the structure of this coding system compared with SNOMED-CT). Hence existing comorbidity indexes, such as the Charlson, Elixhauser and other available comorbidity indexes [51, 52], have been used to obtain this comorbidity burden variable for the Welsh population.
Emergency Department (ED) data, also known as Accident and Emergency (A&E) data, are available in Emergency Department Data Set (EDDS) within SAIL and in the Hospital Episode Statistics (HES) Accident and Emergency (HES-AE) data within the NHS Digital TRE for England. These data have their own coding system and variable format for diagnosis and treatment information. In addition, some Welsh hospitals use ICD-10 codes at a 3-character level in EDDS. So ED related phenotypes in these TREs have been harmonised following a detailed clinical review of mappings between these coding systems. For example, the diagnosis in HES-AE is a 6-character code consisting of diagnosis condition (n2), sub-analysis (n1), anatomical area (n2) and anatomical side (an1) [53]. While in EDDS, the diagnosis code has eight characters, consisting of diagnosis condition (an3), anatomical area (n3) and anatomical side (n2) [54]. So a phenotype such as lower limb fracture can be defined for each of these data sources using the related look up tables for diagnosis condition, sub-analysis (where applicable), and anatomical area and side.
Some methods developed based on specific data sources in one TRE might not apply to another TRE due to differences in the structure and fields contained within the corresponding data source(s) or the lack of similar data source between TREs. For example, the phenotypes defined for COVID-19 intensive care unit (ICU) admission, invasive and non-invasive ventilation for the NHS Digital TRE for England in [55] use code-lists in OPCS-4 for HES Admitted Patient Care (HES-APC) data source as well as specific fields in the following data sources (and not clinical coding): HES for Adult Critical Care (HES-CC) and COVID-19 hospitalisation information from COVID-19 Hospitalisations in England Surveillance System (CHESS). In SAIL, hospital interventions and procedures are recorded in PEDW in OPCS-4, and so phenotypes coded in this coding system can be used in SAIL. However, the intensive care and critical care data in SAIL (available in Critical Care Data Set (CCDS), and ICNARC—Intensive Care National Audit and Research Centre data (ICNC)) are different, and independent approaches [56] have been developed with a similar goal to identify and derive the outcomes needed.
Unified phenotypes related to COVID-19 polymerase chain reaction (PCR) tests, lateral flow tests, and vaccination have been defined using similar data sources in these TREs. In addition, based on the project’s need, phenotypes specific to Wales Results Reporting Service (WRRS, which contains all pathology laboratory results in Wales) have also been developed [39]. Examples are phenotypes (including test codes, their description, unit, and reference ranges) for influenza, pneumonia and other respiratory tract infections.
All phenotypes are documented and uploaded to the Health Data Research UK Phenotype Library [12], and BHF DSC GitHub repository [57] upon completion, signoff, and implementation as part of submitted published work. All generated data tables in Layer 3 are updated following the monthly update of Layer 2 data tables.
Layer 4 (project-specific data tables)
Finally in Layer 4, project-specific data tables are created containing fully harmonised data tables as structured and formatted in both TREs. That is, all data table names, column names, and applicable values and ranges are the same between TREs. For example, demographic categories and outcomes of interest such as sex, age, ethnic groups, smoking status, or cardiovascular-related outcomes are the same for use in research analyses. Also due to the scale of geography and population size of Wales and England, Wales has been considered one region when combining results with England, which has nine defined regions (North West, North East, East of England, London, East Midlands, West Midlands, Yorkshire and the Humber, South East, South West). When evaluating the impact of socioeconomic factors, the Welsh Index of Multiple Deprivation [58] and the English Index of Multiple Deprivation [59] have been used with consideration of the differences between the respective indexes, as the quintiles are not directly comparable between them. Therefore, any analytical pipeline developed in one TRE can be applied to the other with minimal/no change, and then results from these TREs can be combined across nations using appropriate meta-analysis methods.
Initial quality checks and descriptive statistics (e.g., frequencies, median, mean, standard deviation, and ranges) were used to assess the quality of the process and project-specific variables and to compare the consistency (distribution and missing values) of the harmonised data with corresponding data for England. Where required, researchers from both TREs engaged in discussions to understand any potential causes of inconsistencies and to clarify potential solutions.
Figure 3 shows the process for combining results of analyses from SAIL and NHS Digital TRE for England for CVD-COVID-UK projects. In SAIL, disclosure control through file out requests do not permit outputs that would intentionally or unintentionally break the privacy-protection of the anonymised data, primarily handled through a small number policy (< 5 as standard, and < 10 when using any data obtained through the DEA including the ONS 2011 Census data), which entails that the results are considered disclosive, and therefore should be suppressed. Very similar processes are used for disclosure control in the NHS Digital TRE. So, if any results requested out of each TRE (which are required for meta-analysis and/or to be included in the final output(s)) fall below these thresholds, then there will be an issue as unadjusted analysis should be excluded, and counts < 5 for adjusted analysis should be masked. A solution for this issue could be composite outcomes at a different or higher level of aggregation.
We note that the software used for data preparation, generating analytical outputs, visualisations, and results have been different in these TREs due to the availability of different software tools in the TREs. For example, for population-wide analyses, the size of the data in the NHS Digital TRE for England requires distributed computing. So Apache Spark (a data processing engine for distributed computing which sits between the data source and the analysis tool) is provided in this TRE to run SQL queries, and can be utilised using Spark SQL or Python query tools such as PySpark [60]. In SAIL, similar tools (such as Eclipse and Jupyter notebooks) can be used to run SQL queries. For more details about available analytical and version control tools in the SAIL and the NHS Digital TRE for England see [7, 61].
All data tables generated as part of the harmonisation process include individual-level details. Hence, these tables are only accessible within the SAIL Databank TRE. In order to access these resources, researchers working on the CVD-COVID-UK program will require to submit their proposals to the SAIL via (https://www.saildatabank.com/application-process), and all applications are reviewed by an independent Information Governance Review Panel (IGRP). The IGRP considers each project to ensure proper and appropriate use of SAIL data. When access has been granted, it is gained through a privacy protecting safe haven and remote access system, referred to as the SAIL Gateway. Further details of this process can be found on the SAIL Databank website (https://saildatabank.com/).
All SQL and R scripts used to generate data tables in Layers 2,3 and their associated documentation, as well as all scripts used to derive project-specific data tables (in Layer 4) and related meta-data are available in GitLab within the SAIL Gateway, and made publicly available via the BHF DSC GitHub repository [57] following completion of the project.
Finally, although this harmonisation process has been implemented as part of the CVD-COVID-UK programme to enable cross-nation COVID-19 related analysis in England and Wales, the data harmonisation methodology, data curation and linkage techniques, phenotypes definition, and derivation of analysis variables can be generalised and used by other projects using the SAIL Databank and replicated across other TREs across the UK with similar data sources.