The healthcare sector and regulatory bodies increasingly need to understand more about the real-world implications of diseases and healthcare interventions, requiring access to good-quality fully integrated healthcare datasets. England’s National Health Service (NHS) is well placed to deliver this for several reasons. With a single healthcare system, it is possible to follow patients from birth to death. With a low proportion of healthcare services being provided outside of the NHS (£9bn compared with £126bn in the NHS in 2017), [1] it is possible to obtain a near-complete view of both existing and new services and treatments that patients access. The computerisation of UK general practice records and the fact that 98% of the population is registered with a GP leads to almost whole-population coverage. Unlike with clinical trials and biobanks, the denominator resulting from such databases is relatively free from selection bias and represents the entire population.
While primary care sees most contacts with patients, linkage with secondary care and other sectors is needed for a full picture of the patient’s journey through the health and social care system and their outcomes. In England, as in the other UK countries, hospital inpatient data are combined to give a single database, but primary care uses several different systems, so far preventing the creation of an equivalent database for primary care. Instead, large samples of vendor-specific primary care data for research are available from several sources. The Clinical Practice Research Datalink (CPRD), for example, now includes records of over 11 million currently registered patients (16% of UK population), with linkage to hospital records, the national cancer registry, area-level social deprivation information and national mortality data, though some of these sources are for England only; the Health Improvement Network (THIN) [2] and QResearch [3] databases are similar but smaller. CPRD has generated over 1000 research papers [4]. Various initiatives have created local data warehouses such as the KID in Kent [5]. It uses pseudonymisation-at-source to link patient-level records from services including general practices, hospitals, community health services, hospices, and adult social care for its nearly two million population in SE England. It was established to track service use by patients with any of a set of long-term conditions but has since expanded to cover all patients. It is overseen by a steering group, one of whose subgroups considers requests for access to the data. These are not epidemiological cohorts or resources like UK Biobank [6] but were developed primarily for direct patient care and commissioning. Technical issues such as interoperability of data systems and ethics have complicated their construction.
Over the past 5 years, the team behind one such local data warehouse in North West London has overcome such issues to make the dataset available for research. We cover the origin, funding, contents and structure of this data warehouse, derived from the Whole System Integrated Care (WSIC) programme, its anonymised research version Discover, and its consent-to-contact feature. We then compare it with CPRD and discuss access to Discover and its current and future uses and developments for research.
Origins and uses of the WSIC database
Commissioning is the process by which health and care services are planned, purchased and monitored. Within the NHS, local Clinical Commissioning Groups (CCGs) are responsible for planning, designing, buying, and paying for most NHS services including urgent and emergency care, acute care, mental health and community services across England (the commissioning landscape is changing: see [7] for a review). The need for a data warehouse was identified during a programme of consultation on the journey towards integrated care led by eight CCGs in North West London. As in many healthcare systems, medical records are held in database silos, and the need to share information about how patients go through healthcare organisations was recognised as a critical success factor.
The initial requirement for information sharing was to improve patient care, including by developing analytics to prioritise patients who may benefit from proactive intervention e.g. through risk stratification [8]. Whole-system activity was used to calculate a patient system cost. There was also a need for population-based data to inform the development of what was known as “accountable care partnerships”, in which healthcare providers work with a single pooled budget to take joint responsibility for delivering services for a defined population. The WSIC dataset was therefore created, covering primary care, community and mental health care, secondary and tertiary care, emergency departments and social care.
The WSIC database is currently used for direct patient care, service evaluation, commissioning and now also for research as Discover. For direct patient care, the WSIC team developed disease-specific dashboards, which can be accessed by healthcare professionals with a legitimate relationship with WSIC. For other uses, the database is de-identified. The challenges for using linked databases for service evaluation include data quality (coverage, completeness and accuracy) and producing actionable information from the data. For example, evaluating whether a new service reaches the target population better than the old model requires sufficient years of comparable data before and after the change. It also requires appropriate denominator data, i.e. the whole target population and not just those who actually use the service and are thereby captured electronically. Capturing clinical processes in hospital for audit is still usually done using purpose-built audit databases, as administrative data are very limited in what process measures can be constructed from them. Ideally, processes should be captured electronically during routine care, as is done for neonatology [9].
For commissioning, WSIC enables examination of healthcare activity in segments of the population. This can support developing integrated services for individuals with similar needs and monitoring their outcomes. This functionality is under development as providers and commissioners move towards integrated care and start to define population outcomes. CCGs need a range of information, crucially including patient information. CCGs not only draw on evidence about what is most clinically or cost-effective but also consider patient experience and clinical staff’s local knowledge.
WSIC/Discover has been funded by NW London Collaboration of CCGs as well as Imperial College Healthcare Partners – a not-for-profit company owned by a partnership of NHS providers of healthcare services, CCGs and leading local universities. This initiative has been funded for 7 years by the funders and we are currently exploring the feasibility of the sustainability of this solution through licences. The fees for research access cover the administrative costs currently, but we would be moving more to a data licence fee to ensure sustainability. Any organisation wishing to follow this example would need to invest up-front to ensure the data asset and associated products are developed before licensing them: this will ensure a better buy-in from customers as the use cases will be met.
Database technicalities
The WSIC database uses the Microsoft SQL Server 2012 Enterprise Edition platform and has a combined storage of approximately 1.5 TB. As commissioners are not legally permitted to view patient-level data, the data are provided by an intermediary service, the Data Services for Commissioners Regional Offices (DSCRO). Their task is to provide the acute, mental health and community activity data submitted by providers with clear patient identifiable information to the WSIC team, who carry out the data loading process and create the integrated care record through NHS Number linkage.
The primary care clinical systems SystmOne and EMIS are used in the WSIC area, from which data extraction company Apollo extracts the data directly. Apollo purge the sensitive codes (abortions etc) and patient opt-outs (patients who do not wish their records to be used except for direct care) and then pass the raw data files to the WSIC team. All the data are imported using the WSIC ETL (Extract, Transform, Load) layer, which is built from Microsoft’s Integration Services platform. The primary care data are processed in a separate ‘black box’ environment with restricted access and relevant security provisions to ensure that users are unable to view potentially sensitive data without permission. After ‘purging’, the data are transferred into the WSIC warehouse environment to be linked with secondary care and other data. The WSIC ETL layer contains error-handling features to ensure that invalid data are either redirected and removed from the reporting layer or logged and reported to the clinical users in the format of a Tableau dashboard while being imported to the reporting layer. Figure 1 shows the architecture.
A copy of the WSIC data is available in de-identified form that meets NHS data minimum standards. The version for service evaluation is stored on a dedicated server hosted by the Commissioning Support Unit. To gain access to the de-identified data set, a data access request form needs to be submitted by the Security and Access Subgroup for approval. Access is only provided for legitimate use by employees of an organisation that is a signatory of the NWL Digital Information Sharing Agreement (ISA); access may be sponsored by an ISA signatory. The data are provided as SQL tables.
Data held in WSIC are driven from an agreed data specification that has been signed off by the NWL Digital and Cyber Security Governance Group. This has been in operation since the development of the original Information Sharing Agreement (2015) and continues to meet monthly. Any changes to the WSIC data specification need to be approved by the NWL Governance Group.
Accessing Discover for research
Researchers use the WSIC dataset on the platform set up by the Discover team. This use is managed by the governance structure in Fig. 2. The Discover Steering Group meets every 2 months, with broader membership coming from the R&D Directors from the Trusts, WSIC, the National Institute for Health Research, patient representatives and Imperial College Health Partners (ICHP). The Steering Group reports to both the ICHP Board and the NWL Digital Information Sharing Group. The purpose of the Steering Committee is to hold the Discover Data Access Group (DRAG) to account, informing wider stakeholder engagement and providing Discover with strategic direction and an executive decision-making function. The DRAG is chaired by a patient representative and meets monthly to review research proposals on Discover. It has responsibility for evaluating whether applications to access Discover are consistent with the Discover Principles Charter and that the requests do not pose undue risk to the individuals, communities or organisations to which they relate; this includes evaluation of risk of loss of privacy and assurance that appropriate protections of confidentiality and ethics review are in place. The Discover team has HRA approval for any retrospective studies submitted to the DRAG until 2023. See Appendix for details and links on how to access Discover.
Consent-to-contact register
As well as retrospective studies with cross-sectional, time-series and cohort designs, WSIC can also be used for prospective follow-up studies including randomised controlled trials and cohort studies by tagging the electronic records of patients who have consented to take part. To do this, Discover is developing a register for people interested in contributing to health research. This includes anyone aged 18 and over living in NW London, either healthy people or those with a medical condition. This allows the Discover team to contact patients who are already consented to be contacted for research, speeding up recruitment. Launched in 2018, it has so far recruited over 3000 volunteers.