Leveraging electronic medical records for HIV testing, care, and treatment programming in Kenya—the national data warehouse project

Background Aggregate electronic data repositories and population-level cross-sectional surveys play a critical role in HIV programme monitoring and surveillance for data-driven decision-making. However, these data sources have inherent limitations including inability to respond to public health priorities in real-time and to longitudinally follow up clients for ascertainment of long-term outcomes. Electronic medical records (EMRs) have tremendous potential to bridge these gaps when harnessed into a centralised data repository. We describe the evolution of EMRs and the development of a centralised national data warehouse (NDW) repository. Further, we describe the distribution and representativeness of data from the NDW and explore its potential for population-level surveillance of HIV testing, care and treatment in Kenya. Main body Health information systems in Kenya have evolved from simple paper records to web-based EMRs with features that support data transmission to the NDW. The NDW design includes four layers: data warehouse application programming interface (DWAPI), central staging, integration service, and data visualization application. The number of health facilities uploading individual-level data to the NDW increased from 666 in 2016 to 1,516 in 2020, covering 41 of 47 counties in Kenya. By the end of 2020, the NDW hosted longitudinal data from 1,928,458 individuals ever started on antiretroviral therapy (ART). In 2020, there were 936,869 individuals who were active on ART in the NDW, compared to 1,219,276 individuals on ART reported in the aggregate-level Kenya Health Information System (KHIS), suggesting 77% coverage. The proportional distribution of individuals on ART by counties in the NDW was consistent with that from KHIS, suggesting representativeness and generalizability at the population level. Conclusion The NDW presents opportunities for individual-level HIV programme monitoring and surveillance because of its longitudinal design and its ability to respond to public health priorities in real-time. A comparison with estimates from KHIS demonstrates that the NDW has high coverage and that the data maybe representative and generalizable at the population-level. The NDW is therefore a unique and complementary resource for HIV programme monitoring and surveillance with potential to strengthen timely data driven decision-making towards HIV epidemic control in Kenya. Database link (https://dwh.nascop.org/).


Background
The HIV pandemic remains a public health challenge four decades since the first documented case.In 2020, there were an estimated 1.5 million new HIV infections and 680,000 HIV-related deaths [1].Significant strides have been made in the fight against the pandemic including a reduction in the number of new HIV infections and HIV-related mortality by 52% and 64% since their peaks in 1997 and 2004, respectively [1].Whilst evidence from research and modelling studies cautiously point towards the feasibility of achieving epidemic control by the year 2030 [2], translating research interventions into programme policies remains arduous [3,4], and robust surveillance of HIV programme activities for timely and data-driven decision-making remains a challenge.
Historically, paper medical records (PMRs) were the primary data source for HIV programme monitoring and surveillance in Kenya.Whilst practical, PMRs had inherent challenges, including prescription and transcription errors, leading to data quality concerns.Other PMRrelated challenges include tedious reporting processes, duplication of tasks, and lack of robust audit trails [5].These challenges rendered PMRs less effective for patient management, reporting, and timely decision-making.
Over the last decade, Kenya adopted the District Health Information System (DHIS) for aggregate-level indicator reporting purposes [6].The DHIS is an opensource web-based data platform that has revolutionised aggregate-level data reporting, making it a rich resource for real-time disease surveillance and mapping of outbreaks.The major limitation with aggregate-level data repositories, including the DHIS, is their inability to provide granular data at the individual-level for a robust analysis of the burden, distribution, and risk factors of prevalent and emerging public health challenges for targeted interventions.
Because of the challenges with PMRs and aggregatelevel electronic data repositories, Kenya also conducts periodic population-based surveys.These include the Kenya demographic health surveys (KDHS) [7][8][9], Kenya AIDS indicator surveys (KAIS) [10,11] and, more recently, Kenya population-based HIV impact assessments (KENPHIA) [12].These cross-sectional stratified random sampling surveys are a unique source of representative and well-characterised individual-level data.They have proved invaluable in providing an evidencebase for decision-making and policy formulation.However, these surveys also have inherent limitations.First, the cross-sectional design implies that the surveys cannot be used to elucidate longitudinal outcomes.Longitudinal surveillance is increasingly relevant in HIV programming as the pandemic has transitioned from an acutely fatal disease to a chronic manageable, and long-term condition [13].Second, these surveys are done periodically, mostly after every five years, and therefore cannot be leveraged to respond to new and emerging HIV programme priorities in real-time for decision-making.
In addition to population-level surveys, Kenya has also leveraged technology to design electronic stand-alone data platforms for monitoring programme-specific indicators like HIV RNA viral load and early infant diagnosis databases [14,15].These platforms offer rich sources of individual-level data, albeit indicator-specific, that are mostly used to monitor point estimates and temporal trends.The primary limitation with these indicator-specific stand-alone repositories is their inability to delineate the HIV care and treatment continuum for a holistic approach to HIV programme surveillance.
With technological advances, many health facilities in Kenya have adopted electronic medical records (EMR) for patient management.If harnessed into a centralised repository, EMR data have immense potential to enhance HIV programme monitoring and surveillance.In recent years, the country embarked on a project aimed at leveraging EMR to develop a national-level data repository for the HIV programme referred to as the national EMR data warehouse (NDW) project.Here, we describe the evolution of EMR systems in Kenya and their contribution to the NDW.Then, we describe the design and features of the NDW.Lastly, we describe the distribution of the PLHIV captured in the NDW and compare with other national level systems to explore its coverage, representativeness and generalisability for HIV programme surveillance in Kenya.

Evolution of electronic medical records in Kenya Introduction of HIV/AIDS management guidelines and scale-up of antiretroviral therapy
In Kenya, EMRs have been in use over many years for various purposes, including patient management, supply chain management, and human resource management.Introduction of EMR for HIV programming, initially in the form of simple windows-based (Ms Excel, Access) platforms, coincided with the launch of the first national guidelines for HIV voluntary counseling and testing in 2001 [16].The introduction of combination antiretroviral therapy (ART) in 2004 necessitated initial discussions for establishing national electronic data capture systems.Thereafter, EMR were increasingly adopted to improve HIV programme efficiencies in data collection, data management, reporting, and decision-making (Fig. 1).

Establishment of electronic medical records standards and guidelines
Between 2007 and 2009, three separate reviews of EMR systems in Kenya were done to assess their design, functionality, and operationalization [17].Combined, these assessments reviewed 33 EMR systems deployed in 69 facilities in Kenya.Overall, the assessments observed that the systems were developed in the absence of clear guidelines and standards for EMR.Specifically, the assessments identified weak infrastructural frameworks, gaps in functionalities, weak interoperability features and poorly defined operationalization processes.These findings informed the recommendation to establish standards to guide the design, development, deployment, and national scalability of EMR.
In 2010, guidelines for the standardization of EMR were developed with an initial focus on the HIV/AIDS programme [18].In brief, the guidelines were broken down into three sections: (i) EMR systems development, with pre-requisite processes to be undertaken, minimum essential functional features to be met, and critical software attributes to be included to ensure high-quality data collection, (ii) EMR system deployment, facilitating interaction between-and-within modules, interoperability with other HIS and data transmission (e.g., via Health Level 7 (HL7) and Statistical Data and Metadata eXchange (SDMX) messaging) to a centralised electronic data repository, and (iii) EMR implementation, with best practice requirements for seamless operationalization, oversight, maintenance support, data governance and a continuous review of processes.
In 2011, a review of EMRs was done to assess compliance to established standards and guidelines, provide a framework for technical support towards standardization, and identify high-performance applications for national scalability [19,20].The review was carried out in 28 facilities supporting 17 EMR across the country.The applications were assessed and scored against a predetermined minimum set of features and functionalities as defined in the standards and guidelines document [18].The systems reviewed were supported by various backend platforms, including MySQL, MSSQL, MS Access, Oracle, and PostgreSQL.Only 6 systems were web-based.EMR systems with the highest weighted scores were OpenMRS AMPATH (AMRS), IQ Care, C-PAD, and Funsoft.Findings from this review informed the development of a road map for system-specific upgrades to meet established standards and guidelines and scale-up at the national level.

Scale up of electronic medical record systems and development of the national data EMR warehouse
In 2014, EMR data were leveraged to inform the HIV programme at the national level for the first time.In brief, data from ~ 500,000 HIV-infected individuals captured in 17 applications deployed in 345 facilities covering 41 of the 47 counties were used to assess twelve-months retention amongst individuals starting ART in Kenya.This inaugural analysis demonstrated the unique potential of EMRs as sources of individual-level data to support decision-making.However, the manual process of data extraction, data collation, and data cleaning underscored In 2015, the development of technical capacity and physical infrastructure for automating individual-level data transmission from EMR into a centralised repository began.In 2016, the NDW was officially launched.The inaugural upload was done in May 2016 and comprised data from two EMR applications deployed in three facilities covering one county (Meru).By the end of 2020, the NDW was hosting individual-level longitudinal data from five EMR applications deployed in ~ 1,500 facilities (out of the 3,458 [44%] ART sites), covering 41 of the 47 counties in Kenya (Fig. 1).

Sources of data
The NDW is a centralised electronic data repository of routinely collected individual-level HIV programme data transmitted from EMR deployed in health facilities offering HIV testing, care, and treatment services in Kenya.Data is collected from clients and either captured in case report forms (CRFs) for retrospective data entry or entered directly into EMR via mobile devices or desktops, or a hybrid of both.CRFs include the HIV Testing Services form (HTS form, or MOH 362) [21], and the clinical encounter services form (also called the green card, or MOH257) [22].Based on variables from CRFs, a comprehensive data dictionary was developed and informed the database schema in the NDW.Then, the data warehouse application programming interface (DWAPI) was installed and configured alongside the EMR at the health facility server to facilitate data transmission to the NDW.DWAPI facilitates vertical interoperability by allowing disparate EMR systems to all transmit data in similar formats into the NDW.However, there are currently no feature to facilitate horizontal (between EMRs) interoperability.Data transmission is done routinely at the end of every month.

Architecture and design
The NDW is comprised of four layers, including DWAPI, the central staging, the integration service, and the data visualization application layers (Fig. 2).DWAPI enables secure data transmission from multiple sources to a centralised repository through the Secure Sockets Layer (SSL).DWAPI's core architecture is developed using Angular and .NET Core frameworks and supports both MSSQL and MySQL environments.Before uploading, data are processed for de-duplication, de-identification, and generation of a patient key value (PKV) identifier.In brief, a PKV is generated using patient demographics (<gender><soundex(fname)><double metaphone (lname)><dob(yyyymmdd).Probabilistic matching is Fig. 2 Architecture and design of the national EMR data warehouse (NDW) outlining the five components of the repository and illustrating infrastructure, processes, and function involved in each component, through from individual-level data collection to data utility at the national level also undertaken where an index record is further compared with other variables (name, date of birth, gender, date of enrolment, etc.) for matches, and a Jaro-Winkler distance score is calculated on the PKV [23].Those that score a match of > 95% are considered as possible matches.DWAPI also generates data quality logs for system administrators to synthesize, and detailed feedback is reverted to users at facilities for corrective measures.
The central staging layer consists of encrypted API endpoints that use REST protocols to receive data over the internet in JSON format.This layer temporarily stores raw data as received from DWAPI and supports MSSQL, MySQL and Postgres environments.Due to high volumes of data received from multiple EMRs simultaneously, data are queued using features from RabbitMQ, MSMQ, and Hangfire for asynchronous background processing services.Data from all sources are merged into one dataset.An extract, transform and load (ETL) pipeline is then applied to extract and process data into separate data marts based on different data transformation cleanup routines.
The integration service layer comprises high-capacity data servers that store processed data marts.Processed data marts are organized and integrated into the servers using the integration service layer.When needed for analytics, relational data models are applied to extract and process data marts, which are subsequently transmitted for different use cases, including the generation of routine automated or non-routine ad hoc informational data products.By the end of 2020, the server's mainframe was running on a Windows Server 2019 with both MySQL version 8 and MS SQL 2017 database platforms with an Intel Xeon 2.2 GHz Central Processing Unit (CPU).
The data visualization application layer receives processed data marts from the servers and applies them for data visualizations using Highcharts, an interactive open-source data visualization software.Highcharts dashboards are developed and configured to allow for interaction, analysis, visualization, and exportation of data outputs based on pre-defined indicators.Informational products are made accessible on the web-front end of the NDW servers using Java Scripts for easy access by stakeholders and the public.This layer also runs customized SQL queries for download of line-listed datasets to be used for in-depth hypothesis driven data analyses.
The NDW continues to evolve in response to changing HIV programme guidelines and to keep abreast with technological advancements towards enhancing its functionalities for efficient, seamless, and secure data transmission, data archiving, and generation of interactive informational products.

Data quality
Data quality concerns in the NDW project are inherent and consistent with any repository leveraging routine program data.Addressing data quality challenges in the NDW is an on-going priority.Data quality is ensured through established structures and policies that are routinely evaluated and upgraded to conform to the evolving NDW environment.In brief, these include: (i) policies governing training for data entry, transmission, management and archiving, (ii) systemic structures embedded within EMR applications including quality checks for mandatory fields, null values, missing values, range checks, logic validations and data completeness, and (iii) Institutionalization of routine data quality assessment (RDQA), data review meetings and feedback structures for EMR end-users, mid-level managers and NDW administrators.

Data use
Routine automated data products The NDW is used to generate routine automated informational data products.These are broadly classified into process indicator reports, quality monitoring reports and HIV programme performance reports.Process indicator reports are generated monthly and include indicators like proportion of facilities uploading data and timely upload of EMR data.Quality monitoring reports are generated monthly and include indicators like proportion of facilities with erroneous data entries and missing data.HIV programme performance reports are auto-generated on-the-fly to provide real-time updates on national performance indicators of priority including numbers tested for HIV (and positivity yield), numbers started and current on ART and numbers tested for HIV viral load (and virologic suppression).All automated reports are configured, and the results visualized using high-charts dashboards on the publicly accessible NDW web interface.
Non-routine ad hoc data products Data from the repository is also leveraged to respond to non-routine and emerging HIV/AIDS programme priorities for timely decision-making.Individual level data is extracted from the repository and an analytic approach is applied to respond to specific questions and hypotheses.Data analysis is carried out using statistical software like R, Stata or SPSS, and results presented in the form of powerpoint slide sets, technical reports, bulletins, policy briefs, and peer-reviewed manuscripts to counties, national and global HIV programme teams.

Data safety, security, and backup
The NDW uses a 256-bit encryption to ensure that data are secured during transmission.A secure socket layer (SSL) has also been integrated to enable a secure connection between the DWAPI client and the API (server).A Sophos XG firewall and VPNs limit access to and secure the warehouse from the public internet.All EMR and the NDW have role-based access controls and password restrictions, with different cadres of staff having different levels of authorization depending on the organization units.
The NDW is configured to run on two separate (primary and secondary) sites located away from each other and are secured from fire and theft.Data in the warehouse is backed up weekly to an off-site backup server and an external drive kept in a secure location.The server rooms are kept in air-conditioned locked rooms accessible only to registered persons with biometric credentials and server room access keys.The main frame of the NDW and all the servers are only accessible remotely via secure FORTI-CLIENT and SOPHOS VPNs.

Data governance and ethical considerations
The NDW project is establishing data governance structures comprising a steering committee and a policy document.The mandate of the data governance steering committee is to oversee the development, implementation, and compliance with data governance policies and regulations.The data governance steering committee will be comprised of stakeholders from the Ministry of Health, the National HIV/AIDS and STI Control Programme, US government agencies, service delivery partners and patient representative groups.Data governance policy and regulations are under development to outline a framework that will uphold privacy and confidentiality, enhance data protection and security, ensure compliance with data regulations, and inform data access, sharing and data use.
Since data are collected from routine service delivery and are intended to be used for national HIV programme evaluation, clients have not been asked for their informed consent.Instead, ethics approval to collect, archive and use data from the repository without retrospectively seeking informed consent from clients, has been obtained under a non-routine non-research determination (NRD) protocol (P716/2019) from AMREF's Health Africa Ethics and Scientific Review Committee (ESRC).This project was reviewed in accordance with CDC human research protection procedures and was determined to be research, but CDC investigators did not interact with human subjects or have access to identifiable data or specimens for research purposes (CDC, reference number 2018-528).

Scale up of data upload and enhancement of features over time
The inaugural upload of data to the NDW was May 2016.By the end of 2016, an estimated 666 facilities had uploaded data from a cumulative 1,162,778 individuals onto the repository.These included individuals seeking HIV testing services and those enrolled for HIV care  and treatment services.By the end of 2020, an estimated 1,516 facilities had uploaded data from a cumulative 4,833,786 individuals onto the repository (Fig. 3).Over time, major enhancements were also implemented on the NDW infrastructure and functionalities in tandem, in response to the large upload volumes, and to keep abreast with technological advancements (Table 1).Plans are underway to enhance the NDW to include data marts and corresponding reports and dashboard for priority HIV program areas including prevention of mother to child transmission (PMTCT), pre-exposure prophylaxis (PreP), key populations (KP) and voluntary medical male circumcision (VMMC).

Distribution of the HIV care and treatment surveillance population
By the end of 2020, the NDW hosted individual-level longitudinal data from 1,928,458 clients ever started on antiretroviral therapy (ART) in Kenya.These data were transmitted from six EMR applications deployed by 35 service delivery partners supporting 1,516 out of 3,458 (44%) facilities offering care and treatment services, covering 41 of the 47 counties in Kenya (Table 2).
Overall, the NDW hosts longitudinal data from individuals who started ART dating back to the early 2000s.Individuals starting ART increased exponentially from the year 2004.Two peaks in the number of individuals starting ART were observed around 2010-2011 and 2014-2016 (Fig. 4a).Majority of clients starting ART were female (n = 1,287,674 [66.8%]), and this was consistent over the years.The median age at ART initiation among children (< 15 years) was 5.1 (IQR, 1.8 -9.4) years, with the majority starting ART at < 5 years.Among adults, male clients were older at ART initiation compared to females (median age, 38.4 [IQR, 32.0 -46.1] vs. 32.6 [IQR, 26.7 -40.4] years, respectively).The highest proportion of male clients starting ART were 35-39 years (n = 105,517 [16.5%]) whilst that of female clients was 25-29 years (n = 241,836 [18.8%]) (Fig. 4b).

Comparing the spatial distribution of the NDW surveillance population to other data sources
Indicators from the Kenya Health information Systems (KHIS), a well-established national aggregate-level data repository, were used.The number of individuals on ART and on follow-up care in 2020 was used as a proxy indicator for comparison with data from the NDW.From the NDW, an estimated 936,869 individuals picked an ART drug re-fill at least once in 2020.Drugs were picked from facilities supported by the six EMR applications deployed in 41 counties.Most of the facilities were using the Keny-aEMR system.Most of the facilities are clustered around the western part of the country (Fig. 5a).
Overall, the proportional distribution of individuals by counties from the NDW was consistent with that from KHIS (Fig. 5b).In brief, data from both the NDW and KHIS demonstrated that Nairobi County contributed the highest proportion of individuals on ART in 2020.Indeed, Nairobi (14.1% vs 13.6%), Siaya (9.8% vs 7.7%), Homa bay (9.6% vs 9.4%) and Kisumu (9.3 vs 9.2%%) were the top four counties with the highest proportional contribution of individuals on ART from both the NDW and KHIS, respectively.Further and by 2020, there were no EMRs deployed in six counties in the north-eastern part of Kenya, including Wajir, Mandera, Garissa, Marsabit, Isiolo and Lamu, because of their very low to negligible burden of HIV.Similarly, data from KHIS reported that Wajir County had the lowest proportional contribution (0.02%), followed by Mandera (0.06%), Garissa (0.09%), Marsabit (0.10%), Isiolo (0.11%) and Lamu (0.14%).

Discussion
Stand-alone electronic data platforms, aggregate-reporting data repositories, and population-level cross-sectional surveys play a critical role in disease surveillance and data-driven decision-making in the HIV/AIDS programme.However, these data sources have inherent limitations in situations requiring real-time public health responses and in longitudinal follow-up of clients diagnosed with HIV.When integrated into the NDW, EMR data have tremendous potential to bridge this gap by providing a longitudinal data platform for tracking HIV programme performance at the patient level facilitating real-time evidence-based decision-making.
Covering 41 of 47 counties, the NDW has extensive geographic coverage.Six counties are yet to be supported to contribute data to the NDW.These are HIV low-burden counties with an estimated zero number of new HIV infections in 2018 [24].Most of the facilities using EMR systems were concentrated in counties around the Lake Victoria in Western Kenya, particularly Homa Bay, Siaya, and Kisumu.This is a reflection of the high HIV burden, with prevalence estimates of up to 19% in the region, and is consistent with the National HIV/ AIDS programme strategy of targeted interventions for maximum impact [24].Importantly and when compared with data from the KHIS as a proxy, the proportional distribution of counties were consistent in ranking and comparable with those from the NDW [24].In 2020, the NDW captured visits from 936,869 clients on ART, compared to an estimated 1,219,276 clients on ART reported by KHIS, suggesting a 77% representation of PLHIVs on ART in Kenya at the time.Thus, while the NDW has low EMR coverage (1,516 out of 3,483 ART-providing facilities in Kenya [44%]), it seems to achieve high (77%) individual level coverage at the population-level.This observation also suggests that the ~ 56% of facilities not included in the NDW are likely from low volume facilities that capture < 25% of PLHIV on ART in Kenya.
These results indicate notable progress towards representativeness and generalisability of the NDW at the population-level.There was an exponential increase in clients starting ART from 2004, with two subsequent peaks in 2010-2011 and 2014-2016.In 2004, only clients who had WHO clinical staging III/IV or CD4 T-cell count of < 200 cells/mm 3 were eligible for ART initiation [25].The CD4 classification was revised over time to CD4 < 350 cells/mm 3 in 2010, and for all HIV-infected clients regardless of the clinical or immunological status in 2014 [26,27].The peaks in ART initiation around these periods reflect the changing guidelines and demonstrate the unique potential for the NDW as a resource to assess impact of changes in guidelines and policies on the HIV programme.Currently, the NDW is the only programmatic resource that offers this unique perspective on the HIV programme in Kenya.
The NDW has two main strengths.First, the longitudinal structure makes it a unique and incomparable resource for ascertaining longitudinal outcomes at the patient-level in Kenya.Second, the monthly data transmission to the NDW makes it possible to respond to new and emerging HIV programme priorities in realtime for decision-making.However, the NDW is not without limitations.First, while tremendous effort has been put towards setting up structures and policies for generation and archiving of high-quality data in the NDW, data quality concerns cannot be entirely ruled out.Second, while the national programme recommends consistent use of the comprehensive care clinic (CCC) unique identifier, compliance continues to present a challenge.To mitigate this, DWAPI applies a de-duplication algorithm that generates a secondary PKV at the source before data are uploaded, which has been shown to yield up to 96% of unique probabilistic matches [23].Still, mismatches cannot be ruled out.Last, while there is notable progress in the DWH coverage, the remaining sites, which are likely low volume sites, may cater to a different population with unique challenges.This may lead to selection bias and a lack of generalizability and therefore solutions to collect data from non-EMR sites should be explored.

Conclusions
Tremendous gains have been made by leveraging technological advancements for monitoring and surveillance of the HIV programme in Kenya.Over the last two decades, medical information systems have significantly evolved from simple PMRs to sophisticated web-based EMRs with APIs that support data transmission to a centralised repository.The NDW is a powerful resource with capabilities to monitor longitudinal outcomes and to respond to new and emerging public health priorities in real time.A comparison with estimates from KHIS demonstrates that the NDW has high coverage and that the data maybe representative and generalizable at the population-level.The NDW is, therefore a unique and complementary resource for HIV programme monitoring and surveillance with potential to strengthen timely data driven decision-making towards HIV epidemic control in Kenya.

Fig. 1
Fig. 1 Background and evolution of electronic medical record applications and the national EMR data warehouse (NDW) in the backdrop of ever-evolving technology and HIV program guidelines and policies between 2001 and 2020 in Kenya

Fig. 3
Fig. 3 Graph showing number of EMR sites uploading data into the NDW and the number of client records uploaded onto the National EMR Data Warehouse (NDW) between 2016 and 2020

Fig. 4 aFig. 5 a
Fig. 4 a Graph illustrating absolute and cumulative frequency of clients starting antiretroviral treatment by gender over calendar years, and b Population pyramid showing distribution of clients by gender and age groups, from facilities transmitting individual level data from electronic medical record applications to the national EMR data warehouse (NDW) in Kenya (N = 1,928,458)

Table 1
Enhancements implemented in the NDW infrastructure and functionalities to respond to the increasing number of sites and volumes uploaded in the repository between 2016 and 2020

Table 2
A summary of service delivery partners supporting electronic medical record applications that have ever transmitted individual-level data from antiretroviral treatment facilities to the National EMR Data Warehouse (NDW) by the end of 2020 in Kenya