Data Cleaning Process for HIV Indicator Data Extracted from DHIS2 National Reporting System: Case Example of Kenya


 Background

The District Health Information Software 2 (DHIS2) is widely used by countries for national-level aggregate reporting of health data. To best leverage DHIS2 data for decision-making, countries need to ensure that data within their systems are of the highest quality. Comprehensive, systematic and transparent data cleaning approaches form a core component of preparing DHIS2 data for use. Unfortunately, there is paucity of exhaustive and systematic descriptions of data cleaning processes employed on DHIS2-based data. In this paper, we describe results of systematic data cleaning approach applied on a national-level DHIS2 instance, using Kenya as the case example.
Methods

 Broeck et al’s framework, involving repeated cycles of a three-phase process (data screening, data diagnosis and data treatment), was employed on six HIV indicator reports collected monthly from all care facilities in Kenya from 2011 to 2018. This resulted to repeated facility reporting instances. Quality dimensions evaluated included reporting rate, reporting timeliness, and indicator completeness of submitted reports each done per facility per year. The various error types were categorized, and Friedman analyses of variance conducted to examine differences in distribution of facilities by error types. Data cleaning was done during the treatment phases.
Results

A generic five-step data cleaning sequence was developed and applied in cleaning HIV indicator data reports extracted from DHIS2. Initially, 93,179 facility reporting instances were extracted from year 2011 to 2018. 50.23% of these instances submitted no reports and were removed. Of the remaining reporting instances, there was over reporting in 0.03%. Quality issues related to timeliness included scenarios where reports were empty or had data but were never on time. Percentage of reporting instances in these scenarios varied by reporting type. Of submitted reports empty reports also varied by report type and ranged from 1.32–18.04%. Report quality varied significantly by facility distribution (p = 0.00) and report type.
Conclusions

The case instance of Kenya reveals significant data quality issues for HIV reported data that were not detected by the inbuilt error detection procedures within DHIS2. More robust and systematic data cleaning processes should be integrated to current DHIS2 implementations to ensure highest quality data.


Abstract Background
The District Health Information Software 2 (DHIS2) is widely used by countries for national-level aggregate reporting of health data. To best leverage DHIS2 data for decision-making, countries need to ensure that data within their systems are of the highest quality. Comprehensive, systematic and transparent data cleaning approaches form a core component of preparing DHIS2 data for use.
Unfortunately, there is paucity of exhaustive and systematic descriptions of data cleaning processes employed on DHIS2-based data. In this paper, we describe results of systematic data cleaning approach applied on a national-level DHIS2 instance, using Kenya as the case example.

Methods
Broeck et al's framework, involving repeated cycles of a three-phase process (data screening, data diagnosis and data treatment), was employed on six HIV indicator reports collected monthly from all care facilities in Kenya from 2011 to 2018. This resulted to repeated facility reporting instances.
Quality dimensions evaluated included reporting rate, reporting timeliness, and indicator completeness of submitted reports each done per facility per year. The various error types were categorized, and Friedman analyses of variance conducted to examine differences in distribution of facilities by error types. Data cleaning was done during the treatment phases.

Results
A generic five-step data cleaning sequence was developed and applied in cleaning HIV indicator data reports extracted from DHIS2. Initially, 93,179 facility reporting instances were extracted from year 2011 to 2018. 50.23% of these instances submitted no reports and were removed. Of the remaining reporting instances, there was over reporting in 0.03%. Quality issues related to timeliness included scenarios where reports were empty or had data but were never on time. Percentage of reporting instances in these scenarios varied by reporting type. Of submitted reports empty reports also varied by report type and ranged from 1.32-18.04%. Report quality varied significantly by facility distribution (p = 0.00) and report type.

Conclusions
The case instance of Kenya reveals significant data quality issues for HIV reported data that were not detected by the inbuilt error detection procedures within DHIS2. More robust and systematic data cleaning processes should be integrated to current DHIS2 implementations to ensure highest quality data.

Background
Monitoring and Evaluation (M&E) plays a key role in planning of any national health program. DeLay et al. defined M&E as "acquiring, analysing and making use of relevant, accurate, timely and affordable information from multiple sources for the purpose of program improvement (1,2)." With good M&E systems, data can be used to provide information and generate knowledge to monitor delivery of care, support reporting, in healthcare planning, for measure progress and to improve accountability, among others (3). However, these goals can only be achieved if the data collected is of high quality.
To help with M&E in low-and middle-income countries (LMICs), reporting indicators have been highly advocated for use across many disease domains, with HIV indicators among the most common ones reported to national-level facilities in many countries. Over the years, national-level data aggregation systems, such as the District Health Information Software 2 (DHIS2)(4), have been widely adopted for use in collecting, aggregating and analyzing indicator data. DHIS2 has been implemented in over 40 LMICs with the health indicator data reported within the system used for national-and regional-level health-related decision-making, advocacy, and M&E (5).
It is well-recognized that the data within aggregate systems, such as DHIS2, are only as good as their quality. As such, various approaches have been implemented within systems like DHIS2 to improve data quality. Some of these approaches include: (a) validation during data entry in order to ensure data are captured using the right formats and within pre-defined ranges and constraint; (b) userdefined validation rules; (c) automated outlier analysis functions; and (d) automated calculations and reporting of data coverage and completeness (6).
Despite data quality approaches having been implemented within DHIS2, data quality issues remain a thorny problem. A number of evaluations have looked at how well the existing data quality approaches actually ensure that the data contained within the systems are of the highest quality for use in decision-making and analysis (7)(8)(9)(10).Nevertheless, these studies largely fail to exhaustively and systematically describe the steps used in data cleaning of the DHIS2 data before analysis is done.
Ideally, data cleaning should be done systematically, and good data cleaning practice requires transparency and proper documentation of all procedures taken to clean the data (11,12). A closer and systematic look into data cleaning approaches, and a clear outlining of the distribution or characteristics of data quality issues encountered in DHIS2 could be instructive in informing approaches to further ensure higher quality data for decision-making. Further, employment of additional data cleaning steps will ensure that good quality data is available from the widely deployed DHIS2 system for use in accurate decision making and knowledge generation.
In this paper, we report on application of systematic and replicable data cleaning approaches that have specific applicability to the broadly implemented DHIS2 national reporting system. We also present the distribution of data quality issues encountered during the cleaning process of DHIS2, using Kenya as a reference country case. Our approach is guided by a conceptual data-cleaning framework, with a focus on uncovering data quality issues often missed by existing automated approaches. From our evaluation, we provide recommendations on extracting and cleaning data for analysis from DHIS2, which can be used by M&E teams within Ministries of Health in LMICs and by researchers to ensure high quality data for analysis and decision-making.

Method
Data cleaning approaches Data cleaning is defined as "the process used to determine inaccurate, incomplete, or unreasonable data and then improving the quality through correction of detected errors and omissions" (13). Data cleaning is essential to transform raw data into quality data for purposes such as analysis and data mining (14). An extensive body of work exists on how to clean data (15)(16)(17). Some of the approaches that can be employed include quantitative or qualitative methods. Quantitative approaches employ statistical methods, and are largely used to detect outliers (18)(19)(20). On the other hand, qualitative techniques use patterns, constraints, and rules to detect errors (21). These approaches can be applied within automated data cleaning tools such as ARKTOS, AJAX, FraQL, Potter's Wheel and IntelliClean (17,21,22).

Data Cleaning Framework
While tools and approaches exist for data cleaning, no standard consensus-based approach exists to ensure that replicable and rigorous data cleaning standards are applied on DHIS2 data that is widely used by countries for health decision-making. Consequently ad hoc data cleaning approaches have been employed, with failure by implementations to explicitly disclose the systematic data cleaning strategies used and the resulting errors identified. This makes it difficult to replicate data cleaning procedures and to ensure that all types of quality issues are systematically addressed prior to use of data for analysis and decision-making.
There are a limited number of frameworks that exist to guide error detection within data sets, and these can be adapted in recommending a systematic approach for cleaning DHIS2 data. Oftentimes, specific frameworks are applied based on the data set and the aims of the cleaning exercise (23,24).
Our study's data cleaning approach was informed by a conceptual data-cleaning framework recommended by Broeck et al. (11). Broeck et al's framework was used because it provides a deliberate and systematic data cleaning guideline that is amenable to being tailored towards cleaning data extracted from DHIS2. This framework presents data cleaning as a three-phase process involving repeated cycles of data screening, data diagnosis, and data editing of suspected data abnormalities.
The screening process involves identification of lacking or excess data, outliers and inconsistencies and strange patterns (11). Diagnosis involves determination of errors or missing data and any true extremes and true normals (11). Editing involves correction or deleting of any identified errors (11).
Broeck et al's framework has also been extensively applied and validated in various settings (25,26).

Study Setting
This study was conducted in Kenya, a country in East Africa. Kenya adopted DHIS2 for use for its national reporting in 2011 (4). The country has 47 administrative counties and all the counties report a range of healthcare indicator data from care facilities and settings into the DHIS2 system. For the purposes of this study, we focused specifically on HIV indicator data reported within Kenya's DHIS2 system, given that these are the most comprehensively reported set of indicators into the system.

Data Cleaning Process
Adapting the Broeck et al's framework, a step-by-step approach was used during extraction and cleaning of the data from DHIS2. These steps are generic and can be replicated by others conducting robust data cleaning on DHIS2. These steps are outlined below: i.
Step 1 -Outline the analyses or evaluation questions: Prior to applying the Broeck et al's conceptual framework, it is important to identify the exact evaluations or analyses to be conducted, as this helps define the data cleaning exercise. ii.
Step 2 -Description of data and study variables: This step is important for defining the needed data elements that will be used for the evaluation data set. iii.
Step 3 -Create the database: This step involves identifying the data needed and extracting data from relevant databases to generate the final data set. Oftentimes, development of this database might require combining data from different sources. iv.
Step 4 -Apply the framework for data cleaning: During this step, the three data cleaning phases (screening, diagnosis, and treatment) in Broeck et al's framework are applied on the data set created. v.
Step 5 -Analyze the data: This step provides a summary of the data quality issues discovered, the eliminated data after the treatment exercise, and the retained final data set on which analyses can then be done.

Application of data cleaning process: Kenya HIV indicator reporting case example
In this section, we present the application of the data cleaning sequence above using Kenya as case example.
Step 1: Outline the analyses or evaluation questions and goals For this reference case, DHIS2 data had to undergo the data cleaning process prior to use of the data for an evaluation question on 'Performance of health care facilities at meeting the mandated HIVindicator reporting requirements by the Kenyan Ministry of Health (MOH)'. The goal was to identify the best performing and poorest health facilities at reporting within the country, using the completeness and timeliness of their reports into DHIS2.
Step 2: Description Of Data And Study Variables For our use case, we wanted to create a data set to determine performance of facilities at meeting the MOH reporting requirements by evaluating completeness and timeliness of reporting.
Completeness in reporting by facilities within Kenya's DHIS2 is measured as a continuous variable starting at 0-100% and identified within the system by a variable called 'Reporting Rate (RR). RR is calculated automatically within DHIS2 as a percentage of the actual number of reports submitted by each facility into DHIS2 divided by the expected number of reports from the facility (Percent RR = # submitted reports / expected # of reports * 100). It should be noted that this RR calculation only looks at report submission and not the content within the reports. As such, a report may be submitted as blank or have missing indicators, but will be counted as complete simply because it was submitted. At the end of each year, DHIS2 calculates the cumulative RR for the whole year. Timeliness is calculated based on whether the reports were submitted by the 15th day of the reporting period as set by the MOH. Timeliness is represented in DHIS2 as 'Reporting Rate on Time (RRT)', and is also calculated automatically. The RRT for a facility is measured as a percentage of the actual number of reports submitted on time by the facility divided by the expected number of reports (Percent RRT = # reports submitted on time / expected # of reports * 100).
Step 3: Create The Database After obtaining Institutional Review Board (IRB) approval for this work, we set out to create our database from three data sources as outlined below:  Step 4: Application Of The Framework For Data Cleaning true normal or idiopathic (no diagnosis found, but data still suspected to having errors) (11). We used a combination of RR, RTT and CPC to detect various types of situations (errors or no errors) for each facility per annual report (Table 1). Using the combination of CPC, RR, RRT we were able to categorize the various types of situations to be used in diagnosis for every year a facility reported into DHIS2 (Table 1). In this table, "0" represents a situation where percentage is zero; "X" represents a situation where percentage is above zero; and ">100%" represents a situation where percentage is more than 100. Based on the values per each of the three variables, it was possible to diagnose the various issues within DHIS2 (Diagnosis Column).  Table 2 Sectional illustration of first data set Year Organisation  Table 1, reports in situation A-F were deleted hence excluded from the study. Duplicates identified in the scenarios mentioned were also excluded from the study. Thus, only reports in situation G and H were considered ideal for the final clean data set.
Step 5: Data Analysis The data was then disaggregated to form six individual data sets representing each of the programmatic areas containing the facility and year. The disaggregation was because facilities offer different services and do not necessarily report in all the programmatic areas. SPSS was used to analyze the data using frequency distributions and cross tabulations in order to screen for duplication and outliers. Individual health facilities with frequencies of more than eight reporting instances for a specific report type (data set) were identified as duplicates. The basis for this is that the maximum reporting instances for an individual health facility has to be eight, given that data was extracted within an eight-year period. From the cross tabulations, percentage RR and RRT that were above 100% were identified as outliers.
After the multiple iterations of data cleaning as per Fig. 2, where erroneous data were removed by situation type (identified in Table 1), a final clean data set was available and brought forward for use in answering the evaluation question. At the end of the data cleaning exercise, we determined the percentage distribution of the various situation types that resulted in the final data set. Using this analysis and descriptions from Table 1   Situations where data was present in reports but no values present for RR and RRT (Situation D); and scenarios with empty reports (Situation B) were analyzed (Fig. 4). This was in order to examine whether there are differences in distribution of facilities by programmatic area across the eight years, categorized by situation D and empty reports. Facilities were most likely to submit PEP empty reports  Table 4 reveal that PMTCT and CRT had higher distribution of facilities with D (X00) issues on their reports in all the eight years  Table 5 also reveal that PEP had higher distribution of facilities with empty reports B (0XX) in all the eight years.

Discussion
Data quality problems due to replicated entries, missing information or other invalid data are common in integrated data sources such as data warehouses and national-aggregated systems (27). As such, data in these systems cannot be used in their current form for decision-making. Failure to detect data quality issues and to clean these data can lead to inaccurate analyses outcomes, which impact decision-making. Therefore, the need for data cleaning increases significantly, when multiple sources need to be integrated. DHIS2, which acts as a data warehouse for storing routine aggregate data submitted by multiple health facilities is not immune to data quality issues as revealed in this paper.
Despite the fact that various approaches have been implemented within systems like DHIS2 to improve data quality, there is still need for systematic and more robust data cleaning mechanisms.
Systematic data cleaning approaches are salient in identifying and sorting issues within the data resulting to a clean data set that can be used for analysis and decision-making. In addition, identifying various issues within the data may require a human-driven approach as inbuilt data quality checking mechanisms within systems may not have the benefit of a particular knowledge. For instance, our knowledge about health facility reporting enabled us to identify the various situations described in Table 1. This entailed examining more than one column at a time of manually integrated databases. In addition, descriptive statistics such as use of cross tabulations and frequency counts complemented the human-driven processes.
As revealed in the screening, diagnosis and treatment phases presented in this paper, data cleaning process can be more time consuming than the analysis process. Real-world data such as the DHIS2 data and merging of real world data sets as shown in this paper may be noisy, inconsistent and incomplete. In the treatment stage, we present the actions taken to ensure that only meaningful data is included for analysis. Data cleaning also resulted to a smaller data set than the original as demonstrated in the results. A perceived belief is that more data could lead to more results, nevertheless meaningful data that comes about through the process of cleaning is beneficial for decision-making (14).
In this paper, we used Broeck et al's framework to identify various issues within the data such as: duplicate records, data present in reports but no values present for RR and RRT(Situation D), and over-reporting (Situation E and F) (11). Non-parametric tests conducted (Friedman ANOVA and Wilcoxon Signed Ranked Test), brought to perspective significant differences in distribution of facilities with selected situation types across different periods. As observed, PMTCT and CRT (mean rank of 5.88 and 5.13 respectively) had the highest distribution of facilities with situation D (X00), while PEP (mean rank 6.00) had the highest distribution of facilities with empty reports (B (0XX)). As such, for facilities submitting reports, distribution of facilities with the selected situation types varied significantly by the type of report submitted. In addition, a shortfall in DHIS2 is lack of recording the actual zero (0) in the reports, which may give an impression of incomplete reporting of indicators as the values are blank. This is similar to observations in other studies (10), hence making it difficult to distinguish between missing values and true zero values. There is therefore need to ensure that cases reported as zero appear in DHIS2.
The human augmented processes used in this study facilitated diagnosis of the different situations, which would have gone unidentified. Quantitative techniques presented by Hellerstein (19) further stimulate the need of human involvement as some techniques, for instance those used in outlier detection, may fail to flag outliers and instead "masking" them based on underlying automatic procedures engrained in the system. Nevertheless, there are also limitations with human augmented procedures as human is to error especially when dealing with extremely large data sets. In addition data cleaning for large data sets can also be time consuming. Nonetheless, identifying and understanding issues within the data using a human-driven approach provides better perspective prior to developing automatic procedures, which can then detect the identified issues.
Therefore, there is need for developing automated procedures that can identify the various situations addressed in this paper. For example, various approaches can be developed for purposes of detecting the different situation types in Table 1, such as implementing validation rules, univariate outlier detection (discovering issues within the data by examining one column at a time) and multivariate outlier detection (discovering issues within the data by examining more than one column at a time) accompanied by data visualizations. Further still automated analytic procedures can be developed within the system to perform various analyses such as calculating the number of empty reports submitted by a facility for a sought period of time. This could provide beneficial practical implications such as enabling decision-makers to understand the frequency of provision of certain services among the six programmatic areas within a particular period among health facilities. Such findings could be used to improve the quality of reporting. Automatic procedures should also be accompanied by data visualizations, and analyses, integrated within the iterative process in order to provide insights (19).
In addition, user engagement in development of automatic procedures and actively training users in identifying and discovering various issues within the data may contribute to better quality of data (19), (21).

Conclusion
Comprehensive, transparent and systematic reporting of cleaning process is important for validity of the research studies. The data cleaning included in this article was semi-automatic. It complemented the automatic procedures and resulted in improved data quality, which could not be secured by the automated procedures solemnly. In addition, this was the first systematic attempt to explore data quality and design data cleaning procedures of HIV indicator data reporting in DHIS2.

Consent for publication
Not applicable

Availability of data and materials
The data sets generated and analysed during the current study are available in the national District Health Information Software 2 online database at https://hiskenya.org/.  Creation of the evaluation data set Repeated cycles of data cleaning Distribution of facility instances based on empty reports (B) and error type (D) against programmatic area