Mapping command line functions to GUI elements
The sections from the PDF report created by DQAstats were gradually migrated to the GUI and tailored to the web interface. The summary overview of the completeness and conformance checks was integrated into the main GUI dashboard, which is automatically displayed after the completion of DQ analyses. The results of the automated comparison between two databases (completeness checks) are highlighted in color to attract attention to inconsistencies in the case of detected irregularities (see Additional file 1: Figs. S2 and S3). Tabs were introduced for easy navigation between the different DQ check results.
Characteristic details of each analyzed data element were previously provided in the PDF report’s section “Detailed Descriptive Results”. This information is now provided as a new GUI screen named Descriptive Results. Since the results of one data element are often dispersed throughout several pages in the PDF report (see Additional file 1: Fig. S4), comparing the characteristics of the data and the results of the DQ analysis in both systems can be cumbersome and error-prone. In the GUI, the findings are now displayed side by side for each data element in the source and target database to address this aspect, simplifying the comprehension of the results and enhancing direct comparison (see Figs. 3 and 4).
Similar to the PDF report, the adherence of a data element’s values to conformance criteria (value conformance) specified in the metadata repository (MDR) is presented in the descriptive results. Furthermore, the available metadata of the data element itself, such as the variable name, a short description, and its data type, as well as information on the data element’s mappings in the data sets, are also displayed here. For example, the table name from which the data element was loaded and the variable name in the respective database is visible. The visualization of the results of the DQ analyses depends on the variable type: basic distribution parameters (such as minimum, median, mean, standard deviation, and maximum) are calculated and displayed for numeric values or dates, whereas unique values and frequency counts are shown for categorical data elements or strings (see Figs. 3 and 4).
As another enhancement when analyzing SQL databases, the SQL statement that underpins the data element can now be viewed by the click of a button (see Fig. 4) instead of providing them only in the appendix section of the PDF report. These SQL statements can thus be copied from the GUI and pasted into a database tool to serve as a starting point for a more detailed investigation of possible irregularities identified by the DQA tool.
The Plausibility Checks are also visualized in a separate tab and were organized similarly to the Descriptive Results-Tab. Again, those checks are listed in sequential order in the PDF report. Now, the results for each plausibility check are displayed side by side to easily compare the results of the two databases under consideration. A sub-menu allows the user to select between the two implemented subcategories of the plausibility checks (Atemporal Plausibility checks and Uniqueness Plausibility checks), which are displayed as distinct screens (see Additional file 1: Fig. S5).
The Completeness Check screen presents a tabular summary of absolute and relative counts of missing values per data element to the user. Although these counts of missing values for the source and target database can be examined and compared, this view does neither offer an automated comparison of the two databases nor highlights notable attributes. Instead, the automated evaluation of the comparison of the absolute counts of missing values between a source and a target data set is presented on the main dashboard screen along with the “Completeness Checks (Validation)” (see right column “Check Missings” in Additional file 1: Fig. S3).
Besides the interactive presentation of the results in the GUI, the former PDF report can still be downloaded from the Reporting Tab (see Additional file 1: Fig. S6). A list of all database IDs associated with “conspicuous” values, for example that violate the value conformance or plausibility checks, and a summary of the check results presented on the dashboard, can also be downloaded as CSV files here. This information can be used to track and follow up on detected DQ irregularities directly in the databases.
For parametrizing the DQA tool, a screen was designed for setting default values during the GUI deployment, which is helpful when setting up the GUI for long-term use within a fixed infrastructure environment. Users can now select the desired databases to be tested (the information of available systems is taken from the MDR) on a new Config page. When provided during the initial deployment of the tool, predefined connection parameters for various databases are automatically inserted into the respective fields (see Additional file 1: Fig. S7), allowing users to connect to databases without technical knowledge. When all required parameters have been defined properly, a button is enabled from which the analysis can be triggered directly from this Config-page.
Finally, the Logfile tab displays all internal messages created during analysis and provides a full breakdown of the completed program steps (see Additional file 1: Fig. S8). During the iterative development of the interface, this was also a helpful source of information for troubleshooting software faults reported in the user feedback.
Runtime
One user feedback from the first evaluation round addressed the rather long runtimes of the DQA tool when analyzing large data sets. Three enhancements, outlined in detail in the following section, addressed this aspect.
Selecting data elements
When utilizing DQAstats to perform a DQ analysis, the full set of data elements defined in the MDR for one database, or the intersection of data elements provided for two databases, will be examined. Sometimes, however, it is necessary to test only certain elements, e.g., newly added data elements only. The configuration page of the GUI was appended with the option of selecting the desired data elements for a DQ analysis to address this scenario (see Additional file 1: Fig. S7). This alteration restricts the analysis to data elements of interest and, reduces the tool’s overall runtime.
Time constraint for testing real-time data sets
During the initial phase of the MIRACUM project, most extract, transform, and load (ETL) jobs extracted the data from a clinical source system and transferred it to a research database in one batch. Therefore, all data was processed at once to make large data sets quickly available for analysis in the MIRACUM research data repositories. The MII-wide harmonization process to create a core data set and accompanying MII FHIR profiles was still in its early stages. As these processes advanced and since clinical routine data is rather dynamic and grows over time, the data-processing infrastructure was re-designed and adapted to take these developments into account. As a result, the former batch ETL jobs were re-implemented using Apache Kafka [29, 30] in order to support incremental data streams but at the same time ensuring compatibility with the MII FHIR profiles. ETL reimplementation makes the research infrastructure scalable, allowing it to manage real-time data generated in clinical practice and provide researchers with the most up-to-date information available. To allow meaningful DQ checks to be carried out on continuously growing databases, a new feature was added to the DQA tool for analyzing subsets of the databases based on time frames. Thus, the DQA tool is capable of examining subsets of research data repositories that are being filled in real time, as well as evaluating their filling ETL processes. As a side effect, selecting a smaller time frame also reduces the runtime.
Performance optimization
In the first version of the DQA tool, all DQ checks were processed sequentially for each data element. Suitable parts of the code were parallelized [31] to make the best use of available computing capacity while reducing analysis time and speeding up processing.
Datamap
To centrally collect and report aggregated counts of selected data elements, a so-called datamap feature has been added.
If an item has been marked in the MDR for inclusion in the datamap, it will be displayed prominently on the dashboard after a DQ analysis. When the analysis is complete, the datamap can be sent to a predetermined recipient by the partner sites by clicking a button. The datamap functionality within the MIRACUM project was prototypically extended to send aggregated counts to a central database. The counts provide a project-wide visualization of the availability of various data elements across all sites that is publicly available [32] (see Fig. 5). The datamap is intended to give researchers an initial overview of the quantity of selected data elements available at all sites before requesting the data to investigate their research questions.
User feedback evaluation
Each release of the DQA tool was deployed at all ten MIRACUM sites. The first feedback round (FR1) finished at the end of 2019. During the process of this assessment, a total of 36 unique issues were reported. Subsequently, issues were allocated to one or more of four classifications: Eighteen issues were allocated to the class LOGIC (Problems that are caused by semantic or syntactic errors in the programming code), 3 issues to the class ETL (Problems that are caused by discrepancies in the ETL processes that populate the systems under test, rather than by faults in the DQA tool), 6 issues to the class MDR (Problems caused by inconsistencies in the metadata of the analyzed data elements, which, for example, led to incorrectly permitted value ranges), and 9 issues to the class GUI (Feedback on the graphical user interface of the DQA tool). Any reported issue could be allocated to multiple classes.
Furthermore, the feedback was prioritized based on its urgency and relevancy (see Additional file 1: section “Feedback round 1 (FR1)”). The most critical concerns were implemented by April 2020 and made available to all sites in an interim release. Additional issues were continuously addressed and provided to the sites in mid-2021 followed by another project-wide feedback round (FR2).
Container-based application with Kubernetes support in MIRACUM
For providing the GUI-version of the DQA tool to the MIRACUM partner sites and to allow for a seamless integration into their local DIC, Docker was used to simplify the deployment of the application across different environments [33]. A container image [34, 35] was developed that integrates well within the MIRACUM DIC infrastructure, similar to the deployment of the command-line-based version of the DQA tool [20]. Since some sites already use Kubernetes for container orchestration, a Kubernetes manifest [35, 36] was also provided. The manifest leverages Argo workflows [37], an open source container-native workflow engine for orchestrating jobs on Kubernetes, to run automated DQ checks regularly. As a feature, this also updated the MIRACUM datamap with the latest metrics during its prototype development. Furthermore, practical aspects such as container availability (scheduling, scaling, and inter container communication) were addressed, enabling container orchestration in real time [33, 35, 36].
MIRACUM enhancements and customizations
DQAgui was developed as a generic GUI-frontend built on top of DQAstats. Users without prior R programming knowledge can use it to analyze databases regarding their data quality and to compare different data sets. For customizing the DQA tool to the specific requirements within MIRACUM, such as connecting it to the central M-MDR and to provide the configurations to the MIRACUM research data repositories, the R package miRacumDQA was previously developed [20]. During the development of the GUI, this package was extended to also establish the connection to the prototype of the MIRACUM datamap.
Like DQAstats, the GUI was designed in a generic manner in order to be used for DQ checks independent of the project or data context. To demonstrate its applicability, the DQA tool includes synthetic data sets. A demo-instance is publicly available [38]. In the context of this paper and within the MIRACUM project, health data were analyzed.