An ontology-based documentation of data discovery and integration process in cancer outcomes research

Background To reduce cancer mortality and improve cancer outcomes, it is critical to understand the various cancer risk factors (RFs) across different domains (e.g., genetic, environmental, and behavioral risk factors) and levels (e.g., individual, interpersonal, and community levels). However, prior research on RFs of cancer outcomes, has primarily focused on individual level RFs due to the lack of integrated datasets that contain multi-level, multi-domain RFs. Further, the lack of a consensus and proper guidance on systematically identify RFs also increase the difficulty of RF selection from heterogenous data sources in a multi-level integrative data analysis (mIDA) study. More importantly, as mIDA studies require integrating heterogenous data sources, the data integration processes in the limited number of existing mIDA studies are inconsistently performed and poorly documented, and thus threatening transparency and reproducibility. Methods Informed by the National Institute on Minority Health and Health Disparities (NIMHD) research framework, we (1) reviewed existing reporting guidelines from the Enhancing the QUAlity and Transparency Of health Research (EQUATOR) network and (2) developed a theory-driven reporting guideline to guide the RF variable selection, data source selection, and data integration process. Then, we developed an ontology to standardize the documentation of the RF selection and data integration process in mIDA studies. Results We summarized the review results and created a reporting guideline—ATTEST—for reporting the variable selection and data source selection and integration process. We provided an ATTEST check list to help researchers to annotate and clearly document each step of their mIDA studies to ensure the transparency and reproducibility. We used the ATTEST to report two mIDA case studies and further transformed annotation results into sematic triples, so that the relationships among variables, data sources and integration processes are explicitly standardized and modeled using the classes and properties from OD-ATTEST. Conclusion Our ontology-based reporting guideline solves some key challenges in current mIDA studies for cancer outcomes research, through providing (1) a theory-driven guidance for multi-level and multi-domain RF variable and data source selection; and (2) a standardized documentation of the data selection and integration processes powered by an ontology, thus a way to enable sharing of mIDA study reports among researchers.


Background
Cancer is a major disease burden worldwide [1]. As the 2nd leading cause of death in the United States (US), about 1 in 4 deaths is due to various types of cancer [2]. In 2019, an estimation of 1,762,450 new cancer cases diagnosed and 606,880 cancer deaths is reported by the American Cancer Society (ACS) in US [2]. The lifetime probabilities of being diagnosed with cancer are 39.3 and 37.4% for male and female, respectively [3]. However, the risk factors (RFs) for these high cancer incidence and mortality rates are still not fully understood.
Over the past two decades, increasing efforts have been directed toward identifying and understanding cancer RFs using various methods, such as genome-wide association studies (GWAS) [4,5] or more recent machine learningbased approaches [6,7]. Nevertheless, emerging evidence suggests that it is the interaction among many risk factors together that affect the risk of cancer and cancer outcomes, rather than a single cause [8]. Further, the RFs involved are across different domains (e.g., genetic, environmental, and behavioral risk factors) and levels (e.g., individual level, interpersonal level, and community level). However, there is not yet an agreement among the cancer research community regarding how these multi-level cancer RFs interact with each other. To do so, the first and most crucial step is to gain a comprehensive view of potential multi-level RFs associated with various cancer outcomes such as the stage of diagnosis (the most important prognostic factor) and survival.
We surveyed existing research on RFs for late stage cancer diagnosis and poor survival, we found current studies about RFs for cancer outcomes are mostly from single-level analyses with mostly individual patient-level data. For instance, Andrew et al. assessed individual patient characteristics (e.g., age, gender, family history), and lifestyle factors (e.g., education, insurance and socioeconomic status) to study their risks associated with colorectal cancer at late stage [9]. These individual-level RFs have also been reported for other major types of cancers such as breast and cervical cancers [10][11][12][13]. Further, prior studies studying cancer RFs often only analyzed data from a single source, such as SEER [14], SEER-Medicare [15], or a state or hospital cancer registry [16]. Among these cancer risk factor studies, the complex interplay between difference levels RFs are often ignored (e.g., county-level smoking rate vs. individual smoking behavior). These single-level RF analyses (1) lead to biased effect estimates of RFs due to potential confounding from omitted factors, (2) omit critical cross-level RF interactions, such as race by residence, that could inform multi-level intervention design.
Nowadays, advances in technology created new ways for us to determine and measure disease risk factors across different levels (e.g., from advancements in genome sequencing for genetic markers to better sensors for producing more accurate estimates of environmental pollutants). The availability of such abundant data online in electronic formats enables researchers to pool data on an unprecedented scale and offers a great opportunity to do a thorough examination of multi-level RFs in a multilevel integrative data analysis (mIDA) so that confounding effects and across-level interactions can be studied. However, researchers face significant barriers to do so, especially because there is a lack of consensus and proper guidance to help researchers systematically think and discovery these variables from heterogenous sources. In 2017, National Institute on Minority Health and Health Disparities (NIMHD) of the National Institute of Health (NIH) proposed a Research Framework [17], an extension to the well-known social ecological model [18], to help investigators systematically study health disparities. Recognized by the NIMHD Framework, individuals are embedded within the larger social system and constrained by the physical environment they live in. Within this framework, cancer outcomes are influenced by RFs from different levels (i.e., individual, interpersonal, community, and societal) and multiple domains (i.e., biological, behavioral, physical/built environment, sociocultural environment, and healthcare system). In this work, we adopted the NIMHD framework as the guiding theory for risk factor discovery and data source selection.
Further, mIDA for cancer outcomes research requires the integration of data from multiple sources. However, data integration processes in the very limited number of existing mIDA studies [19,20] are inconsistently performed and poorly documented, and thus threatening transparency and reproducibility [21,22]. The data integration processes are often time summarized in one or two sentences without explicitly documentation of the steps. For example, Guo et al. explored the impact of the relationships among socioeconomic status, individual smoking status, and community-level smoking rate on pharyngeal cancer survival [20]. The multi-level risk factors above were obtained and integrated from three different data sources (i.e., Florida Cancer Data System [FCDS], U.S. Census, and Behavioral Risk Factor Surveillance System [BRFSS]) as mentioned in the abstract. However, for the rest of the paper, there is no description of how the individual-level records from FCDS are linked with county-level smoking rate from BRFSS and census tract-level poverty rate from U.S. Census. Even though the integration process might be as simple as integrating these multi-level variables through the geographic code (e.g., county code), it still needs to be standardized and explicitly documented to avoid ambiguity. For example, the paper discussed that "regional smoking was measured as the average percentage of adult current smokers at the county level between 1996 and 2010" and the readers might be able to make an educated guess that the regional smoking rates were more likely to be generated using the BRFSS data rather than from the FCDS data; however, explicit documentation is needed as both BRFSS and FCDS data have individual smoking status. Keegan et al. explored and whether breast cancer survival patterns are influenced by factors such as nativity (individual level) and neighborhood socioeconomic status (community level). Similarly, they summarized integration process in one sentence by stating each patient was assigned a neighborhood socioeconomic status variable based the census block groups. However, the details such as variable names in each data sources, or whether the original geographic variables require pre-processing (e.g., derive census tract from zip codes) are not clearly documented [19]. The explicit documentation of these variable selection and data integration processes will help readers to better understand the study results, benefit other researchers who want to replicate the studies, but also more importantly, make it possible for machines to understand and replicate the steps (when these explicit documentations are encoded in a computable format such as with an ontology).
Further, even though these mIDA studies above did not emphasize the need for data integration or integrated datasets, the fact that they can only investigated a handful of variables at a time indicated the lack of but needed support on data integration. Even in studies on building frameworks or platforms to support or automate the data integration process (especially those related to creating integrated dataset to support cancer research), they often ignored the need for documenting the integration steps to guarantee the transparency and reproducibility of their approaches. For example, semantic data integration approach -connecting variables across different databases at the semantic level through mapping them to standardized concepts in a global schema (e.g., often time a global ontology)has been proposed in data integration studies in recent years to support generating integrated datasets for cancer research [23][24][25]. However, none of these studies mentioned the need for standardizing and documenting their integration steps, for example, most of them did not even discuss the rationale for selecting the specific data sources to integrate. Nevertheless, when reporting mIDA studies, it is critical to document the steps that were followed to select, integrate, and process the data so that others can repeat the same steps and reproduce the findings.
To address challenges above, in this paper, we first developed a reporting guideline to guide and document the RF variable selection, data source selection, and data integration process. The guideline is informed by (1) the NIMHD research framework that provides guidance and promotes structural thinking on identifying multi-level cancer RFs; and (2) reviewing existing reporting guidelines from the Enhancing the QUAlity and Transparency Of health Research (EQUATOR) network [26]. Then, we proposed an ontology-based approach to annotate and document the RF selection and data integration process in mIDA studies based on the reporting guideline we developed. To do so, we developed the Ontology for the Documentation of vAriable selecTion and daTa sourcE Selection and inTegration process (OD-ATTEST) so that the RF selection and data integration report can be (1) explicitly modeled with a shared, controlled vocabulary, (2) understandable to humans and computable to computers, and (3) adaptive to changes when the reporting process is refined.
In our prior work [27], we proposed a preliminary reporting guideline for RF variable and data source selection based on our own experience of pooling multilevel RFs from different data sources to support mIDAs of cancer survival [28,29]. In this extended journal paper, we significantly expanded our ontology-based reporting guideline-ATTEST (vAriable selecTion and daTa sourcE Selection and inTegration): We conducted a systematic search of existing reporting guidelines from the EQUATOR network to extract reporting elements relevant to variable selection and data integration. We updated our reporting guideline based on the result of the systematic review to include new items regarding data integration (e.g., data processing, data integration strategy, data validation, etc.) as well as variable and data source selection. We completed building the OD-ATTEST following the best practice in ontology development to provide a formal presentation for the reporting guideline with standardized and controlled vocabularies. We provided an ontology (OD-ATTEST) annotated report generated based on a prior mIDA study to represent the annotated items and their relationships in reporting guideline.

Methods
Development of a reporting guideline for risk factor selection, data source selection, and data integration To develop the reporting guideline, we started with summarizing our previous studies where we assessed the effect of data integration on predictive ability of cancer survival models [28] and created a semantic data integration framework to pool multi-level RFs from heterogenous data sources to support mIDA [29]. In the above studies, we went through the process of RF selection, data source selection, and data integration. To be able to ensure the reproducibility of these studies, a number of middle steps need to be documented as detailed in our previous paper [27]. For example, both rural-urban commuting area (RUCA) codes [30] and the National Center for Health Statistics (NCHS) urbanrural classification scheme [31] are often used to represent an geographic area's rurality status. The difference between the two resides in the classification granularity, where RUCA focuses on classifying U.S. census tracts (i.e., tens levels from rural to metropolitan) while the NCHS urban-rural classification scheme focuses on classifying U.S. counties (i.e., a hierarchal definition with six levels). Thus, we need to clearly document which rural definition we used in the data analysis since different representations of the same variable (i.e., rurality in this case) have different impacts on model results, as shown in our prior work [28]. Further, before integration RFs from various data sources at different levels (e.g., census tract level vs. county level) and covered different time periods, we assume that area-level characteristics (e.g., social vulnerability index) derived from 2000 U.S. Census data were applicable across different time periods (as our individual level data from FCDS covered 1996 and 2010). Above experiences suggest that we must document these data integration nuances so that other researchers can repeat our data integration and data processing pipeline and reproduce the same results (e.g., integrated dataset). In sum, three key items need to be documented: (1) RF selection (e.g., individual vs countylevel variables), (2) data source selection (e.g., individuallevel data from FCDS and contextual-level data from US Census), and (3) data integration and data preprocessing strategies.
Through discussions with expert biostatisticians, data analysts, and cancer outcomes researchers, we summarized the typical mIDA process and found there is little structured thinking when investigators selecting and identifying risk factors and their data sources. We thus propose to use the NIMHD research framework to provide a theory-driven guidance for multi-level and multidomain RF and data source selections. The NIMHD framework is originally designed to depicts a wide range of health determinants (i.e., RFs from different levels and domains) relevant to understanding and addressing minority health and health disparities. The goal of using the NIMHD framework is to help investigators to structurally and comprehensively think and identify relevant RFs and corresponding data sources in their IDA studies.
To build upon existing established reporting guidelines, we searched and identified relevant reporting guidelines from the Enhancing the QUAlity and Transparency Of health Research (EQUATOR) network-a comprehensive searchable database of guidelines for health research reporting. The EQUATOR network categorizes health researches into 13 study types (e.g., quantitative studies, experimental studies, and observational studies), where reporting guidelines for observational studies are most relevant to our mIDA use case. To further identify relevant reporting guidelines in EQUATOR, we developed a set of screening criteria to determine whether a reporting guideline in EQUATOR contains the information that can be used to improve our ATTE ST reporting guideline as shown below: The reporting guideline is designed for secondary data analysis studies. The reporting guideline contains at least one of the following sections: data, outcomes (variables), and methods, as these sections will contain information related to variable selection, data source selection, and data integration methods. The reported data within the guideline must be health related. The use of the guideline (at least part of the guideline) can be extended to the cancer outcomes research, especially those related to variable selection, data source selection, and data integration.
We reviewed all reporting guidelines designed for observational studies and eliminated guidelines that do not involve the tasks of RF and data source selection and integration. We then identified all reporting guidelines that contain the following sections: data, outcomes (variables), and methods. For those that do not have sections clearly marked, we manually reviewed the entire reporting guideline to identify whether they discussed one of the three aspects. We then extracted reporting items in the selected reporting guidelines that are relevant to RF selection, data source selection, and data integration.
Two reviewers (HZ and JB) independently extracted these reporting items of interest and resolved conflicts with a third reviewer (YG). We further analyzed these extracted reporting items and discussed with experts (i.e., biostatisticians, data analysts and cancer outcomes researchers) to summarize items needed in our reporting guideline, especially those related to the data integration process.
Construction of an ontology for the documentation of variable and data source selection and integration process (OD-ATTEST) The ATTEST reporting guideline we developed is used to guide the variable and data source selection and integration process in cancer outcomes research. We propose to use an ontology-based approach to annotate and document the items in the reporting guideline. The goal of the OD-ATTEST ontology is to standardize the terminology used in documenting the selection and integration steps of RF variables and data sources to support mIDA.
The OD-ATTEST is developed using Protégé 5. We used Basic Formal Ontology (BFO) [32] as the upperlevel ontology. We first adopted a top down approach to enumerate important entities (classes and relations) based on the reporting guideline we developed. Following the best practice, we reviewed existing widely accepted ontologies using the National Center for Biomedical Ontology (NCBO) BioPortal [33] to find the entities can be reused in OD-ATTEST. Then, we started with the definitions of the most general concepts in the domain and subsequent specialization of the concepts to develop the class hierarchy. We also took a bottom-up process, where we started with the definitions of the most specific classes, and then subsequent grouped similar classes into more general classes. For example, we started by identifying the most specific classes (i.e., the leaf nodes in the ontology hierarchy) for "median", "maximum value", "minimum value", and "percentile", and then created a common superclass for these classes named "descriptive statistic". We also examined how these reporting items are associated with each other (e.g., "sample size" is determined by "primary outcome") and determined what additional classes and relations were needed to fully represent these entities in OD-ATTEST.
An OD-ATTEST-annotated report generated based on a mIDA case study following the reporting guideline To test the developed ATTEST reporting guideline and the OD-ATTEST ontology, we first created a ATTEST report based on our previous mIDA case study, where we explored the impact of the relationships among socioeconomic status, individual smoking status, and community-level smoking rate on pharyngeal cancer survival [20]. To annotate the ATTEST report using OD-ATTEST, we used the following annotation process: 1) identify information related to the reporting items in ATTEST through reviewing the original publication and supplementary materials; 2) annotate the information using the entities in OD-ATTEST; and 3) transform annotation results into semantic triples in Resource Description Framework (RDF) format using Turtle syntax [34].

Results
The ATTEST reporting guideline for RF variable and data source selection and data integration We extended our preliminary reporting guideline [27] through a review of existing relevant reporting guidelines published in the EQUATOR network. Fig. 1 shows our review process. We reviewed 94 reporting guidelines designed for from observational studies in the EQUA-TOR network. Out of the 94 reporting guidelines, 30 contain the required data, outcomes (variables), and method sections, which we retained for data extraction. In the data extraction step, for each reporting guideline, we extracted items relevant to RF and data source selection and integration, where the data and outcomes (variables) sections often contain information regarding how RF variables and data sources are selected, while the method section contains information about how data are processed and integrated.
We categorized these reporting guidelines (Table 1) based on the domains and levels of the data sources reported in the guidelines and mapped them to the NIMH D framework. As shown in Table 1, these 29 reporting guidelines cover data sources from all domains and levels of influences. Among them, 9 guidelines focused on providing a general reporting guideline for observational studies without specifying a specific domain of influence; while the rest of the guidelines are designed for different domains. For example, the Genetic RIsk Prediction Studies (GRIPS) statement [45] is designed for risk prediction studies using genetic data. Furthermore, most guidelines only considered the data sources from individual level, while 2 of them considered data sources from multi-levels. For example, the Checklist for One Health Epidemiological Reporting of Evidence (CO-HERE) [55] considered both individual and environmental risk factors when studying a disease.
In our preliminary reporting guideline [27], we focused only on reporting items relevant to RF variables and data sources selection. In this review, we extracted items that can be used to improve our initial reporting guideline but with a focus on documenting the data integration process. In total, three reporting guidelines [57][58][59] were found containing information about data integration processes. However, items included in these 3 guidelines focus on data linkage and do not contain enough details about how to solve the heterogeneities of data from different sources. For example, when integrating variables across different levels (e.g., combine individual-level patient data and county-level smoking rate), none of the 3 guidelines have items on documenting the cross-level integration choices (e.g., layering the county-level smoking rate to individual based on residence of the individuals and county code), while this type of choices is frequently encountered in mIDA studies. Further, data processing steps such as the choices and algorithms used for creating new data elements (e.g., compute a body  Table 1 Summary of reporting guidelines based on the data source domains and levels guided by the NIMHD framework

Domain of influences Level of influences Guidelines
Not specified a Individual level [35][36][37][38][39][40][41][42][43] Societal level [44] Biological data Genetics data Individual level [45,46] Immunogenomic data [47] Molecular epidemiological data [48,49] Drug safety data from biologics registers [50,51] Behavioral data Crime, violence data Individual level [52] Dietary or nutritional data [53] Medication adherence [54] Sociocultural environment When reporting data sources or RF variables, these studies did not specify a specific data domain mass index variable from two separate variables, weight and height) are not documented in existing reporting guidelines. Therefore, we further extended the ATTEST to include these important data integration and data processing procedures based on our previous research experience on building data integration framework [29]. Informed by the NIMHD research framework and consistent with our prior work, the ATTEST reporting guideline consists of two main parts as shown in Fig. 2, reporting (1) the objective of the study including explaining the background and rationale for designing the study in one or two sentences and describing the hypothesis of the study; and (2) the study design for variable and data source selection processes and describing the data along with the data integration and processing strategies. The variable and data source selection process consists of five key steps: (1) define the outcome variables for primary and (if necessary) secondary outcomes; (2) for each outcome variable, follow an iterative process (see Fig. 2a) to determine the data sources according to NIMHD framework. After selecting each outcome variable and data sources, investigators need to think about how to select or consolidate similar outcome variables from the different selected data sources. For example, if the outcome of interest is an individual's lung cancer risk, we shall first identify potential data sources (e.g., cancer registries or electronic health records [EHRs]) that contain individual-level patient data where lung cancer incidence data are available. Then, based on the cohort criteria and other information such as required sample size and data range (e.g., time coverage and geographic information) of the potential data sources, the investigator could determine the qualified data sources and choose an adequate one based on the objective and design of the study. For example, if 2 data sources, cancer registry and EHRs, are both available and contain individual-level lung cancer incidence data, the investigator has the choices to (1) choose one data source over the other, or (2) link the two data sources and integrate variables from the two data sources. If the investigator chooses to link and integrate the two data sources, she needs to explicitly document the linkage and integration processes for each of variables as shown in Fig. 2 (Report -Variables -E, F, G, H) so that others can repeat the processes to generate the same analytical dataset; (3) determine the individual-level predictors and covariates of the study; (4) for each individual-level predictor or covariate, follow loop B in Fig. 2 to identify the different levels/domains of predictors or covariates according to NIMHD framework. Similar to the outcome variables, different data sources could potentially contain the same predictor or covariate variable, thus, it is important to contrast and consolidate a new predictor or covariate with the existing selected predictors and covariates to resolve duplicates. If an investigator chooses to integrate the "duplicate" variables (e.g., choosing smoking status from cancer registry data over EHRs because cancer registries data are manually abstracted and typically have better data quality than raw EHRs), these data integration choices also need to be explicitly documented. Nevertheless, it is often a difficult choice and these "duplicate" variables might all need to be tested in models before a selection can be made. Regardless, these decisions and data processing steps need to be clearly documented; and (5) after selecting individual-level predictors and covariates, one can use a similar process, following loop C in Fig. 2 to identify additional contextual-level predictors and covariates and data sources of interest. In the end, a report of the selected data and data sources as well as the data integration processes shall be generated as shown in Fig. 2. The corresponding ATTEST reporting guideline checklist is shown in Table 2.

Development of the OD-ATTEST ontology
Based on the ATTEST reporting protocol above, we identified that 48 classes and 25 properties are needed in OD-ATTEST to represent the ATTEST reporting guideline. Fig. 3 shows the class hierarchy of OD-ATTEST. We reused classes from the following existing wellknown ontologies: Ontology for Biomedical Investigations (OBI), Information Artifact Ontology (IAO), National Cancer Institute Thesaurus (NCIt), Statistics Ontology (STATO) and Semanticscience Integrated Ontology (SIO) as shown in Table 3. Note that there are very few existing ontologies designed for the purpose of documenting the variable and data source selection and data integration process. The limited number of properties in these existing ontologies are not informative to represent the elements in the reporting guideline and their relationships, requiring us to create a large number of new properties in OD-ATTEST.
An OD-ATTEST-annotated report generated based on a mIDA case study following the reporting guideline We annotated two of our previously published mIDA case studies: (1) one study that explored the impact of the relationships among socioeconomic status, individual smoking status, and community-level smoking rate on pharyngeal cancer survival [20], and (2) another study that created a semantic data integration framework to pool multi-level RFs from heterogenous data sources to support mIDA [29]. Table 4 is the filled ATTEST checklist for the two studies. Fig. 4 shows a snippet of the ontology annotated variable and data source selection and integration process for the second study [29], while the corresponding semantic triples in RDF format using Turtle syntax is shown in Table 5. The items: RF  variables, data sources, and data integration steps and their relationships are explicitly standardized and modeled using the classes and properties from OD-ATTEST.

Discussion
In this study, we first developed a reporting guideline, ATTEST, to provide a theory-driven approach to guide the RF variable and data source selection and integration process in cancer outcomes research. We then proposed an ontology-based approach to annotate the items in our reporting guideline so that information relevant to variables, data sources and data integration in mIDA studies can be explicitly documented. To develop the reporting guideline, we conducted a systematic search to identify useful reporting items to improve our selection and data integration process. We categorized these reporting guidelines based on their reported data source domains and levels according to NIMHD framework, so that we can identify items need to be reported when selecting variables or data sources from different domains and levels. For example, when report populationlevel estimates (variables) [44], the information regarding the sources of bias (e.g., selection bias) need to be documented. Therefore, we updated our previous reporting guideline and added "sources of bias" as a reporting item when documenting data sources. This is important, because subsequent data processing steps might be needed to correct the bias. Further, The use of NIMHD framework can also help researchers to systematically think and structure the variable and data source selection process when considering multi-level RF variables from heterogenous data sources. For example, if an investigator is considering smoking related risk factors in cancer outcomes research, following the NIMHD framework, one can start with variables in the behavioral domain and then list potential smoking related variables for each level of influences step by step, such as individual smoking status at the individual level, second hand smoke exposure at the interpersonal level, county level smoking rate at the community level, and smoking policies or laws (e.g., federal minimum age to purchase tobacco products) at the societal level. The same process can be applied to select other smoking related variables from other domains of influences. In this way, investigators can systematically think and evaluate the confounding effects and cross-level interactions among those selected variables which are usually ignored in previous cancer outcome studies using a single data source.
We provided a ATTEST checklist (1) to help researchers clearly document each step of their RF and  data source selection and integration process, and (2) to improve the completeness and transparency of their mIDA studies. As shown in Table 4, we used the ATTE ST checklist to report two previous mIDA studies. Based on the checklist, we can easily (1) check whether these mIDA studies document required items that can help other researchers replicate their studies, and (2) compare their variables, data sources and data integration processes. As shown in Table 4, we found that there are 3 items never discussed in either of the two studies including "sources of bias", "missing data" for selected variables, and "data cleaning" (i.e., method used to detect and correct or remove the incorrect records, missing values or outliers). All three items are relevant to data quality issues, where rarely being discussed or documented in these mIDA studies or even more broadly in cancer outcomes research. Nevertheless, data quality issues such as missing data can dramatically affect the results of the cancer outcomes research (e.g., in cancer survival prediction) [64]. Comparing the two case mIDA studies, the data integration process was not welldocumented in the first study [20], where most of the    Demographic variables: Page 5, Table 1 Individual smoking: Page 2, section "Data source and case selection", paragraph 2, line 1-2 Smoking status: Page 10, section "The ontology for Cancer research variables (OCRV)", paragraph 2, line 13-27 Regional smoking: Page 3, section "Data source and case selection", paragraph 2, line 4-6 Marital status: Page 14, section "Type 4: Queries that generate results based on the knowledge encoded in ontology", paragraph 2, line 7-10 Insurance payer: Page 5, Table 1 Residency: Page 5, Table 1 Age at diagnosis: Page 5, Table 1 Year of diagnosis: Page 5, Table 1 Tumor stage: Page 5, Table 1 Tumor type: Page 5, Table 1 Treatment procedure: Page 5, Table 1 Census Tract SVI: Page 14, section "Type 3: Queries that are used to link a patient to contextual factors through geographic variables", paragraph 1, line 5-16 Census tract high school completion rates: Page 5, Table 1 Census tract family poverty rates: Page 5, Table 1 Census tract rurality status: Page 4, section "Data integration use case: The multi-level integrative data analysis of Cancer survival", paragraph 1, line 8-11 County adult mental and physical health status: Page 5, Table 1 County density of primary care physicians:  County smoking rate: Page 10, section "The ontology for Cancer research variables (OCRV)", paragraph 2 County alcohol consumption rate: Page 5, Table 1 5b State the data type (e.g., numerical, categorical) of independent variable Socioeconomic status: Page 2, section "Variable definitions", paragraph 3, line 9-10 Demographic variables: N/A Individual smoking: Page 2, section "Data source and case selection", paragraph 2, line 2-3 Smoking status: Page 13, Table 3 Regional smoking: Page 3, section "Data source and case selection", paragraph 2, line 4-6 Marital status: Page 14, section "Type 4: Queries that generate results based on the knowledge encoded in ontology", paragraph 2, line 7-10 Insurance payer: N/A
Age at diagnosis: Page 16, Fig. 6 Census Tract SVI: Page 16, Fig. 7 County smoking rate: Page 17, Fig. 8 Marital status: Page 18, Fig. 9 Census tract high school completion rates: Age at diagnosis: Page 16, Fig. 6 Census Tract SVI: Page 16, Fig. 7 County smoking rate: Page 17, Fig. 8 Marital status: Page 18, Fig. 9 Census tract high school completion rates: Page 15, section "Type 3: Queries that are used to link a patient to contextual factors through geographic variables", paragraph 1, line 1-3 Census tract family poverty rates: Page 15, section "Type 3: Queries that are used to link a patient to contextual factors through geographic variables", paragraph 1, line 1-3 Census tract rurality status: Page 15, section "Type 3: Queries that are used to link a patient to contextual factors through geographic variables", paragraph 1, line 1-3  Fig. 4 An OD-ATTEST-annotated report generated based on a mIDA case study items relevant to data integration are blank; while, in the other study [29], the processes about data processing, data integration, and data validation were all clearly documented according to the ATTEST checklist. Therefore, using this checklist, one can improve the completeness of their documentation on the selection and integration process as shown in Table 4. The OD-ATTEST ontology provides a way to standardize the documentation of the mIDA study process from variable and data source selection to data integration. Also, the ontology-based annotations of the report is beneficial because it provides an initial step towards a report that is not only readable and understandable by human but also potentially executable by machines. After transforming these annotations into semantic triples, the report can be stored into a knowledge base and represented as knowledge graphs (Fig. 4) to facilitate examination and analysis of these mIDA reports, enabling robust sharing and comparison of different mIDA studies.

Limitations and future work
Most of the reporting guidelines we reviewed from the EQUATOR network have limited information on how to document the data integration process, indicating a significant gap in existing practice. Nevertheless, we were able to summarize the key elements need to be reported for the integration process based on 3 existing guidelines and our own previous experience on semantic data integration case studies. As a future study, one shall conduct a systematic review on data integration literatures to summarize relevant reporting items to improve the reporting guideline. Meanwhile, we will conduct a yearly review of existing reporting guidelines following the reviewing process discussed in Fig. 1 to identify new reporting items of interest and keep our framework up to date. Further, beyond standardized reporting, our ultimate goal is to let computers understand the ontologyannotated report (in RDF triples) regarding (1) how different variables are defined and represented and (2) how different variables are selected and integrated, so that machines can automatically repeat these processes and generate integrated dataset based on an executable ontology-annotated report. For variable definition and representation, it is important to recognize and being interoperable with existing data standards and common data models (CDM) such as those that standardized exchanging of EHRs data including the national Patient-Centered Clinical Research Network (PCORnet) CDM, the Observational Medical Outcomes Partnership (OMOP) from the Observational Health Data Sciences and Informatics (OHDSI) network, and the uprising Fast Healthcare Interoperability Resources (FHIR) protocol adopted by major EHR system vendors. Developing the ontology against these CDMs that have already standardized existing data resources would be critical to assure the generalizability of our framework. Nevertheless, for modeling the variable selection and integration processes as shown in Fig. 4, more fine-grained information regarding the variables, data sources and the integration process are currently documented as free-text descriptions. We face challenges in transforming these "freetext" information into executable algorithms (e.g., a data processing step that calculates BMI using weight and height). Such information is related to the concept of data provenance-"a type of metadata, concerned with the history of data, its origin and changes made to it" [65]. The importance of data provenance is widely recognized, especially for study reproducibility and replicability. More than one-half of the systematic efforts to reproduce computational results across different fields have failed, mainly due to insufficient detail on digital artifacts, such as data, code, and computational workflow [66]. However, descriptions of data provenance are often neglected or inadequate in scientific literature due to the lack of a tractable, easily operated approach with supporting tools. Future studies that focus on the development of easy-to-use tools with a standardized framework to persist end-to-end data provenance with high integrity including intermediate processes and data products are urgently needed. Further, future developments of tools and platforms to automate the documentation process, where the data elements and associated information (e.g., levels and domains) are also automatically annotated with the standardized ontology are warranted.

Conclusions
In this paper, we have proposed and developed an ontology-based reporting guideline solving some key challenges in current mIDA studies for cancer outcomes research, through providing (1) a theory-driven guidance for multi-level and multi-domain RF variable and data source selection; and (2) a standardized documentation of the data selection and integration processes powered by an ontology, thus a way to enable sharing of mIDA study reports among researchers.