As the second leading cause of death, cancer is responsible for one in every four deaths in the United States . In 2017, there were approximately 1.68 million new cancer cases and 600 thousand cancer deaths in the US  estimated by the American Cancer Society. When first diagnosed with cancer, patients ask about their prognosis, whether their cancer is relatively easy or more difficult to treat, and the likelihood of survival. There is a huge variation in survival between cancer types, stages, age groups, races/ethnicities, genders, and many other factors. For example, among some of the frequently diagnosed cancers, including lung, colorectal, breast, and prostate cancers, the 5-year overall survival rates are 18.3, 64.9, 89.7, and 98.3%, respectively . These rates are much worse when the cancer is metastasized, which are 4.5, 13.9, 26.9, and 29.8% for the same types of cancers, respectively .
To improve cancer survival rates and prognosis, one of the first steps is to improve our understanding of contributory factors associated with cancer survival. Priori research such as the National Institute on Minority Health and Health Disparities (NIMHD) Research Framework  and the social-ecological model  recognizes that individuals are embedded within the larger social system and constrained by the physical environment they lived in. Thus, the determinants of individuals’ health span across different domains of influence (i.e., biological, behavioral, physical/built environment, sociocultural environment, and healthcare system) as well as different levels of influence (i.e., individual, interpersonal, community, and societal). Within these frameworks, cancer survival is influenced by multiple factors from multiple levels and multiple domains. At the individual level, cancer survival is influenced by not only cancer stage of diagnosis and treatment, demographics, and financial status, but also risky health behaviors such as smoking, alcohol drinking, and physical inactivity. For example, cigarette smoking is by far the most important risk factor for lung cancer; 80% of lung cancer deaths in the US were caused by smoking . Beyond individual-level factors, occupational or environmental exposure to secondhand smoke, air pollution, radiation, and some organic chemicals are also significant risk factors. Further, at the contextual level, cancer survival is influenced by public policies that influence health care delivery which could impact patients’ travel distance to the treatment facility .
Prior epidemiologic research on cancer survival in the US, however, has primarily focused on contributory factors from the individual level due to limited data availability. Very few studies have explored contextual factors, and certainly no study has explored all possible factors together. Most of these analyses used data from a single source, such as data from a hospital (e.g., electronic health records, EHRs), a cancer registry (e.g., the Surveillance, Epidemiology, and End Results, SEER registry) or administrative claims systems (e.g., data from Centers for Medicare and Medicaid Services, CMS) [7,8,9,10]. SEER is an extremely popular data source for studying cancer survival [8,9,10]. However, it is important to pool heterogeneous data sets with variables beyond the individual level for integrative data analysis (IDA) that simultaneously examine as many cancer survival predictors as possible (i.e. top down approach to the model building) so that confounding effects and interactions among predictors can be fully understood. For example, the linked SEER-Medicare data give us a more complete picture of cancer patients beyond their cancer status with other clinical characteristics such as comorbidity as well as their healthcare utilization patterns [11,12,13,14]. Nonetheless, the ability to integrate risk factors of more domains and levels from other data sources such as socioeconomic status of the community from US Census data and community smoking rate from the Behavioral Risk Factor Surveillance System (BRFSS) will further advance our understanding of the determinants of cancer survival.
Nevertheless, researchers are faced with key challenges when integrating data from different sources. Data integration is a daunting task because data from different sources can be heterogeneous in syntax (e.g., file formats, access protocols), schema (e.g., data structures), and semantics (e.g., meanings or interpretations). The effort required to connect different sources is substantial due to lack of clear definitions (i.e., data semantics) of variables, measures, and constructs. Many traditional data integration techniques have been used on large scale in biomedical research [15,16,17], such as rule-based links (i.e., link variables from different data sources directly base on the names and definitions), data warehouses (i.e., create a new system to store a copy of the data from difference data sources, and manage the data separately from the original data systems) and ad-hoc query optimizers (i.e., re-phrasing a user’s query into multiple subqueries according to the structures of individual distributed databases) and federated middleware frameworks (i.e., link multiple applications and user interfaces to multiple data sources, act as the overarching facade across multiple applications). However, all these traditional methods did not consider the semantic knowledge, which intend to integrate information based on the meaning of the data elements. For example, how to distinguish synonyms, homonyms and related terms (e.g., different representations of the same disease using different coding standards) across different data sources. Therefore, adopting a semantic data integration approach, we propose to generate a universal conceptual representation of “information” to bridge the data heterogeneities across different sources. The “information” includes not only data elements but also their relationships, via “ontologies”. An ontology is a computational representation of a domain of knowledge based upon a controlled, standardized vocabulary for describing entities and the semantic relationships between them [18,19,20,21]. The use of ontologies can facilitate data integration in many ways, including metadata representation, automatic data verification, global conceptualization, support for high-level semantic queries, and extend beyond traditional approaches of using common data elements (CDEs) and common data models (CDMs) [22,23,24], especially in the biomedical domain [15, 25].
Marenco et al. developed a Query Integrator System (QIS) to address robust data integration from heterogeneous data sources in the biosciences in 2004 . An ontology server was used in QIS to map data sources’ metadata to the concepts in standard vocabularies . Cheung et al. developed a prototype web application called YeastHub based on a Resource Description Framework (RDF) database to support the integration of different types of yeast genome data in different sources in 2005 . Lam et al. used the Web Ontology Language (OWL) to integrate two heterogeneous neuroscience databases  in 2005. In a follow-up study, Lam et al. designed AlzPharm that used RDF and its extension vocabulary, RDF Schema (RDFS), to facilitate both data representation and integration . Smith et al. built the LinkHub system leveraging Semantic Web technologies (i.e., RDF and RDF queries) to facilitate cross-database queries and information retrieval in proteomics in 2007 . In 2008, Shironoshita et al. introduced a query formulation method to execute semantic queries across multiple data services in the cancer Biomedical Informatics Grid (caBIG), named Semantic caBIG Data Integration (semCDI). Mercadé et al. developed an ontology-based application called Orymold for dynamic gene expression data annotation, integration and exploration in 2009. Based on the QIS , Luis et al. designed an automated approach for integrating federated databases using ontological metadata mappings in 2009 . Chisham et al. created the Comparative Data Analysis Ontology (CDAO) and developed the CDAO-Store system to support data integration for phylogenetic analysis in 2011 . Kama et al. built a Data Definition Ontology (DDO) using the D2RQ (i.e., a platform to provide RDF-based access over relational databases) for accessing heterogeneous clinical data sources . Pang et al. developed BiobankedConnect to speed up the process of integrating comparable data from different biobanks to get a pooled data using ontological and lexical indexing in 2014 . Ethier et al. designed the Clinical Data Integration Model (CDIM) based on the Basic Formal Ontology (BFO)  to support biomedical data integration in 2015 . Mate et al. proposed an ontology-based approach to organize and describe the medical concepts of both source and target systems in order to integrate the data across different clinical and research systems . Livingston et al. created an integrated knowledge base of biomedical data from multiple sources, called KaBOB, based on Open Biomedical Ontologies . In 2016, Liang et al. proposed an ontology-oriented approach to represent the relations between genes, drugs, phenotypes, symptoms, and diseases from multiple information sources in aiding the analysis of psychiatric drug repurposing . Similar to our approach, Kock-schoppenhauer et al. used the ontology-based data access (OBDA) model and the Ontop framework to access relational clinical databases with SPARQL queries . However, most of these existing semantic data integration systems and frameworks have focused on 1) the harmonization and alignments of data elements using semantic resources; 2) creating tailored ad hoc resources for specific use cases that may not be generalizable; and 3) the integration of data from similar data sources (e.g., data from different electronic health record systems) and addressing the syntactic (i.e., data formats) and schematic (i.e., data models) heterogeneity. Very few studies have fully leveraged the reasoning ability provided by ontologically structured data. And none of the studies has used ontologies as a knowledge representation tool to document the data integration process.
This paper describes a case study of semantic data integration linking five data sets that cover both individual and contextual level factors for the purpose of assessing the association of predictors of interest with cancer survival. The main contribution of our work is that we applied an ontology-based data integration framework to integrate both individual and contextual level factors to facilitate integrative data analysis (i.e., pool heterogeneous data sets). The use of ontologies can facilitate data integration in many ways and extend beyond traditional data integration approaches. Unlike existing ontology-driven data integration methods, our study focused on encoding the different data integration scenarios explicitly using a formal and computational model with a shared vocabulary—the Ontology for Cancer Research Variables (OCRV). Our goal is not only to make the data integration process easier, but also to facilitate documentation and communication of the data integration processes between scientists. This is significant for research rigor, transparency, reproducibility as well as data reusability.
In our previous short paper , we prototyped an ontology-based data access approach to integrate three different datasets to support IDA of cancer survival. In this extended journal paper, we significantly expanded our ontology-based data integration framework.
We used n-ary relations  in our ontology to represent relations among more than two individuals. For example, we created a ‘ocrv:diagnosis_relation’ class to link the ‘ocrv:date of diagnosis’, the ‘ocrv:diagnosed tumor type’ and the ‘ncit:patient’.
We adopted the Time Event Ontology (TEO) [43,44,45] for representing events, time, and their relationships. For example, the ‘ocrv:date of diagnosis’ was represented as a ‘ocrv:diagnosis_relation’ instance (event) associated with an ‘teo:timeInstance’ (time).
We improved the reasoning ability via using OWL restrictions , so that we can encode certain knowledge (i.e., constrains on properties) in the ontology. For example, in our model, current smoker is defined as patients who (1) is a current everyday/someday smoker, and (2) smoked at least 100 cigarettes in the entire life. Thus, we created restrictions for the object property ‘ocrv:has_smoking_status’.
We leveraged the ontology to exam the consistency of the source data. For example, we used an individual’s ‘ocrv:date of diagnosis’ and ‘ncit:birth year’ to calculate the diagnosis age and then compared with the value directly obtained from the ‘ncit:age at diagnosis’ variable to check the consistency of the source data.