Data
Patient cohort selection
We included patients who were newly diagnosed with ICC histopathologically, and treated at Beijing 302 Hospital between July 2007 and July 2017. Patients were excluded using criteria as follows: hilar or distal ICC, intrahepatic metastasis of extrahepatic cholangiocarcinoma (ECC), mixed with hepatocellular carcinoma (HCC), uncertain origin or benign mass, perioperative mortality (defined as 1 month after operation), and combinations with other malignancies. Patients with only once medical record and loss of follow-up thereafter or those with incomplete information were also excluded.
Data collection
We collected the following clinical data on the ICC patients: demographic characteristics, medical history, smoking status, laboratory results, and tumor features. All clinical data were abstracted at the time of diagnosis prior to specific anti-cancer therapy. Patients were followed up for death or recurrence of ICC from the time of diagnosis to July 31, 2017.
The OHDSI CDM
The OHDSI CDM is designed to include all observational health data elements to support the generation of reliable scientific evidence. It is essentially a relational database representing observational data derived from the EHR. In the OHDSI CDM data tables, the meaning of each portion of content is represented using standard concepts. Content-related concepts are stored with their concept_ids as foreign keys to the CONCEPT table in the standardized vocabularies. We used the V5.3 OHDSI CDM (https://github.com/OHDSI/CommonDataModel) for our local table schemas. It contains 37 tables and 42 vocabularies.
R packages
We used two R packages in our experiments. The first one is the survival package (https://cran.r-project.org/web/packages/survival/index.html), which was used to support core survival analysis, including the definition of survival objects, the Kaplan-Meier estimation, and the Cox model analysis. The second one is the R Mice package (https://cran.r-project.org/web/packages/mice/index.html). Missing data is a ubiquitous problem with clinical research data. For example, blood pressure measurements may be missing because of the breakdown of an automatic sphygmomanometer. Many current analysis tools can only handle a complete data sets, thus, we use the R Mice package to impute missing values.
Overall method framework
We designed an OHDSI CDM-supported survival analysis framework to facilitate large-scale multi-center survival analysis, as shown in Fig. 1. The framework contains the following three key modules: mapping local terms to standard concepts, Extraction-Transformation-Loading (ETL) of patient data into OHDSI CDM, and developing a generic analysis interface with the OHDSI CDM.
Mapping local terms to standard concepts
We analyzed all of the tables and vocabularies of the OHDSI CDM and manually created mappings from the analytical variables in the patient data to the corresponding CDM tables and concepts and normalized the value expressions. The collected ICC patient data was used as the source EHR data for the mapping study.
Six main categories of ICC patient data are used for survival analysis. 1) Demographic characteristics, including date of birth, gender, country, ethnicity, and blood type. 2) Medical history including comorbidities prior to ICC diagnosis (e.g., diabetes, hypertension, cholelithiasis, hyperlipidemia, coronary artery disease, and cholecystectomy). Underlying liver disease was either abstracted and confirmed by reviewing the medical records or ascertained through related clinical observations. 3) Laboratory results with parameters involving, among others, leukocytes, erythrocytes, platelets, albumin, total bilirubin, creatinine, carbohydrate antigen (CA) 19–9, and CA 125. 4) Smoking status was manually abstracted from the admission medical records of patients. 5) Therapeutic procedures conducted on each patient. 6) Tumor features such as tumor number, maximum size, vascular invasion, and lymph node involvement were assessed using contrast computed tomography (CT), magnetic resonance imaging (MRI) examination, or pathological reports of the patients.
Creating mappings between the source variables and their values to the target OHDSI concepts is crucial to facilitate patient data standardization. Because the source variable names are typically expressed in non-standard terms, and the textual variable values are often in free-style using different local expressions, we must standardize these terms and the textual values into standard concepts. The mappings between the source EHR data terms and the OHDSI CDM concepts were created as shown in Fig. 2. We used the OHDSI vocabulary browser Athena (http://athena.ohdsi.org/) to help find the corresponding standard concept in OHDSI when conducting concept mappings. To ensure minimal mapping errors and minimal information loss, two authors reviewed the concept mappings to achieve agreement.
Transformation and loading of patient data into the OHDSI CDM
To load patient data into the OHDSI CDM, we developed a data transformation and loading algorithm to populate the ICC patient data into the CDM using the above concept mappings. There are three key steps for patient data transformation and loading. 1) The variable grouping. We manually categorized the source variables with respect to the target OHDSI tables. Six core tables are involved in our data loading: condition_occurrence [medical history], measurement [laboratory results], observation [tumor features and smoking status], procedure_occurrence [procedures], person [demographic characteristics], and death [vital status] Table. 2) De-identification. To maintain patient confidentiality privacy and security, when the patient data were loaded into the OHDSI CDM, the original patient_ids were removed and the CDM generated random person_ids for OHDSI research data management and analysis purposes. 3) Missing data imputation. The R package Mice provides the general approaches to deal with missing data for multivariate scenarios [14]. The Mice package currently offers more than 20 different methods to for different situations. The “random forest” imputation is one of the most commonly used methods in the Mice framework, and [15] recommends this method for imputing complex research data sets. Therefore, we used the function mice (data, method = “rf”) to impute our missing ICC patient data. 4) Loading patient data into the CDM. According to the concept mappings, we developed a set of transformation scripts to directly load the patient data into the corresponding six core tables. The associated relationships with other indirect tables, such as concepts, vocabulary, and domain, were generated through “concept_id”.
Developing a generic analysis interface in the OHDSI CDM
Once the patient data are loaded into the OHDSI CDM, the scalable survival analysis across multiple sites is facilitated. We developed a generic survival analysis interface in the OHDSI CDM to enable the reusable R functions using the R ‘survival’ package. The event of interest is patient death and the overall survival (OS) is defined as the interval between the diagnosis and death or the date of the last contact with the subjects. Our R interface supports the general survival analysis functions in the CDM, and currently includes: 1) Creating the analysis dataset which relies on a set of input variants ‘concept_id’ because ‘concept_id’ is the standard identifier of each analysis object. Furthermore, ‘concept_id’ is independent with respect to platforms, and offers extensible interoperability with other OHDSI CDM applications, such as cohort query results. 2) Constructing the baseline demographics study that includes predefined analysis functions such as frequency, percentage, mean ± standard deviation, and the interquartile range (IQR). 3) Building the Kaplan-Meier survival curve, defined as the probability of surviving within a given length of time, considering time in many small intervals [16].
The CDM-based R analysis tool runs in the local environment within a single institute to ensure the security of the personal health identification (PHI) information. Our methods also support the capability portable among multiple institutes that deploy the OHDSI CDM.
Evaluation of the CDM-based results
To evaluate the CDM-based analysis results, we performed the analysis using the source ICC patient data with the common analysis tool Intercooled Stata 13.0 (https://www.stata.com/), and compared the results for a group of specific analysis tasks.