Health administrative data sources
This study utilizes a repository of Ontario’s administrative health databases housed at the ICES [14]. ICES is an independent, non-profit research institute whose legal status under Ontario’s health information privacy law allows it to collect and analyze health care and demographic data, without consent, for health system evaluation and improvement. These datasets were linked using unique encoded identifiers and analyzed at ICES. Since all Ontario residents are covered through a single-payer insurance system for physician, hospital-based care and home care services, and drugs for residents 65 years of age and older, healthcare encounters can be linked across systems through individual health card numbers and each resident receiving a unique ICES Key Number (IKN) [16]. Population-based ICES data sources linked for this analysis include the Canadian Institute for Health Information’s Discharge Abstract Database and Same Day Surgery database (CIHI-DAD/SDS), which report all hospital visits dated back to 1988, the CIHI National Ambulatory Care Reporting System (NACRS), which reports hospital and community-based ambulatory care visits starting from the year 2000, and the Ontario Health Insurance Plan (OHIP) database reporting outpatient physician services since 1991. The use of data in this project was authorized under Sect. 45 of Ontario’s Personal Health Information Protection Act, which does not require review by a Research Ethics Board.
Cohort creation
Our study cohort comprised of all Ontario residents 18 years of age and older who had a transcatheter closure procedure for ASD or PFO closure recorded in CIHI-DAD/SDS (CCI code 1HN80GPFL) between October 2002 and December 2017.
Reference standard database
CorHealth Ontario’s cardiac registry was selected as the reference standard [17, 18]. The CorHealth cardiac registry captures select clinical data on all cardiac procedures performed in Ontario catheterization laboratories [17, 18]. Two distinct fields in the catheterization laboratory data indicate if the procedure was a PFO closure (Yes/No) or an ASD closure (Yes/No), and were used in our study to identify if the procedure was closure of PFO, ASD, both or neither [19]. The index event date for each patient in the study sample was the date of the procedure.
If patients had multiple interventions, only the first intervention was kept for this analysis. Patient records were excluded from the study dataset if their ICES Key Number (IKN) was missing, invalid, or repeating, if their gender code was missing or invalid, if the patient was not a resident of Ontario, or if at the time of intervention, the patient was younger than 18 years of age. Records Cases labeled as having both PFO and ASD or neither diagnoses were excluded from the building of this classification algorithm.
Algorithm variable selection and definitions
Variables extracted from ICES data were considered for inclusion into our algorithm to identify PFO cases based on clinical relevance and review of the literature. Please see Additional file 1: Appendix B for the full list of variables and their respective codes. Patient demographic information was captured through sex and age group. All of the following variables were reported during a 5-year lookback period prior to TC. History of stroke and TIA were available as dichotomous variables (i.e. presence/absence or yes/no flags) and total number of stroke or TIA events. An overall Charlson Comorbidity Index score was also retrieved from ICES [20]. Other comorbidities were defined ICD-based yes/no flags only. Healthcare utilization was captured by intervention codes reported during index admission, and any history of admission for ASD, PFO, or other congenital heart diseases (CHD) 5 years prior to closure.
Random forest classification
Random forest models are made up of several decision trees, a non-parametric and supervised machine learning approach that may be used for both regression and classification tasks [21,22,23]. Decision trees are constructed by recursively splitting data based on simple rules learned from the input variables provided from a given dataset of interest [21,22,23]. With random forest models, each individual decision tree therein analyzes a different sample of the data, and then all trees “vote” as an ensemble what a given observation should be categorized as, in this case whether a patient has undergone transcatheter closure for a PFO or an ASD [21,22,23].
A random forest method was chosen because it is non-parametric and builds upon the positive attributes of the popular decision tree method such as providing implicit feature selection, and decreased sensitivity to outliers compared to other classification techniques such as logistic or linear regression [21, 23, 24]. Given the novel nature of this classification model, minimal a priori feature selection was preferred. Furthermore, by combining the results of multiple individual decision trees, it follows that a combination of all resultant outputs may result in a higher predictive accuracy than each constituent tree alone, especially with complex and high-dimensional data [23, 24]. The combination of this majority voting approach on sub-samples of the data is known as bootstrap aggregating, or bagging [21, 24]. Bagging decreases the likelihood of overfitting and improves model generalization by decreasing outlier influence and model variance [21, 24]. This then provides a unique advantage when encountering high-dimensional data with complex interactions [23, 24].
All versions of the classification model were run in R using the randomForest package with 500 trees generated within each random forest [25]. To assess model performance, the reference standard was randomly sampled and split 40/60 into a training and a test set. Performance measures were compared between test and training sets to assess models for degree of overfitting, i.e., if training values were much higher than test values. Overall model performance was based on test accuracy, sensitivity, and specificity.
Variable importance was assessed through a mean decrease in Gini index. The Gini index indicates a level of partition “purity” which the random forest model uses to determine its classifications [21, 23, 24]. The higher the mean decrease in Gini for a given variable, the less likely it is that variable will lead to misclassified patients across all constructed trees [21, 23, 24]. Variable importance scores were compared among covariates to determine their relative ranking.
The final model was chosen once performance measures were optimized via hyperparameter tuning of mtry and the decision threshold. Mtry is a hyperparameter that pertains to the randomness of the forest, namely how many of the variables are considered at each split [26]. To determine the correct value, a grid search was run with the caret package, where a linear search was performed for a vector of candidate mtry values, and the value resulting in the highest accuracy was used for the final model [27]. The classification threshold, at default set at 0.5, reflects the probability required for an observation, in this case a patient in the CorHealth dataset, to be classified as ASD or PFO [28]. Different values for this threshold were attempted until the resultant tuned model performance was optimized.
As a sensitivity analysis, model performance was also compared to prior classification methods, using the same reference data and performance measures, by designating patients who had experienced an ischemic stroke, a hemorrhagic stroke, or a TIA within 1 year of closure as PFO patients, and the rest as ASD patients, without using any machine learning methods. Please refer to Additional file 1: Appendix C for a reproducible example of utilized code with simulated data.
Descriptive statistics of classified cohort
Following classification of patients by the random forest model as having undergone ASD or PFO transcatheter closure, the clinical and demographic characteristics were descriptively summarized in R by counts and percentages using the tableone package [29]. Clinical and demographic characteristics were compared between groups through chi-squared tests, with a significance level of p = 0.05.