- Research
- Open access
- Published:
Machine learning in a real-world PFO study: analysis of data from multi-centers in China
BMC Medical Informatics and Decision Making volume 22, Article number: 305 (2022)
Abstract
Purpose
The association of patent foreman ovale (PFO) and cryptogenic stroke has been studied for years. Although device closure overall decreases the risk for recurrent stroke, treatment effects varied across different studies. In this study, we aimed to detect sub-clusters in post-closure PFO patients and identify potential predictors for adverse outcomes.
Methods
We analyzed patients with embolic stroke of undetermined sources and PFO from 7 centers in China. Machine learning and Cox regression analysis were used.
Results
Using unsupervised hierarchical clustering on principal components, two main clusters were identified and a total of 196 patients were included. The average age was 42.7 (12.37) years and 64.80% (127/196) were female. During a median follow-up of 739 days, 12 (6.9%) adverse events happened, including 6 (3.45%) recurrent stroke, 5 (2.87%) transient ischemic attack (TIA) and one death (0.6%). Compared to cluster 1 (n = 77, 39.20%), patients in cluster 2 (n = 119, 60.71%) were more likely to be male, had higher systolic and diastolic blood pressure, higher body mass index, lower high-density lipoprotein cholesterol and increased proportion of presence of atrial septal aneurysm. Using random forest survival (RFS) analysis, eight top ranking features were selected and used for prediction model construction. As a result, the RFS model outperformed the traditional Cox regression model (C-index: 0.87 vs. 0.54).
Conclusions
There were 2 main clusters in post-closure PFO patients. Traditional cardiovascular profiles remain top ranking predictors for future recurrence of stroke or TIA. However, whether maximizing the management of these factors would provide extra benefits warrants further investigations.
Introduction
Foreman ovale is a bridging structure during embryological development. Normally, this structure closes spontaneously after birth. If it is not the case, a patent channel will be formed and named as patent foreman ovale (PFO), predisposing an increased risk of paradoxical embolisms [1, 2]. The association of PFO and stroke was first proposed in 1988 [3]. Since then, numerous studies, including observational or randomized control trials (RCTs), have shown potential causal effect of PFO on cryptogenic stroke (CS) [4,5,6,7,8,9,10]. The prevalence of PFO in general population is around 25%, but reaches up to 40% in CS patients [3].
Although the first three RCTs suggested a neutral effect of PFO closure on stroke prevention [4,5,6], latest RCTs [7, 9, 10] and updated meta-analyses [11, 12] all supported the benefits of PFO closure. In view of these data, current guidelines have recommended PFO closure in patients with proved PFO-associated stroke (PFO-AS), a new concept created in 2020 [13, 14]. Although device closure on average decreases the risk for recurrent stroke, treatment effects varied substantially across different studies [15].
In recent years, the applications of artificial intelligence or machine learning have shown promising results in health care system. It not only helps the process of data management, but also aids in disease prediction or patient sub-clustering. For example, in a data-driven machine-learning analysis, the authors applied hierarchical k-means clustering algorithm to explore potential sources of embolic stroke [16]. In another study, machine learning was used to automatically discriminate cardioembolic from non-cardioembolic strokes in large datasets [17].
Understanding different causes and classifying strokes based on etiologic subtypes are prerequisites for effective treatments. With the state-of-the-art machine learning, we could easily identify subsets of patients that would benefit most from PFO closure or sustain elevated risk for recurrent stroke. Thus, in this study, we applied unsupervised machine learning to detect sub-clusters in post-closure PFO patients and assess their associated risk with adverse outcomes. As the second aim, we used supervised machine learning to identify potential predictors for adverse outcomes.
Methods
Study design and population
The analyzed population was from 7 centers in China, including Guangdong Cardiovascular Institute, Zhongnan Hospital of Wuhan University, Wuhan Asian Heart Hospital, Hubei Huiyi Cardiovascular Center, Jiang Men Central Hospital, the first people's hospital of Foshan and General Hospital of Southern Theatre Command of PLA. Patients with embolic stroke of undetermined sources (ESUS) and PFO were included during June 1st, 2013 and May 31st, 2020. The diagnosis of ESUS was deliberately and systematically assessed by both a neurologist and a cardiologist after excluding the other common etiologies of stroke. PFO was initially discovered by either transthoracic/transesophageal or right heart contrast echocardiography and finally confirmed during cardiac catheterization. Patients with overt alternative causes of their strokes or not receiving PFO closure were excluded. We collected demographic information, laboratory and echocardiographic data for the included subjects.
Outcome ascertainment
Patients were followed by regular telephone interviews or outpatient examinations. The main outcome in our study was a composite of recurrent ischemic stroke, transient ischemic attack (TIA) or all-cause death. Major bleeding or new-onset atrial fibrillation (AF) was examined as secondary outcomes. Cardiac rhythm was assessed by cardiac auscultation, which was followed by electrocardiography if abnormal auscultation was found. At each telephone interview or outpatient visit, a standardized and validated questionnaire (Questionnaire for verifying stroke-free status) was used to detect potential stroke or TIA.
Statistical analysis
Student t-test and chi-square test were used for comparisons of differences between groups. Data were shown as mean (SD), median (interquartile range[IQR]) or number (percentage). Two-sided P < 0.05 was considered to be significant.
Data pre-processing was conducted before machine learning. The summary for missing data is shown in Additional file 1: eTable1. As suggested by the missing data pattern (Additional file 1: eFigure1), it was considered as random missing data case with no particular trend among all the variables. Missing values were then computed with multiple imputation using R package of “mice”. Thirty-four variables from demographic, laboratory and echocardiographic data were finally included.
We first used principal component analysis (PCA) to reduce the dimensions with the function of FAMD (Factor Analysis of Mixed Data). Next, we applied cluster analysis on the PCA outputs using the function of HCPC (Hierarchical Clustering on Principal Components) in FactoMineR package. The partitioning of the HCPC is performed by cutting the hierarchical tree (dendrogram). To consolidate the final partitioning solution, we further performed k-means clustering. The binary data was treated as numeric values before clustering [18]. The average silhouette of observations for different values of k (1 to 10) were computed. The location of the maximum is considered as the optimal number of clusters. Cox proportional hazards regression was then applied to calculate the hazard ratio (HR) and 95% confidence interval (95%CI) of adverse events by different clusters. Proportional-hazards assumption was tested and no violation was found. To examine potential bias from the imputed datasets, we performed complete case analysis as one of the sensitivity analyses.
In supervised learning, we first used all available variables to construct the random forest survival (RFS) model and accessed the variable importance (VIMP). We then selected the top ranking features to reconstruct the prediction models and assessed the performance using concordance index (C-index) and Brier score (BS). A higher C-index and lower BS suggest a better prediction performance. In addition, we applied supervised self-organizing maps to help visualize features associated with individual cluster within the studied patients. All statistical analyses were performed using R version 4.1.2 or Stata 15.1 (StataCorp/SE, College Station, TX). Detailed descriptions of R source code were disclosed in the supplement.
Results
197 PFO patients receiving percutaneous interventions were initially included in this study. The first 12 principal components with an eigenvalue ≥ 1, which accumulatively counted for 70.62% of the dataset, were used as input for the HCPC method (Additional file 1: eTable 2). Using HCPC, 3 clusters were identified (Fig. 1-A). Since the middle cluster included only 1 patient, we excluded this cluster, leaving a final number of 196 patients for subsequent analysis. Briefly, the average age of the included subjects was 42.7 (12.37) years and 64.80% (127/196) were female. During a median follow-up of 739 (IQR 731–924) days, 22 (11.22%) patients were lost to follow up. A total number of 12 adverse events (12/174, 6.9%) were reported, including 6 recurrent stroke (6/174, 3.45%), 5 TIA (5/174, 2.87%) and one death (1/174, 0.6%). No AF or major bleeding was documented.
Among the analyzed subjects, 77 (39.29%) patients were assigned to cluster 1 and 119 (60.71%) were assigned to cluster 2. Compared to cluster 1, patients in cluster 2 were more likely to be male, had higher systolic and diastolic blood pressure, higher body mass index (BMI), lower high-density lipoprotein cholesterol (HDL-C) and increased proportion of presence of atrial septal aneurysm (ASA). The values of red blood cell, hemoglobin, creatinine, uric acid, left atrium (LA), left ventricular end-diastolic dimension (LVEDD), interventricular septum (IVS) and posterior wall thickness (PW) were also higher in patients of cluster 2. Detailed descriptions of these variables were summarized in Table 1 and vividly visualized in Fig. 2. In Cox regression analysis, patients in cluster 2 tended to have 21% increased hazards for adverse events than those in cluster 1 (HR 1.21, 95%CI 0.62–2.36, P = 0.58, Fig. 3-A).
In k-means clustering analysis, the highest average silhouette was located at k = 2, suggesting 2 as the optimal number of clusters (Fig. 1-B). Detailed descriptions of baseline characteristics across the 2 clusters were summarized in Additional file 1: eTable 3. Generally, the results were similar to what we found from HCPC. Cox regression analysis also suggested that patients in high risk cluster tended to have increased hazards for adverse events (HR 2.11, 95% CI 0.63–6.96, P = 0.21, Fig. 3-B). And the high risk cluster was characterized by higher proportion of male gender, higher blood pressure, higher BMI and lower HDL-C. Likewise, the finding from complete case analysis was largely identical to that of the primary analysis, except that the analyzed sample was significantly reduced (Additional file 1: eFigure2 and eTable4).
Figure 4 plots the variable importance of the full model using random forest survival analysis. We then selected the eight top ranking features to construct the prediction models, including fasting blood glucose, thickness of interventricular septum, the ratio of mitral peak early (E) to late (A) diastolic filling velocity, left ventricular end-systolic dimension, BMI, systolic blood pressure, thickness of the posterior wall and PTA. As presented in Table 2, the RFS model had similar Brier Score (2.6% vs 2.4%) but higher C-index than the traditional Cox proportional hazard regression model (0.87 vs 0.54), suggesting a better predictive model for adverse events.
Discussion
Increasing data have supported that PFO closure could significantly reduce the risk of stroke or TIA compared to medical therapy [7, 9,10,11,12]. The reported rate of recurrent stroke or TIA after PFO closure varied across different studies, ranging from 0 to 5.61% [4,5,6,7,8,9,10]. As pointed out by previous researchers, the key determination of treatment effect relies mainly on whether the discovered PFO is causally related to the stroke or just an innocent bystander [19]. Currently, there are two prediction systems used to evaluate the likelihood of a stoke-related PFO-the risk of paradoxical embolism (RoPE) score and the PASCAL classification system [14, 20].
Main components for RoPE score include age, smoking status, history of hypertension, diabetes, stroke or TIA, characteristics of the infarct on imaging [20]. PASCAL classification system is based on RoPE score, with combined consideration of PFO features, like PFO shunt size or the presence/absence of an ASA [2, 14]. Although the prediction systems are widely used in clinical practice, these estimations are sometimes violated by model assumptions or limited by subjective feature selections.
Machine learning is a promising technique increasingly applied in health care system. It allows phenotyping or sub-clustering of the analyzed population without knowing the definite outcomes, which helps reveal the underlying etiologies. In this study, we identified two main clusters in post-closure PFO patients, with the high risk cluster characterized by higher proportion of male gender and poorer cardiovascular profiles. Machine learning also enables objective feature selections and efficient prediction model constructions. The analysis of RFS further supported the predictive value of traditional risks factors, suggesting that high-risk groups should continue to be targeted to prevent stroke recurrence even after PFO closure [21, 22]. However, whether maximizing the management of these factors would provide extra benefits for these patients warrants further investigations.
Till now, few studies are conducted on machine learning and PFO [16, 17]. Owing to the small number of adverse outcomes, statistical power to identify independent predictors of recurrent stroke/TIA was often limited, when using the traditional Cox regression model [23]. The application of supervised machine learning to some extent helps settle this matter [24]. As shown in this study, RFS model did display better performance compared to Cox regression model after selecting the top ranking features. Additionally, RFS model was able to identify predictive features that were neglected in previous studies, for example, the ratio of mitral peak early (E) to late (A) diastolic filling velocity and thickness of the interventricular or posterior wall.
Although this is a pioneer study, several limitations should still be acknowledged. First, this is a post-hoc analysis, some data are not available, for example, PFO shunt size before closure or residual shunting after closure. Second, the small datasets and missing data could potentially bias the results, although the missing pattern suggests a random missing data case and the results from complete case analysis was similar to that of the imputed datasets. Third, the constructed model was not further validated by external datasets, which to some extent limits its generalizability. Finally, the evaluation of AF was based on cardiac assessment during follow-up visits. Occult AF might still be possible, leading to an underestimation of the prevalence of AF being reported.
Conclusions
There were 2 main clusters in PFO patients receiving device closure. The supervised and unsupervised machine learning both suggest that traditional cardiovascular profiles remain important predictors for future recurrence of stroke or TIA. However, whether maximizing the management of these factors would provide extra benefits in post-closure PFO patients warrants further investigations.
Availability of data and materials
The datasets and analyzed codes are available upon reasonable request from the corresponding author.
Abbreviations
- PFO:
-
Patent foreman ovale
- TIA:
-
Transient ischemic attack
- ASA:
-
Atrial septal aneurysm
- HCPC:
-
Hierarchical clustering on principal components
- RFS:
-
Random forest survival
- SBP:
-
Systolic blood pressure
- BMI:
-
Body mass index
- HDL-C:
-
High-density lipoprotein cholesterol
- LDL-C:
-
Low-density lipoprotein cholesterol
References
Rana BS, Thomas MR, Calvert PA, Monaghan MJ, Hildick-Smith D. Echocardiographic evaluation of patent foramen ovale prior to device closure. JACC Cardiovasc Imaging. 2010;3(7):749–60.
Cho KK, Khanna S, Lo P, Cheng D, Roy D. Persistent pathology of the patent foramen ovale: a review of the literature. Med J Aust. 2021;215(2):89–93.
Lechat P, Mas JL, Lascault G, Loron P, Theard M, Klimczac M, et al. Prevalence of patent foramen ovale in patients with stroke. N Engl J Med. 1988;318(18):1148–52.
Furlan AJ, Reisman M, Massaro J, Mauri L, Adams H, Albers GW, et al. Closure or medical therapy for cryptogenic stroke with patent foramen ovale. N Engl J Med. 2012;366(11):991–9.
Carroll JD, Saver JL, Thaler DE, Smalling RW, Berry S, MacDonald LA, et al. Closure of patent foramen ovale versus medical therapy after cryptogenic stroke. N Engl J Med. 2013;368(12):1092–100.
Meier B, Kalesan B, Mattle HP, Khattab AA, Hildick-Smith D, Dudek D, et al. Percutaneous closure of patent foramen ovale in cryptogenic embolism. N Engl J Med. 2013;368(12):1083–91.
Mas JL, Derumeaux G, Guillon B, Massardier E, Hosseini H, Mechtouff L, et al. Patent foramen ovale closure or anticoagulation vs. antiplatelets after stroke. N Engl J Med. 2017;377(11):1011–21.
Saver JL, Carroll JD, Thaler DE, Smalling RW, MacDonald LA, Marks DS, et al. Long-term outcomes of patent foramen ovale closure or medical therapy after stroke. N Engl J Med. 2017;377(11):1022–32.
Søndergaard L, Kasner SE, Rhodes JF, Andersen G, Iversen HK, Nielsen-Kudsk JE, et al. Patent foramen ovale closure or antiplatelet therapy for cryptogenic stroke. N Engl J Med. 2017;377(11):1033–42.
Lee PH, Song JK, Kim JS, Heo R, Lee S, Kim DH, et al. Cryptogenic stroke and high-risk patent foramen ovale: the DEFENSE-PFO trial. J Am Coll Cardiol. 2018;71(20):2335–42.
Smer A, Salih M, Mahfood Haddad T, Guddeti R, Saadi A, Saurav A, et al. Meta-analysis of randomized controlled trials on patent foramen ovale closure versus medical therapy for secondary prevention of cryptogenic stroke. Am J Cardiol. 2018;121(11):1393–9.
Tsivgoulis G, Katsanos AH, Mavridis D, Frogoudaki A, Vrettou AR, Ikonomidis I, et al. Percutaneous patent foramen ovale closure for secondary stroke prevention: network meta-analysis. Neurology. 2018;91(1):e8–18.
Kleindorfer DO, Towfighi A, Chaturvedi S, Cockroft KM, Gutierrez J, Lombardi-Hill D, et al. 2021 guideline for the prevention of stroke in patients with stroke and transient ischemic attack: a guideline from the American heart association/american stroke association. Stroke. 2021;52(7):e364–467.
Elgendy AY, Saver JL, Amin Z, Boudoulas KD, Carroll JD, Elgendy IY, et al. Proposal for updated nomenclature and classification of potential causative mechanism in patent foramen ovale-associated stroke. JAMA Neurol. 2020;77(7):878–86.
Kent DM, Saver JL, Kasner SE, Nelson J, Carroll JD, Chatellier G, et al. Heterogeneity of treatment effects in an analysis of pooled individual patient data from randomized trials of device closure of patent foramen ovale after stroke. JAMA. 2021;326(22):2277–86.
Ntaios G, Weng SF, Perlepe K, Akyea R, Condon L, Lambrou D, et al. Data-driven machine-learning analysis of potential embolic sources in embolic stroke of undetermined source. Eur J Neurol. 2021;28(1):192–201.
Guan W, Ko D, Khurshid S, Trisini Lipsanopoulos AT, Ashburner JM, Harrington LX, et al. Automated electronic phenotyping of cardioembolic stroke. Stroke. 2021;52(1):181–9.
Ralambondrainy H. A conceptual version of the K-means algorithm. Pattern Recogn Lett. 1995;16(11):1147–57.
Kent DM, Ruthazer R, Weimar C, Mas JL, Serena J, Homma S, et al. An index to identify stroke-related vs incidental patent foramen ovale in cryptogenic stroke. Neurology. 2013;81(7):619–25.
Mojadidi MK, Zaman MO, Elgendy IY, Mahmoud AN, Patel NK, Agarwal N, et al. Cryptogenic stroke and patent foramen ovale. J Am Coll Cardiol. 2018;71(9):1035–43.
Zonneveld TP, Richard E, Vergouwen MD, Nederkoorn PJ, de Haan R, Roos YB, et al. Blood pressure-lowering treatment for preventing recurrent stroke, major vascular events, and dementia in patients with a history of stroke or transient ischaemic attack. Cochrane Database Syst Rev. 2018;7(7):Cd007858.
Redfern J, McKevitt C, Dundas R, Rudd AG, Wolfe CD. Behavioral risk factor prevalence and lifestyle change after stroke: a prospective study. Stroke. 2000;31(8):1877–81.
Elmariah S, Furlan AJ, Reisman M, Burke D, Vardi M, Wimmer NJ, et al. Predictors of recurrent events in patients with cryptogenic stroke and patent foramen ovale within the CLOSURE I (Evaluation of the STARFlex septal closure system in patients with a stroke and/or transient ischemic attack due to presumed paradoxical embolism through a patent foramen ovale) trial. JACC Cardiovasc Interv. 2014;7(8):913–20.
Zhou J, Chou OHI, Wong KHG, Lee S, Leung KSK, Liu T, et al. Development of an electronic frailty index for predicting mortality and complications analysis in pulmonary hypertension using random survival forest model. Front Cardiovasc Med. 2022;9:735906.
Acknowledgements
Not applicable.
Funding
This work was supported by the Guangdong Medical Association Program [H022018002] and Scientific Program of Guangdong Provincial People’s Hospital [2016dzx03].
Author information
Authors and Affiliations
Contributions
LDL, YZY and ZCJ have full access to all the data in this study and take full responsibility as guarantors for the integrity of the data and the accuracy of the data analysis. ZCJ, LDL WSL and YZY contributed to the study design. ZGC, SQS, ZHW, LJX, HH and HJX contributed to analysis and data interpretation. LDL, YZY, ZCJ and WSL drafted the manuscript and contributed to the final approval of the manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
The study conformed to the ethical guidelines of the 1975 Declaration of Helsinki and was approval by the Ethics Committee of Guangdong General Hospital. Written informed consent was obtained from each patient.
Consent for publications
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
eTable1 Variable summary for missing data. eTable2 Hierarchical clustering on principal components analysis of the patients. eTable3 Baseline characteristic of the studied patients across the clusters (k-means clustering analysis). eTable4 Baseline characteristics of the studied patients across the clusters (in complete case analysis). eFigure1 Missing data pattern. eFigure2 Clusters identified by different methods in complete case analysis.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Luo, D., Yang, Z., Zhang, G. et al. Machine learning in a real-world PFO study: analysis of data from multi-centers in China. BMC Med Inform Decis Mak 22, 305 (2022). https://doi.org/10.1186/s12911-022-02048-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12911-022-02048-5