Skip to main content

Machine learning in a real-world PFO study: analysis of data from multi-centers in China



The association of patent foreman ovale (PFO) and cryptogenic stroke has been studied for years. Although device closure overall decreases the risk for recurrent stroke, treatment effects varied across different studies. In this study, we aimed to detect sub-clusters in post-closure PFO patients and identify potential predictors for adverse outcomes.


We analyzed patients with embolic stroke of undetermined sources and PFO from 7 centers in China. Machine learning and Cox regression analysis were used.


Using unsupervised hierarchical clustering on principal components, two main clusters were identified and a total of 196 patients were included. The average age was 42.7 (12.37) years and 64.80% (127/196) were female. During a median follow-up of 739 days, 12 (6.9%) adverse events happened, including 6 (3.45%) recurrent stroke, 5 (2.87%) transient ischemic attack (TIA) and one death (0.6%). Compared to cluster 1 (n = 77, 39.20%), patients in cluster 2 (n = 119, 60.71%) were more likely to be male, had higher systolic and diastolic blood pressure, higher body mass index, lower high-density lipoprotein cholesterol and increased proportion of presence of atrial septal aneurysm. Using random forest survival (RFS) analysis, eight top ranking features were selected and used for prediction model construction. As a result, the RFS model outperformed the traditional Cox regression model (C-index: 0.87 vs. 0.54).


There were 2 main clusters in post-closure PFO patients. Traditional cardiovascular profiles remain top ranking predictors for future recurrence of stroke or TIA. However, whether maximizing the management of these factors would provide extra benefits warrants further investigations.

Peer Review reports


Foreman ovale is a bridging structure during embryological development. Normally, this structure closes spontaneously after birth. If it is not the case, a patent channel will be formed and named as patent foreman ovale (PFO), predisposing an increased risk of paradoxical embolisms [1, 2]. The association of PFO and stroke was first proposed in 1988 [3]. Since then, numerous studies, including observational or randomized control trials (RCTs), have shown potential causal effect of PFO on cryptogenic stroke (CS) [4,5,6,7,8,9,10]. The prevalence of PFO in general population is around 25%, but reaches up to 40% in CS patients [3].

Although the first three RCTs suggested a neutral effect of PFO closure on stroke prevention [4,5,6], latest RCTs [7, 9, 10] and updated meta-analyses [11, 12] all supported the benefits of PFO closure. In view of these data, current guidelines have recommended PFO closure in patients with proved PFO-associated stroke (PFO-AS), a new concept created in 2020 [13, 14]. Although device closure on average decreases the risk for recurrent stroke, treatment effects varied substantially across different studies [15].

In recent years, the applications of artificial intelligence or machine learning have shown promising results in health care system. It not only helps the process of data management, but also aids in disease prediction or patient sub-clustering. For example, in a data-driven machine-learning analysis, the authors applied hierarchical k-means clustering algorithm to explore potential sources of embolic stroke [16]. In another study, machine learning was used to automatically discriminate cardioembolic from non-cardioembolic strokes in large datasets [17].

Understanding different causes and classifying strokes based on etiologic subtypes are prerequisites for effective treatments. With the state-of-the-art machine learning, we could easily identify subsets of patients that would benefit most from PFO closure or sustain elevated risk for recurrent stroke. Thus, in this study, we applied unsupervised machine learning to detect sub-clusters in post-closure PFO patients and assess their associated risk with adverse outcomes. As the second aim, we used supervised machine learning to identify potential predictors for adverse outcomes.


Study design and population

The analyzed population was from 7 centers in China, including Guangdong Cardiovascular Institute, Zhongnan Hospital of Wuhan University, Wuhan Asian Heart Hospital, Hubei Huiyi Cardiovascular Center, Jiang Men Central Hospital, the first people's hospital of Foshan and General Hospital of Southern Theatre Command of PLA. Patients with embolic stroke of undetermined sources (ESUS) and PFO were included during June 1st, 2013 and May 31st, 2020. The diagnosis of ESUS was deliberately and systematically assessed by both a neurologist and a cardiologist after excluding the other common etiologies of stroke. PFO was initially discovered by either transthoracic/transesophageal or right heart contrast echocardiography and finally confirmed during cardiac catheterization. Patients with overt alternative causes of their strokes or not receiving PFO closure were excluded. We collected demographic information, laboratory and echocardiographic data for the included subjects.

Outcome ascertainment

Patients were followed by regular telephone interviews or outpatient examinations. The main outcome in our study was a composite of recurrent ischemic stroke, transient ischemic attack (TIA) or all-cause death. Major bleeding or new-onset atrial fibrillation (AF) was examined as secondary outcomes. Cardiac rhythm was assessed by cardiac auscultation, which was followed by electrocardiography if abnormal auscultation was found. At each telephone interview or outpatient visit, a standardized and validated questionnaire (Questionnaire for verifying stroke-free status) was used to detect potential stroke or TIA.

Statistical analysis

Student t-test and chi-square test were used for comparisons of differences between groups. Data were shown as mean (SD), median (interquartile range[IQR]) or number (percentage). Two-sided P < 0.05 was considered to be significant.

Data pre-processing was conducted before machine learning. The summary for missing data is shown in Additional file 1: eTable1. As suggested by the missing data pattern (Additional file 1: eFigure1), it was considered as random missing data case with no particular trend among all the variables. Missing values were then computed with multiple imputation using R package of “mice”. Thirty-four variables from demographic, laboratory and echocardiographic data were finally included.

We first used principal component analysis (PCA) to reduce the dimensions with the function of FAMD (Factor Analysis of Mixed Data). Next, we applied cluster analysis on the PCA outputs using the function of HCPC (Hierarchical Clustering on Principal Components) in FactoMineR package. The partitioning of the HCPC is performed by cutting the hierarchical tree (dendrogram). To consolidate the final partitioning solution, we further performed k-means clustering. The binary data was treated as numeric values before clustering [18]. The average silhouette of observations for different values of k (1 to 10) were computed. The location of the maximum is considered as the optimal number of clusters. Cox proportional hazards regression was then applied to calculate the hazard ratio (HR) and 95% confidence interval (95%CI) of adverse events by different clusters. Proportional-hazards assumption was tested and no violation was found. To examine potential bias from the imputed datasets, we performed complete case analysis as one of the sensitivity analyses.

In supervised learning, we first used all available variables to construct the random forest survival (RFS) model and accessed the variable importance (VIMP). We then selected the top ranking features to reconstruct the prediction models and assessed the performance using concordance index (C-index) and Brier score (BS). A higher C-index and lower BS suggest a better prediction performance. In addition, we applied supervised self-organizing maps to help visualize features associated with individual cluster within the studied patients. All statistical analyses were performed using R version 4.1.2 or Stata 15.1 (StataCorp/SE, College Station, TX). Detailed descriptions of R source code were disclosed in the supplement.


197 PFO patients receiving percutaneous interventions were initially included in this study. The first 12 principal components with an eigenvalue ≥ 1, which accumulatively counted for 70.62% of the dataset, were used as input for the HCPC method (Additional file 1: eTable 2). Using HCPC, 3 clusters were identified (Fig. 1-A). Since the middle cluster included only 1 patient, we excluded this cluster, leaving a final number of 196 patients for subsequent analysis. Briefly, the average age of the included subjects was 42.7 (12.37) years and 64.80% (127/196) were female. During a median follow-up of 739 (IQR 731–924) days, 22 (11.22%) patients were lost to follow up. A total number of 12 adverse events (12/174, 6.9%) were reported, including 6 recurrent stroke (6/174, 3.45%), 5 TIA (5/174, 2.87%) and one death (1/174, 0.6%). No AF or major bleeding was documented.

Fig. 1
figure 1

Clusters identified by different methods. A Dendrogram from hierarchical clustering on principal components analysis. B The average silhouette of observations for different values of k (1 to 10) using k-means clustering analysis. The highest average silhouette was located at k = 2

Among the analyzed subjects, 77 (39.29%) patients were assigned to cluster 1 and 119 (60.71%) were assigned to cluster 2. Compared to cluster 1, patients in cluster 2 were more likely to be male, had higher systolic and diastolic blood pressure, higher body mass index (BMI), lower high-density lipoprotein cholesterol (HDL-C) and increased proportion of presence of atrial septal aneurysm (ASA). The values of red blood cell, hemoglobin, creatinine, uric acid, left atrium (LA), left ventricular end-diastolic dimension (LVEDD), interventricular septum (IVS) and posterior wall thickness (PW) were also higher in patients of cluster 2. Detailed descriptions of these variables were summarized in Table 1 and vividly visualized in Fig. 2. In Cox regression analysis, patients in cluster 2 tended to have 21% increased hazards for adverse events than those in cluster 1 (HR 1.21, 95%CI 0.62–2.36, P = 0.58, Fig. 3-A).

Table 1 Baseline characteristics of the study patients according to the clusters (from HCPC)
Fig. 2
figure 2

Self-organizing maps supervised by clusters identified by HCPC analysis. HCPC, hierarchical clustering on principal components; sbp, systolic blood pressure; dbp, diastolic blood pressure; bmi, body mass index; hdlc, high-density lipoprotein cholesterol; hgb, hemoglobin; ast, aspartate aminotransferase; alt, alanine aminotransferase; alb, albumin; ua, uric acid, asa, atrial septal aneurysm

Fig. 3
figure 3

Cumulative hazard estimates of adverse events according to the identified clusters. A Clusters from hierarchical clustering on principal components analysis. B Clusters from k-means clustering. HR, hazard ratio

In k-means clustering analysis, the highest average silhouette was located at k = 2, suggesting 2 as the optimal number of clusters (Fig. 1-B). Detailed descriptions of baseline characteristics across the 2 clusters were summarized in Additional file 1: eTable 3. Generally, the results were similar to what we found from HCPC. Cox regression analysis also suggested that patients in high risk cluster tended to have increased hazards for adverse events (HR 2.11, 95% CI 0.63–6.96, P = 0.21, Fig. 3-B). And the high risk cluster was characterized by higher proportion of male gender, higher blood pressure, higher BMI and lower HDL-C. Likewise, the finding from complete case analysis was largely identical to that of the primary analysis, except that the analyzed sample was significantly reduced (Additional file 1: eFigure2 and eTable4).

Figure 4 plots the variable importance of the full model using random forest survival analysis. We then selected the eight top ranking features to construct the prediction models, including fasting blood glucose, thickness of interventricular septum, the ratio of mitral peak early (E) to late (A) diastolic filling velocity, left ventricular end-systolic dimension, BMI, systolic blood pressure, thickness of the posterior wall and PTA. As presented in Table 2, the RFS model had similar Brier Score (2.6% vs 2.4%) but higher C-index than the traditional Cox proportional hazard regression model (0.87 vs 0.54), suggesting a better predictive model for adverse events.

Fig. 4
figure 4

Random forest variable importance (VIMP). Blue bars indicate positive VIMP, red indicates negative VIMP. Importance is relative to positive length of bars

Table 2 Performance metrics for different prediction models*


Increasing data have supported that PFO closure could significantly reduce the risk of stroke or TIA compared to medical therapy [7, 9,10,11,12]. The reported rate of recurrent stroke or TIA after PFO closure varied across different studies, ranging from 0 to 5.61% [4,5,6,7,8,9,10]. As pointed out by previous researchers, the key determination of treatment effect relies mainly on whether the discovered PFO is causally related to the stroke or just an innocent bystander [19]. Currently, there are two prediction systems used to evaluate the likelihood of a stoke-related PFO-the risk of paradoxical embolism (RoPE) score and the PASCAL classification system [14, 20].

Main components for RoPE score include age, smoking status, history of hypertension, diabetes, stroke or TIA, characteristics of the infarct on imaging [20]. PASCAL classification system is based on RoPE score, with combined consideration of PFO features, like PFO shunt size or the presence/absence of an ASA [2, 14]. Although the prediction systems are widely used in clinical practice, these estimations are sometimes violated by model assumptions or limited by subjective feature selections.

Machine learning is a promising technique increasingly applied in health care system. It allows phenotyping or sub-clustering of the analyzed population without knowing the definite outcomes, which helps reveal the underlying etiologies. In this study, we identified two main clusters in post-closure PFO patients, with the high risk cluster characterized by higher proportion of male gender and poorer cardiovascular profiles. Machine learning also enables objective feature selections and efficient prediction model constructions. The analysis of RFS further supported the predictive value of traditional risks factors, suggesting that high-risk groups should continue to be targeted to prevent stroke recurrence even after PFO closure [21, 22]. However, whether maximizing the management of these factors would provide extra benefits for these patients warrants further investigations.

Till now, few studies are conducted on machine learning and PFO [16, 17]. Owing to the small number of adverse outcomes, statistical power to identify independent predictors of recurrent stroke/TIA was often limited, when using the traditional Cox regression model [23]. The application of supervised machine learning to some extent helps settle this matter [24]. As shown in this study, RFS model did display better performance compared to Cox regression model after selecting the top ranking features. Additionally, RFS model was able to identify predictive features that were neglected in previous studies, for example, the ratio of mitral peak early (E) to late (A) diastolic filling velocity and thickness of the interventricular or posterior wall.

Although this is a pioneer study, several limitations should still be acknowledged. First, this is a post-hoc analysis, some data are not available, for example, PFO shunt size before closure or residual shunting after closure. Second, the small datasets and missing data could potentially bias the results, although the missing pattern suggests a random missing data case and the results from complete case analysis was similar to that of the imputed datasets. Third, the constructed model was not further validated by external datasets, which to some extent limits its generalizability. Finally, the evaluation of AF was based on cardiac assessment during follow-up visits. Occult AF might still be possible, leading to an underestimation of the prevalence of AF being reported.


There were 2 main clusters in PFO patients receiving device closure. The supervised and unsupervised machine learning both suggest that traditional cardiovascular profiles remain important predictors for future recurrence of stroke or TIA. However, whether maximizing the management of these factors would provide extra benefits in post-closure PFO patients warrants further investigations.

Availability of data and materials

The datasets and analyzed codes are available upon reasonable request from the corresponding author.



Patent foreman ovale


Transient ischemic attack


Atrial septal aneurysm


Hierarchical clustering on principal components


Random forest survival


Systolic blood pressure


Body mass index


High-density lipoprotein cholesterol


Low-density lipoprotein cholesterol


  1. Rana BS, Thomas MR, Calvert PA, Monaghan MJ, Hildick-Smith D. Echocardiographic evaluation of patent foramen ovale prior to device closure. JACC Cardiovasc Imaging. 2010;3(7):749–60.

    Article  PubMed  Google Scholar 

  2. Cho KK, Khanna S, Lo P, Cheng D, Roy D. Persistent pathology of the patent foramen ovale: a review of the literature. Med J Aust. 2021;215(2):89–93.

    Article  PubMed  Google Scholar 

  3. Lechat P, Mas JL, Lascault G, Loron P, Theard M, Klimczac M, et al. Prevalence of patent foramen ovale in patients with stroke. N Engl J Med. 1988;318(18):1148–52.

    Article  CAS  PubMed  Google Scholar 

  4. Furlan AJ, Reisman M, Massaro J, Mauri L, Adams H, Albers GW, et al. Closure or medical therapy for cryptogenic stroke with patent foramen ovale. N Engl J Med. 2012;366(11):991–9.

    Article  CAS  PubMed  Google Scholar 

  5. Carroll JD, Saver JL, Thaler DE, Smalling RW, Berry S, MacDonald LA, et al. Closure of patent foramen ovale versus medical therapy after cryptogenic stroke. N Engl J Med. 2013;368(12):1092–100.

    Article  CAS  PubMed  Google Scholar 

  6. Meier B, Kalesan B, Mattle HP, Khattab AA, Hildick-Smith D, Dudek D, et al. Percutaneous closure of patent foramen ovale in cryptogenic embolism. N Engl J Med. 2013;368(12):1083–91.

    Article  CAS  PubMed  Google Scholar 

  7. Mas JL, Derumeaux G, Guillon B, Massardier E, Hosseini H, Mechtouff L, et al. Patent foramen ovale closure or anticoagulation vs. antiplatelets after stroke. N Engl J Med. 2017;377(11):1011–21.

    Article  CAS  PubMed  Google Scholar 

  8. Saver JL, Carroll JD, Thaler DE, Smalling RW, MacDonald LA, Marks DS, et al. Long-term outcomes of patent foramen ovale closure or medical therapy after stroke. N Engl J Med. 2017;377(11):1022–32.

    Article  PubMed  Google Scholar 

  9. Søndergaard L, Kasner SE, Rhodes JF, Andersen G, Iversen HK, Nielsen-Kudsk JE, et al. Patent foramen ovale closure or antiplatelet therapy for cryptogenic stroke. N Engl J Med. 2017;377(11):1033–42.

    Article  PubMed  Google Scholar 

  10. Lee PH, Song JK, Kim JS, Heo R, Lee S, Kim DH, et al. Cryptogenic stroke and high-risk patent foramen ovale: the DEFENSE-PFO trial. J Am Coll Cardiol. 2018;71(20):2335–42.

    Article  PubMed  Google Scholar 

  11. Smer A, Salih M, Mahfood Haddad T, Guddeti R, Saadi A, Saurav A, et al. Meta-analysis of randomized controlled trials on patent foramen ovale closure versus medical therapy for secondary prevention of cryptogenic stroke. Am J Cardiol. 2018;121(11):1393–9.

    Article  PubMed  Google Scholar 

  12. Tsivgoulis G, Katsanos AH, Mavridis D, Frogoudaki A, Vrettou AR, Ikonomidis I, et al. Percutaneous patent foramen ovale closure for secondary stroke prevention: network meta-analysis. Neurology. 2018;91(1):e8–18.

    Article  PubMed  Google Scholar 

  13. Kleindorfer DO, Towfighi A, Chaturvedi S, Cockroft KM, Gutierrez J, Lombardi-Hill D, et al. 2021 guideline for the prevention of stroke in patients with stroke and transient ischemic attack: a guideline from the American heart association/american stroke association. Stroke. 2021;52(7):e364–467.

    Article  PubMed  Google Scholar 

  14. Elgendy AY, Saver JL, Amin Z, Boudoulas KD, Carroll JD, Elgendy IY, et al. Proposal for updated nomenclature and classification of potential causative mechanism in patent foramen ovale-associated stroke. JAMA Neurol. 2020;77(7):878–86.

    Article  PubMed  Google Scholar 

  15. Kent DM, Saver JL, Kasner SE, Nelson J, Carroll JD, Chatellier G, et al. Heterogeneity of treatment effects in an analysis of pooled individual patient data from randomized trials of device closure of patent foramen ovale after stroke. JAMA. 2021;326(22):2277–86.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Ntaios G, Weng SF, Perlepe K, Akyea R, Condon L, Lambrou D, et al. Data-driven machine-learning analysis of potential embolic sources in embolic stroke of undetermined source. Eur J Neurol. 2021;28(1):192–201.

    Article  CAS  PubMed  Google Scholar 

  17. Guan W, Ko D, Khurshid S, Trisini Lipsanopoulos AT, Ashburner JM, Harrington LX, et al. Automated electronic phenotyping of cardioembolic stroke. Stroke. 2021;52(1):181–9.

    Article  CAS  PubMed  Google Scholar 

  18. Ralambondrainy H. A conceptual version of the K-means algorithm. Pattern Recogn Lett. 1995;16(11):1147–57.

    Article  Google Scholar 

  19. Kent DM, Ruthazer R, Weimar C, Mas JL, Serena J, Homma S, et al. An index to identify stroke-related vs incidental patent foramen ovale in cryptogenic stroke. Neurology. 2013;81(7):619–25.

    Article  PubMed  PubMed Central  Google Scholar 

  20. Mojadidi MK, Zaman MO, Elgendy IY, Mahmoud AN, Patel NK, Agarwal N, et al. Cryptogenic stroke and patent foramen ovale. J Am Coll Cardiol. 2018;71(9):1035–43.

    Article  PubMed  Google Scholar 

  21. Zonneveld TP, Richard E, Vergouwen MD, Nederkoorn PJ, de Haan R, Roos YB, et al. Blood pressure-lowering treatment for preventing recurrent stroke, major vascular events, and dementia in patients with a history of stroke or transient ischaemic attack. Cochrane Database Syst Rev. 2018;7(7):Cd007858.

    PubMed  Google Scholar 

  22. Redfern J, McKevitt C, Dundas R, Rudd AG, Wolfe CD. Behavioral risk factor prevalence and lifestyle change after stroke: a prospective study. Stroke. 2000;31(8):1877–81.

    Article  CAS  PubMed  Google Scholar 

  23. Elmariah S, Furlan AJ, Reisman M, Burke D, Vardi M, Wimmer NJ, et al. Predictors of recurrent events in patients with cryptogenic stroke and patent foramen ovale within the CLOSURE I (Evaluation of the STARFlex septal closure system in patients with a stroke and/or transient ischemic attack due to presumed paradoxical embolism through a patent foramen ovale) trial. JACC Cardiovasc Interv. 2014;7(8):913–20.

    Article  PubMed  Google Scholar 

  24. Zhou J, Chou OHI, Wong KHG, Lee S, Leung KSK, Liu T, et al. Development of an electronic frailty index for predicting mortality and complications analysis in pulmonary hypertension using random survival forest model. Front Cardiovasc Med. 2022;9:735906.

    Article  PubMed  PubMed Central  Google Scholar 

Download references


Not applicable.


This work was supported by the Guangdong Medical Association Program [H022018002] and Scientific Program of Guangdong Provincial People’s Hospital [2016dzx03].

Author information

Authors and Affiliations



LDL, YZY and ZCJ have full access to
all the data in this study and take full responsibility as guarantors for the integrity of the data and the accuracy of the data analysis. ZCJ, LDL WSL and YZY contributed to the study design. ZGC, SQS, ZHW, LJX, HH and HJX contributed to analysis and data interpretation. LDL, YZY, ZCJ and WSL drafted the manuscript and contributed to the final approval of the manuscript.

Corresponding authors

Correspondence to Shulin Wu or Caojin Zhang.

Ethics declarations

Ethics approval and consent to participate

The study conformed to the ethical guidelines of the 1975 Declaration of Helsinki and was approval by the Ethics Committee of Guangdong General Hospital. Written informed consent was obtained from each patient.

Consent for publications

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

eTable1 Variable summary for missing data. eTable2 Hierarchical clustering on principal components analysis of the patients. eTable3 Baseline characteristic of the studied patients across the clusters (k-means clustering analysis). eTable4 Baseline characteristics of the studied patients across the clusters (in complete case analysis). eFigure1 Missing data pattern. eFigure2 Clusters identified by different methods in complete case analysis.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Luo, D., Yang, Z., Zhang, G. et al. Machine learning in a real-world PFO study: analysis of data from multi-centers in China. BMC Med Inform Decis Mak 22, 305 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: