Imbalanced target prediction with pattern discovery on clinical data repositories
 TakMing Chan^{1}Email author,
 Yuxi Li^{2},
 ChooChiap Chiau^{1},
 Jane Zhu^{1},
 Jie Jiang^{2} and
 Yong Huo^{2}
DOI: 10.1186/s1291101704433
© The Author(s). 2017
Received: 20 October 2016
Accepted: 11 April 2017
Published: 20 April 2017
Abstract
Background
Clinical data repositories (CDR) have great potential to improve outcome prediction and risk modeling. However, most clinical studies require careful study design, dedicated data collection efforts, and sophisticated modeling techniques before a hypothesis can be tested. We aim to bridge this gap, so that clinical domain users can perform firsthand prediction on existing repository data without complicated handling, and obtain insightful patterns of imbalanced targets for a formal study before it is conducted. We specifically target for interpretability for domain users where the model can be conveniently explained and applied in clinical practice.
Methods
We propose an interpretable pattern model which is noise (missing) tolerant for practice data. To address the challenge of imbalanced targets of interest in clinical research, e.g., deaths less than a few percent, the geometric mean of sensitivity and specificity (Gmean) optimization criterion is employed, with which a simple but effective heuristic algorithm is developed.
Results
We compared pattern discovery to clinically interpretable methods on two retrospective clinical datasets. They contain 14.9% deaths in 1 year in the thoracic dataset and 9.1% deaths in the cardiac dataset, respectively. In spite of the imbalance challenge shown on other methods, pattern discovery consistently shows competitive crossvalidated prediction performance. Compared to logistic regression, Naïve Bayes, and decision tree, pattern discovery achieves statistically significant (pvalues < 0.01, Wilcoxon signed rank test) favorable averaged testing Gmeans and F1scores (harmonic mean of precision and sensitivity). Without requiring sophisticated technical processing of data and tweaking, the prediction performance of pattern discovery is consistently comparable to the best achievable performance.
Conclusions
Pattern discovery has demonstrated to be robust and valuable for target prediction on existing clinical data repositories with imbalance and noise. The prediction results and interpretable patterns can provide insights in an agile and inexpensive way for the potential formal studies.
Keywords
Pattern discovery Data mining Prediction Imbalanced data Clinical data repositoryBackground
Data analytics on clinical data repositories
Healthcare Information Systems (HIS) such as Cardiovascular Information Systems (CVIS) have been available for decades [1]. The main function is to store and access patient records with deeper information than Electronic Medical Records (EMR). Integrated with EMR, Radiology Information Systems (RIS), Laboratory Information Systems (LIS), etc., HIS and CVIS have been useful for monitoring reporting, operating, scheduling and managing purposes with graphical user interfaces (GUI) such as dashboards.
With the emerging technology and availability of clinical registries and clinical data repositories [2], advanced predictive data analytics has great potential to add value to clinical research and improvement of clinical outcomes [3]. Traditional clinical studies, either retrospective or perspective, require tremendous efforts in design, data collection and sophisticated processing before any hypothesis can be tested or a target can be predicted. Mining existing massive practice data from repositories offers a promising way to create value and provide insights without too much extra overhead of a traditional clinical study. The challenge lies in the noises of practice data and imbalance of prediction targets of major clinical importance, such as bleeding after percutaneous coronary intervention (PCI) [4], or cardiac death [5]. Because data directly available from clinical data repositories is not subject to strict inclusion/exclusion criteria or sample matching to balance cases and controls [6], typical data mining methods for prediction (classification) are not designed to handle such challenges. The dilemma is that with rich existing data, domain users desire to generate initial datadriven hypotheses and get insights whether a specific clinical target of interest is predictable and what attributes (predictor variables) should be considered, before they take on the more involving way of a formal clinical study. As imbalanced target prediction is more challenging, it is also a realistically meaningful challenge which offers high practical value for outcome prediction and quality improvement in the realworld distribution of cases and control.
We aim to bridge the gap between sophisticated preparation to handle the imbalanced noisy data properties and firsthand datadriven insights for predicting targets of interest directly on existing data. In this regard, domain users can generate meaningful hypotheses and gain insights of their targets of interest with respect to predictability, and discover informative patterns of potential predictors to distinguish the targets from the others. The patterns discovered in this way are comparable to the best achievable discoveries requiring a series of sophisticated data processing, such as upsampling with tweaking, before typical prediction methods can be applied. As a result, more involving clinical studies can be potentially guided for what data samples to include/exclude and what predictors (variables) to collect, for example, in a case report form (CRF) for a formal study.
Model interpretability for domain users
Model interpretability is also highly desired in the clinical context for domain users. Technically speaking, the ability to draw classification boundaries on data is valid interpretability, but we specifically aim at clinical interpretability. In particular, it requires a prediction model can be explained by clinical domain practitioner, and applied, for example to select characteristics of patient cohorts that are expected to be consistent with the model predictability, to conduct his/her formal followup study. Therefore, interpretability throughout this paper represents a domain specific challenge rather than a technical one.
While domain users are gradually accepting more sophisticated prediction models in the clinical domains [7–13], we exclude the following models which are not considered domain interpretable in our scope: support vector machine (SVM) and artificial neural networks (ANN) which do not generate explainable rules for domain users [11, 14], random tree and random forest which generate an excessive number of trivial rules that are overwhelming for clinical reasoning [15]. For example, we testran random tree on the first real dataset in our evaluation, and it generated a tree that spans 228 lines (attributevalue occurrences). Though technically it can be considered as a decision tree, the lengthy rules are not feasibly interpretable for domain users.
In order to identify predictive patterns to guide potential formal studies, interpretability is critical for not only the selection of attributes (predictors) but also the specific properties (values) of the predictors to look into. Therefore, among the numerous classification methods available, we focus on interpretable ones in our proposed method and comparison, which should include the explicit attributes and values in the trained models (classifiers) with a model length digestible by human users. The representable interpretable models included in our evaluation comparisons are logistic regression [16], Naïve Bayes [17] and decision tree [18].
In this paper, we propose a predictive and intuitively interpretable pattern model that is noise tolerant for real data. We develop a simple pattern discovery algorithm where an optimization criterion is employed for prediction targets that are rare but of clinical importance, such as cardiac death. To evaluate the effectiveness, we employed two retrospective clinical datasets with imbalance and compared pattern discovery with the above representative interpretable prediction methods. Evaluation with crossvalidation shows competitive prediction performance of pattern discovery. Pattern discovery is expected to be a handy and valuable analytics tool for domain users to predict imbalanced targets from existing practice data without sophisticated processing, and to provide firsthand insights for formal research and studies to follow.
Problem definition and related works
We have listed existing interpretable classifiers included for comparisons: logistic regression, Naïve Bayes, and decision tree (C4.5). They were not designed for imbalanced datasets. Naive Bayes would be less influenced as the target proportion could be used as the prior in training. But a moderately high imbalance ratio would overweigh the prior and impact the prediction performance, as will be shown in experimental results and recent work [13]. Both logistic regression and decision tree optimize towards the overall accuracy where the prediction performance of a minority target can be significantly influenced.
The other noninterpretable methods, such as knearestneighbor [19], support vector machines [20] and artificial neural nets [3], are beyond our scope of comparison as they do not directly provide explicit humanreadable “patterns” to follow up for domain users.
The proposed pattern discovery in this work has some resemblance with association rule mining [21], associated motif discovery from biological sequences [22] and feature selection for data mining [23]. Association rule finds only frequent items, but does not model prediction (classification). One critical limitation of association rule based methods is that the target has to be frequent, which is not the case in clinical outcomes of interest [6]. Further extensions of classification after association rule mining suffer from scalability because nontrivial rules (over 3 attributes) can take intractable time to compute [24]. Furthermore, association rule mining works with only exact occurrences which cannot tolerate noises in healthcare data. These two limitations also apply to rule extraction based prediction methods [25]. Motif discovery works on sequential and contiguous patterns which are not the case in mining healthcare data (attributes are disjoint without an order and are not contiguous) [22, 26]. Nonetheless, the approximate matching modeling of biological motifs [27] inspires us to introduce a control to tolerate noise and increase flexibility of the pattern model. Feature selection usually works as an auxiliary method in combination with formal data mining methods for target prediction [23], but it works only on the attribute level (not attributevalue) and does not explicitly generates an prediction model for direct interpretation. On the other hand, the wide spectrum of feature selection methods provides many choices to select attributes for pattern discovery, such as ChiSquared test based feature selection [28].
Motivated by these, this work presents a pattern discovery classifier featuring a highly interpretable predictive pattern model on noisy, imbalanced healthcare data in practice for domain users.
Methods
Data
In this study, we utilize two published datasets to evaluate how pattern discovery can be applied on imbalanced target prediction, similarly in the way for clinical data repositories where minimum data processing is needed. The two datasets have been deidentified and published online for scientific research. The availability and approval information can be found from the corresponding references.
The thoracic dataset is about surgical risk for reallife clinical data from the thoracic surgery domain. The data was originally collected retrospectively at Wroclaw Thoracic Surgery Centre for patients who underwent major lung resections for primary lung cancer in the years 2007–2011 [20]. The publicly available dataset is after feature selection and elimination of missing values. It is composed of 470 samples, 16 preoperative attributes after feature selection, and the target attribute of 1year survival period labels (denoted as Risk1Yr = Yes if patient died; prevalence = 14.9%). To simulate the target scenario without requiring much tweaking, the original numeric attributes (PRE4, PRE5, and age) without wellestablished categorization were skipped, the total 22 missing values (0.3%) in the data were kept asis and no imputation was done to evaluate noise handling. Instead, PRE4 and PRE5 were combined into the wellestablished chronic obstructive pulmonary disease COPD (Yes/No) category with the auxiliary function. The attribute list is detailed in the Additional file 1.
The cardiac death dataset contains patients with coronary artery disease (CAD). Peripheral blood samples from 338 subjects aged 62 ± 11 years with CAD were analyzed, and followed for a mean 2.4 years for cardiovascular death (31 deaths). The available dataset is composed of 43 attributes (41 nontrivial) covering both clinical attributes and derived ones from gene expressions [5]. While the study discovers association between gene expression profiles and cardiac death, the next question of both great interest and challenge to domain users is whether a predictive pattern can be discovered for more followup studies. Therefore, in the experiments we tried the prediction of Cardiac Death = Yes (prevalence = 9.1%) on the available data asis, with the definition dependent removed to properly evaluate the prediction performance. In our experiments, data of both phases was combined for evaluation. In this dataset, gene expressions were transformed into more concise principal components (Prin*), and conserved axes of variation (snmAxis*). In our experiments, the gene expression components/axes were categorized by their signs (>0 or ≤ 0) with the auxiliary function. Other clinical indicator attributes were categorized according to typical normal/abnormal ranges. The total 417 missing values (2.9%) were kept to evaluate noise handling. The attribute list and more details on categorization are available in the Additional file 1.
Pattern discovery
We first propose the pattern model to support interpretability and tolerate noise for real data. An optimization criterion for prediction performance on imbalanced targets is then employed. A simple algorithm is then presented to computationally discover a predictive pattern according to the optimization goal.
The proposed model is a combination of attributes and their corresponding (categorized) values for a chosen prediction target. An auxiliary configuration function is implemented to transform numeric values to categories according based on clinical guidelines or domain knowledge. To make the pattern practical and flexible for noisy realistic data, a matching ratio threshold is introduced. It controls the minimal percentage of attributevalue pairs to match where a sample can be considered an imperfect match of the pattern.
 1)
a subset of attributes P = {P_{1}, P_{2}, …, P_{w}} ⊂ R,
 2)
a specific set of their corresponding values S = {v_{1}, v_{2}, …, v_{w}}, and
 3)
a matching ratio threshold 0 < r < = 1 to control the ratio of matching values of a data sample on P.
It can be also represented as P_{1} = v_{1}, P_{2} = v_{2}, …, P_{w} = v_{w} (with matching ratio threshold = r)
Pattern matching: a sample D_{i} = {d_{i1}, d_{i2}, …, d_{im}} is defined to match a pattern Pat = {P, S, r} = {{ P_{1}, P_{2}, …, P_{w} }, {v_{1}, v_{2}, …, v_{w}}, r}, if count(d_{iP1} == v_{1}, d_{iP2} == v_{2}, …, d_{iPw} == v_{w})/m > = r. We denote this case as match(Pat, D_{i}) = TRUE. Otherwise match(Pat, D_{i}) = FALSE.
An illustrative example of categorical CVIS patient data
ID  Gender  PCI History  Hemoglobin  Diabetes  CRP  Bleeding 

1  Male  Yes  Abnormal  No  Abnormal  Yes 
2  Female  No  Abnormal  N/A  Abnormal  No 
3  Male  No  N/A  No  Normal  No 
4  N/A  Yes  Normal  No  Normal  No 
5  Female  Yes  N/A  No  Abnormal  Yes 
6  Male  No  Normal  No  Normal  No 
The threshold r thus tolerates missing values by allowing them as mismatches. Therefore, the pattern model is intuitively interpretable by clinical users. The challenge is about discovering a pattern computationally from data that maximizes certain prediction criterion de novo.
To optimize and evaluate the pattern model specifically on imbalance target, the following criteria are employed.
For a dataset D with m attributes R = {R_{1}, R_{2}, …, R_{m}}, there are an exponential number of attributevalue combinations as pattern candidates, so we need certain optimization criterion to distinguish informative candidates from spurious ones. For the imbalanced minority target of interest T = t, the prediction performance should be evaluated by criteria other than accuracy, as it is noninformatively high (=1prevalence) if one simply predicts all samples to be the majority cases T ≠ t.
All these measures have the range [0, 1] and are higher the better towards the ideal value 1. The evaluation steps of the candidate pattern on the illustrative data are shown in the Additional file 1.
These evaluation measures therefore serve as potential optimization criteria for a classifier targeting the prediction of minority T = t. In this work, we employ Gmean as the optimization criterion, which shows stronger trends for performance balance than F1score in optimization (geometric mean versus harmonic mean) in initial experiments (details not shown). The optimization of Gmean is only carried out on training data, not on testing data.
The pattern discovery problem can be therefore defined as: given an input dataset D with input attributes R = {R_{1}, R_{2}, … R_{m}} and target attribute T, a specified target of interest T = t, and a maximal pattern width W (< = m), find a pattern Pat = {P, S, r} where P ⊂ R, P < = W, such that the optimization criterion of Gmean for T = t is maximized on D.
The next challenge is to discover a pattern de novo to maximize the optimization criterion on the training data. We introduce a simple pattern discovery algorithm and further integrate it with independent log likelihoods for cases with too weak patterns to form the pattern discovery classifier.
For pattern discovery, search exhaustively is computationally intractable. The search space can be broken down into three steps in a simplified view: the candidate attributes; the optimal combination of possible values of the attributes; and the optimal matching threshold. The first two steps are still computationally intractable to reach optimal solutions with respect to measures such as fmeasure [23]. A heuristic computational method is developed to discover a feasible pattern candidate first by eliminating hundreds of thousands of less predictive candidates, so that clinical users can have a feasible pattern to start with during interactions.
Identifying pattern candidate attributes is a feature selection problem [23]. The Chisquared test of independence [28] is employed, which is well established and interpretable for domain users. To determine pattern width W, a cutoff of pvalue (<= 0.05), or top K significant attributes can be used.
To tackle the challenge of determining the attributevalue combinations for imbalanced target prediction, we develop a heuristic method based on attributevalue percentage comparison. For a candidate attribute, all its values are listed with the target value (T = t) and nontarget value (T ≠ t) in a table. The count of samples belonging to each specific attributetarget value combination is filled in. The rowwise percentages are then calculated. The heuristic method then compares these percentages columnwise and selects the value with the maximal percentage to associate with the target value. An illustrative example is shown in the Additional file 1.
Lastly, the matching ratio threshold r is determined from the exhaustive range of at least one attribute (1/W) up to all attributes (W/W = 100%), where the value generating the best optimization criterion is chosen as the output r.
Though the pattern model is intuitively interpretable, there can be cases with too weak and ambiguous patterns to discover when imbalance exists. To construct a robust classifier not to miss a case like this, we calculate the log likelihood of T = t with the attributevalues of the case along the pattern, and accepts cases if the log likelihood is larger than T ≠ t. This intuitively integrates the Naïve Bayes scoring to classify cases without any explicit patterns. We set a relatively loose criterion of positive/negative ratio < 2 to trigger the log likelihood scoring. The setting is for use convenience, as it is intuitively the minimal integer > 1, which is the boundary case of balanced data. Further optimizing this with decimal points may improve the results but it is not our current focus.
The training and classifying procedures of the pattern discovery classifier are summarized as follows:
Train classifier on training set D
Chisquared test to select W attributes (W specified by user or by pvalue cutoff): P = {P_{1}, P_{2}, …, P_{w}} ⊂ R
Heuristic method to find values with the maximal rowwise percentages across the columns for the attributes: S = {v_{1}, v_{2}, …, v_{w}} for T = t
For r = 1/W to W/W
Evaluate {P, S, r} on D and keep the pattern Pat with the best Gmean
Calculate the log likelihoods for all values of P = {P_{1}, P_{2}, …, P_{w}} if imbalance exists
Classify D_{i} in a test set
Return match(Pat, D_{i})  (log likelihood (T = t D_{i}) > log likelihood (T ≠ t  D_{i}) if calculated)
Evaluation methods and experiment design
To evaluate prediction performance, a typical way is to use holdout testing data after building prediction models on training data. The training data could be further split to optimize parameters and select the model with the best generality before testing by applying crossvalidation [30]. In this work, our aim is to evaluate model prediction generality for domain users with minimal tweaking and the datasets are retrospective ones. Instead of using oneoff trainingtesting split which may introduce bias, we repeated trainingtesting multiple times (10) and recorded the average holdout testing performance each time. This was effectively a stratified 10fold crossvalidation, but without optimizing parameters or selecting top models. We further performed this rotated 10time holdout testing 20 runs, resulting in a distribution for each prediction performance metrics of precision, sensitivity, F1score and Gmean. Besides comparing the averaged 20run metrics along with their standard deviations (±), we further evaluated the statistical significance of the performance distributions, as illustrated at the bottom of Fig. 1.
The nonparametric paired Wilcoxon signed rank test was applied [31], to assess whether the favorable (higher) F1scores and Gmeans of pattern discovery were statistically significant compared to other method. We used R (Version 3.2.1) to perform this evaluation, particularly wilcox.test() with the following parameters: paired = T, alternative = ‘greater’.
A 10fold crossvalidation used in this trainingtesting way would generate 10 (slightly) different models due to the holdout difference. It is tricky to list all models of the 20 runs or synthesizing a unified one. We employed the common practice for illustration on retrospective data [32], which is to use full data to train a final model, also consistent with the way of rules illustrated in the thoracic dataset reference [20]. Note that in this regards the discovered pattern would be for illustration simplicity only, and a future testing set should be used to validate it. The final pattern generation part is illustrated in top right of Fig. 1.
The methods compared, including logistic regression, naive Bayes, and decision tree (C4.5), were run with the Weka 3.6 APIs which was able to run over missing values [32]. A random baseline classifier with equal chance to predict positive/negative for any sample was implemented, serving as a noninformative random guess method. This method has the theoretical sensitivity = 0.5 and precision = prevalence for any specified target. Therefore, no standard deviation is available. The Weka APIs of evaluation were employed to compute the metrics for all methods. All methods were run with the default parameters on the same set of attributes. Therefore the crossvalidation was for evaluating the holdout testing each time rather than parameter optimization. Note that all models were trained on the same generated folds in each run for fair comparisons.
Targeting for the domain user scenarios, we focus on performance evaluation with the original prevalence (imbalance) of data. On the other hand, we notice that there are specific methods on downsampling [11], upsampling [10], or generating new artificial samples (such as synthetic minority oversampling technique: SMOTE [33]) to address imbalanced data besides the typical cost matrix (high penalty on misclassified target cases in training) approaches [34]. While they have yielded promising results in many other applications, in our target scenario to gain initial insights from practice data, clinical domain users would be confused and disengaged by the statistics not reflecting the real data, either linked to nonexisting samples or prevented from viewing certain real samples. This would become a concern beyond the scope here as we aim to provide interpretability for domain users to investigate into and connect to the actual samples.
Nevertheless, we performed extended experiments with upsampling. We used the same evaluation framework, where additionally we upsampled the minority positive cases to certain positive/negative ratios (upsampling ratios) in the training set only, and evaluated the holdout testing set WITHOUT any upsampling. Note that our purpose is to illustrate that pattern discovery can achieve comparably robust performance with the original imbalanced prevalence. This was done not for the scenario desired by healthcare domain users, as interpretability would be affected with nonexisting samples and distorted case proportions.
Results
Results on the thoracic dataset
Following the experiment design, we first report the average 20run testing results on the thoracic dataset with the original prevalence. Then we cover the extended experiment results with upsampling. Statistical test results are then summarized, and the discovered pattern is illustrated with references to results beyond our scope. Detailed evaluation results with standard deviations (±) are available in the Additional file 1.
Therefore, finding the optimal upsampling ratios beforehand is not trivial especially for domain users. Pattern discovery shows the advantage of robust and consistent prediction performance even without upsampling, being the least sensitive to training upsampling ratios. This is desirable for our target scenario where domain users would like to discover insights from noisy and imbalanced practice data before they further invest heavily into formal studies.
Wilcoxon test (paired, greater than) pvalues between pattern discovery and the other methods on testing F1scores and Gmeans of the cross validations on the thoracic dataset
Logistic Regression  Naive Bayes  Decision Tree  

F1score  
Original  4.43 × 10^{−5}  4.43 × 10^{−5}  4.43 × 10^{−5} 
0.2  4.43 × 10^{−5}  4.43 × 10^{−5}  4.43 × 10^{−5} 
0.4  4.43 × 10^{−5}  4.43 × 10^{−5}  4.43 × 10^{−5} 
0.6  4.43 × 10^{−5}  8.14 × 10^{−5}  4.43 × 10^{−5} 
0.8  7.54 × 10^{−4}  0.0045  4.43 × 10^{−5} 
1.0  5.17 × 10^{−5}  0.0895  4.43 × 10^{−5} 
Gmean  
Original  4.43 × 10^{−5}  4.43 × 10^{−5}  4.43 × 10^{−5} 
0.2  4.43 × 10^{−5}  4.43 × 10^{−5}  4.43 × 10^{−5} 
0.4  4.43 × 10^{−5}  4.43 × 10^{−5}  4.43 × 10^{−5} 
0.6  4.43 × 10^{−5}  4.43 × 10^{−5}  4.43 × 10^{−5} 
0.8  4.43 × 10^{−5}  4.43 × 10^{−5}  4.43 × 10^{−5} 
1.0  4.43 × 10^{−5}  9.72 × 10^{−4}  4.43 × 10^{−5} 
For illustration, we show the patterns discovered with all data on for cardiac Risk1Yr = Yes (in the data T means Yes and F means No).
Discovered pattern from full thoracic dataset for illustration
Pattern (Rule)  Coverage  Accuracy 

PRE11 = T, PRE10 = T, PRE9 = T, PRE8 = T, PRE7 = T, PRE6 = PRZ2, COPD = Yes, PRE25 = T, DGN = DGN5, PRE17 = T, PRE14 = OC14, PRE30 = T; r = 25%  > Risk1Yr = T  0.42  0.23 
OTHERWISE  > Risk1Yr = F  0.58  0.91 
Although our focus is on interpretable models and minimal sampling handling to fit the target scenario for domain users, we are aware that advanced sampling combined with noninterpretable methods such as support vector machine (SVM) could generate very promising prediction performance [10, 11, 33], with also reported evaluation results on this dataset [20]. We listed the reference Gmean and projected the F1score with pattern discovery’s results generated at upsampling ratio 1.0, solely for audience information. Note that this is not a formal comparison as the methods in the list were not interpretable methods in our scope, where the referenced crossvalidation experiment was not with the same fold or run numbers.
F1scores and Gmeans of pattern discovery and the referenced noninterpretable methods
Methods  F1score  Gmean 

pattern discovery  0.345 ^{a}  0.633 ^{a} 
RUSBoost (RUS) [10]  0.302  0.588 
SVM + SMOTE (SSVM)  0.338  0.625 
boosted SVM for imbalanced data (BSI)  0.375  0.657 
JRip + BSI  0.362  0.648 
UnderBagging (UB)  0.354  0.651 
Further categorizing on PRE4, PRE5, and age, we are able to get a pattern with F1score 0.41 (precision 0.40, sensitivity 0.42; including AGE > = 80 and PRE5 < = 3.62) comparable to 0.44 by the JRip + BSI rules. Compared to the 9 rules extracted from JRip + BSI that are complex and less handy for practice (details in reference [20], also available from the Additional file 1), our discovered pattern is concise and practically interpretable for domain users, demonstrating its value to be used in the target scenario. The results shows great potential when pattern discovery is fully utilized with domain knowledge.
Results on the cardiac death dataset
This subsection reports the same experiment results and comparisons on the cardiac death dataset.
Wilcoxon test (paired, greater than) pvalues between pattern discovery and the other methods on testing F1scores and Gmeans of the cross validations on the cardiac death dataset
Logistic Regression  Naive Bayes  Decision Tree  

Fscore  
Original  4.43 × 10^{−5}  5.17 × 10^{−5}  4.43 × 10^{−5} 
0.2  4.43 × 10^{−5}  4.43 × 10^{−5}  6.02 × 10^{−5} 
0.4  4.43 × 10^{−5}  0.0012  9.45 × 10^{−5} 
0.6  4.43 × 10^{−5}  0.1161  4.43 × 10^{−5} 
0.8  4.43 × 10^{−5}  0.7492  4.43 × 10^{−5} 
1.0  4.43 × 10^{−5}  0.1953  4.43 × 10^{−5} 
Gmean  
Original  4.43 × 10^{−5}  4.43 × 10^{−5}  4.43 × 10^{−5} 
0.2  4.43 × 10^{−5}  4.43 × 10^{−5}  4.43 × 10^{−5} 
0.4  4.43 × 10^{−5}  4.43 × 10^{−5}  4.43 × 10^{−5} 
0.6  4.43 × 10^{−5}  4.43 × 10^{−5}  4.43 × 10^{−5} 
0.8  4.43 × 10^{−5}  4.43 × 10^{−5}  4.43 × 10^{−5} 
1.0  4.43 × 10^{−5}  1.27 × 10^{−4}  4.43 × 10^{−5} 
The experiment results on the cardiac death dataset demonstrate consistent robust prediction performance of pattern discovery, which reaches average testing F1scores and Gmeans comparable to the best results achievable from various upsampling ratios on training data. Furthermore, pattern discovery offers good interpretability for domain users, fitted best for our target scenario where initial insights are desired before potential formal followups.
We illustrate the discovered pattern from full data and discuss its details in the Additional file 1. The interpretable pattern sheds light to predictive modeling of cardiac deaths before more data can be obtained, and can be used as screening reference for more indepth followup and cohort studies for more detailed clinical and biological significance.
Discussion
In this work we have targeted a practical scenario where domain users would like to perform firsthand prediction without requiring sophisticated handling on clinical data repositories with existing practice data, so that they can plan more precisely before more involving efforts are spent. On the two retrospective datasets, pattern discovery has shown promising results with good interpretability and competitive prediction performance without sophisticated data handling.
Pattern discovery is novel in its intuitively interpretable model combined with the optimized matching threshold to accommodate noise. Pattern discovery is designed for minority and noise challenges which association rule mining does not address. Different from noninterpretable methods (such as SVM, ANN), or impractically complex models (such as random forest, random tree), pattern discovery offers domain interpretability. It also shows competitive performance compared with representative interpretable methods including naive Bayes, logistic regression, and decision tree. Without sophisticated processing or tweaking (such as boosting and sampling techniques), pattern discovery can achieve predictive performance on imbalanced data comparable to the best achievable one.
As a good starting point for domain users to gain insights on clinical data repositories with existing practice data, pattern discovery can be further enhanced first into pattern visual analytics. With good interpretability, pattern discovery can be visualized and updated by users in an interactive manner. Clinical users can conveniently incorporate their knowledge into discovered patterns and check how the prediction performance will be influenced instantly. As a result, they are engaged to have a detailed understanding of both the predictive pattern and patient data, which can be utilized for followups such as patient cohort design.
We are also aware of the limitation of this work for future improvement. We focus on comparisons among domain interpretable methods, and excluded methods which would provide stronger predictive performance by compromising interpretability. Our experiments were limited in the two retrospective dataset and rotated trainingtesting split was employed in crossvalidation, but a real clinical application with trainingtesting split would better evaluate the actual predictive performance. Furthermore, the search/optimization towards optimal patterns will become more critical, especially with extensions to more advanced pattern modeling, such as autocategorization for numeric attributes, multivalue and multipattern supports for better descriptive power.
Conclusions
Pattern discovery has been developed with good interpretability and a simple but effective algorithm. On the two retrospective datasets with high imbalance ratios and noise where the other interpretable methods face difficulty without sophisticated technical data handling, pattern discovery has demonstrated to be robust and valuable for the minority target prediction. The prediction results and interpretable patterns can provide insights in an agile and inexpensive way for the potential formal studies. We are looking into several directions to further enhance the value of pattern discovery.
Abbreviations
 ANN:

Artificial neural network
 BSI:

Boosted SVM for imbalanced data
 CAD:

Coronary artery disease
 CDR:

Clinical data repository
 CRF:

Case report form
 CVIS:

Cardiovascular Information System
 EMR:

Electronic Medical Record
 Gmean:

Geometric mean of sensitivity and specificity
 GUI:

Graphical user interface
 HIS:

Healthcare Information System
 LIS:

Laboratory Information System
 PCI:

Percutaneous coronary intervention
 pre:

Precision
 RIS:

Radiology Information System
 RUS:

RUSBoost
 sen:

Sensitivity
 SSVM:

SVM + SMOTE
 SVM:

Support vector machine
 UB:

UnderBagging
Declarations
Acknowledgements
We would like to thank Prof. Greg Gibson for providing detailed and updated information of the cardiac death dataset.
Funding
None.
Availability of data and materials
The thoracic dataset is available at from the University of California Irvine Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Thoracic+Surgery+Data. The cardiac death dataset is available upon request to the authors of the original reference.
Authors’ contributions
TC developed the computational method, carried out the experiments and drafted the manuscript. YL, JJ, JZ participated in the design of the methodology and coordination of experiments. CC and YH helped to conceive of the study and draft the manuscript. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent to publication
Not applicable.
Ethics approval and consent to participate
Not required as the datasets were published retrospective datasets, where the corresponding approval and consent were handled specifically.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Taylor GS, Muhlestein JB, Wagner GS, Bair TL, Li P, Anderson JL. Implementation of a computerized cardiovascular information system in a private hospital setting. Am Heart J. 1998;136:792–803.View ArticlePubMedGoogle Scholar
 Anderson HV, Shaw RE, Brindis RG, Hewitt K, Krone RJ, Block PC, McKay CR, Weintraub WS. A contemporary overview of percutaneous coronary interventions: The American College of CardiologyNational Cardiovascular Data Registry (ACCNCDR). J Am Coll Cardiol. 2002;39:1096–103.View ArticlePubMedGoogle Scholar
 Yoo I, Alafaireet P, Marinov M, PenaHernandez K, Gopidi R, Chang JF, Hua L. Data mining in healthcare and biomedicine: A survey of the literature. J Med Syst. 2012;36:2431–48.View ArticlePubMedGoogle Scholar
 Rao SV, McCoy LA, Spertus JA, Krone RJ, Singh M, Fitzgerald S, Peterson ED. An updated bleeding model to predict the risk of postprocedure bleeding among patients undergoing percutaneous coronary intervention: A report using an expanded bleeding definition from the national cardiovascular data registry CathPCI registry. JACC Cardiovasc Interv. 2013;6:897–904.View ArticlePubMedGoogle Scholar
 Kim J, Ghasemzadeh N, Eapen DJ, Chung NC, Storey JD, Quyyumi AA, Gibson G. Gene expression profiles associated with acute myocardial infarction and risk of cardiovascular death. Genome Med. 2014;6:40.View ArticlePubMedPubMed CentralGoogle Scholar
 Wasfy JH, Singal G, O’Brien C, Blumenthal DM, Kennedy KF, Strom JB, Spertus JA, Mauri L, Normand SLT, Yeh RW. Enhancing the Prediction of 30Day Readmission After Percutaneous Coronary Intervention Using Data Extracted by Querying of the Electronic Health Record. Circ Cardiovasc Qual Outcomes. 2015;8:477–85.View ArticlePubMedGoogle Scholar
 Ziȩba M, Tomczak JM. Boosted SVM with active learning strategy for imbalanced data. Soft Comput. 2015;19:3357–68.View ArticleGoogle Scholar
 Tomczak JM, Ziȩba M. Probabilistic combination of classification rules and its application to medical diagnosis. Mach Learn. 2015;101:105–35.View ArticleGoogle Scholar
 Oh S, Lee MS, Zhang BT. Ensemble learning with active example selection for imbalanced biomedical data classification. IEEE/ACM Trans Comput Biol Bioinforma. 2011;8:316–25.View ArticleGoogle Scholar
 Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A. RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Trans Syst Man, Cybern Part A Syst Hum. 2010;40:185–97.View ArticleGoogle Scholar
 Tao D, Tang X, Li X, Wu X. Asymmetric bagging and random subspace for support vector machinesbased relevance feedback in image retrieval. IEEE Trans Pattern Anal Mach Intell. 2006;28:1088–99.View ArticlePubMedGoogle Scholar
 Khalilia M, Chakraborty S, Popescu M. Predicting disease risks from highly imbalanced data using random forest. BMC Med Inform Decis Mak. 2011;11:51.View ArticlePubMedPubMed CentralGoogle Scholar
 Huang Z, Chan TM, Dong W. MACE prediction of acute coronary syndrome via boosted resampling classification using electronic medical records. J Biomed Inform. 2017;66:161–70.View ArticlePubMedGoogle Scholar
 Werbos PJ. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis. Washington: Harvard University; 1975.
 Breiman L. Random forests. Mach Learn. 2001;45:5–32.View ArticleGoogle Scholar
 Gortmaker SL, Hosmer DW, Lemeshow S. Applied Logistic Regression. Contemp Sociol. 1994;23:159.View ArticleGoogle Scholar
 John GHG, Langley P. Estimating Continuous Distributions in Bayesian Classifiers. Proc Elev Conf Uncertain Artif Intell Montr Quebec, Canada. 1995;1:338–45.Google Scholar
 Quinlan JR. C4.5: Programs for Machine Learning. 1992.Google Scholar
 Aha DW, Kibler D, Albert MK. InstanceBased Learning Algorithms. Mach Learn. 1991;6:37–66.Google Scholar
 Ziȩba M, Tomczak JM, Lubicz M, Swia̧tek J. Boosted SVM for extracting rules from imbalanced data in application to prediction of the postoperative life expectancy in the lung cancer patients. Appl Soft Comput J. 2014;14:99–108.View ArticleGoogle Scholar
 Agrawal R, Srikant R. Fast Algorithms for Mining Association Rules in Large Databases. J Comput Sci Technol. 1994;1215:487–99.Google Scholar
 Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, Makeev VJ, Mironov AA, Noble WS, Pavesi G, Pesole G, Regnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z. Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol. 2005;23:137–44.View ArticlePubMedGoogle Scholar
 Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. 2003;3:1157–82.Google Scholar
 Liu B, Hsu W, Ma Y, Ma B. Integrating Classification and Association Rule Mining. Knowl Discov Data Min. 1998;1998:80–6.Google Scholar
 Cohen WW. Fast effective rule induction. Proc Twelfth Int Conf Mach Learn. 1995;95:115–23.Google Scholar
 Leung KS, Wong KC, Chan TM, Wong MH, Lee KH, Lau CK, Tsui SKW. Discovering proteinDNA binding sequence patterns using association rule mining. Nucleic Acids Res. 2010;38:6324–37.View ArticlePubMedPubMed CentralGoogle Scholar
 Chan TM, Wong KC, Lee KH, Wong MH, Lau CK, Tsui SKW, Leung KS. Discovering approximateassociated sequence patterns for proteinDNA interactions. Bioinformatics. 2011;27:471–8.View ArticlePubMedGoogle Scholar
 Lawrence J. A guide to Chisquared testing. J Stat Plan Inference. 1997;64:157–8.View ArticleGoogle Scholar
 Hripcsak G, Rothschild AS. Agreement, the Fmeasure, and reliability in information retrieval. J Am Med Informatics Assoc. 2005;12:296–8.View ArticleGoogle Scholar
 Kohavi R. A Study of CrossValidation and Bootstrap for Accuracy Estimation and Model Selection. Int Jt Conf Artif Intell. 1995;14:1137–43.Google Scholar
 Woolson RF. Wilcoxon signedrank test. Wiley Encycl Clin Trials. 2008;2008:1–3.Google Scholar
 Garner SR. WEKA: The Waikato Environment for Knowledge Analysis. Proc New Zeal Comput Sci. 1995;1995:57–64.Google Scholar
 Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic minority oversampling technique. J Artif Intell Res. 2002;16:321–57.Google Scholar
 Ling CX, Sheng VS. Costsensitive learning and the class imbalance problem. Encycl Mach Learn. 2008;2008:231–5.Google Scholar