 Research article
 Open Access
 Open Peer Review
 Published:
Comparison of machine learning techniques to predict allcause mortality using fitness data: the Henry ford exercIse testing (FIT) project
BMC Medical Informatics and Decision Makingvolume 17, Article number: 174 (2017)
Abstract
Background
Prior studies have demonstrated that cardiorespiratory fitness (CRF) is a strong marker of cardiovascular health. Machine learning (ML) can enhance the prediction of outcomes through classification techniques that classify the data into predetermined categories. The aim of this study is to present an evaluation and comparison of how machine learning techniques can be applied on medical records of cardiorespiratory fitness and how the various techniques differ in terms of capabilities of predicting medical outcomes (e.g. mortality).
Methods
We use data of 34,212 patients free of known coronary artery disease or heart failure who underwent clinicianreferred exercise treadmill stress testing at Henry Ford Health Systems Between 1991 and 2009 and had a complete 10year followup. Seven machine learning classification techniques were evaluated: Decision Tree (DT), Support Vector Machine (SVM), Artificial Neural Networks (ANN), Naïve Bayesian Classifier (BC), Bayesian Network (BN), KNearest Neighbor (KNN) and Random Forest (RF). In order to handle the imbalanced dataset used, the Synthetic Minority OverSampling Technique (SMOTE) is used.
Results
Two set of experiments have been conducted with and without the SMOTE sampling technique. On average over different evaluation metrics, SVM Classifier has shown the lowest performance while other models like BN, BC and DT performed better. The RF classifier has shown the best performance (AUC = 0.97) among all models trained using the SMOTE sampling.
Conclusions
The results show that various ML techniques can significantly vary in terms of its performance for the different evaluation metrics. It is also not necessarily that the more complex the ML model, the more prediction accuracy can be achieved. The prediction performance of all models trained with SMOTE is much better than the performance of models trained without SMOTE. The study shows the potential of machine learning methods for predicting allcause mortality using cardiorespiratory fitness data.
Background
Using data to make decisions and predications is not new. However, the nature of data availability is changing and the changes bring with them complexity in managing the volumes and analysis of these data. The marriage between mathematics and computer science is driven by the unique computational challenges of building predictive models from large data sets and getting into untapped hidden knowledge. Machine learning (ML) [1, 2] is a modern data analysis technique with the unique ability to learn and improve its performance without being explicitly programmed and without human instruction. The main goal of supervised ML classification algorithms [3] is to explain the dependent variable in terms of the independent variables. The algorithms get adjusted based on the training sample and the error signal. In general, conventional statistical techniques commonly rely on the process of hypothesis testing. This process is very userdriven where user specifies variables, functional form and type of interaction. Therefore, user intervention may influence resulting models. With ML techniques, the primary hypothesis is that there is a pattern (rather than an association) in the set of predictor variables that will identify the outcome. ML algorithms automatically scan and analyze all predictor variables in a way that prevents overlooking potentially important predictor variables even if it was unexpected. Therefore, it has been acknowledged as a powerful tool which dramatically changes the mode and accessibility of science, research and practice in all domains [4]. Medicine and Healthcare are no different [5,6,7].
The Henry Ford exercIse Testing (FIT) Project [8] is a retrospective cohort that included 69,985 patients who had undergone exercise cardiopulmonary treadmill stress testing at Henry Ford Health System in Detroit, MI from January 1, 1991 May 28, 2009. Briefly, the study population was limited to patients over 18 years of age at the time of stress testing and excluded patients undergoing modified or non Bruce protocol [9] exercise stress tests. Information regarding a patient’s medical history, demographics, medications, cardiovascular disease risk factors were obtained at the time of testing by nurses and exercise physiologists, as well as searches through the electronic medical records. For the full details of The FIT Project, we refer to prior work by AlMallah et al. [8]. Several studies [10,11,12,13] have used conventional statistical techniques to predict various medical outcomes using the FIT project data. In general, ML is an exploratory process, where there is no onemodelfitsall solution. In particular, there is no model that is known to achieve the highest accuracy for all domains, problem types or datasets [14]. The best performing model varies from one problem to another based on the characteristics of the variables and observation. In this study, we evaluate and compare seven popular supervised ML algorithms in terms of its accuracy of prediction for mortality based on exercise capacity (e.g., fitness) data. In particular, we conducted experiments using the following ML techniques: Decision Tree (DT), Support Vector Machine (SVM), Artificial Neural Networks (ANN), Naïve Bayesian Classifier (BC), Bayesian Network (BN), KNearest Neighbor (KNN), and Random Forest (RF). We applied the 10fold crossvalidation evaluation method for all techniques where several evaluation metrics are compared and reported. The study shows the potential of machine learning methods for predicting allcause mortality using cardiorespiratory fitness data.
Methods
Cohort study
In this study, we have excluded from the original registry of the FIT project the patients with known coronary artery disease (n = 10,190) or heart failure (n = 1162) at the time of the exercise test or with less than 10year followup (n = 22; 890). Therefore, a total of 34,212 patients were included in this study. The baseline characteristics of the included cohort are shown in Table 1 and indicate a high prevalence of traditional risk factors for cardiovascular disease. After a followup duration of 10 years, a total of 3921 patients (11.5%) died as verified by the national social security death index. All included patients had a social security number and were accounted for. In this study, we have classified the patients into two categories: low risk of allcause mortality (ACM) and high risk of ACM. In particular, patients were considered to have high risk for ACM if the predicted event rate is more than or equal to 3%.
Data Preprocessing
Data preprocessing is a crucial step in ML. Data that have not preprocessed carefully may lead to misleading prediction results. In our study, we have conducted the following preprocessing steps.

Outliers: The dataset used has been preprocessed by removing outliers (values that deviate from the expected value for a specific attribute) using the statistical measure namely interquartile range (IQR) [15]. The authors in [1] compare different outlier detection methods on biomedical datasets. The results show that the IQR is the fastest method in detecting all outliers correctly. Since the dataset used in this study is nearly symmetric, its mean equals its median equals its midrange, then the IQR is a good choice for handling outliers. The IQR measure is used to preprocess and identify the outliers from the training dataset. The IQR finds the outliers from the dataset by identifying the data that is over ranging from the dataset. The IQR is evaluated as IQR = Q3Q1 where Q3 and Q1 are the upper and lower quartiles, respectively. The number of records that are identified as outliers and has been removed is 808 records.

Missing values: It has been noted that some attributes such as the Percentage of Achieved Heart Rate and Metabolic Equivalent (METS) have missing values. The missing data for such attributes has been handled by replacing the missing values by the attribute mean.
Feature selection
The FIT project dataset includes 49 demographic and clinical variables.^{Footnote 1} In general, it is a common case that a few or several of the variables used in ML predictive models are in fact not associated with the response. In practice, including such irrelevant variables leads to unnecessary complexity in the resulting model. Therefore, before developing our model, we utilized an automated Rbased popular feature selection algorithm, information gain [16], to choose the most effective attributes in classifying the training data. In particular, this algorithm assesses the weight of each variable by evaluating the entropy gain with respect to the outcome, and then ranks the variables according to their weights. Only attributes with information gain >0 were subsequently used in model building.
Sampling
One of the main issues we encountered with the dataset used in this study is that it is imbalanced. In particular, the dataset included 3946 records with class label Yes (high risk of allcause mortality) and 30,985 records with class label No (low risk of allcause mortality). In general, the predication accuracy is significantly affected with imbalanced data [17]. In practice, there are two ways to handle the imbalanced class problem. One way is assign distinct costs to examples in the training dataset [18]. The other way is to either oversampling the minority class or to undersampling the majority class [19,20,21,22]. In order to handle the imbalanced dataset used in this study, we use Synthetic Minority Oversampling (SMOTE) Technique [23]. It is an oversampling technique in which the minority class is oversampled by creating synthetic examples rather than by oversampling with replacement. SMOTE selects the minority class samples and creates “synthetic” samples along the same line segment joining some or all K nearest neighbors belonging to the minority class [24, 25]. In other words, the oversampling is done as follows:

1.
Take sample of the dataset and find its nearest neighbors

2.
To create a synthetic data point, take the vector between a data point P in the sample dataset and one of P’s knearest neighbors.

3.
Multiply this vector by a random number x which lies between 0 and 1.

4.
Add this to P to create the new synthetic data point.
The percentage of SMOTE instances created in our experiment is 300% (11,838 records from the minority class).
Machine learning classification techniques
In our experiments, we studied the following seven popular ML classification techniques: Decision Tree (DT), Support Vector Machine (SVM), Artificial Neural Networks (ANN), Naïve Bayesian Classifier (BC), Bayesian Network (BN), KNearest Neighbor (KNN) and Random Forest (RF). We explore the space of parameters and common variations for each machine learning algorithm as thoroughly as is computationally feasible.
Decision Tree (DT) [26] is a model that uses a treelike graph to predict the value of a target variable by learning simple decision rules inferred from the data features. We use J48 decision tree algorithm (Weka implementation of C4.5 [27]). We tested the J48 classifier with confidence factor of 0.1, 0.25, 0.5, 0.75 and 1. The confidence factor parameter tests the effectiveness of postpruning and lowering the confidence factor decreases the amount of postpruning.
Support Vector Machine (SVM) [28] represents the instances as a set of points of 2 types in N dimensional place and generates a (N  1) dimensional hyperplane to separate those points into 2 groups. SVM attempts to find a straight line that separates those points into 2 types and is situated as far as possible from all those points. Training the SVM is done using Sequential Minimal Optimization algorithm [2]. We used Weka implementation of SMO [29]. We tested SVM using polynomial, normalized polynomial, puk kernels and varied the complexity parameter {0.1, 10, and 30}. The value of the complexity parameter controls the tradeoff between fitting the training data and maximizing the separating margin.
Artificial Neural Network (ANN) [30] attempts to mimic the human brain in order to learn complex tasks. It is modeled as an interconnected group of nodes in a way that is similar to the vast network of neurons in the human brain. Each node of the network receives inputs from other nodes, combines them in some way, performs a generally nonlinear operation on the result and outputs the final result. We trained the Neural Networks with gradient descent backpropagation. We varied the number of hidden units {1, 2, 4, 8, 32} and the momentum {0,0.2,0.5,0.9}.
Naïve Bayesian Classifier [31] applies Bayes’ theorem [32] with the naive assumption of independence between every pair of features. We use Weka implementation of Multilayer Perceptron [33]. We try three different Weka options for handling continuous attributes: modeling them as a single normal, modeling them with kernel estimation, or discretizing them using supervised discretization. Bayesian Network [34] is designed for modeling under uncertainty where the nodes represent variables and arcs represent direct connections between them. BNs model allows probabilistic beliefs about the variables to be updated automatically as new information becomes available. We tried different search algorithms including K2 [33], Hill Climbing [35], Repeated Hill Climber, LAGD Hill Climbing, TAN [36], Tabu search [53] and Simulated annealing [37].
KNearest Neighbors (KNN) [38] identifies from the neighbors, K similar points in the training data that are closest to the test observation and classifies it by estimating the conditional probability of belonging to each class and choosing the class with the largest probability. We varied the number of k {1, 3, 5, 10} neighbors. We considered three distance functions: Euclidean distance, Manhattan distance and Minkowski distance.
Random Forest (RF) [39, 40] is a classification algorithm that works by forming multiple decision trees at training and at testing it outputs the class that is the mode of the classes (classification). Decision tree works by learning simple decision rules extracted from the data features. The deeper the tree, the more complex the decision rules and the fitter the model. Random decision forests overcome the problem of over fitting of the decision trees. We use Random Forest Weka implementation. We varied the forests to have 10, 50, and 100 trees. The size of the feature set considered at each split is 1, 2, 4, 8, and 12.
Model evaluation and validation
In order to evaluate our models, we used the 10fold crossvalidation [39] evaluation method where the data are randomly partitioned into 10 mutually exclusive subsets {D_{1}, D_{2}, …, D_{K}} with approximately equal size. The testing operation is then repeated 10 times where at the i^{th} evaluation iteration, the D_{i} subset is used as the test set and the others as the training set. In general, a main advantage of the 10fold crossvalidation evaluation method is that it has a lower variance than a single holdout set evaluator. In particular, it reduces this variance by averaging over 10 different partitions, therefore, it is less sensitive to any partitioning bias on the training or testing data. For each iteration of the evaluation process, the following metrics are calculated:

Sensitivity: True Positive recognition rate
Sensitivity = TP/TP + FN

Specificity: True Negative recognition rate
Specificity = TN/TN + FP

Precision: It represents the percentage of tuples that the classifier has labeled as positive are actually positive
Precision = TP/TP + FP

Fscore: It represents the harmonic mean of precision and sensitivity
Fscore = 2 * TP / 2* TP + FP + FN

Root Mean Squared Error (RMSE): It is defined as the square root of the mean square error that measures the difference between values predicted by the model and the actual values observed, where y ^{′} is a vector of n predictions and y is the vector of n observed (actual) values

ROC: Receiver Operating Characteristic (ROC) Curve [40] is a way to quantify the diagnostic value of a test over its whole range of possible cutoffs for classifying patients as positive vs. negative. In each possible cutoff, the true positive rate and false positive rate is calculated as the X and Y coordinates in the ROC Curve.
True Positive (TP) refers to the number of high risk patients who are classified as high risk, whereas False Negative (FN) refers to the number of high risk patients who are classified as low risk patients. On the other hand, False Positive (FP) refers to the number of low risk patients who are classified as highrisk patients and False Negative (FN) refers to the number of low risk patients who are classified as low risk patients. All results of the different metrics are then averaged to return the final result.
Results
As an outcome of the feature selection process, the ML models have been developed using only 15 variables where Age, METS, Percentage HR achieved, HX Hypertension, Reason for test are ranked as the top significant five variables. The full list of the outcome variables is presented in Fig. 1.
Tables 2 and 3 show the performance of the DT classifier, with confidence parameter (Conf) equals 0.1, 0.25, 0.5, 0.75 and 1, using sampling and without using sampling, respectively. The results show that the AUC increased as the confidence factor increased up to about 0.75 at a peak of 0.88 AUC using sampling and up to about 0.25 at a peak of 0.73 AUC without using sampling, after which the classifier exhibited effects of overtraining. These effects are seen by a decrease in the AUC value with a confidence factor above 0.75 using sampling and above 0.25 without using sampling.
The results of the SVM classifier using sampling and without using sampling are reported in Tables 4 and 5, respectively. Different kernels (polynomial kernel, normalized polynomial kernel and puk kernel) and complexity parameters (C) (0.1, 10 and 30) are tested. The results show that the AUC increased as the complexity parameter increased up to 30 using sampling. In addition, the SVM using puk kernel outperforms the SVM using other kernels achieving AUC of 0.80 using sampling and 0.59 without using sampling with complexity parameter C = 30.
Tables 6 and 7 show the performance of Neural Networks with gradient descent backpropagation using hidden units H = {1, 2, 4, 8, 32} and the momentum M = {0, 0.2, 0.5, 0.9} using sampling and without using sampling, respectively. The number of hidden units and momentum rate that gives better AUC value is considered here. For neural networks, the highest performance is achieved when H = 4 and M = 0.5 for the case of using sampling (AUC = 0.82) while when H = 8 and M = 0 for the case of not using sampling (AUC = 0.80).
The performance of the Naïve Bayesian Classifier using sampling and without using sampling is reported in Tables 8 and 9, respectively. Three different Weka options for handling continuous attributes are explored (single normal, kernel estimation and supervised discretization). Results show that BC using supervised discretization achieves the highest AUC value of 0.82 using sampling and without using sampling. The performance results of the Bayesian Network classifier with different search algorithms (K2, Hill Climbing, Repeated Hill Climber, LAGD Hill Climbing, TAN, Tabu and Simulated Annealing) using sampling and without using sampling are reported in Tables 10 and 11, respectively. Bayesian Network classifier using Tan search algorithm achieves the highest AUC value of 0.84 using Sampling and 0.83 without using sampling.
Tables 12 and 13 report the performance of the KNN classifier, with different values of k {1, 3, 5, 10} neighbors, and using sampling and without using sampling. In our experiments, we used different distance functions; Euclidean distance, Manhattan distance and Minkowski distance. The results show that the KNN classifier using sampling has its best performance (AUC = 0.88) with K value equals 1 using any of the three distance functions while the KNN classifier without using sampling has its best performance (AUC = 0.74) with K value equals 10 using any of the three distance functions.
Tables 14 and 15 report the performance of the Random Forest (RF) classifier using 10, 50 and 100 trees. The size of the feature set (F) considered at each split is 1, 2, 4, 8, and 12. The results show that the highest AUC (0.97) is achieved using a forest of 50 trees with a feature set of 1, 2, 4, 8 or 12 using sampling whereas the highest AUC (0.82) is achieved using a forest of 100 trees with a feature set of 4.
We compared the impact of using different percentage of synthetic examples of the class “yes” (patients who are considered to have high risk for ACM). Figure 2 shows the area under the curve of seven different machine learning models trained using Decision Tree (DT), Support Vector Machine (SVM), Artificial Neural Networks (ANN), Naïve Bayesian Classifier (BC), Bayesian Network (BN), KNearest Neighbor (KNN) and Radom Forest (RF). All the models have been evaluated using datasets with 100%, 200% and 300% of synthetic example created using the SMOTE sampling technique on the training dataset and evaluated using 10fold cross validation. The results show that increasing the percentage of synthetic examples improves the prediction accuracy for all models except for the BC. For example, the SVM model achieves AUC of 0.62 using the sampled dataset with 100% synthetic examples compared to 0.72 using the sampled dataset with 200% synthetic examples. Increasing the percentage of synthetic examples to 300% improves the AUC of the BN to achieve 0.8. The performance of KNN, DT and RTF models using SMOTE has shown great improvement. The RF has shown the best improvement using SMOTE achieving 0.83 using 100% synthetic examples compared to 0.95 and 0.97 using 200% and 300% synthetic examples respectively. In our experiments, further increasing the synthetic examples to 400% and 500% did not show any improvement in the performance of the prediction models.
In order to evaluate the impact of using the SMOTE sampling techniques in handling the problem of the imbalanced dataset, we build different prediction models with and without SMOTE. Tables 16 and 17 show the prediction performance of different prediction models using various evaluation metrics without and with the SMOTE sampling technique (300%), respectively. For each metric (row), we highlighted the highest value in bold font and underlined the lowest value. As shown in Tables 16 and 17, after applying the 10fold crossvalidation on the training dataset, the AUC and sensitivity for all models used SMOTE have been significantly improved over the training results without SMOTE except for the BC. In addition, the performance of each model can differ from one metric to another. In general, the Random Forest (RF) classifier using SMOTE sampling achieves the best performance improvement. In particular, it achieves the best performance in terms of Sensitivity (95.07%), RMSE (0.18), FScore (84.55%) and AUC (0.97). However, the same model without using SMOTE achieves Sensitivity of (59.09%), RMSE of (0.29), FScore (29.35%) and AUC of (0.82). The KNN models using SMOTE achieves the best performance in terms of Specificity (96.98%) and Precision (77.18%). The KNN model without SMOTE achieves Specificity of 89.31%% and Precision of 11.12%. This improved performance of the prediction models is due to the imbalanced data size. It is noted that all the models with SMOTE achieve a more balanced sensitivity. Figure 3(a) and (b) illustrates the ROC curves for the different ML models with and without using SMOTE, respectively.
Discussion
Using machine learning methods to predict different medical outcomes (e.g., diabetics, hypertension and death) from medical datasets is gaining an increasing attention in the medical domain. This study is designed to take advantage of the unique opportunity provided by our access to a large and rich clinical research dataset, a total of 34,212 patients, collected by the FIT project to investigate the relative performance of various machine learning classification methods for predicting allcause mortality (ACM) using medical records of cardiorespiratory fitness. The large number of attributes of the dataset, 49 attributes, is used to uncover new potential predictors of ACM. To the best of our knowledge, this is the first study that compares the performance of ML model for predicting ACM using cardiorespiratory fitness data. We have evaluated seven models trained with and without SMOTE using various evaluation metrics.
Knuiman et al. [41] presented an empirical comparison of four different techniques for estimating the risk of death using mortality followup data on 1701 men. The four techniques used are binary tree, logistic regression, survival tree and Cox regression. The Cox regression outperformed the other three techniques achieving area under the AUC of 0.78 followed by logistic regression (AUC = 0.72), survival tree (AUC = 0.71) and binary tree (AUC = 0.66), respectively. Vomlel et al. [42] presented a predictive model for mortality using five different machine learning techniques on a data of 603 patients from University Hospital in Olomouc. The machine learning techniques used are logistic regression, decision tree, Naive Bayes classifier, Artificial Neural Network and Bayesian Network classifier. Using 10 fold cross validation logistic regression achieves the highest area under curve of 0.82, whereas the decision tree has the lowest AUC value of 0.61. Allyn et al. [43] compared the performance of logistic regression model and different machine learning models to predict the mortality inhospital after elective cardiac surgery. The study includes database of 6520 patients from December 2005 to December 2012, from a cardiac surgical center at University Hospital. Five different machine learning models have been evaluated: logistic regression, gradient boosting machine, random forest, support vector machine and naive bayes. The area under the ROC curve for the machine learning model (AUC = 0.795) was significantly higher than the logistic regression model (AUC = 0.742). Taylor et al. [44] studied the prediction of mortality of 4676 patients with sepsis at the emergency department using logistic regression and machine learning model. The machine learning model (AUC 0.86) outperforms the logistic regression model (AUC 0.76). Sherri [45] studied the Physical Performance and AgeRelated Changes in Sonomans (SPPARCS) to predict death among 2066 residents of Sonoma, California over the period between 1993 and 1995. In this study, a super learner has been used for death prediction. A super learner is an ensembling machine learning approach that combines multiple machine learning algorithms into a single algorithm and returns a prediction function with the best crossvalidated mean squared error. The super learner outperforms all single algorithms in the collection of algorithms, although its performance was quite similar to that of some algorithms. Super learner outperformed the worst algorithm (neural networks) by 44% with respect to estimated crossvalidated mean squared error. In principle, the datasets of both studies (Knuiman et al. [41] and Allyn et al. [43]) are considered to be relatively small in comparison to the number of patients for our dataset. In general, in Machine Learning, the bigger the size of the dataset, the higher the accuracy and robustness of the developed prediction models. In these studies, the highest AUC achieved by the developed prediction models is 0.86. In our experiments, the Random Forest (RF) model using SMOTE sampling achieved AUC of 0.97which significantly outperform the models of both studies.
Sullivan et al. [46] investigated the literature related to the comparisons made between established risk prediction models for perioperative mortality used in the setting of cardiac surgery. Metaanalysis was conducted to calculate a summary estimate of the difference in AUCs between models. The comparisons include 22 studies. The authors noted that all he investigated studies relied on relatively small datasets. This highlights the strengths and uniqueness of our study which is relying on large datasets reflected on the number of patients and the number of variables.
In general, an important observation from the results of our experiments is that for all metrics, the results show that it is not necessarily that the complex ML models (e.g., Support Vector Machine (SVM), Artificial Neural Networks (ANN)) can always outperform simpler models (e.g., Decision Tree (DT) model [47]). In particular, the Decision Tree (DT) model has been outperforming the complex models in terms of all evaluation metrics. The RF and KNN classifiers are considered to be less complex than SVM and ANN. However, it achieved the best performance for all metrics for model trained using SMOTE. In general, KNN is a nonlinear classifier, therefore, it tends to perform very well with a lot of data points. It is also very sensitive to bad features (variables). Therefore, effective feature selection [27] is an important step before using the KNN classifier and tends to improve its results. The Decision Tree (DT) model benefits from the feature selection and removing colinear variables steps as well. In general, decision trees do not require any assumptions of linearity in the data and thus they work well for nonlinearly related variables.
On the other hand, the SVM model tends to perform well in highdimensioned classification problems that may have over hundreds of thousands of dimensions, which is not the case of this study. In addition, the SVM model does not tend to perform well if the classes of the problem are strongly overlapping. In general, parametric models (e.g., SVM, Bayesian Network) can suffer from remembering local groupings as by their nature they summarize information in some way. ANN can usually outperform other methods if the dataset is very large and if the structure of the data is complex (e.g., they have many layers). This is an advantage for the KNN classifier which makes the least number of assumptions regarding the input data.
The results also show that the performance of the KNN and ANN classifiers, similar to the other models, can be very sensitive for the values of its parameters and thus these parameters need to be carefully explored and tuned in order to reach an adequate configuration. For example, the results show that setting the K parameter to the value of 1 achieves the best performance for all the evaluation metrics. For example, for K = 1, the model achieves AUC of (0.94) while for K = 3, 5 and 10, the model achieves the accuracy of (0.93), (0.91) and (0.90), respectively. In general, increasing the value of the K parameter has a mostly negative impact on the performance of the classifier for all metrics. The risk of model overfilling by using a low K value has been overcome by using the 10fold crossvalidation evaluation method. However, clearly, the optimal value of the K parameter can significantly differ from one problem to another.
Conclusion
ML techniques have shown solid prediction capabilities in various application domains including medicine and healthcare. In this study, we presented an evaluation and comparison of seven popular ML techniques on predicating allcause mortality (ACM) using medical records of Cardiorespiratory Fitness for the Henry Ford Testing (FIT) Project. The results show that various ML techniques can significantly vary in terms of its performance for the different evaluation metrics. It is also not necessarily that the more complex the ML model, the more prediction accuracy can be achieved. Simpler models can perform better in some cases as well. Therefore, there is no onesizefitsall model that can be well performing for all domains or datasets. Each problem and dataset need to be carefully evaluated, modeled and studied in order to reach an effective predictive model design. The results have also shown that it is critical to carefully explore and evaluate the performance of the ML models using various tuned values for their parameters. These results confirm the explorative nature of the ML process that requires iterative and explorative experiments in order to discover the model design that can achieve the target accuracy.
Notes
 1.
The detailed descriptions of the variables of the dataset are available on the appendix of the article.
Abbreviations
 ACM:

Allcause mortality
 ANN:

Artificial Neural Networks
 BC:

Naïve Bayesian Classifier
 BN:

Bayesian Network
 CRF:

Cardiorespiratory Fitness.
 DT:

Decision Tree
 KNN:

KNearest Neighbor
 ML:

Machine Learning
 RF:

Random Forest.
 SMOTE:

Synthetic Minority OverSampling Technique (SMOTE)
 SVM:

Support Vector Machine
References
 1.
Alpaydin E. Introduction to machine learning. MIT press; 2014. https://mitpress.mit.edu/books/introductionmachinelearning0.
 2.
Marsland S. Machine learning: an algorithmic perspective. CRC press; 2015. https://www.crcpress.com/MachineLearningAnAlgorithmicPerspectiveSecondEdition/Marsland/p/book/9781466583283.
 3.
Aggarwal CC. Data classification: algorithms and applications. CRC Press; 2014. https://www.crcpress.com/DataClassificationAlgorithmsandApplications/Aggarwal/p/book/9781466586741.
 4.
MayerSchonberger V, Cukier K. Big data: a revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt; 2013. https://www.amazon.com/BigDataRevolutionTransformThink/dp/0544227751.
 5.
Waljee AK, Higgins PD. Machine learning in medicine: a primer for physicians. Am J Gastroenterol. 2010;105(6):1224.
 6.
Kayyali B, Knott D, Van Kuiken S. “The bigdata revolution in us health care: Accelerating value and innovation,” Mc Kinsey & Company; 2013. https://www.mckinsey.com/industries/healthcaresystemsandservices/ourinsights/thebigdatarevolutioninushealthcare.
 7.
Burke J. Health analytics: gaining the insights to transform health care, 1st ed. Wiley; 2013. http://eu.wiley.com/WileyCDA/WileyTitle/productCd1118383044.html.
 8.
AlMallah MH, Keteyian SJ, Brawner CA, Whelton S, Blaha MJ. Rationale and design of the henry ford exercise testing project (the fit project). Clin Cardiol. 2014;37(8):456–61.
 9.
Bruce R, Kusumi F, Hosmer D. Maximal oxygen intake and nomographic assessment of functional aerobic impairment in cardiovascular disease. Am Heart J. 1973;85(4):546–62.
 10.
Juraschek SP, Blaha MJ, Whelton SP, Blumenthal R, Jones SR, Keteyian SJ, Schairer J, Brawner CA, AlMallah MH. Physical fitness and hypertension in a population at risk for cardiovascular disease: the henry ford exercise testing (fit) project. J Am Heart Assoc. 2014;3(6):e001268.
 11.
Hung RK, AlMallah MH, McEvoy JW, Whelton SP, Blumenthal RS, Nasir K, Schairer JR, Brawner C, Alam M, Keteyian SJ, et al. Prognostic value of exercise capacity in patients with coronary artery disease: the fit (henry ford exercise testing) project. Mayo Clin Proc. 2014;89(12. Elsevier):1644–54.
 12.
Juraschek SP, Blaha MJ, Blumenthal RS, Brawner C, Qureshi W, Keteyian SJ, Schairer J, Ehrman JK, AlMallah MH. Cardiorespiratory fitness and incident diabetes: the fit (henry ford exercise testing) project. Diabetes Care. 2015;38(6):1075–81.
 13.
Qureshi WT, Alirhayim Z, Blaha MJ, Juraschek SP, Keteyian SJ, Brawner CA, AlMallah MH. “Cardiorespiratory fitness and risk of incident atrial fibrillation: results from the henry ford exercise tesing (fit) project,”. Circulation. 2015:CIRCULATIONAHA–114. https://www.ncbi.nlm.nih.gov/pubmed/25904645.
 14.
Austin PC, Tu JV, Ho JE, Levy D, Lee DS. Using methods from the datamining and machinelearning literature for disease classification and prediction: a case study examining classification of heart failure subtypes. J Clin Epidemiol. 2013;66(4):398–407.
 15.
Bu Y, Howe B, Balazinska M, Ernst MD. The HaLoop approach to largescale iterative data analysis. VLDB J. 2012;21(2):169.
 16.
Batista GE, Prati RC, Monard MC. A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explorations Newsletter. 2004;6(1):20–9.
 17.
M. J. Pazzani, C. J. Merz, P. M. Murphy, K. M. Ali, T. Hume, and C. Brunk, “Reducing misclassification costs,” in machine learning, proceedings of the eleventh international conference, Rutgers University, New Brunswick1994, 1994, pp. 217–225.
 18.
Kubat M, Matwin S. “Addressing the curse of imbalanced training sets: Onesided selection”. In: Proceedings of the Fourteenth International Conference on Machine Learning, vol 97. Nashville: ICML; 1997. pp. 179–86.
 19.
Japkowicz N. The class imbalance problem: significance and strategies. In: In proceedings of the 2000 international conference on artificial intelligence (ICAI); 2000. p. 111–7.
 20.
Lewis DD, Catlett J. “Heterogenous uncertainty sampling for supervised learning,” in machine learning, proceedings of the eleventh international conference, vol. 1994. New Brunswick: Rutgers University; 1994. p. 148–56.
 21.
Ling CX, Li C. “Data mining for direct marketing: Problems and solutions”. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD98). New York City; 1998. pp. 73–79. [Online]. Available: http://www.aaai.org/Library/KDD/1998/kdd98011.php. Acceesed 1 May 2017.
 22.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority oversampling technique. J Artif Intell Res (JAIR). 2002;16:321–57.
 23.
Li DC, Liu CW, Hu SC. A learning method for the class imbalance problem with medical data sets. Comput Biol Med. 2010;40(5):509–18.
 24.
Ramentol E, Caballero Y, Bello R, Herrera F. Smotersb*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced datasets using smote and rough sets theory. Knowl Inf Syst. 2012;33(2):245–65.
 25.
Quinlan JR. Induction of decision trees. Mach Learn. 1986;1(1):81–106.
 26.
Hearst MA, Dumais ST, Osman E, Platt J, Scholkopf B. Support vector machines. IEEE Intelligent Systems and their Applications. 1998;13(4):18–28.
 27.
Platt J. “Fast Training of Support Vector Machines using Sequential Minimal Optimization.” In Advances in Kernel Methods  Support Vector Learning. MIT Press; 1998. https://dl.acm.org/citation.cfm?id=299094.299105.
 28.
Arbib MA. The handbook of brain theory and neural networks. MIT press; 2003. https://mitpress.mit.edu/books/handbookbraintheoryandneuralnetworks.
 29.
Cooper G, Herskovits E. A Bayesian method for the induction of probabilistic networks from data. Mach Learn. 1992;9:309–47.
 30.
Murphy KP. “Naive bayes classifiers,” University of British Columbia; 2006. https://datajobsboard.com/wpcontent/uploads/2017/01/NaiveBayesKevinMurphy.pdf.
 31.
J. M. Bernardo and A. F. Smith, “Bayesian theory,” 2001.
 32.
Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers. Mach Learn. 1997;29(23):131–63.
 33.
Buntine WL. A guide to the literature on learning probabilistic networks from data. IEEE Trans Knowl Data Eng. 1996;8:195–210.
 34.
Cunningham P, Delany SJ. Knearest neighbour classifiers. Multiple Classifier Systems. 2007;34:1–17.
 35.
Cheng J, Greiner R. “Comparing bayesian network classifiers.” Proceedings UAI, 101–107; 1999. https://dl.acm.org/citation.cfm?id=2073808.
 36.
Bouckaert RR. “Bayesian Belief Networks: from Construction to Inference.” Ph.D. thesis. University of Utrecht; 1995. https://dspace.library.uu.nl/handle/1874/845.
 37.
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
 38.
Ho TK. Random decision forests. In: Proceedings of the third international conference on document analysis and recognition (volume 1)  volume 1. ICDAR '95. Washington, DC: IEEE Computer Society; 1995. p. 278.
 39.
Prasad AM, Iverson LR, Liaw A. Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems. 2006;9(2):181–99.
 40.
Refaeilzadeh P, Tang L, Liu H. “Crossvalidation,” in Encyclopedia of database systems. Spring. 2009:532–8.
 41.
Knuiman MW, Vu HT, Segal MR. An empirical comparison of multivariable methods for estimating risk of death from coronary heart disease. J Cardiovasc Risk. 1997;4(2):127–34.
 42.
Vomlel J, Kruzık H, Tuma P, Precek J, Hutyra M. Machine learning methods for mortality prediction in patients with st elevation myocardial infarction. Proceedings of WUPES. 2012;2012:204–13.
 43.
Allyn J, Allou N, Augustin P, Philip I, Martinet O, Belghiti M, Provenchere S, Montravers P, Ferdynus C. A comparison of a machine learning model with euroscore ii in predicting mortality after elective cardiac surgery: a decision curve analysis. PLoS One. 2017;12(1):e0169772.
 44.
Taylor RA, Pare JR, Venkatesh AK, Mowafi H, Melnick ER, Fleischman W, Hall MK. Prediction of inhospital mortality in emergency department patients with sepsis: a local big datadriven, machine learning approach. Acad Emerg Med. 2016;23(3):269–78.
 45.
Rose S. Mortality risk score prediction in an elderly population using machine learning. Am J Epidemiol. 2013;177(5):443.
 46.
Sullivan PG, Wallach JD, Ioannidis JP. Metaanalysis comparing established risk prediction models (euroscore ii, sts score, and acef score) for perioperative mortality during cardiac surgery. Am J Cardiol. 2016;118(10):1574–82.
 47.
Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology. 1982;143(1):29–36.
Acknowledgments
The authors thank the staff and patients involved in the FIT project for their contributions.
Availability of data and material
The FIT project includes data from a single institution which was collected under IRB approval and did not utilize public funding or resources. Resources from Henry Ford Hospital were utilized in this project. The IRB approval clearly stated that the data will remain with the PI (Dr. Mouaz AlMallah  mouaz74@gmail.com) and the study investigators. We would like to note that there many ongoing analyses from the project. Data sharing will be only on a collaborative basis after the approval of the all the investigators who have invested time and effort on this project. This also has to be subject to IRB approval from Henry Ford Hospital and data sharing agreements.
Funding
Funding was provided by King Abdullah International Medical Research Center. Funding grant number SP16/100. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Author information
Affiliations
Contributions
SS: Data analysis and manuscript drafting. RE: Data analysis and manuscript drafting. AA: Data collection, critical review of manuscript. WQ: Data collection, critical review of manuscript. CB: Data collection, critical review of manuscript. SK: Data collection, critical review of manuscript. MB: Data analysis, critical review of manuscript. MA: Data collection, Data analysis and critical review of manuscript. All authors read and approved the final manuscript.
Corresponding author
Correspondence to Mouaz H. AlMallah.
Ethics declarations
Ethics approval and consent to participate
This article does not contain any studies with human participants or animals performed by any of the authors. The FIT project is approved by the IRB (ethics committee) of HFH hospital (IRB #: 5812). Informed consent was waived due to retrospective nature of the study. The consent to participate is not required.
Consent for publication
Not applicable. The manuscript doesn’t contain any individual identifying data.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 FIT (Henry ford ExercIse testing) project
 Allcause mortality
 Machine learning