 Research
 Open Access
 Published:
Diabetes mellitus risk prediction in the presence of class imbalance using flexible machine learning methods
BMC Medical Informatics and Decision Making volume 22, Article number: 36 (2022)
Abstract
Background
Early detection and prediction of type two diabetes mellitus incidence by baseline measurements could reduce associated complications in the future. The low incidence rate of diabetes in comparison with nondiabetes makes accurate prediction of minority diabetes class more challenging.
Methods
Deep neural network (DNN), extremely gradient boosting (XGBoost), and random forest (RF) performance is compared in predicting minority diabetes class in Tehran Lipid and Glucose Study (TLGS) cohort data. The impact of changing threshold, costsensitive learning, over and undersampling strategies as solutions to class imbalance have been compared in improving algorithms performance.
Results
DNN with the highest accuracy in predicting diabetes, 54.8%, outperformed XGBoost and RF in terms of AUROC, gmean, and f1measure in original imbalanced data. Changing threshold based on the maximum of f1measure improved performance in gmean, and f1measure in three algorithms. Repeated edited nearest neighbors (RENN) undersampling in DNN and costsensitive learning in treebased algorithms were the best solutions to tackle the imbalance issue. RENN increased ROC and PrecisionRecall AUCs, gmean and f1measure from 0.857, 0.603, 0.713, 0.575 to 0.862, 0.608, 0.773, 0.583, respectively in DNN. Weighing improved gmean and f1measure from 0.667, 0.554 to 0.776, 0.588 in XGBoost, and from 0.659, 0.543 to 0.775, 0.566 in RF, respectively. Also, ROC and PrecisionRecall AUCs in RF increased from 0.840, 0.578 to 0.846, 0.591, respectively.
Conclusion
Gmean experienced the most increase by all imbalance solutions. Weighing and changing threshold as efficient strategies, in comparison with resampling methods are faster solutions to handle class imbalance. Among sampling strategies, undersampling methods had better performance than others.
Introduction
Diabetes mellitus (DM) is a chronic disease and according to the International Diabetes Federation (IDF), it is one of the fastest growing global health emergencies in this century. About 463 million diabetic people lived worldwide in 2019, of whom 352 million people are of working age (between 20 and 64 years old). It is projected 417 million adults will live with diabetes by 2030. In 2019, the proportion of undiagnosed diabetes is estimated at 50.1% around the world. Untreated diabetes can damage the heart, kidneys, nerves and can cause eye difficulties such as diabetic retinopathy [1]. According to IDF, in 2019, total health expenditures for diabetes was 760.3 billion dollars and it is expected to increase to 824.7 billion dollars by 2030 [2]. Identifying atrisk people, in addition to prevent health problems and promote quality of life, can save billions of dollars.
In recent years, machine learning methods, specifically deep neural networks have provided considerable applications in the health system [3,4,5,6,7]. Machine learning algorithms can model complicated and nonlinear patterns to identify atrisk people. In addition, some algorithms could extract and determine features importance [8, 9]. Healthcare researchers are often interested in predicting disease cases which are rare in comparison with normal population. As a result, class imbalance is a common issue in most medical datasets. In the presence of class imbalance, minority class has a lower significant number of instances relative to other class. Most classifiers aim to achieve optimal performance on the whole classes. It has been proved that algorithms tend not to perform well on the minority class [10, 11]. There are several reasons for the poor results of learning algorithms in the classification of minority class. Rare samples may be treated as noisy, small sample size could cause challenges for models to detect rare patterns and evaluation metrics are biased towards the majority class [12, 13]. In the healthcare applications, misclassification minority class of patients, impose more costs than an error in classifying healthy persons. However, standard learning algorithms mostly assume an equal misclassification error and balanced class distribution [14]. Analyzing diabetic data to predict occurrence of diabetes mostly has been challenging. Complex and nonlinear patterns of risk factors, in addition to the imbalance distribution of diabetes, are big issues in the prediction models.
To cope with the class imbalance problem, two main approaches have been established in the literature [15]. At data level, class distribution of data becomes fairly balanced with sampling techniques [16, 17]. At algorithm level, the distribution of data remains unchanged, but by modifying the cost of misclassification in minority class, model has been adjusted to focus more on learning rare class [18]. In threshold moving which is categorized under the algorithm level approach, class label prediction will be based on the optimal threshold instead of the default threshold (0.5) which is used routinely [15].
In this study, we will evaluate three the stateoftheart machine learning algorithms, deep neural network (DNN), extreme gradient boosting (XGBoost), and random forest with various imbalance solving strategies including sampling methods, costsensitive learning, and threshold moving to improve prediction accuracy for the risk of diabetes. We will compare the effect of each strategy on algorithms performance based on various metrics and determine the best solution.
Methods and materials
Data description
We used data from the Tehran Lipid and Glucose Study (TLGS) which its details have been published previously [19,20,21]. Briefly, this study aims to determine atherosclerosis risk factors on a representative sample of district13 of Tehran residents (n = 15,005, age \(\ge\) 3) that started at 1999–2001 as crosssectional prevalence study (phase 1). To determine the efficacy of populationbased measures in preventing the incidence of diabetes mellitus and dyslipidemia, lifestyle intervention implemented in selected people that started at 2002–2005 as prospective followup study (phase 2). Data of all participants measured repeatedly every three years. The TLGS study was approved by the ethics committee at the Research Institute for Endocrine Sciences at Shahid Beheshti University of Medical Sciences. The study procedure and its aims were explained to all participants prior to data collection, and all participants in the study provided informed consent. All methods were carried out in accordance with relevant guidelines and regulations. Approval for undertaking the current project was also obtained from the Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences. Nondiabetic people who aged > 20 years were selected from phase 3 (2005–2008) (i.e. second reexamination) of this populationbased ongoing study. These subjects were followed for the next three phases (phase 4, 2008–2011, phase 5, 2011–2014, phase 6, 2014–2017). During phases 4 to 6, 315, 326, and 326 new cases of DM type2 were identified respectively. Type2 diabetes was defined based on fasting plasma glucose (FPG) \(\ge\) 126 mg/dL or 2 h postchallenge plasma glucose (2 hPG) \(\ge\) 200 mg/dL or taking antidiabetic drugs. We considered people diabetic if they had one of the mentioned conditions in any 3 consecutive phases. Nondiabetic subjects who were lost to followup in the last phase were discarded because we could not consider them surely nondiabetic by the end of followup. Hence 1930 individuals were eliminated from 7600. In the end, 967 of the 5670 subjects were diagnosed with type2 diabetes, while 4703 of the subjects were nondiabetic.
All selected variables for the study included demographic, anthropometric measures, physical activity, family history of CVD and Diabetes, biochemical blood parameters, systolic and diastolic blood pressure, smoking status, and medication for hypertension and hyperlipidemia. The dependent variable was the incidence of diabetes during the 9year followup period.
Data preprocessing
To detect outliers in this high dimensional data we used the Isolation Forest method. It is an efficient way in identifying outliers based on random forest. When a sample is an outlier, it will be isolated in a shorter path than a normal sample in recursive splitting in the fitted decision trees [22].
To impute missing values, we implemented a multivariate iterative method based on extremely randomized tree (extra tree) regressors. In this method, each feature with missing values is considered as a dependent variable and other features are predictors in a regression model. It repeated iteratively for each feature and for a certain number of rounds. We used extra tree classifier for categorical features, and extra tree regressors for continuous variables. Extra tree algorithm is an ensemble of randomized decision trees on various subsamples of the dataset [23, 24].
Since the distribution of classes is imbalanced, stratified split with 70% for training and 30% for testing is used. With stratified split strategy, the ratio of diabetics to nondiabetics individuals remains the same in train and test data. All preprocessing methods included outlier detection and imputing parameters are only learned from training data and then transformed to testing data. It prevents information leakage from testing data to the learning process that could lead to an optimistic evaluation of models performance. It means that testing dataset had no contribution to the learning process and only have used for evaluating final models performance.
All programming was carried out in Python version 3.6 using Scikitlearn, Imbalancedlearn, Keras, and other related libraries.
Machine learning algorithms
To compare various algorithms performance in predicting the patients with diabetes, we applied deep neural network, extremely gradient boosting, and random forest methods.
Algorithms
Deep neural network In neural network independent variables input to the first layer, all neurons in this layer are fully connected to neurons in subsequent layers which are called hidden layers. The last layer outputs the prediction of the network. Deep neural networks have more than one hidden layer. Each neuron is weighted, and a bias value is added to the summation of weighted neurons. Weights control the contribution of each neuron in learning the network. In a neural network architecture, first, initial random weights are assigned to input neurons, then an activation function is used to calculate the output of each neuron in the hidden layer.
in this formula, x represents the value of neuron in the layer, and w represents the corresponding weight which after their multiplication, the sum of them is added to a bias term. Then activation function is applied to this value. The most popular activation function in hidden layers is Rectified Linear Unit (ReLU), which is calculated as follows:
ReLU solves the traditional vanishing gradient problem in learning deep neural networks [25]. For nonnegative values, it simply returns the input value. Because output of each layer is the input of the next layer, applying multiple activation functions connected in a chain, represent nonlinear and complex relations between variables. In the last layer, sigmoid activation function is applied to project values to the range from 0 to 1 (Fig. 1). This estimated value illustrates the probability of being diabetic for the input variables.
Concerning the error of the network in predicting of being diabetic for each individual, initial weights will be updated to reach stop criteria. To prevent overfitting and to increase generalization of the trained model to unseen data, we used early stopping and drop out in learning of the model. By increasing the number of neurons in hidden layers, training error decreases, but testing error after some point increases. Drop out is a type of regularization which randomly deactivate a fraction of neurons and all connections of them in a hidden layer during the learning process. With this approach, in each iteration of training, some neurons are omitted from learning, so, different neurons contribute to train model and it leads to an ensemble of subnetworks. Each subnetwork could learn a different aspect of data. Early stopping is another kind of regularization and it stops the learning process when performance of the model starts to decrease in hold out validation data [26].
Extreme gradient boosting It is an efficient implementation of gradient boosting algorithm. Classification and regression trees (CART), which assign a prediction score to each leaf, are the base learners. First, an initial value is assumed as prediction. This prediction is improved by adding a new tree to the residuals of its predecessor tree. This structure is showed in Fig. 2. After learning each tree, its contribution to the final model is weighted by the learning rate which is commonly between 0.1 and 0.3. In addition to the use of regularization term and shrinking learning rate to reduce overfitting, in XGBoost we can implement subsample of columns and rows before creating each tree [27].
Random forest It is an ensemble of decision trees which are constructed based on bootstrap samples. Each tree is learned by a random sample that is taken with replacement from training data. In the presence of a strong predictor, most of constructed trees use this predictor in the top split [28]. In random forest algorithm each split is built based on a random subsample of predictors. By this approach, all predictors take chance in learning data and model generalization to unseen data is increased. In classification, the most predicted class is the final prediction of the ensemble model.
Evaluation metrics
Accuracy measures overall performance of the algorithm, but in imbalanced data, this metric can be misleading. If algorithm always classify all samples as majority class, accuracy will be as high as the ratio of majority class, but definitely, this algorithm is useless. Gmean is geometric mean of sensitivity and specificity. Poor performance in diabetic class leads to a low gmean, even if all nondiabetic persons correctly be predicted. F1measure is a harmonic mean of recall (sensitivity) and precision that weighs precision and recall equally. Matthews Correlation Coefficient (MCC) is robust to data imbalance. It is a discretization of Pearson correlation between the observed and predicted classes [29]. Receiver Operating Characteristic (ROC) curve represents sensitivity (recall) versus 1specificity for all possible thresholds. Area under it (AUC) is summary of this curve [30]. PrecisionRecall (PR) curve represents precision versus recall for all possible thresholds. In imbalanced datasets, PR curve is more informatics than ROC curve [31]. In the case of focusing on classification successes, gmean is not biased towards the majority class. But, if we also want to consider classification errors, MCC is preferred [29]. Selecting a suitable metric to determine best algorithm, always have been challenging [32].
Parameter and feature selection
To determine hyperparameters (these parameters are specified by the analyst in order to optimize the performance of the model, and they cannot be estimated from the data) of classifiers, we used fivefold stratified crossvalidation grid search. In this method, all possible values of different parameters are considered. Then, for each combination of these values, the model is fitted to four training folds and evaluated by a remained test fold. Finally, the average of the results is considered. The combination which leads to highest gmean is chosen as the best hyperparameters.
After selecting optimal values of hyperparameters, to determine the most important features, we used SHAP (SHapley Additive exPlanation) [33] values which could explain blackbox machine learning algorithms. Shapley value as a concept in the game theory, calculates each player contribution to the final team result. It is the average marginal contribution of each player by considering all possible combination of players. For machine learning algorithms, SHAP estimates Shapley values to determine each features contribution to the output of the model.
Tackling class imbalance
Threshold moving This is the simplest way in handling class imbalance [34]. In unequal class distribution and costs of misclassification, default threshold (0.5) is not appropriate in the prediction of class labels. The optimized threshold can be selected based on max of gmean in ROC or f1measure in PR curves. These two approaches yield different thresholds.
Costsensitive learning Learning of algorithms are based on minimizing loss function. Each instance of the training dataset has equal weight in updating unknown parameter values during the iterative learning process of the algorithm. by assigning higher weights to minority class, and minimizing weighted loss function, instances from this class will have a greater role in the learning process [35].
Sampling Repeated edited nearest neighbors (RENN) as an undersampling method is a strategy to remove noisy, redundant, and borderline samples. Each instance in majority class is classified by its k nearest neighbors. If sample is misclassified by its neighbors, it will be removed, otherwise this sample is remained. In repeated edited nearest neighbors this editing is repeated several times [36].
One sided selection (OSS) as an undersampling method selects all minority class samples, and one randomly chosen sample form majority class combined to construct a new, smaller training set (C). Then, all original training samples are classified by 1 nearest neighbor classifier. Each sample from majority class which is misclassified by its nearest neighbor will be added to C. In the next step, all majority Tomek links samples which are the nearest neighbors from different classes [34] are removed from C. As a result, the undersampled training set (C) contains all minority class samples in addition to cleaned set of the majority class from redundant, noisyborderline samples [37].
Synthetic minority oversampling technique (SMOTE) generates synthetic examples by operation in feature space [16]. For oversampling, an instance and its nearest neighbors are randomly selected. Then, based on the desired amount of oversampling, some neighbors are chosen at random. After that, the difference between the selected sample and its neighbor in the feature space is taken. This difference is multiplied by a random number from (0,1) distance and is added to the selected instance. By this approach, synthetic samples are generated between two neighbors in the minority class.
SVMSMOTE this method only oversample instances from minority class which are in borderline. To identify borderline instances, support vector machine (SVM) algorithm is applied. SVM finds the best hyperplane that separates samples of two classes with maximum margins. This optimal hyperplane is only found based on a few samples which are called support vectors. In SVMSMOTE, samples from minority class that are around this borderline support vectors are oversampled by interpolation and extrapolation. In this algorithm, based on the number of nearest neighbors of majority class around minority class support vectors, oversampling is applied. If most of the m nearest neighbors of chosen minority support vector are from the majority class, as SMOTE strategy, new samples are generated by interpolation. But, if less than a half of m nearest neighbors are from the majority class, SMOTE oversampling is applied by extrapolation (Fig. 3) [38].
Hybrid approach class balance can be improved by combination of under and oversampling. ENNSMOTE is a hybrid technique that performs undersampling in majority class by edited nearest neighbor method and oversampling in minority class by SMOTE.
Results
From 5670 samples considered in this study, 57 samples have identified as outliers and have been discarded. Family history of cardiovascular disease (CVD) and diabetes, and also being exposed to secondhand smoke at home or work had respectively 29, 24, and 18 percent of missing values. Other variables had less than 3 percent missing values. Diabetic class percentage was 16.5% in training data. Characteristics of individuals at baseline (phase 3) has been summarized in Table 1.
Optimal values of hyperparameters for each algorithm have been reported in Table 2. Based on SHAP values, for the XGBoost model, the most important variables in predicting diabetes were fasting plasma glucose, twohour postprandial plasma glucose, BMI, waisthip ratio, age and family history of diabetes with mean absolute values of 0.637, 0.586, 0.356, 0.214, 0.201, and 0.201, respectively. For random forest, top variables were twohour postprandial plasma glucose, fasting plasma glucose, BMI, age, and triglyceride with mean absolute values of 0.0848, 0.0775, 0.0286, 0.0144 and 0.0125, respectively.
The results indicated that XGBoost and DNN (except for accuracy) in terms of all metrics outperform random forest (Table 3). In comparison with XGBoost, DNN has higher values in f1measure, gmean, and AUROC. Based on MCC, these two algorithms have approximately similar performance, but in terms of AUPRC, XGBoost performs better than DNN.
Figure 4 depicts best thresholds that lead to maximum of gmean and f1measure in ROC and PR curves in all algorithms. Based on gmean criteria, the thresholds are 0.266, 0.12, and 0.168 for DNN, XGBoost and random forest, respectively. For f1measure these thresholds are 0.427, 0.310, and 0.294 respectively.
Changing threshold from an ordinary value (0.5) to one based on maximum of gmean has led to higher gmean, but other metrics have experienced a drop in all algorithms (Table 4). While changing threshold based on maximum of f1measure yields better performance in f1measure, gmean in all algorithms, and MCC in DNN, and XGBoost. The percent of improvements based on f1measure and gmean were 1.6, and 4.4 in DNN, 3.2, and 7.1 in XGBoost, and 2.1, and 7.4 in random forest, respectively. MCC has enhanced by 0.3 and 0.1 percent in DNN and XGBoost. Only for random forest, there was a 0.7 percent decrease in MCC. Moving threshold does not affect ROC and PR AUCs, because they are independent of the selected threshold. Based on both approaches accuracy has decreased in all algorithms.
Weighing diabetic class in all algorithms in comparison with original models have increased f1measure, gmean, AUROC, and AUPRC. Only for weighted XGBoost, there is a slight drop in the ROC and PR AUCs (0.1 and 0.2 percent, respectively). Among improved metrics, gmean experienced the most increase by 6.7 percent in DNN, 10.9 percent in XGBoost and 11.6 percent in random forest. When compared to changing threshold based on maximum of gmean, weighing has boosted performance in terms of accuracy, f1measure, and MCC in all algorithms and gmean in XGBoost, and random forest. On the other hand, only gmean improved in weighted algorithms in comparison with changing threshold based on maximum of f1measure.
For the last approach to enhance accuracy in prediction diabetes, we have used 5 sampling methods which their effect on the distribution of classes have shown in Fig. 5.
RENN undersampling method consistently has increased f1measure, gmean, and AUROC in all algorithms (Table 5). Also, AUPRC improved by RENN in DNN algorithm. The other undersampling method, OSS, only has boosted gmean in all algorithms. In terms of f1measure, gmean, MCC, AUROC and AUPRC, one of undersampling methods outperforms other sampling strategies in DNN and random forest. For XGBoost algorithm, this superority is based on f1measure and gmean metrics.
In comparison with original data, gmean in all algorithms, and f1measure in XGBoost has increaed by SMOTE. While, SVMSMOTE has resulted in improvement in both gmean and f1measure in three algorithms, and AUROC in tree based algorithms. Lastly, AUROC in random forest and gmean in three classifers have boosted by ENNSMOTE sampler.
In comparison among sampling methods, based on AUROC, RENN is the best sampling methods in all algorithms and based on AUPRC, SVMSMOTE in XGBoost, OSS in random forest and RENN in DNN have the best performance.
To summarize the results, for DNN algorithm, best strategies to deal with imbalance issue among three applied approaches are, OSS in terms of accuracy, f1measure and MCC, and RENN undersampling methods in terms of ROC and PR AUCs. For XGBoost algorithm, with approximately same values of MCC, AUROC and AUPRC, weighing yields an improvement of 3.4 and 10.9 percent in f1measure and gmean, respectively. In terms of mentioned metrics, weighing, and in terms of AUROC, RENN are the best approaches. For random forest algorithm, as XGBoost, weighing has increased f1measure and gmean by 2.3 and 11.6 percent, respectively. In addition to mentioned metrics, AUROC and AUPRC experienced an improvement of 0.6 and 1.3 percent, respectively. Based on all these metrics, as well as MCC, weighing is the best solution to tackle imbalance issue for random forest.
Discussion
We studied three powerful machine learning algorithms to predict diabetes incidence in the future based on some demographic, biochemical, and anthropometric measures. To tackle minority diabetes class imbalance, we used three strategies. Changing threshold as a simple strategy, costsensitive learning and sampling which involve more searching to fit optimal algorithm, are applied.
We evaluated the performance of algorithms before and after providing a solution to the imbalance issue by examining various metrics. Each metric focuses on a special aspect of performance. Except ROC and PR AUCs, all metrics are constructed based on confusion matrix. Accuracy is consistently decreased after applying imbalance solutions, while gmean as unbiased metric in imbalanced data [29] is raised substantially. Other metrics had variable behavior.
Our results show that changing threshold based on value that maximizes f1measure, improved f1measure, gmean, and MCC (except for random forest) in three investigated algorithms. In changing threshold approach, the algorithm is not refitted. As a consequence, training time is reduced in comparison with other strategies which imply new hyperparameters. This effortless solution could have comparable results with other solutions [34]. Our study also demonstrates its efficiency. Although, ROC and PR AUCs remain constant, for a powerful trained algorithm changing threshold could be a first solution to enhance overall performance and to increase prediction accuracy in minority diabetes class.
For treebased algorithms, XGBoost and random forest, costsensitive learning was the best approach based on f1measure and gmean. Besides, it had good results in DNN. In comparison with sampling strategies, weighing only has one hyperparameter which should be tuned. As a result, the complexity of the training procedure and runtime are lower than sampling methods. By increasing the weight of minority diabetes class, sensitivity is consistently increased but on the other hand, specificity is decreased [39, 40].
Usually, to address the imbalance problem, sampling strategies are applied [41,42,43]. We studied five sampling methods. Among sampling strategies, one of the undersampling methods outperformed oversampling and hybrid procedures based on f1measure and gmean in all algorithms. Although in comparison with original data, sampling resulted in better performance, they were not the best solution to solve imbalance distribution between diabetic and healthy classes. Only for DNN, sampling method outperformed other approaches. Sampling strategies have multiple hyperparameters that should be tuned precisely.
Overall, in original imbalanced data, DNN had highest accuracy for minority diabetes class and outperformed other classifiers based on mean of metrics. After giving solution to class imbalance, in terms of AUROC and AUPRC, undersampled DNN and weighted XGBoost were better performers, respectively, among combination of algorithms and solving imbalance problem approaches. One of the applied advantages of XGBoost is its ability to model data with missing values which is a common case in medical data [27]. In addition, it is trained very fast and as a powerful algorithm, it has attracted attention in modeling challenging data [44, 45].
One limitation of our work is the low number of investigated sampling methods. SMOTE oversampling is frequently applied to handle class imbalance [12], but in our study, it was not the best performer. A possible explanation for this could be the high overlap between two classes in our data. Applying SMOTE could result in more ambiguous borderline between diabetes and nondiabetes classes. To explore the efficiency of sampling strategies, we will study a larger number of methods in the future with other datasets.
Conclusion
To conclude, we studied three main approaches to address the class imbalance in predicting diabetes risk. Our optimized algorithms led to a considerable rise in accurate prediction of rare diabetes class before and after giving imbalance solutions for TLGS data [43]. Weighing and changing threshold, compared to resampling methods are faster solutions to handle class imbalance. Our study results could assist researchers to choose the best way to deal with class imbalance for medical data.
Availability of data and materials
The datasets generated and/or analysed during the current study are not publicly available because this data are only available for approved proposals at Research Institute for Endocrine Sciences (RIES) in Shahid Beheshti University of Medical Sciences but are available from Davood Khalili, head of Department of Biostatistics and Epidemiology at RIES (email: dkhalili@endocrine.ac.ir) on reasonable request.
Abbreviations
 AUC:

Area Under Curve
 BMI:

Body mass index
 CART:

Classification and regression trees
 CVD:

Cardiovascular disease
 DM:

Diabetes mellitus
 DNN:

Deep neural network
 IDF:

International Diabetes Federation
 MCC:

Matthews Correlation Coefficient
 OSS:

One sided selection
 PR:

PrecisionRecall
 RENN:

Repeated edited nearest neighbors
 RF:

Random forest
 ROC:

Receiver Operating Characteristic
 SMOTE:

Synthetic minority oversampling technique
 SVM:

Support vector machine
 TLGS:

Tehran Lipid and Glucose Study
 XGBoost:

Extremely gradient boosting
References
Qummar S, Khan FG, Shah S, Khan A, Shamshirband S, Rehman ZU, Khan IA, Jadoon W. A deep learning ensemble approach for diabetic retinopathy detection. IEEE Access. 2019;7:150530–9.
IDF DIABETES ATLAS. 9th ed. https://www.diabetesatlas.org/upload/resources/material/20200302_133351_IDFATLAS9efinalweb.pdf.
Shishvan OR, Zois DS, Soyata T. Machine intelligence in healthcare and medical cyber physical systems: a survey. IEEE Access. 2018;6:46419–94.
Jothi N, Husain WJ. Data mining in healthcare–a review. Procedia Comput Sci. 2015;72:306–13.
Xie Z, Nikolayeva O, Luo J, Li D. Building risk prediction models for type 2 diabetes using machine learning techniques. Prev Chronic Dis. 2019;16:E130.
Mezzatesta S, Torino C, Meo P, Fiumara G, Vilasi A. A machine learningbased approach for predicting the outbreak of cardiovascular diseases in patients on dialysis. Comput Methods Programs Biomed. 2019;177:9–15.
Shamshirband S, Fathi M, Dehzangi A, Chronopoulos AT, AlinejadRokny H. A review on deep learning approaches in healthcare systems: taxonomies, challenges, and open issues. J Biomed Inform. 2021;113:103627.
Joloudari JH, Hassannataj Joloudari E, Saadatfar H, Ghasemigol M, Razavi SM, Mosavi A, Nabipour N, Shamshirband S, Nadai L. Coronary artery disease diagnosis; ranking the significant features using a random trees model. Int J Environ Res Public Health. 2020;17(3):731.
Joloudari JH, Saadatfar H, Dehzangi A, Shamshirband S. Computeraided decisionmaking for predicting liver disease using PSObased optimized SVM with feature selection. Inform Med Unlocked. 2019;17:100255.
He H, Ma Y. Imbalanced learning: foundations, algorithms, and applications. Hoboken: WileyIEEE Press; 2013.
Chawla NV. Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L, editors. Data mining and knowledge discovery handbook. Berlin: Springer; 2005. p. 853–67.
Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G. Learning from classimbalanced data: review of methods and applications. Expert Syst Appl. 2017;73:220–39.
Krawczyk B. Learning from imbalanced data: open challenges and future directions. Prog Artif Intell. 2016;5(4):221–32.
Sun Y, Wong AK, Kamel MS. Classification of imbalanced data: a review. Int J Pattern Recognit Artif Intell. 2009;23(04):687–719.
Buda M, Maki A, Mazurowski MA. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 2018;106:249–59.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority oversampling technique. J Artif Intell Res. 2002;16:321–57.
He H, Bai Y, Garcia EA, Li S: ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence): 2008. IEEE; 2008. pp. 1322–1328.
Fernández A, García S, Galar M, Prati RC, Krawczyk B, Herrera F. Learning from imbalanced data sets. Berlin: Springer; 2018.
Azizi F, Ghanbarian A, Momenan AA, Hadaegh F, Mirmiran P, Hedayati M, Mehrabi Y, ZahediAsl S. Prevention of noncommunicable disease in a population in nutrition transition: Tehran Lipid and Glucose Study phase II. Trials. 2009;10(1):5.
Azizi F, Rahmani M, Emami H, Mirmiran P, Hajipour R, Madjid M, Ghanbili J, Ghanbarian A, Mehrabi J, Saadat N. Cardiovascular risk factors in an Iranian urban population: Tehran lipid and glucose study (phase 1). Soc Prev Med. 2002;47(6):408–26.
Azizi F, Madjid M, Rahmani M, Emami H. MIRMIRAN P, Hadjipour R: Tehran Lipid and Glucose Study (TLGS): rationale and design. Iran J Endocrinol Metab. 2000;2(2):77–86.
Liu FT, Ting KM, Zhou ZH. Isolationbased anomaly detection. ACM Trans Knowl Discov Data: TKDD. 2012;6(1):1–39.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V. Scikitlearn: Machine learning in Python. J Mach Learn Res. 2011;12:2825–30.
van Buuren S, GroothuisOudshoorn K. mice: multivariate imputation by chained equations in R. J Stat Soft. 2010;45:1–68.
Glorot X, Bordes A, Bengio Y. Deep sparse rectifier neural networks. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics: 2011: JMLR Workshop and Conference Proceedings; 2011. p. 315–23.
Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning, vol. 1. Cambridge: MIT Press; 2016.
Chen T, Guestrin C. Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining: 2016; 2016. p. 785–794.
James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning with applications in R. Berlin: Springer; 2013.
Luque A, Carrasco A, Martín A, de las Heras A. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognit. 2019;91:216–31.
Hand DJ. Measuring classifier performance: a coherent alternative to the area under the ROC curve. Mach Learn. 2009;77(1):103–23.
Davis J, Goadrich M. The relationship between PrecisionRecall and ROC curves. In: Proceedings of the 23rd international conference on Machine learning: 2006; 2006. p. 233–240.
Chicco D, Tötsch N, Jurman G. The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in twoclass confusion matrix evaluation. BioData Min. 2021;14(1):1–22.
Lundberg SM, Lee SI. A unified approach to interpreting model predictions. In: Proceedings of the 31st international conference on neural information processing systems: 2017; 2017. p. 4768–4777.
He H, Ma Y. Imbalanced learning: foundations, algorithms, and applications. WileyIEEE Press; 2013.
Elkan C. The foundations of costsensitive learning. In: International joint conference on artificial intelligence: 2001. Lawrence Erlbaum Associates Ltd; 2001. p. 973–978.
Tomek I. An experiment with the edited nearestnieghbor rule. IEEE Trans Syst Man Cybernet. 1976;6(6):448–52.
Kubat M, Matwin S. Addressing the curse of imbalanced training sets: onesided selection. In: Icml: 1997. Citeseer; 1997. p. 179–186.
Nguyen HM, Cooper EW, Kamei K. Borderline oversampling for imbalanced data classification. Int J Knowl Eng Soft Data Paradig. 2011;3(1):4–21.
Wong A, Anantrasirichai N, Chalidabhongse TH, Palasuwan D, Palasuwan A, Bull D. Analysis of visionbased abnormal red blood cell classification. arXiv:210600389 2021.
Yang PT, Wu WS, Wu CC, Shih YN, Hsieh CH, Hsu JL. Breast cancer recurrence prediction with ensemble methods and costsensitive learning. Open Med. 2021;16(1):754–68.
Teh K, Armitage P, Tesfaye S, Selvarajah D, Wilkinson ID. Imbalanced learning: improving classification of diabetic neuropathy from magnetic resonance imaging. PLoS ONE. 2020;15(12):e0243907.
Barbieri D, Chawla N, Zaccagni L, Grgurinović T, Šarac J, Čoklo M, Missoni S. Predicting cardiovascular risk in athletes: resampling improves classification performance. Int J Environ Res Public Health. 2020;17(21):7923.
Ramezankhani A, Pournik O, Shahrabi J, Azizi F, Hadaegh F, Khalili D. The impact of oversampling with SMOTE on the performance of 3 classifiers in prediction of type 2 diabetes. Med Decis Mak. 2016;36(1):137–44.
XGBoost: Machine Learning Challenge Winning Solutions. https://github.com/dmlc/xgboost/tree/master/demo#machinelearningchallengewinningsolutions.
Tang X, Tang R, Sun X, Yan X, Huang G, Zhou H, Xie G, Li X, Zhou Z. A clinical diagnostic model based on an eXtreme Gradient Boosting algorithm to distinguish type 1 diabetes. Ann Transl Med. 2021;9(5):409.
Acknowledgements
We would like to acknowledge the Research Institute for Endocrine Sciences in Shahid Beheshti University of Medical Sciences for preparing research data in the framework of the Tehran Lipid and Glucose Study.
Funding
There are no funding resources for this research.
Author information
Affiliations
Contributions
SS, MP, MAM contributed to study planning and design. DK, AR contributed to data collection, data preparation. SS conducted the statistical analyses. MP, MAM, AR, and DK commented on the results. SS drafted the first manuscript and revised the manuscript after input from the other authors. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Ethical approval for the TLGS study was obtained from the Ethics Committee of the Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences. All of the participants provided written informed consent. All methods were carried out in accordance with relevant guidelines and regulations. Approval for undertaking the current project was also obtained from the Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Sadeghi, S., Khalili, D., Ramezankhani, A. et al. Diabetes mellitus risk prediction in the presence of class imbalance using flexible machine learning methods. BMC Med Inform Decis Mak 22, 36 (2022). https://doi.org/10.1186/s1291102201775z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1291102201775z
Keywords
 Diabetes mellitus
 Machine learning
 Imbalanced data
 Sampling strategies
 Costsensitive learning