- Research article
- Open Access
- Open Peer Review
Accurate and rapid screening model for potential diabetes mellitus
BMC Medical Informatics and Decision Making volume 19, Article number: 41 (2019)
Prediction or early diagnosis of diabetes is crucial for populations with high risk of diabetes.
In this study, we assessed the ability of five popular classifiers (J48, AdaboostM1, SMO, Bayes Net, and Naïve Bayes) to identify individuals with diabetes based on nine non-invasive and easily obtained clinical features, including age, gender, body mass index (BMI), hypertension, history of cardiovascular disease or stroke, family history of diabetes, physical activity, work stress, and salty food preference. A total of 4205 data entries were obtained from annual physical examination reports for adults in the Shengjing Hospital of China Medical University during January–April 2017. Weka data mining software was used to identify the best algorithm for diabetes classification.
The results indicate that decision tree classifier J48 has the best performance (accuracy = 0.9503, precision = 0.950, recall = 0.950, F-measure = 0.948, and AUC = 0.964). The decision tree structure shows that age is the most significant feature, followed by family history of diabetes, work stress, BMI, salty food preference, physical activity, hypertension, gender, and history of cardiovascular disease or stroke.
Our study shows that decision tree analyses can be applied to screen individuals for early diabetes risk without the need for invasive tests. This procedure will be particularly useful in developing regions with high epidemiological risk and poor socioeconomic status, and enable clinical practitioners to rapidly screen patients for increased risk of diabetes. The key features in the tree structure could further facilitate diabetes prevention through targeted community interventions, which can potentially improve early diabetes diagnosis and reduce burdens on the healthcare system.
The worldwide incidence of diabetes rose from 108 million in 1980 to 422 million in 2014, and could potentially be the seventh-leading cause of death in 2030 . However, half of the patients with diabetes are unaware of their disease. The incidence of diabetes (100 million adult patients) in China was the highest worldwide in 2015, whereas 52.7% of these patients (50 million) are undiagnosed [2, 3]. Hence, early detection and prevention of diabetes is a severe challenge in China.
The American Diabetes Association recommends annual screening for diabetes in patients older than 45 years and in younger patients with major risk factors . China’s National Plan for Non-Communicable Diseases Prevention and Treatment (2012–2015) identified diabetes as one of the priority diseases in China, and proposed several recommendations to predict diabetes based on blood glucose tests and routine physical examinations .
The main challenge in screening for diabetes is economic, including expensive blood work and additional human labor, which is even more challenging in developing countries . The World Health Organization recommends that simple strategies should be developed to identify patients with risk for diabetes and then implement early lifestyle interventions . To achieve these recommendations, it is crucial to develop a simple and accurate diabetes screening method.
Developing appropriate disease prediction algorithms can be technically challenging. In a Brazilian investigation, Lélis et al.  applied seven classification techniques to make a diagnosis of meningococcal meningitis and demonstrated this model is accurate and affordable. Choi et al.  developed two models to screen for prediabetes of 9251 individuals using an artificial neural network (ANN) and support vector machine (SVM) and performed a systematic evaluation of the models using internal and external validation, and concluded that the SVM model is superior to the ANN model in the screening for prediabetes. In another Brazilian study, Olivera et al.  utilized and compared machine-learning algorithms to develop predictive models using data from ELSA-Brasil and found that most of these predictive models yielded similar results and demonstrated the feasibility of identifying individuals with highest risk of having undiagnosed diabetes through easily-obtained clinical data. Data mining and machine learning are analytical methods that leverage artificial intelligence to identify patterns in large data sets, make decisions with minimal human intervention, and build models. There is considerable interest in determining how different classification techniques from machine learning can be utilized as disease prediction tools [10,11,12,13,14,15,16,17,18,19]. These tools have been used to diagnose diabetes [3, 8,9,10, 20,21,22], meningitis , glaucoma , asthma , coronary artery disease , cancer [14,15,16,17, 23], tuberculosis [18, 24], hypertension , and heart arrhythmia .
The objective of this study is to use easily obtained and directly observable clinical data to construct a predictive model to identify patients with increased risk for diabetes. Specifically, we utilize data mining and machine learning to develop an accurate diabetes classifier that can rapidly screen clinical data. Our approach will be particularly useful in locations with high epidemiological risk and poor socioeconomic status, where patients cannot afford medical laboratory costs . Rapid identification of patients with high diabetes risk can help to avoid disease progression and prevent the incidence of disease complications.
A total of 8452 annual physical examination reports between January 2017 and April 2017 were collected from the electronic health records database in Shengjing Hospital of China Medical University, located in the center of Liaoning Province in China. We adopted the nine most frequently used features from previous studies of diabetes prediction models [8, 20, 27,28,29,30]. These features are either directly observable or easily obtained without expensive and invasive tests. Approval for this study was obtained from the Shengjing Hospital (reference number 2017PS42K).
The nine features included age, gender, body mass index (BMI), hypertension, history of cardiovascular disease or stroke, family history of diabetes, physical activity, work stress, and salty food preference (eating the salty meat or fish 4–7 times a week). Among 8452 records, a total of 3956 records were excluded due to missing data for BMI, blood pressure, family history of diabetes, history of cardiovascular disease or stroke, physical activity, work stress, or salty food preference. Records with past history of diabetes (291 records) also were excluded because we focused on predicting prediabetes and diabetes. Finally, a total of 4205 records were included in this study as shown in Fig. 1.
Data collection and transformation
The nine features were characterized for data analysis. Age and gender were demographic characteristics. Family history of diabetes was defined as any family member previously diagnosed by a physician as diabetic or prediabetic (Yes = 1, No = 0). BMI was calculated as body weight divided by the square of height in meters and BMI ≥ 25 was defined as overweight. History of cardiovascular disease or stroke was defined as the patient previously diagnosed with coronary heart disease or stroke by a physician (Yes = 1, No = 0). Physical activity indicated if the patient engaged in more than 30 min of exercise 3 days a week (More = 1, Less = 0). Work stress was grouped into three levels according to the patients’ subjective impression (High = 2, Moderate = 1, Low = 0). Salty food preference (salty meat or fish) indicated if the person preferred salty food for 4–7 days a week (Yes = 1, No = 0).
BMI and hypertension were defined and measured as below. BMI was calculated as weight in kilograms divided by the square of height in meters (kg/m2); BMI ≥ 25 was defined as overweight. Hypertension was defined as systolic blood pressure ≥ 140 mmHg, or diastolic blood pressure ≥ 90 mmHg, and/or use of medication for blood pressure control.
Each report included a diagnosis (diabetes or normal) based on fasting plasma glucose. Diabetes diagnoses included prediabetes and type 2 diabetes, and was defined as fasting plasma glucose ≥5.6 mmol/L [8, 20].
After data preparation and transformation, the final database consisted of 4205 records and 10 variables. These 10 variables included 9 input variables and one target variable. The target variable consisted of two classes: one class was the diagnosis of diabetes, the other class was normal. The characteristics of participants and chi-square test results between two groups are presented in Table 1. There were statistically significant differences in the nine features between the two groups, at a significance level of 0.05.
We applied five popular classifiers to train the dataset, including J48 (class for generating a pruned or unpruned), AdaboostM1 (method for boosting a nominal class classifier), SMO (implements John Platt’s sequential minimal optimization algorithm for training a support vector classifier), Bayes Net (Bayes network learning method that implements a hill climbing algorithm restricted by an order on the variables), and Naïve Bayes (class for a naïve Bayes classifier using estimator classes). Weka software (version 3.8; University of Waikato, Hamilton, NZ) [6, 16] was used to assess the classifiers and identify the best algorithm for diabetes classification. To avoid over-fitting and unnecessary complexity, the decision tree created by the J48 algorithm was pruned by removing nonessential terminal branches. This pruning method was based on defined algorithms and did not affect the classification accuracy [6, 21, 25].
Classifier accuracy and performance evaluation
The entire dataset was randomly divided into two parts: the training set consisted of 70% of the data for model development, and the test set consisted of the remaining data (30%) for model validation [21, 31]. The algorithms were compared based on accuracy, precision, recall, F-measure, and the area under the receiver operating characteristic (ROC) curve (AUC), and the best-performing algorithm was selected [21, 32]. Eqs. 1–2 were used to calculate the accuracy, precision, recall, and F-measure, respectively.
The AUC summarizes ROC curves by indicating whether the classifier is more likely to distribute the score as positive rather than the randomly selected negative sample. Better models have larger AUC values. The relative accuracy of the classification test is graded according to the following scale : Excellent = 0.90–1; Good = 0.80–0.90; Fair = 0.70–0.80; Poor = 0.60–0.70; Fail = 0.50–0.60.
A total of 4205 records (2734 females and 1471 males) were selected for this analysis, which included 709 (16.86%) diabetes diagnoses and 3496 (83.14%) normal patients. Table 2 presents the performance of all classifiers, and shows that J48 exhibits better results than others (accuracy = 0.9503, precision = 0.950, recall = 0.950, F-measure = 0.948, and AUC = 0.964). Figure 2 presents the ROC curves of all classifiers.
The final tree contains 18 nodes and 19 leaves, as shown in Fig. 3.
The decision tree shows that age was assigned by as the first and most informative node, followed by family history of diabetes, work stress, BMI, salty food preference, physical activity, hypertension, gender, and history of cardiovascular disease or stroke. Most leaves in the left half of the decision tree (≤49 years old) were classified as normal, whereas most leaves in the right half of the decision tree (> 49 years old) were classified as diabetes.
The decision tree can be converted into a set of if-then rules by tracing the path from the root node to each terminal (leaf) node. The if-then rules created by the model are presented in Table 3.
In this study, we employed data mining and machine learning to examine the performance of five classifiers (J48, AdaboostM1, SMO, Bayes Net, and Naïve Bayes) and nine non-invasive and easily obtained clinical features (age, gender, BMI, hypertension, history of cardiovascular disease or stroke, family history of diabetes, physical activity, work stress, and salty food preference) for the rapid and accurate identification of individuals with diabetes. The best classifier was trained with the decision tree generated by the J48 algorithm, which had accuracy = 0.9503, precision = 0.950, recall = 0.950, F-measure = 0.948, and AUC = 0.964. The results indicate that this strategy successfully achieves accurate and rapid diabetes screening. This approach can be applied for non-invasive prediction of prediabetes and diabetes without the need for expensive lab tests. Thus, this test could be particularly useful in regions with high epidemiological risk and low socioeconomic status.
Decision trees are powerful classification algorithms used in parallel with data mining methods [20, 21, 24, 33, 34]. The first variable (root) in the tree is the most important factor, whereas consecutively distant variables further from the root are ranked in order as less important factors for data classification . This study shows that age is the most important attribute discriminating between those with and without diabetes. Age is followed by family history of diabetes, work stress, BMI, salty food preference, physical activity, hypertension, gender, and history of cardiovascular disease or stroke. These results are consistent with those reported in previous studies [20, 35, 36].
The decision tree shows that family history of diabetes, work stress and BMI are the following important factors after age. The tree identified a subgroup of individuals [1457 patients (99%)] with age ≤ 49, without a family history of diabetes and BMI ≤ 25 that were normal cases. Another subgroup of individuals [98 patients (97%)] with age ≤ 49, with a family history of diabetes, BMI > 25 that were identified as diabetes cases. A subgroup of individuals [85 patients (100%)] with age > 49 and work stress high, with a family history of diabetes were identified as diabetes cases (Table 3). These key features could facilitate diabetes prevention through community interventions. Several large-scale trials have demonstrated the benefits of preventing diabetes with targeted lifestyle interventions [20, 37,38,39]. By reducing these risk factors, would be rewarded as Therefore, patients who are at a high risk of developing diabetes could be targeted to reduce established risk factors and provide educational programs, which will reduce the public health burden and the number of undiagnosed individuals [8, 40].
A major strength of this study is that we used a real medical dataset of annual physical examinations from Shengjing Hospital of China Medical University. All subjects were subjected to laboratory glucose tests to diagnose prediabetes or diabetes, so the results were more reliable than if the individuals were diagnosed by self-reporting. In 2014, Shengjing Hospital received the Stage Seven award from the Healthcare Information and Management Systems Society for successful implementation of electronic health records and rapid sharing of clinical information via standardized electronic transactions, data warehousing, and data continuity with the emergency department and other ambulatory care departments. Shengjing Hospital routinely collects and stores a large amount of data in the electronic hospital records. We used data mining, machine learning, and knowledge discovery capabilities to identify potential data patterns and specific features containing enough information to increase the accuracy of diabetes predictions [10, 41]. A large-scale study conducted in Iran compared different classification algorithms in the diagnosis of type 2 diabetes and demonstrated that it is therefore highly recommended that the choice and selection of features for data mining applications in disease diagnosis, be done by the help and advice of experts to obtain the best possible results. Artificial neural network is the most accurate method of classification with an accuracy of 97.18% .
In the future, we will test the model and develop prediction models with more sensitivity and specificity. We will focus on applying similar methods in different populations using more data. When the amount of data increases, the results will be more robust . Our approach can be extended to larger databases that store more variables and risk factors related to diabetes . The results of these studies could provide novel evidence-based prevention and treatment strategies. Clinical researchers can help to establish new priorities for further analyses by diabetes researchers.
Our study has two limitations. Data were collected from only one large hospital in China. Further studies with additional data from this hospital and other centers need to be performed. This was a cross-sectional design study. The results should be confirmed in a prospective study.
We utilized data mining classifiers and machine learning to generate a decision tree that identified potential prediabetes and diabetes in clinical data extracted from annual health examination reports in a large Chinese hospital. We assessed the classifiers using nine clinical features that were easily obtained and non-invasive. The J48 classifier had the best performance, and indicates that decision tree analyses can be successfully applied to rapidly and accurately screen for diabetes in clinical practice. This type of work is essential in regions with high epidemiological risk and low socioeconomic status. The tree structure identifies the most important risk factors, and suggests that diabetes prevention programs could be applied through targeted community interventions. This would help improve early diabetes diagnosis and reduce burdens on the healthcare system.
The area under the receiver operating characteristic curve
Body mass index
Sequential minimal optimization
World Health Organization, Global report on diabetes 2016. http://apps.who.int/iris/bitstream/10665/204871/1/9789241565257_eng.pdf.
Hadaegh F, et al. High prevalence of undiagnosed diabetes and abnormal glucose tolerance in the Iranian urban population: Tehran lipid and glucose study. BMC Public Health. 2008;8:176.
Jahani M, Mahdavi M. Comparison of predictive models for the early diagnosis of diabetes. Healthc Inform Res. 2016;22(2):95–100.
Pippitt K, Li M, Gurgle HE. Diabetes mellitus: screening and diagnosis. Am Fam Physician. 2016;93(2):103–9.
The National Guideline for Basic Public Health Services. Available from: http://www.nhc.gov.cn/ewebeditor/uploadfile/2017/04/20170417104506514.pdf. (In Chinese).
Lelis VM, Guzman E, Belmonte MV. A statistical classifier to support diagnose meningitis in less developed areas of Brazil. J Med Syst. 2017;41(9):145.
World Health Organization, 2008–2013 action plan for the global strategy for the prevention and control of non-communicable disease.
Choi SB, et al. Screening for prediabetes using machine learning models. Comput Math Methods Med. 2014;2014:618976.
Olivera AR, et al. Comparison of machine-learning algorithms to build a predictive model for detecting undiagnosed diabetes - ELSA-Brasil: accuracy study. Sao Paulo Med J. 2017;135(3):234–46.
Habibi S, Ahmadi M, Alizadeh S. Type 2 diabetes mellitus screening and risk factors using decision tree: results of data mining. Glob J Health Sci. 2015;7(5):304–10.
Huang ML, Chen HY. Glaucoma classification model based on GDx VCC measured parameters by decision tree. J Med Syst. 2010;34(6):1141–7.
Farion K, et al. A tree-based decision model to support prediction of the severity of asthma exacerbations in children. J Med Syst. 2010;34(4):551–62.
Gregori D, et al. Non-invasive risk stratification of coronary artery disease: an evaluation of some commonly used statistical classifiers in terms of predictive accuracy and clinical usefulness. J Eval Clin Pract. 2009;15(5):777–81.
Chao CM, et al. Construction the model on the breast cancer survival analysis use support vector machine, logistic regression and decision tree. J Med Syst. 2014;38(10):106.
Kate RJ, Nadig R. Stage-specific predictive models for breast cancer survivability. Int J Med Inform. 2017;97:304–11.
Takada M, et al. Prediction of axillary lymph node metastasis in primary breast cancer patients using a decision tree-based model. BMC Med Inform Decis Mak. 2012;12:54.
Cakir A, Demirel B. A software tool for determination of breast cancer treatment methods using data mining approach. J Med Syst. 2011;35(6):1503–11.
Saybani MR, et al. Diagnosing tuberculosis with a novel support vector machine-based artificial immune recognition system. Iran Red Crescent Med J. 2015;17(4):e24557.
Zmiri D, Shahar Y, Taieb-Maimon M. Classification of patients by severity grades during triage in the emergency department using data mining methods. J Eval Clin Pract. 2012;18(2):378–88.
Meng X-H, et al. Comparison of three data mining models for predicting diabetes or prediabetes by risk factors. Kaohsiung J Med Sci. 2013;29(2):93–9.
Ramezankhani A, et al. Applying decision tree for identification of a low risk population for type 2 diabetes. Tehran lipid and glucose study. Diabetes Res Clin Pract. 2014;105(3):391–8.
Yu W, et al. Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes. BMC Med Inform Decis Mak. 2010;10:16.
Boyce S, et al. Evaluation of prediction models for the staging of prostate cancer. BMC Med Inform Decis Mak. 2013;13:126.
Kammerer JS, et al. Tuberculosis transmission in nontraditional settings: a decision-tree approach. Am J Prev Med. 2005;28(2):201–7.
Tayefi M, et al. The application of a decision tree to establish the parameters associated with hypertension. Comput Methods Prog Biomed. 2017;139:83–91.
Alickovic E, Subasi A. Medical decision support system for diagnosis of heart arrhythmia using DWT and random forests classifier. J Med Syst. 2016;40(4):108.
Hu FB, et al. Diet, lifestyle, and the risk of type 2 diabetes mellitus in women. N Engl J Med. 2001;345(11):790–7.
Anderson JW, et al. Carbohydrate and fiber recommendations for individuals with diabetes: a quantitative assessment and meta-analysis of the evidence. J Am Coll Nutr. 2004;23(1):5–17.
Colditz GA, et al. Weight gain as a risk factor for clinical diabetes mellitus in women. Ann Intern Med. 1995;122(7):481–6.
Koppes LL, et al. Moderate alcohol consumption lowers the risk of type 2 diabetes: a meta-analysis of prospective observational studies. Diabetes Care. 2005;28(3):719–25.
Li CP, et al. Performance comparison between logistic regression, decision trees, and multilayer perceptron in predicting peripheral neuropathy in type 2 diabetes mellitus. Chin Med J. 2012;125(5):851–7.
Lavrac N. Selected techniques for data mining in medicine. Artif Intell Med. 1999;16(1):3–23.
Bellazzi R, Zupan B. Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inform. 2008;77(2):81–97.
Marinov M, et al. Data-mining technologies for diabetes: a systematic review. J Diabetes Sci Technol. 2011;5(6):1549–56.
Ravikumar P, et al. Prevalence and risk factors of diabetes in a community-based study in North India: the Chandigarh urban diabetes study (CUDS). Diabetes Metab. 2011;37(3):216–21.
Reis JP, et al. Lifestyle factors and risk for new-onset diabetes: a population-based cohort study. Ann Intern Med. 2011;155(5):292–9.
Li G, et al. The long-term effect of lifestyle interventions to prevent diabetes in the China Da Qing diabetes prevention study: a 20-year follow-up study. Lancet. 2008;371(9626):1783–9.
Saaristo T, et al. Lifestyle intervention for prevention of type 2 diabetes in primary health care: one-year follow-up of the Finnish National Diabetes Prevention Program (FIN-D2D). Diabetes Care. 2010;33(10):2146–51.
Tuomilehto J, et al. Prevention of type 2 diabetes mellitus by changes in lifestyle among subjects with impaired glucose tolerance. N Engl J Med. 2001;344(18):1343–50.
Anderson AE, et al. Electronic health record phenotyping improves detection and screening of type 2 diabetes in the general United States population: a cross-sectional, unselected, retrospective study. J Biomed Inform. 2016;60:162–8.
Devoe JE, et al. Electronic health records vs Medicaid claims: completeness of diabetes preventive care data in community health centers. Ann Fam Med. 2011;9(4):351–8.
Heydari, M, et al. Comparison of various classification algorithms in the diagnosis of type 2 diabetes in Iran. Int J Diabetes Dev C. 2016;36(2):167–73.
China Medical Board (Grant number 15–219). The funding bodies play a role in the design of the study and collection, analysis of data and in writing the manuscript.
Availability of data and materials
The datasets analyzed during the current study are available from the corresponding author on reasonable request.
Ethics approval and consent to participate
Ethics approval was obtained by Shengjing Hospital of China Medical University Ethics Committee (ref. Ethics 2017PS42K) without requirement of consent for participation.
Consent for publication
All authors declare that they have no conflicts of interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.