Identification of risk factors for patients with diabetes: diabetic polyneuropathy case study

Metsker, Oleg; Magoev, Kirill; Yakovlev, Alexey; Yanishevskiy, Stanislav; Kopanitsa, Georgy; Kovalchuk, Sergey; Krzhizhanovskaya, Valeria V.

doi:10.1186/s12911-020-01215-w

Research article
Open access
Published: 24 August 2020

Identification of risk factors for patients with diabetes: diabetic polyneuropathy case study

Oleg Metsker¹,
Kirill Magoev^2,3,
Alexey Yakovlev^1,2,
Stanislav Yanishevskiy¹,
Georgy Kopanitsa ORCID: orcid.org/0000-0002-6231-8036²,
Sergey Kovalchuk² &
…
Valeria V. Krzhizhanovskaya^2,3

BMC Medical Informatics and Decision Making volume 20, Article number: 201 (2020) Cite this article

3316 Accesses
13 Citations
Metrics details

Abstract

Background

Methods of data mining and analytics can be efficiently applied in medicine to develop models that use patient-specific data to predict the development of diabetic polyneuropathy. However, there is room for improvement in the accuracy of predictive models. Existing studies of diabetes polyneuropathy considered a limited number of predictors in one study to enable a comparison of efficiency of different machine learning methods with different predictors to find the most efficient one. The purpose of this study is the implementation of machine learning methods for identifying the risk of diabetes polyneuropathy based on structured electronic medical records collected in databases of medical information systems.

Methods

For the purposes of our study, we developed a structured procedure for predictive modelling, which includes data extraction and preprocessing, model adjustment and performance assessment, selection of the best models and interpretation of results. The dataset contained a total number of 238,590 laboratory records. Each record 27 laboratory tests, age, gender and presence of retinopathy or nephropathy). The records included information about 5846 patients with diabetes. Diagnosis served as a source of information about the target class values for classification.

Results

It was discovered that inclusion of two expressions, namely “nephropathy” and “retinopathy” allows to increase the performance, achieving up to 79.82% precision, 81.52% recall, 80.64% F1 score, 82.61% accuracy, and 89.88% AUC using the neural network classifier. Additionally, different models showed different results in terms of interpretation significance: random forest confirmed that the most important risk factor for polyneuropathy is the increased neutrophil level, meaning the presence of inflammation in the body. Linear models showed linear dependencies of the presence of polyneuropathy on blood glucose levels, which is confirmed by the clinical interpretation of the importance of blood glucose control.

Conclusion

Depending on whether one needs to identify pathophysiological mechanisms for one’s prospective study or identify early or late predictors, the choice of model will vary. In comparison with the previous studies, our research makes a comprehensive comparison of different decisions using a large and well-structured dataset applied to different decision support tasks.

Peer Review reports

Background

Every third patient with diabetes suffers from diabetic polyneuropathy (DPN) [1]. This complication is a severe problem for a chronic patient because it can be accompanied by neuropathic pain and lead to a decrease in the quality of life. Neuropathic pain significantly disrupts sleep, negatively affects daily activity and life satisfaction. Due to the asymptomatic nature, it is important to have methods for identifying and predicting the development of this complication [2]. Diabetes is usually accompanied by other comorbid conditions such as hypertension, chronic heart failure and others that should be considered when analyzing risk factors [3].

The earlier diagnostics results in efficient management of the disease. Machine learning methods can support doctors with a preliminary judgment about diabetic polyneuropathy based on routinely collected physical examination data, and can support doctors in their decision-making process.

Selecting the valid features and the correct classifier are the most important problems for machine learning methods [4, 5]. Predictive models for diabetes already provide highly accurate prognosis of the disease.

Several studies on the prediction of diabetes mellitus that utilize Linear regression (LR) [6, 7] developed predictive models based on the records with up to 967 independent variables available from large electronic health record archives that resulted in up to 80% Area Under The Curve Receiver Operating Characteristics (AUC-ROC) [6] and enabled the formulation the most likely trajectory of comorbidities [7].

Application of the ensemble models approach based on the naïve Bayes, Linear regression (LR), Instance-Based Learner, support vector machine (SVM) [8], artificial neural networks (ANN) [9], decision trees [10], and random forests (RF) help to increase the accuracy of diabetes prediction up to 74% [11, 12].

Prediction of risks of specific complications of diabetes can achieve higher accuracy than general diabetes prediction models. For example, Sudharsan et al. [13] applied methods of machine learning to predict hypoglycemia in patients with diabetes as a result of insulin treatment. The authors used a variable number of blood glucose measurements to predict hypoglycemia (low blood glucose). According to the research, the RF model showed the best results at 10 measurements per week (92% sensitivity and 70% specificity). When using additional data on prescribed medications and reducing the prognosis window to 1 h, the specificity of the model increases to 90%.

Specific predictive models for diabetes polyneuropathy are based on screening methods, for example Nerve conduction studies (NCS) [14] can reach up to AUC 65.8–84.7% for the conditional diagnosis of DPN in primary care.

Prediction methods that utilize data from personal health records deal with large non-specific datasets with different prediction methods. Li et al. [15] utilized 30 independent variables, which allowed to implement a model with AUC = 88.63% for a Multilayer perceptron (MLP). Linear regression (LR) based methods [6, 16] produced up to AUC = 80.0%.

Huang et al. [17] developed a DT-based model to predict diabetic nephropathy. Their method-based laboratory analyses and genetic data in conjunction with patient-field-based rules to achieve maximum prediction accuracy. Depending on the gender of a patient, the model uses a set of variables that show the highest results in the corresponding group. According to the study, the model achieved the accuracy of 78.50%, specificity of 80.64%, and sensitivity of 81.40%.

Lagani et al. [18, 19] developed models to determine the risk of diabetes-related complications. Complications considered by the authors include cardiovascular disease, hypoglycemia, ketoacidosis, microalbuminuria, proteinuria, neuropathy, and retinopathy. For all outcomes, the authors determined the smallest set of clinical parameters and the algorithm to achieve the highest accuracy of diagnosis. The set of developed models achieves the accuracy from 62 to 83% depending on the specific complication.

The progress of polyneuropathy depends not only on the duration of diabetes but also on the treatment. It is known that with adequate control of blood glucose level, the incidence of polyneuropathy after 15 years from the development of diabetes does not exceed 10%, and with poor control of glycemia increases to 50%. Also, no direct relationship was discovered between the severity of diabetes and the progression of polyneuropathy (for example, severe forms of polyneuropathy can be observed in patients with a relatively mild course of diabetes). According to EURODIAB IDDM Complication Study (2001), the risk factors for polyneuropathy were: old age, duration of diabetes mellitus, hemoglobin level, excessive weight, proliferative diabetic retinopathy, high level of low-density lipoproteins, cardiovascular disease. According to Seattle Prospective Diabetic Foot Study (1996), new factors were found, including increased diastolic blood pressure, ketoacidosis, increased triglyceride levels, and microalbuminuria. Currently, the theory of “metabolic memory” (2007) is being developed, according to which the first (early) metabolic or inflammatory changes due to certain inertia of the pathological process can have a long-term effect and contribute to the further progression of late complications of diabetes mellitus. In clinical practice, the importance of early diagnosis of polyneuropathy is often underestimated by many specialists, and many cases remain unrecognized. Previously, the prognostic significance of risk factors for the occurrence and progression of polyneuropathy was not assessed in the studies. Isolation of high-risk groups and the progress of polyneuropathy among patients with diabetes mellitus will allow to improve the quality of medical care and reduce the number of disabling forms of the disease. Prediction model can also be used to set-up focused clinical studies with the hypotheses supported by real world evidences (RWE).

Methods of data mining and analytics can be efficiently applied in medicine to develop models that use patient-specific data to predict the development of diabetic polyneuropathy, however there is still space to improve the efficiency of the predictive models. None of the studies dealing with diabetes polyneuropathy considered any significant number of predictors in one study to enable a comparison of efficiency of different machine learning methods with different predictors to find the most efficient one. The studies [11, 12, 18, 19] that dealt with applying machine learning to diabetes complications did not provide a comprehensive comparison of different decision support models using a well-structured and large enough dataset applied to different decision support tasks.

The purpose of this study is the implementation of machine learning methods for early identification of the risk of diabetes polyneuropathy based on structured electronic medical records collected in databases of medical information systems. The models developed in this study consider comorbidity with retinopathy and nephropathy.

Methods

The data from the medical information system of Almazov specialized medical center was used as the empirical basis of the study. The unique conditions of combining several specialized departments into a single system create prerequisites for the validity of this study. The medical center has diagnostic, intensive care, rehabilitation, and other departments. This medical center is one of the most well-equipped in Russia. The degree of digitalization of treatment processes is also at a high level. On this basis, it is possible to extract data from the medical system with a sufficient level of compliance with the real process. The local ethics committee of ITMO university approved the study.

For our study, we’ve applied a basic procedure for predictive modelling (Fig. 1), which includes data extraction and preprocessing, models’ adjustment and performance assessment, selection of the best models, and interpretation of the results. Further in this section, basic methods and technologies are described. For this study we used Python 3.6.3 and scikit-learn 0.19.1 [20] as the basic framework for machine learning models.

Data extraction and preprocessing

The dataset contains the total number of 238,590 laboratory records. Each record contains patient and episode identifiers, a timestamp, and a number of measured parameters, the total number of which is 31 (27 laboratory tests, retinopathy, nephropathy, age, and gender).

The patients were eligible for the study based on the following inclusion and exclusion criteria:

Inclusion criteria:

Age > = 16 years old.

Diagnoses:

E08 Diabetes mellitus due to underlying condition
E09 Drug or chemical induced diabetes mellitus
E10 Type 1 diabetes mellitus
E11 Type 2 diabetes mellitus

Timeframe: Cases were open form 03-07-2010 to 22-08-2017.

Exclusion criteria:

Age < 16 years old.

E12 Malnutrition-related diabetes mellitus.

O24 Diabetes mellitus in pregnancy, childbirth, and the puerperium.

Case opened outside of a timeframe 03-07-2010 to 22-08-2017.

The records cover the time period from 03 to 07-2010 to 22-08-2017 and include information about 5846 patients with diabetes. Diagnosis served as a source of information about the target class values for classification (whether or not a patient had developed polyneuropathy).

In this article we focus on the analysis of complication factors in patients in general. Diabetes mellitus is a chronic disease that a patient acquires for life we did not divide the dataset into episodes.

In the first phase of the data preprocessing, we selected a set of 31 parameters with the largest populations of patients and the longest average time series length and filtered out the rest, normalized the remaining values by parameter medians, and interpolated and extrapolated the patients’ data to convert it into regularly sampled time series having equal lengths.

Laboratory test datasets were extracted from the medical database to tables for validation and analysis. All data was anonymized. The laboratory tests included:

1. Hemoglobin (HGB), 2. Leukocytes (LEU), 3. Platelets (PLT), 4. pH, 5. Mean platelet volume (MPW), 6. Creatinine, 7. Mean cell hemoglobin (MCH), 8. Neutrophils (NEUT), 9. Mean corpuscular volume (MCV), 10. Cholesterol, 11. Glucose, 12. Procalcitonin (PCT), 13. Red blood cell distribution width (RDW), 14. Alanine transaminase (ALT),	15. Bilirubin, 16. Platelet distribution width (PDW), 17. High-density lipoprotein (HDL), 18. Aspartate aminotransferase (AST), 19. White blood count (WBC), 20. Troponin, 21. Monocytes, 22. Bilirubin, 23. Red blood cell count (RBC), 24. Triglycerides, 25. Hematocrit (HCT), 26. Low-density lipoproteins (LDL), 27. Blood in urine (BLD).

In addition to the laboratory data, the dataset contained patients’ gender, age, and the presence of comorbidities: retinopathy, nephropathy (true or false).

We removed the patients with the insufficient amount of data (< 6 parameters) and no data for the retinopathy and nephropathy from the dataset. We also removed 1% of values having the highest z-score to filter out some obvious outliers.

We employed three approaches to preprocessing of the time series information:

1.
Replacement of each time series with its last value;
2.
Replacement of each time series with three of its characteristics: mean, length, and standard deviation;
3.
Replacement of each time series with its maximum, with the exception of hemoglobin time series, which were replaced with their minimums.

After that, we applied two strategies of dealing with missing values to ensure that all patients have the same set of variables:

1.
Replacement of missing data with the medians of the corresponding parameters;
2.
Deletion of parameters that have too many missing values (> 500 by default) and removal of all patients that have any missing values in the remaining parameters.

The resulting dataset was randomly split into train (80%) and test (20%) samples. All the models were implemented using a train data set.

Cluster identification in diabetes population

To identify subclasses within the class of patients with diabetes, the clustering problem was solved. The scikit-learn library’s method of T-distributed Stochastic Neighbour Embedding (T-SNE) was used. T-SNE was used as the most efficient method to organize unsupervised learning. It excludes bias related to the known number of clusters (e.g. this bias exists in the principal component analysis). In comparison with SNE, T-SNE uses a Student distribution that minimizes the influence of outliers that are common in medical data sets [21, 22]. The values presented in Table 1 for T-SNE parameters in scikit-learn were selected after several iterations with expert interpretation.

Table 1 T-SNE parameters

Full size table

All the parameters of the dataset were used for training the clustering model. The features were selected on the basis of correlation analysis for the target class. Grid search was used to select the optimal values of hyperparameters (Table 1).

Sensitivity analysis and grid search for the classification model

Each experiment ran in the setting of stratified 5-fold cross-validation i.e., random 80% of training dataset was used for training and random 20% of training dataset for testing. Target class ratios in the folds were preserved. For the performance assessment of SVM and DT classifiers, we ran it 100 times; 100 × 5-fold cross-validation resulted in 500 predictions. All the measurements were performed separately per dataset and per model parameter value to determine the best parameters for classifiers as well as optimal data preprocessing. After determining the optimal dataset and model parameters, we performed a validation with the testing dataset. As an additional performance assessment score, we used AUC of ROC, which represents the trade-off between sensitivity and specificity of the model. We used a series of classification models available within scikit-learn as a pool for the selection of the best predictive methods to be applied within the proposed scheme. A summary of the required models parameters is presented in Table 2.

Table 2 Classification models

Full size table

Model evaluation

All the models were evaluated using a test data set. To evaluate the results of classification, we used common measures for machine learning models: accuracy, precision, recall, and F-measure. In the case of medical data, it is more expedient to use the recall metric, since the most critical problem of predicting diagnoses is the error of the first kind (the disease exists, but is not classified).

We performed all the experiments with all the classifiers on the full dataset (with comorbidities) and reduced dataset (without comorbidities) to measure the influence of comorbidities on the resulting performance of the classifiers.

Results interpretation

For the interpretation of the results, we employed LIME [23], an explanation technique that explains the predictions of each classifier. We applied it to all classifiers we experimented with except the linear regression classifier, since it already has the exact solution for what the method is trying to approximate. Every classifier was trained 5 times using randomly drawn 80% of all the patients, then the predicted outcomes for the remaining 20% of patients were produced, after which each prediction was interpreted by LIME one-by-one. For every prediction, LIME produced a set of linear model coefficients corresponding to the variables, which can be interpreted as a set of variable contributions (given that the input was normalized).

The interpretation was carried out by two independent experts in the field of vascular disease and neurology, as well as in the field of endocrine diseases. The interpretation method included validation of the results and exclusion of invalid values. Further, by cross-comparison, conflicting elements were excluded if at least one expert considered it contradictory. After that, the conclusions of both experts were combined.

Results

Data preprocessing

After the patients with insufficient amount of data were removed from the dataset, it contained 5425 patients, 2342 (43.17%) of which did and 3083 (56.83%) did not develop polyneuropathy. Applying three ways to preprocess the time series and two approaches of handling missing data resulted in six datasets to apply our methods to. The summary of these datasets is presented in Table 3.

Table 3 Properties of datasets after preprocessing

Full size table

Subclasses in diabetes population

As a result of clustering using T-SNE, 6 subclasses were identified (Fig. 2, Table 4).

Table 4 Cluster description

Full size table

There are two large clusters (Cluster #0 and Cluster #3) that are opposite in the percentage of patients with polyneuropathy, namely the smallest and the largest rate of polyneuropathy. Cluster #0 has an increased rate polyneuropathy: 59%. In cluster #3, on the contrary, the reduced number of patients with polyneuropathy is 0%. Both clusters are female. In male clusters #1 and #2, the percentage is close to the average of 40%. Cluster #4 is slightly below average. Cluster #5 is characterized by low polyneuropathy rates. When considering age distribution in the clusters, in most cases polyneuropathy can occur in patients that are 45 years or older. Polyneuropathy was observed only in clusters #0, 1, 2, and 5. No polyneuropathy was observed in clusters 3 and 4. This conclusion supports the validity of the results as they correlate with clinical practices. In clusters 0, 1, and 2, patients are about the same age. The peculiarity of cluster 1 is that there is an outlier on the left and it is male in contrast to cluster 0. Cluster 3 patients are young men younger than patients in clusters 0, 1, and 2 and also male. Cluster 5 patients of different ages with a small rate of polyneuropathy are mostly men. It is evident that female gender and age of more than 50 years is a risk factor for the development of polyneuropathy in patients with diabetes mellitus.

Correlation analysis and feature importance

Figure 3a shows the feature importance of the decision tree trained further. Figure 3b shows an example of a correlation matrix for selected parameters of the patient and polyneuropathy.

The most significant features are the levels of neutrophils and glucose level in blood and urine. The correlation matrix shows the high correlation of MPV, RBC, and HGB with polyneuropathy.

Results of sensitivity analysis and grid search for the classification model

This section presents the results of sensitive analysis and model identification. The selected predictive models were trained and compared according to the methodology described earlier in Section 2. The key performance indicators for the validation of the best selected options are presented in Table 5. Further subsections provide more details for each of the models.

Table 5 Performance of the classifiers without comorbidities

Full size table

Results of artificial neural network classifier and sensitivity analysis

As it was stated before, sensitivity analysis for ANN consists of defining the optimal hidden layer configuration. It was decided to experiment only with single-hidden-layer networks, since it is commonly considered sufficient for simple machine learning tasks, which was confirmed by a number of experiments. For each of our datasets, we run 10 × 5-fold validation for varying number of nodes in the hidden layer of the network to determine the optimal value for each preprocessing and filtering option. Figures A3.1–A3.6 in Additional file 3 present the results of the process.

According to the figures, the method’s performance is growing as the number of nodes in the hidden layer increases up to the certain critical value after which performance remains relatively stable. The resulting curves are presented in Figures A3.5–A3.12 in Additional file 3.

Results of SVM

As it was mentioned before, we applied the classifier to all six preprocessed dataset options in the setting of 100 × 5-fold cross-validation SVM with linear kernel. The results of the analysis are presented in Table A1.1 in Additional file 1, while the ROC curves can be found in Figures A1.1–A1.6 in Additional file 1.

Decision tree

The classifier was applied to all datasets with 4 different depth limits: 2, 4, 8, and 16, with lower and upper limits being empirically determined to be too low and too high respectively for each dataset. The results of the analysis are presented in Table A2.1 in Additional file 2. Figures A2.1–A2.6 in Additional file 2 present ROC curves for each dataset and for each tree depth limit.

When using the decision tree classifier, the first preprocessing option (replacing time series with their last values) seems to be showing the best results (highest F1 score and overall accuracy). However, this preprocessing technique is the least valuable option for application in a real clinical setting, since the value sufficient for correct discrimination might be generated too late to be of any practical use. Depth limit of 4 usually exhibits the best performance with a single exception of the last preprocessing option.

Influence of comorbidities

The final performance measurements are presented in Table 6. All the classifiers were applied to the dataset with ANN showing the best results.

Table 6 Performance evaluation with comorbidities

Full size table

According to the table, ANN classifier produces the highest recall, F1 score, and accuracy of prediction, while SVM classifier shows the highest precision. Recall is a much more valuable measure for medical predictions, since in clinical settings it is much more important to detect as many patients having the condition as possible, rather than detecting some of them with higher precision. This means that ANN classifier is the most relevant for prediction of polyneuropathy.

Discussion

We have made an exploratory study to research real-world evidences of the polyneuropathy development risks. Despite considering endpoint events on polyneuropathy we tried to make more general conclusion to understand what features are important for the prediction. For example, as revealed in the study blood sugar is one of the most important features. This real-world evidence can be used to set up a study for example on the influence of portable and wearable devices on the polyneuropathy risk management.