Risk factor mining and prediction of urine protein progression in chronic kidney disease: a machine learning- based study

Background Chronic kidney disease (CKD) is a global public health concern. Therefore, to provide timely intervention for non-hospitalized high-risk patients and rationally allocate limited clinical resources is important to mine the key factors when designing a CKD prediction model. Methods This study included data from 1,358 patients with CKD pathologically confirmed during the period from December 2017 to September 2020 at Zhongshan Hospital. A CKD prediction interpretation framework based on machine learning was proposed. From among 100 variables, 17 were selected for the model construction through a recursive feature elimination with logistic regression feature screening. Several machine learning classifiers, including extreme gradient boosting, gaussian-based naive bayes, a neural network, ridge regression, and linear model logistic regression (LR), were trained, and an ensemble model was developed to predict 24-hour urine protein. The detailed relationship between the risk of CKD progression and these predictors was determined using a global interpretation. A patient-specific analysis was conducted using a local interpretation. Results The results showed that LR achieved the best performance, with an area under the curve (AUC) of 0.850 in a single machine learning model. The ensemble model constructed using the voting integration method further improved the AUC to 0.856. The major predictors of moderate-to-severe severity included lower levels of 25-OH-vitamin, albumin, transferrin in males, and higher levels of cystatin C. Conclusions Compared with the clinical single kidney function evaluation indicators (eGFR, Scr), the machine learning model proposed in this study improved the prediction accuracy of CKD progression by 17.6% and 24.6%, respectively, and the AUC was improved by 0.250 and 0.236, respectively. Our framework can achieve a good predictive interpretation and provide effective clinical decision support. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-023-02269-2.

Risk factor mining and prediction of urine protein progression in chronic kidney disease: a machine learning-based study Yufei Lu 1 † , Yichun Ning 1 † , Yang Li 1 , Bowen Zhu 1 , Jian Zhang 1 , Yan Yang 1 , Weize Chen 1 , Zhixin Yan 1 , Annan Chen 1 , Bo Shen 1 , Yi Fang 1 , Dong Wang 2* , Nana Song 1* and Xiaoqiang Ding 1* Background Chronic kidney disease (CKD) affects 5-10% of the global population and is the leading cause of catastrophic health expenditure.It has therefore become a major global public health problem [1].Furthermore, CKD is projected to become the fifth leading cause of death worldwide by 2040.The compensatory effects of the kidneys make the monitoring of CKD difficult [2].Clinicians have made significant efforts to determine the key factors that can delay the progression of CKD [3].Therefore, a risk prediction model for monitoring such progression would be an economical and effective tool [4][5][6].
With the loss of renal function in CKD patients, the interval between follow-ups recommended by nephrologists becomes shorter, which makes the 24-h urine protein test a heavier medical burden [7].This burden can be effectively reduced through follow-ups and less timeconsuming inspections.Patients with a 24-h urinary protein content less than 1 g/24 h are classified as low-risk, and outpatient follow-up is considered the main treatment.Patients with a 24-h urinary protein content higher than 1 g/24 h are classified as high-risk and assigned to centralized in-hospital management.However, the 24-h quantitative urine protein detection process is complex, involving a lengthy measurement cycle, high patientcompliance requirement, and numerous influencing factors.We are therefore committed to the development of a simple and rapid method to replace the traditional approach.
Compared with traditional scale-based scoring, machine learning (ML) models are widely used in interdisciplinary fields owing to its efficiency, accuracy, and reproducibility.Moreover, it demonstrates significant potential for disease prediction [8].In comparison to six other machine learning models, Lee et al. achieved an excellent performance when applying a gradient boosting model to malaria prediction [9].The results of Huang et al. showed that random forest can effectively predict stroke incidence in adult patients with hypertension [10].The application of machine learning in the field of kidney disease has long been a topic of interest.Various functional methods have been developed for purposes such as predicting the survival rate of dialysis patients [11] and early screening of CKD [12].Although considerable progress has been made, achieving a good predictability and interpretability remains a considerable challenge.Existing risk prediction models primarily focus on identifying risk factors, and further investigations into the detailed relationship between high-risk factors and CKD risk have rarely been reported.
In current medical studies, new prognostic indicators and their clinical interpretation have received an increasing amount of attention.The screening of such potential clinical indicators has become an important problem.Therefore, several novel feature reduction algorithms have been proposed, including a novel feature reduction (NFR) model [13], an advanced hybrid ensemble gain ratio feature selection (AHEGFS) model [14], and a bioinspired ensemble feature selection (BEFS) model [15].Meanwhile, the Shapley additive explanations (SHAP) algorithm has also made exciting discoveries in the use of interpretable techniques in the medical field.SHAP is a method introduced by Lundberg and Lee in 2017 for explaining the predictions of ML models using SHAP values.The key idea of SHAP is to compute SHAP values for each feature of the sample to be explained, to estimate the total effect, main effects, and interaction effects of the variables [16].Zhao et al. first identified mechanical ventilation and pressure support ventilation as the most important predictive features of extubation failure in intensive care units based on SHAP values [17].Tseng et al. used SHAP technology to identify important risk factors in acute kidney injury that were ignored by traditional risk scoring models, including intraoperative urine output, IV fluid infusion, blood product transfusion, and dynamic changes in hemodynamics [18].SHAP interpreters are used to provide a personalized assessment and interpretation of models from both global and local perspectives, ensuring the reliability of prediction results and providing more evidence for solving clinical problems.
Herein, we describe a study conducted with patients having CKD and report a method for CKD prediction and interpretation.Specifically, recursive feature elimination with logistic regression (RFE-LR) was used to identify the risk factors for the progression of kidney disease.Second, based on the random forest (RF) algorithm and voting integration method combined with logistic regression, a risk stratification system for CKD was developed.Finally, the SHAP method was used to explain the prediction model used to support clinical practice and ensure the reliability of the results.

Materials and methods
The study protocol (Fig. 1) received ethical approval from the Ethics Committee of Zhongshan Hospital.The study was conducted in compliance with the World Medical Association Declaration of Helsinki on Ethical Principles for Medical Research Involving Human Subjects, and the national research regulations.Considering the retrospective nature of this study, informed consent was waived by the Ethics Committee of Zhongshan Hospital.

Study participants
From the database, we retrospectively selected 1,358 patients with pathologically confirmed CKD from December 2017 to September 2020.Patients younger than 18 years and those who underwent kidney transplantation or dialysis or had a diagnosis of hereditary hyperuricemia, severe cardiopulmonary dysfunction, infection, tumor blood disease, shock, or hyperparathyroidism were excluded from all analyses.
We collected treatment data and then retrospectively extracted the clinical characteristics, such as demographics, routine blood tests, blood biochemistry, and blood immunity of the patients from electronic medical records and entered them into our structured database.

Study outcome
In our study, the prediction targets are represented in binary form (0 = negative, 1 = positive).The outcome of the present study was the status of 24-h urinary protein, which was judged based on whether the urine protein level was lower or higher than 1 g/24 h, defined as mild (negative) or advanced (positive), respectively.

Data construction and feature selection
We collected 100 easily obtainable clinical features from our database.The proportion of missing values for all features was < 10%.Missing categorical data were filled in based on the mode, and continuous features were replaced through an imputation using RF [19].The categorical features were then transformed into binary dummy variables.The dataset was randomly divided into a training cohort (80%) and an independent test cohort (20%), and synthetic minority oversampling technique (SMOTE) was used on the training set to balance the dataset.To identify whether any subsets of the features can achieve a better discrimination than the initial set of features and to determine the informative characteristic variables (features) in the prediction of CKD, the RFE-LR and least absolute shrinkage and selection operator (LASSO) algorithms were used.

Model development and assessment
For the development system, in this study, we used macOS Monterey (Apple M1 Pro) with 16 GB of memory.As the analysis software, Python version v3.10 and the sklearn v1.1.1 machine learning library were utilized as the main analysis tool.The model development included trials using several different machine learning classifiers, such as extreme gradient boosting (XGBoost) models [20], gaussian-based naive bayes [21], a neural network (NN) [22], ridge regression [23], and linear model logistic regression (LR) [24].A brief description of these algorithms is described in the model establishment and brief illustrations (Additional file 1).We trained the models using a stratified k-fold cross-validation (k = 10) applied to the training cohort, and determined the best hyperparameter combinations through a grid search approach.
To quantify the discriminative capabilities of the model, we plotted the receiver operating characteristic (ROC) and precision-recall curves based on a confusion matrix, and then calculated the area under the ROC curve (AUC), which was used as the main metric to assess the model performance.Furthermore, the sensitivity, specificity, accuracy, average precision, and execution time were used to evaluate the model performance from multiple perspectives.The calculation principles of these assessment indicators are described in the performance metrics section (Additional file 1).In addition, we adopted a soft voting ensemble model by integrating the two models with the best AUC.

Feature interpretation
Feature importance refers to the extent to which the elimination of feature information increases the model error, which provides a highly compressed global insight into the behavior of the model.We computed the SHAP values to evaluate the correctness of the feature interpretation in the best-performing model and explain the global interpretations of each feature contribution to the risk of CKD.

Patients and clinical characteristics
In the final cohort, we reviewed the medical records of 1,358 patients with CKD who underwent treatment at Zhongshan Hospital from December 2017 to September 2020.The mean age was 51.12 ± 16.09 years, and 910 (67.01%) of the patients were male.A total of 906 (68%) and 452 (32%) subjects were classified as patients having advanced (positive) and mild (negative) CKD, respectively.In addition, after applying data balance processing of the SMOTE algorithm on the training set, 364 negative samples were oversampled, and thus the sample ratio of the final training dataset was 1 (both at 725).The estimated glomerular filtration rate (eGFR) was calculated using the MDRD formula.The proportions of missing values of the included clinical features were all < 10%.After data preprocessing, 100 complete clinical variables were used as predictive variables, the baseline characteristics of which are shown in Table 1.

Feature selection
After imputation, we compared the results of the model construction without feature screening and with RFE-LR and LASSO feature screening, and then used the AUC as the main evaluation index of the model.In the model construction results without feature screening, the highest AUC was 0.833 (Table S3).A total of 21 feature indexes were obtained through LASSO feature screening based on the optimal penalty parameter λ (0.035) using the 1 − standard error (SE) criterion (Figure S1), which achieved the highest AUC of 0.828 (Table S4).In the results of the RFE-LR feature screening, the performance of the model was significantly improved when 17 variables were used (Fig. 2a), and the model showed an over-fitting with a further increase in the number of variables.The highest AUC was 0.85 after RFE-LR feature screening (Table 2).In brief, the RFE-LR algorithm was used to reduce the number of feature variables to 17, which achieved the highest accuracy and AUC compared to using all features separately, with an improvement of 3.3% and 0.017, respectively.Based on the results of the   AUC comparison, we conducted a follow-up study using the results of RFE-LR.We then used these 17 variables for subsequent model building, including gender, total protein (TP), albumin (ALB), serum protein electrophoresis-albumin (SPE-ALB), serum protein electrophoresis-alpha2 (SPE-alpha2), serum protein electrophoresis-beta (SPE-beta), eGFR, cystatin C (CYSC), uric acid (UA), glycated albumin (GA), non high density lipoprotein (N-HDL), apolipoprotein A (APO-A-I), creatine phosphokinase (CPK), retinol conjugated protein (RBP), transferrin (TRF), lambda light chain (LAM), and 25 Hydroxyvitamin D (25OHD).

Model comparison
The adjustment results of the model hyperparameters were summarized firstly.Before adjusting these hyperparameters, the LR model achieved the highest AUC (0.839) (Table S5).Four machine learning models were constructed based on the best hyperparameter combinations of the algorithms (Table 2).The results of the confusion matrixes are summarized in Table 3, where XGBoost created the minimum number of false positives (15) and LR created the maximum number of true positives (153).As can be seen from Table 4; Fig. 2b and c, LR achieved the best AUC (0.850) in the single machine learning model.The ensemble model constructed using the voting ensemble method further improved the predictive power and achieved the highest performance (AUC: 0.856).The model with the best sensitivity applied LR (0.845).The specificity values of XGBoost, NN, and the traditional creatinine (Cre) indicator were all above 0.8, whereas the sensitivity of Cre was low (0.392).
When compared with the pre-existing single renal function evaluation indices (eGFR, Scr), the prediction performance of machine learning for the progression of CKD was significantly improved (Table 4).In addition, we also compared the running time of different machine learning models under the same hardware conditions.As shown in Table S6, there is little difference among the models in the test cohort.However, when training the cohort of each hyperparameter, GNB had the fastest and XGBoost had the slowest execution time.

Most important predictors of CKD risk
To identify the features influencing the model and their impact on the risk of CKD as a way to support clinical decision-making, a particular variant of SHAP for kernel-based explainers was used for the ensemble model  3. Features other than Scr and eGFR were discussed to highlight those that may need to be closely monitored.As shown in Fig. 3, lower levels of 25OHD, ALB, and transferrin (TRF), male sex, and higher levels of CYSC were the major predictors of moderate-to-high severity.In addition, to obtain the exact form of the relationship, SHAP-dependence plots (Fig. 4) were employed.A SHAP value exceeding zero is regarded as the cut-off point, and the critical point corresponding to each feature can be observed at this time.
According to the results, 25OHD levels lower than 30 nmol/L indicate a moderate or even severe loss of renal function.In addition, when the 25OHD level was higher than 75 nmol/L, the individual differences increased.A decrease in serum ALB level predicts an increase in the risk of CKD.ALB levels below 37 g/L were correlated with a positive predictive value.We also found that the accumulation of CYSC indicates an increased risk of CKD, that is, when the CYSC level is higher than 2 mg/L, the same level of CYSC accounts for a greater difference among the patients.In addition, a higher glycated albumin (GA) level (%) indicates an increased risk of CKD.The results also illustrate the tendency of CKD risk when eGFR levels decrease.An eGFR level below 60 ml/ min/1.73m 2 is correlated with a positive predictive value.Within the range of 1.5-2.0g/L, TRF changes slightly, whereas SHAP increases sharply, which shows that attention should be paid to changes in the TRF.Such analyses can help clinicians understand the results of potential

Discussion
The 24-h urine protein test has stringent patient compliance requirements and difficult follow-up procedures.The use of routine laboratory biochemical tests to replace the 24-h urine protein quantification will improve the convenience of outpatients and follow-up patients.In this retrospective cohort study, we developed machine learning algorithms using 100 easily obtainable clinical features for predicting CKD based on the severity of the proteinuria (Fig. 5).Some studies have shown that changes in the proteinuria are significantly associated with certain kidney function metrics, including a doubling of serum creatinine levels, rapid eGFR decline, and progression to end-stage kidney disease [25][26][27].However, the detection of 24-h proteinuria is difficult owing to several factors, such as better applicability to hospitalized patients than to outpatients, poor patient compliance, and increased medical pressure.In the present study, the linear LR model exhibited the best AUC performance for single-model prediction, whereas the ensemble model (LR + XGBoost) exhibited the best AUC (0.856) among all models considered, with balanced specificity and sensitivity.Model fusion technology is therefore suitable for clinical decision support.Owing to the diversity of the available data and an adequate AUC performance, it can be concluded that the results of this study are informative for the rapid diagnostic identification of patients with CKD, with the mining of key risk factors contributing to subsequent treatment.Artificial intelligence is being increasingly used in the medical field to predict various outcomes.Several longitudinal studies involving CKD have reported progress regarding the use of machine learning algorithms in CKD prediction.A survey by Huang et al. showed that 125 metabolites and 14 clinical variables can be used as predictors to establish a CKD prediction model for patients with type 2 diabetes (AUC = 0.857) [28].Rashed-Al-Mahfuz et al. developed five models for predicting CKD using low-cost diagnostic screening.The RF accurately predicted (at a rate of 99.5%) patients at risk of CKD, but this high predictive power may be due to overfitting caused by too little data quantities [29].Ferguson et al. also used routinely collected laboratory data and machine learning models to identify those at high risk of developing advanced CKD within the next 5 years [30].However, none of these studies can provide personalized information for individual patients, thus hindering the ability of predictive models to support decision-making under clinical settings.This study provides a comprehensive framework for combining the predictive accuracy of CKD risk with interpretable results for the important characteristics of individual patients.Interestingly, consistent with the research by Xiao et al. on the use of proteinuria as a standard for CKD [31], the linear model achieved the best prediction performance in the prediction of multiple models; the ML model fusion used in this study can further improve the model AUC, which suggests that the model fusion scheme has potential practical capabilities.Similarly, some common factors such as ALB, TP, and eGFR have been found to be significantly related to CKD progression.More details about the above-mentioned studies are shown in Table S7.Particularly, our outcome differs from most existing reports, namely, we used 24-hour urinary protein as the outcome, while others were more based on eGFR, but the systemic changes in tubular creatinine secretion and extrarenal creatinine clearance could bias the results.Routinely, 24-hour urinary protein quantification is the gold standard for assessing the severity of CKD, but there are few studies that have stratified the risk of CKD with an outcome of 24-hour urinary protein, resulting in limited comparable studies, and this may be due to the difficulty in obtaining the results of 24-hour urinary protein quantification In CKD, the combination of a limited vitamin D intake and a reduced renal capacity to activate 25(OH)D into 1,25(OH)2D leads to a progressive vitamin D deficiency [32].Additionally, this study found that patients with 25OHD in the 28-35 nmol/L range require close monitoring to delay the progression of CKD.A close and regular review of 25OHD in such patients is recommended in clinical practice.However, further analysis of the actual health status of the patient is required to determine whether the dosing schedule of vitamin D can be adjusted.ALB was determined to be of the next highest importance based on the SHAP values; lower ALB levels are associated with the loss of kidney function.ALB accounts for approximately 60% of the total serum protein content, maintains colloidal osmotic pressure, and binds a variety of compounds under physiological conditions [33].The glomerular filtration barrier prevents ALB from entering the ultrafiltrate.However, under the pathological condition of CKD, an increase in the effective radius of the barrier leads to protein loss, which further leads to a decrease in serum albumin levels [34].The ALB of the point with zero SHAP values was approximately 36-37 g/L.This means that for patients with reduced renal function, even if the reference range for ALB is 35-55 g/L, ALB levels below 37 g/L may indicate moderate-to-severe renal impairment and require closer monitoring.Although the production rate of CYSC is more stable and its internal variability is smaller than that of Scr, there have been fewer studies on the renal function marker CYSC, which is a low-molecular-weight protein produced by nucleated cells at a constant rate and acts as lysosomal and cysteine proteases [35].A recent metaanalysis showed similar findings; in particular, CYSC has a stronger correlation with renal function than Scr.We speculate that as an underlying explanation, CYSC is unaffected by muscle mass compared to Scr [36].The interpretation based on the SHAP value is model-independent; that is, the SHAP value can be applied to different models.Therefore, although this research focused on CKD, the framework can be easily extended to the risk prediction and interpretation of other diseases to better support clinical decision-making.
Overall, in this study, an integrated framework for CKD risk prediction and interpretation is proposed to provide clinicians with decision support and model interpretation.Specifically, an integrated algorithm was developed to achieve a good prediction performance on the CKD dataset.While accurately predicting high-risk patients, it also achieves a strong interpretability for specific indicators.Finally, this study has certain limitations.Firstly, this is a single-center retrospective study, and there may be variations in the clinical characteristics of the data across different regions.Therefore, to assess the generalizability Fig. 5 Overall summary of the study.Using common clinical variables, machine learning based approaches can effectively predict and explain the progression of CKD.Furthermore, decision support is provided for early intervention, and medical resource allocation is given for outpatients and those requiring a follow-up of the model, the conclusions drawn from this study need to be validated in external cohorts.Secondly, this study only considered the correlation between predictive factors and CKD, without considering causality.Thirdly, this study used conventional feature selection models, and the application of more recent advanced techniques such as NRF, AHEGFS, and BEFS may help identify more reliable CKD progression risk factors.Finally, the dataset used in this study only included blood-related indicators and ignored medical prescriptions and imaging examinations.

Conclusions
In conclusion, we developed a machine learning model for predicting CKD based on proteinuria severity.The experimental results show that constructing a predictive interpretation framework can lead to a good predictive interpretation and provide effective clinical decision support.Another essential value is in providing new clinical insights for the management of patients requiring followup examinations for different diseases in large hospitals.

Fig. 1
Fig. 1 Chronic kidney disease (CKD) prediction and decision support framework.A total of 1,358 patients were included in this study, with 100 clinical variables applied.The data were divided into training (80%) and validation (20%) sets.The model was trained using k-fold cross-validation (k = 10), and a grid search was conducted to determine the best parameter combinations

Fig. 2
Fig. 2 Screening of predictors and evaluation of models.(a) RFE-LR used to examine whether any subset of the input features can achieve a better discrimination than the initial set of features.(b) ROC curves of different models on the validation sets.(c) Precision-recall (PR) curves of different models on the validation sets

Fig. 3
Fig. 3 SHAP summary plot of the top-17 features of the ensemble model.The abscissa is the SHAP value, which represents the impact on the model output.The ordinates are different features, with red representing larger eigenvalues, and blue indicating smaller eigenvalues

Fig. 4
Fig. 4 SHAP dependence plots for ensemble model.The SHAP-dependence plot shows the effect of a single feature on the output of the ensemble prediction model.SHAP values for specific features exceeding zero represent an increased risk of CKD progression.(a-f) 25-hydroxyvitamin D, albumin, cystatin C, glycated albumin, estimated glomerular filtration rate (eGFR), transferrin, protein A1, uric acid, and total protein

Table 1
Lu et al.BMC Medical Informatics and Decision Making (2023) 23:173 Baseline characteristics of included CKD patients

Table 1
(continued) Lu et al.BMC Medical Informatics and Decision Making

Table 2
Results of hyper-parameter optimization of different machine learning algorithms

Table 3
Confusion matrices

Table 4
Performance summary