Skip to main content

Risk factor mining and prediction of urine protein progression in chronic kidney disease: a machine learning- based study



Chronic kidney disease (CKD) is a global public health concern. Therefore, to provide timely intervention for non-hospitalized high-risk patients and rationally allocate limited clinical resources is important to mine the key factors when designing a CKD prediction model.


This study included data from 1,358 patients with CKD pathologically confirmed during the period from December 2017 to September 2020 at Zhongshan Hospital. A CKD prediction interpretation framework based on machine learning was proposed. From among 100 variables, 17 were selected for the model construction through a recursive feature elimination with logistic regression feature screening. Several machine learning classifiers, including extreme gradient boosting, gaussian-based naive bayes, a neural network, ridge regression, and linear model logistic regression (LR), were trained, and an ensemble model was developed to predict 24-hour urine protein. The detailed relationship between the risk of CKD progression and these predictors was determined using a global interpretation. A patient-specific analysis was conducted using a local interpretation.


The results showed that LR achieved the best performance, with an area under the curve (AUC) of 0.850 in a single machine learning model. The ensemble model constructed using the voting integration method further improved the AUC to 0.856. The major predictors of moderate-to-severe severity included lower levels of 25-OH-vitamin, albumin, transferrin in males, and higher levels of cystatin C.


Compared with the clinical single kidney function evaluation indicators (eGFR, Scr), the machine learning model proposed in this study improved the prediction accuracy of CKD progression by 17.6% and 24.6%, respectively, and the AUC was improved by 0.250 and 0.236, respectively. Our framework can achieve a good predictive interpretation and provide effective clinical decision support.

Peer Review reports


Chronic kidney disease (CKD) affects 5–10% of the global population and is the leading cause of catastrophic health expenditure. It has therefore become a major global public health problem [1]. Furthermore, CKD is projected to become the fifth leading cause of death worldwide by 2040. The compensatory effects of the kidneys make the monitoring of CKD difficult [2]. Clinicians have made significant efforts to determine the key factors that can delay the progression of CKD [3]. Therefore, a risk prediction model for monitoring such progression would be an economical and effective tool [4,5,6].

With the loss of renal function in CKD patients, the interval between follow-ups recommended by nephrologists becomes shorter, which makes the 24-h urine protein test a heavier medical burden [7]. This burden can be effectively reduced through follow-ups and less time-consuming inspections. Patients with a 24-h urinary protein content less than 1 g/24 h are classified as low-risk, and outpatient follow-up is considered the main treatment. Patients with a 24-h urinary protein content higher than 1 g/24 h are classified as high-risk and assigned to centralized in-hospital management. However, the 24-h quantitative urine protein detection process is complex, involving a lengthy measurement cycle, high patient-compliance requirement, and numerous influencing factors. We are therefore committed to the development of a simple and rapid method to replace the traditional approach.

Compared with traditional scale-based scoring, machine learning (ML) models are widely used in interdisciplinary fields owing to its efficiency, accuracy, and reproducibility. Moreover, it demonstrates significant potential for disease prediction [8]. In comparison to six other machine learning models, Lee et al. achieved an excellent performance when applying a gradient boosting model to malaria prediction [9]. The results of Huang et al. showed that random forest can effectively predict stroke incidence in adult patients with hypertension [10]. The application of machine learning in the field of kidney disease has long been a topic of interest. Various functional methods have been developed for purposes such as predicting the survival rate of dialysis patients [11] and early screening of CKD [12]. Although considerable progress has been made, achieving a good predictability and interpretability remains a considerable challenge. Existing risk prediction models primarily focus on identifying risk factors, and further investigations into the detailed relationship between high-risk factors and CKD risk have rarely been reported.

In current medical studies, new prognostic indicators and their clinical interpretation have received an increasing amount of attention. The screening of such potential clinical indicators has become an important problem. Therefore, several novel feature reduction algorithms have been proposed, including a novel feature reduction (NFR) model [13], an advanced hybrid ensemble gain ratio feature selection (AHEGFS) model [14], and a bio-inspired ensemble feature selection (BEFS) model [15]. Meanwhile, the Shapley additive explanations (SHAP) algorithm has also made exciting discoveries in the use of interpretable techniques in the medical field. SHAP is a method introduced by Lundberg and Lee in 2017 for explaining the predictions of ML models using SHAP values. The key idea of SHAP is to compute SHAP values for each feature of the sample to be explained, to estimate the total effect, main effects, and interaction effects of the variables [16]. Zhao et al. first identified mechanical ventilation and pressure support ventilation as the most important predictive features of extubation failure in intensive care units based on SHAP values [17]. Tseng et al. used SHAP technology to identify important risk factors in acute kidney injury that were ignored by traditional risk scoring models, including intraoperative urine output, IV fluid infusion, blood product transfusion, and dynamic changes in hemodynamics [18]. SHAP interpreters are used to provide a personalized assessment and interpretation of models from both global and local perspectives, ensuring the reliability of prediction results and providing more evidence for solving clinical problems.

Herein, we describe a study conducted with patients having CKD and report a method for CKD prediction and interpretation. Specifically, recursive feature elimination with logistic regression (RFE-LR) was used to identify the risk factors for the progression of kidney disease. Second, based on the random forest (RF) algorithm and voting integration method combined with logistic regression, a risk stratification system for CKD was developed. Finally, the SHAP method was used to explain the prediction model used to support clinical practice and ensure the reliability of the results.

Materials and methods

The study protocol (Fig. 1) received ethical approval from the Ethics Committee of Zhongshan Hospital. The study was conducted in compliance with the World Medical Association Declaration of Helsinki on Ethical Principles for Medical Research Involving Human Subjects, and the national research regulations. Considering the retrospective nature of this study, informed consent was waived by the Ethics Committee of Zhongshan Hospital.

Fig. 1
figure 1

Chronic kidney disease (CKD) prediction and decision support framework. A total of 1,358 patients were included in this study, with 100 clinical variables applied. The data were divided into training (80%) and validation (20%) sets. The model was trained using k-fold cross-validation (k = 10), and a grid search was conducted to determine the best parameter combinations

Study participants

From the database, we retrospectively selected 1,358 patients with pathologically confirmed CKD from December 2017 to September 2020. Patients younger than 18 years and those who underwent kidney transplantation or dialysis or had a diagnosis of hereditary hyperuricemia, severe cardiopulmonary dysfunction, infection, tumor blood disease, shock, or hyperparathyroidism were excluded from all analyses.

We collected treatment data and then retrospectively extracted the clinical characteristics, such as demographics, routine blood tests, blood biochemistry, and blood immunity of the patients from electronic medical records and entered them into our structured database.

Study outcome

In our study, the prediction targets are represented in binary form (0 = negative, 1 = positive). The outcome of the present study was the status of 24-h urinary protein, which was judged based on whether the urine protein level was lower or higher than 1 g/24 h, defined as mild (negative) or advanced (positive), respectively.

Data construction and feature selection

We collected 100 easily obtainable clinical features from our database. The proportion of missing values for all features was < 10%. Missing categorical data were filled in based on the mode, and continuous features were replaced through an imputation using RF [19]. The categorical features were then transformed into binary dummy variables. The dataset was randomly divided into a training cohort (80%) and an independent test cohort (20%), and synthetic minority oversampling technique (SMOTE) was used on the training set to balance the dataset. To identify whether any subsets of the features can achieve a better discrimination than the initial set of features and to determine the informative characteristic variables (features) in the prediction of CKD, the RFE-LR and least absolute shrinkage and selection operator (LASSO) algorithms were used.

Model development and assessment

For the development system, in this study, we used macOS Monterey (Apple M1 Pro) with 16 GB of memory. As the analysis software, Python version v3.10 and the sklearn v1.1.1 machine learning library were utilized as the main analysis tool. The model development included trials using several different machine learning classifiers, such as extreme gradient boosting (XGBoost) models [20], gaussian-based naive bayes [21], a neural network (NN) [22], ridge regression [23], and linear model logistic regression (LR) [24]. A brief description of these algorithms is described in the model establishment and brief illustrations (Additional file 1). We trained the models using a stratified k-fold cross-validation (k = 10) applied to the training cohort, and determined the best hyperparameter combinations through a grid search approach.

To quantify the discriminative capabilities of the model, we plotted the receiver operating characteristic (ROC) and precision–recall curves based on a confusion matrix, and then calculated the area under the ROC curve (AUC), which was used as the main metric to assess the model performance. Furthermore, the sensitivity, specificity, accuracy, average precision, and execution time were used to evaluate the model performance from multiple perspectives. The calculation principles of these assessment indicators are described in the performance metrics section (Additional file 1). In addition, we adopted a soft voting ensemble model by integrating the two models with the best AUC.

Feature interpretation

Feature importance refers to the extent to which the elimination of feature information increases the model error, which provides a highly compressed global insight into the behavior of the model. We computed the SHAP values to evaluate the correctness of the feature interpretation in the best-performing model and explain the global interpretations of each feature contribution to the risk of CKD.


Patients and clinical characteristics

In the final cohort, we reviewed the medical records of 1,358 patients with CKD who underwent treatment at Zhongshan Hospital from December 2017 to September 2020. The mean age was 51.12 ± 16.09 years, and 910 (67.01%) of the patients were male. A total of 906 (68%) and 452 (32%) subjects were classified as patients having advanced (positive) and mild (negative) CKD, respectively. In addition, after applying data balance processing of the SMOTE algorithm on the training set, 364 negative samples were oversampled, and thus the sample ratio of the final training dataset was 1 (both at 725). The estimated glomerular filtration rate (eGFR) was calculated using the MDRD formula. The proportions of missing values of the included clinical features were all < 10%. After data preprocessing, 100 complete clinical variables were used as predictive variables, the baseline characteristics of which are shown in Table 1.

Table 1 Baseline characteristics of included CKD patients

Feature selection

After imputation, we compared the results of the model construction without feature screening and with RFE-LR and LASSO feature screening, and then used the AUC as the main evaluation index of the model. In the model construction results without feature screening, the highest AUC was 0.833 (Table S3). A total of 21 feature indexes were obtained through LASSO feature screening based on the optimal penalty parameter λ (0.035) using the 1 − standard error (SE) criterion (Figure S1), which achieved the highest AUC of 0.828 (Table S4). In the results of the RFE-LR feature screening, the performance of the model was significantly improved when 17 variables were used (Fig. 2a), and the model showed an over-fitting with a further increase in the number of variables. The highest AUC was 0.85 after RFE-LR feature screening (Table 2). In brief, the RFE-LR algorithm was used to reduce the number of feature variables to 17, which achieved the highest accuracy and AUC compared to using all features separately, with an improvement of 3.3% and 0.017, respectively. Based on the results of the AUC comparison, we conducted a follow-up study using the results of RFE-LR. We then used these 17 variables for subsequent model building, including gender, total protein (TP), albumin (ALB), serum protein electrophoresis-albumin (SPE-ALB), serum protein electrophoresis-alpha2 (SPE-alpha2), serum protein electrophoresis-beta (SPE-beta), eGFR, cystatin C (CYSC), uric acid (UA), glycated albumin (GA), non high density lipoprotein (N-HDL), apolipoprotein A (APO-A-I), creatine phosphokinase (CPK), retinol conjugated protein (RBP), transferrin (TRF), lambda light chain (LAM), and 25 Hydroxyvitamin D (25OHD).

Fig. 2
figure 2

Screening of predictors and evaluation of models. (a) RFE-LR used to examine whether any subset of the input features can achieve a better discrimination than the initial set of features. (b) ROC curves of different models on the validation sets. (c) Precision–recall (PR) curves of different models on the validation sets

Table 2 Results of hyper-parameter optimization of different machine learning algorithms

Model comparison

The adjustment results of the model hyperparameters were summarized firstly. Before adjusting these hyperparameters, the LR model achieved the highest AUC (0.839) (Table S5). Four machine learning models were constructed based on the best hyperparameter combinations of the algorithms (Table 2). The results of the confusion matrixes are summarized in Table 3, where XGBoost created the minimum number of false positives (15) and LR created the maximum number of true positives (153). As can be seen from Table 4; Fig. 2b and c, LR achieved the best AUC (0.850) in the single machine learning model. The ensemble model constructed using the voting ensemble method further improved the predictive power and achieved the highest performance (AUC: 0.856). The model with the best sensitivity applied LR (0.845). The specificity values of XGBoost, NN, and the traditional creatinine (Cre) indicator were all above 0.8, whereas the sensitivity of Cre was low (0.392). When compared with the pre-existing single renal function evaluation indices (eGFR, Scr), the prediction performance of machine learning for the progression of CKD was significantly improved (Table 4). In addition, we also compared the running time of different machine learning models under the same hardware conditions. As shown in Table S6, there is little difference among the models in the test cohort. However, when training the cohort of each hyperparameter, GNB had the fastest and XGBoost had the slowest execution time.

Table 3 Confusion matrices
Table 4 Performance summary

Most important predictors of CKD risk

To identify the features influencing the model and their impact on the risk of CKD as a way to support clinical decision-making, a particular variant of SHAP for kernel-based explainers was used for the ensemble model interpretation with the best AUC performance. The features ranked based on the SHAP values in the training dataset are shown in Fig. 3. Features other than Scr and eGFR were discussed to highlight those that may need to be closely monitored. As shown in Fig. 3, lower levels of 25OHD, ALB, and transferrin (TRF), male sex, and higher levels of CYSC were the major predictors of moderate-to-high severity. In addition, to obtain the exact form of the relationship, SHAP-dependence plots (Fig. 4) were employed. A SHAP value exceeding zero is regarded as the cut-off point, and the critical point corresponding to each feature can be observed at this time. According to the results, 25OHD levels lower than 30 nmol/L indicate a moderate or even severe loss of renal function. In addition, when the 25OHD level was higher than 75 nmol/L, the individual differences increased. A decrease in serum ALB level predicts an increase in the risk of CKD. ALB levels below 37 g/L were correlated with a positive predictive value. We also found that the accumulation of CYSC indicates an increased risk of CKD, that is, when the CYSC level is higher than 2 mg/L, the same level of CYSC accounts for a greater difference among the patients. In addition, a higher glycated albumin (GA) level (%) indicates an increased risk of CKD. The results also illustrate the tendency of CKD risk when eGFR levels decrease. An eGFR level below 60 ml/min/1.73 m2 is correlated with a positive predictive value. Within the range of 1.5–2.0 g/L, TRF changes slightly, whereas SHAP increases sharply, which shows that attention should be paid to changes in the TRF. Such analyses can help clinicians understand the results of potential interventions and design appropriate personalized care plans to reduce the risk of CKD.

Fig. 3
figure 3

SHAP summary plot of the top-17 features of the ensemble model. The abscissa is the SHAP value, which represents the impact on the model output. The ordinates are different features, with red representing larger eigenvalues, and blue indicating smaller eigenvalues

Fig. 4
figure 4

SHAP dependence plots for ensemble model. The SHAP-dependence plot shows the effect of a single feature on the output of the ensemble prediction model. SHAP values for specific features exceeding zero represent an increased risk of CKD progression. (a-f) 25-hydroxyvitamin D, albumin, cystatin C, glycated albumin, estimated glomerular filtration rate (eGFR), transferrin, protein A1, uric acid, and total protein


The 24-h urine protein test has stringent patient compliance requirements and difficult follow-up procedures. The use of routine laboratory biochemical tests to replace the 24-h urine protein quantification will improve the convenience of outpatients and follow-up patients. In this retrospective cohort study, we developed machine learning algorithms using 100 easily obtainable clinical features for predicting CKD based on the severity of the proteinuria (Fig. 5). Some studies have shown that changes in the proteinuria are significantly associated with certain kidney function metrics, including a doubling of serum creatinine levels, rapid eGFR decline, and progression to end-stage kidney disease [25,26,27]. However, the detection of 24-h proteinuria is difficult owing to several factors, such as better applicability to hospitalized patients than to outpatients, poor patient compliance, and increased medical pressure. In the present study, the linear LR model exhibited the best AUC performance for single-model prediction, whereas the ensemble model (LR + XGBoost) exhibited the best AUC (0.856) among all models considered, with balanced specificity and sensitivity. Model fusion technology is therefore suitable for clinical decision support. Owing to the diversity of the available data and an adequate AUC performance, it can be concluded that the results of this study are informative for the rapid diagnostic identification of patients with CKD, with the mining of key risk factors contributing to subsequent treatment.

Fig. 5
figure 5

Overall summary of the study. Using common clinical variables, machine learning based approaches can effectively predict and explain the progression of CKD. Furthermore, decision support is provided for early intervention, and medical resource allocation is given for outpatients and those requiring a follow-up

Artificial intelligence is being increasingly used in the medical field to predict various outcomes. Several longitudinal studies involving CKD have reported progress regarding the use of machine learning algorithms in CKD prediction. A survey by Huang et al. showed that 125 metabolites and 14 clinical variables can be used as predictors to establish a CKD prediction model for patients with type 2 diabetes (AUC = 0.857) [28]. Rashed-Al-Mahfuz et al. developed five models for predicting CKD using low-cost diagnostic screening. The RF accurately predicted (at a rate of 99.5%) patients at risk of CKD, but this high predictive power may be due to overfitting caused by too little data quantities [29]. Ferguson et al. also used routinely collected laboratory data and machine learning models to identify those at high risk of developing advanced CKD within the next 5 years [30]. However, none of these studies can provide personalized information for individual patients, thus hindering the ability of predictive models to support decision-making under clinical settings. This study provides a comprehensive framework for combining the predictive accuracy of CKD risk with interpretable results for the important characteristics of individual patients. Interestingly, consistent with the research by Xiao et al. on the use of proteinuria as a standard for CKD [31], the linear model achieved the best prediction performance in the prediction of multiple models; the ML model fusion used in this study can further improve the model AUC, which suggests that the model fusion scheme has potential practical capabilities. Similarly, some common factors such as ALB, TP, and eGFR have been found to be significantly related to CKD progression. More details about the above-mentioned studies are shown in Table S7. Particularly, our outcome differs from most existing reports, namely, we used 24-hour urinary protein as the outcome, while others were more based on eGFR, but the systemic changes in tubular creatinine secretion and extrarenal creatinine clearance could bias the results. Routinely, 24-hour urinary protein quantification is the gold standard for assessing the severity of CKD, but there are few studies that have stratified the risk of CKD with an outcome of 24-hour urinary protein, resulting in limited comparable studies, and this may be due to the difficulty in obtaining the results of 24-hour urinary protein quantification in clinical practice. For each patient, risk stratification of CKD and timely identification of high-risk are of great significance for rational allocation of limited clinical resources and treatment intervention. Our research has made up for these deficiencies to a certain extent.

Unlike many studies on CKD risk factors, we used RFE-LR algorithms to screen the most important variables that can be used for prediction and applied the SHAP values to explain the machine learning model. Based on the SHAP values, the 25OHD attribute was assigned the highest importance. The kidney is one of the main organs regulating vitamin D metabolism. The kidney internalizes 25OHD and converts 25(OH)D into 1,25(OH)2D. In CKD, the combination of a limited vitamin D intake and a reduced renal capacity to activate 25(OH)D into 1,25(OH)2D leads to a progressive vitamin D deficiency [32].Additionally, this study found that patients with 25OHD in the 28–35 nmol/L range require close monitoring to delay the progression of CKD. A close and regular review of 25OHD in such patients is recommended in clinical practice. However, further analysis of the actual health status of the patient is required to determine whether the dosing schedule of vitamin D can be adjusted. ALB was determined to be of the next highest importance based on the SHAP values; lower ALB levels are associated with the loss of kidney function. ALB accounts for approximately 60% of the total serum protein content, maintains colloidal osmotic pressure, and binds a variety of compounds under physiological conditions [33]. The glomerular filtration barrier prevents ALB from entering the ultrafiltrate. However, under the pathological condition of CKD, an increase in the effective radius of the barrier leads to protein loss, which further leads to a decrease in serum albumin levels [34]. The ALB of the point with zero SHAP values was approximately 36–37 g/L. This means that for patients with reduced renal function, even if the reference range for ALB is 35–55 g/L, ALB levels below 37 g/L may indicate moderate-to-severe renal impairment and require closer monitoring. Although the production rate of CYSC is more stable and its internal variability is smaller than that of Scr, there have been fewer studies on the renal function marker CYSC, which is a low-molecular-weight protein produced by nucleated cells at a constant rate and acts as lysosomal and cysteine proteases [35]. A recent meta-analysis showed similar findings; in particular, CYSC has a stronger correlation with renal function than Scr. We speculate that as an underlying explanation, CYSC is unaffected by muscle mass compared to Scr [36]. The interpretation based on the SHAP value is model-independent; that is, the SHAP value can be applied to different models. Therefore, although this research focused on CKD, the framework can be easily extended to the risk prediction and interpretation of other diseases to better support clinical decision-making.

Overall, in this study, an integrated framework for CKD risk prediction and interpretation is proposed to provide clinicians with decision support and model interpretation. Specifically, an integrated algorithm was developed to achieve a good prediction performance on the CKD dataset. While accurately predicting high-risk patients, it also achieves a strong interpretability for specific indicators. Finally, this study has certain limitations. Firstly, this is a single-center retrospective study, and there may be variations in the clinical characteristics of the data across different regions. Therefore, to assess the generalizability of the model, the conclusions drawn from this study need to be validated in external cohorts. Secondly, this study only considered the correlation between predictive factors and CKD, without considering causality. Thirdly, this study used conventional feature selection models, and the application of more recent advanced techniques such as NRF, AHEGFS, and BEFS may help identify more reliable CKD progression risk factors. Finally, the dataset used in this study only included blood-related indicators and ignored medical prescriptions and imaging examinations.


In conclusion, we developed a machine learning model for predicting CKD based on proteinuria severity. The experimental results show that constructing a predictive interpretation framework can lead to a good predictive interpretation and provide effective clinical decision support. Another essential value is in providing new clinical insights for the management of patients requiring follow-up examinations for different diseases in large hospitals.

Data availability

The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.



Receiver operating characteristic


Area under the curve


Chronic kidney disease


Logistic regression


Random forest


Recursive feature elimination with logistic regression


Shapley additive explanations


  1. Luyckx VA, Al-Aly Z, Bello AK, Bellorin-Font E, Carlini RG, Fabian J, Garcia-Garcia G, Iyengar A, Sekkarie M, van Biesen W, et al. Sustainable development goals relevant to kidney health: an update on progress. Nat Rev Nephrol. 2021;17(1):15–32.

    Article  PubMed  Google Scholar 

  2. Methven S, MacGregor MS, Traynor JP, Hair M, O’Reilly DS, Deighan CJ. Comparison of urinary albumin and urinary total protein as predictors of patient outcomes in CKD. Am J Kidney Dis. 2011;57(1):21–8.

    Article  CAS  PubMed  Google Scholar 

  3. Robinson BM, Akizawa T, Jager KJ, Kerr PG, Saran R, Pisoni RL. Factors affecting outcomes in patients reaching end-stage kidney disease worldwide: differences in access to renal replacement therapy, modality use, and haemodialysis practices. Lancet. 2016;388(10041):294–306.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Fishbane S, Spinowitz B. Update on Anemia in ESRD and earlier Stages of CKD: Core Curriculum 2018. Am J Kidney Dis. 2018;71(3):423–35.

    Article  PubMed  Google Scholar 

  5. Ruiz-Ortega M, Rayego-Mateos S, Lamas S, Ortiz A, Rodrigues-Diez RR. Targeting the progression of chronic kidney disease. Nat Rev Nephrol. 2020;16(5):269–88.

    Article  PubMed  Google Scholar 

  6. Yang C, Wang H, Zhao X, Matsushita K, Coresh J, Zhang L, Zhao MH. CKD in China: evolving Spectrum and Public Health Implications. Am J Kidney Dis. 2020;76(2):258–64.

    Article  PubMed  Google Scholar 

  7. Hirano K, Kobayashi D, Kohtani N, Uemura Y, Ohashi Y, Komatsu Y, Yanagita M, Hishida A. Optimal follow-up intervals for different stages of chronic kidney disease: a prospective observational study. Clin Exp Nephrol. 2019;23(5):613–20.

    Article  PubMed  PubMed Central  Google Scholar 

  8. Goecks J, Jalili V, Heiser LM, Gray JW. How machine learning will transform Biomedicine. Cell. 2020;181(1):92–101.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Lee YW, Choi JW, Shin EH. Machine learning model for predicting malaria using clinical information. Comput Biol Med. 2021;129:104151.

    Article  PubMed  Google Scholar 

  10. Huang X, Cao T, Chen L, Li J, Tan Z, Xu B, Xu R, Song Y, Zhou Z, Wang Z, et al. Novel insights on establishing machine learning-based stroke prediction models among hypertensive adults. Front Cardiovasc Med. 2022;9:901240.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Kang MW, Kim J, Kim DK, Oh KH, Joo KW, Kim YS, Han SS. Machine learning algorithm to predict mortality in patients undergoing continuous renal replacement therapy. Crit Care. 2020;24(1):42.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Ketteler M, Ambuhl P. Where are we now? Emerging opportunities and challenges in the management of secondary hyperparathyroidism in patients with non-dialysis chronic kidney disease. J Nephrol. 2021;34(5):1405–18.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Pasha SJ, Mohamed ESJIA. Novel feature reduction (NFR) model with machine learning and data mining algorithms for effective disease risk prediction. 2020, 8:184087–108.

  14. Pasha SJ, Mohamed ESJIiMU. Advanced hybrid ensemble gain ratio feature selection model using machine learning for enhanced disease risk prediction. 2022, 32:101064.

  15. Pasha SJ, Mohamed ES. Bio inspired ensemble feature selection (BEFS) model with machine learning and data mining algorithms for disease risk prediction. In: 2019 5th International Conference On Computing, Communication, Control And Automation (ICCUBEA): 2019: IEEE; 2019: 1–6.

  16. Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017;30:4768–77.

    Google Scholar 

  17. Zhao QY, Wang H, Luo JC, Luo MH, Liu LP, Yu SJ, Liu K, Zhang YJ, Sun P, Tu GW, et al. Development and validation of a machine-learning model for prediction of Extubation failure in Intensive Care Units. Front Med (Lausanne). 2021;8:676343.

    Article  PubMed  Google Scholar 

  18. Tseng PY, Chen YT, Wang CH, Chiu KM, Peng YS, Hsu SP, Chen KL, Yang CY, Lee OK. Prediction of the development of acute kidney injury following cardiac surgery by machine learning. Crit Care. 2020;24(1):478.

    Article  PubMed  PubMed Central  Google Scholar 

  19. Stekhoven DJ, Bühlmann P. MissForest–non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012;28(1):112–8.

    Article  CAS  PubMed  Google Scholar 

  20. Zopluoglu C. Detecting examinees with item Preknowledge in large-scale testing using Extreme Gradient Boosting (XGBoost). Educ Psychol Meas. 2019;79(5):931–61.

    Article  PubMed  PubMed Central  Google Scholar 

  21. Zhang H, Jiang T, Shan G. Identification of hot spots in protein structures using Gaussian Network Model and Gaussian Naive Bayes. Biomed Res Int. 2016;2016:4354901.

    Article  PubMed  PubMed Central  Google Scholar 

  22. Kriegeskorte N, Golan T. Neural network models and deep learning. Curr Biol. 2019;29(7):R231–6.

    Article  CAS  PubMed  Google Scholar 

  23. Rokem A, Kay K. Fractional ridge regression: a fast, interpretable reparameterization of ridge regression. Gigascience 2020, 9(12).

  24. Schober P, Vetter TR. Logistic regression in Medical Research. Anesth Analg. 2021;132(2):365–6.

    Article  PubMed  PubMed Central  Google Scholar 

  25. Anderson AH, Xie D, Wang X, Baudier RL, Orlandi P, Appel LJ, Dember LM, He J, Kusek JW, Lash JP, et al. Novel risk factors for Progression of Diabetic and nondiabetic CKD: findings from the chronic renal insufficiency cohort (CRIC) study. Am J Kidney Dis. 2021;77(1):56–73e51.

    Article  CAS  PubMed  Google Scholar 

  26. Inaguma D, Imai E, Takeuchi A, Ohashi Y, Watanabe T, Nitta K, Akizawa T, Matsuo S, Makino H, Hishida A, et al. Risk factors for CKD progression in japanese patients: findings from the chronic kidney Disease Japan Cohort (CKD-JAC) study. Clin Exp Nephrol. 2017;21(3):446–56.

    Article  PubMed  Google Scholar 

  27. Inaguma D, Kitagawa A, Yanagiya R, Koseki A, Iwamori T, Kudo M, Yuzawa Y. Increasing tendency of urine protein is a risk factor for rapid eGFR decline in patients with CKD: a machine learning-based prediction model by using a big database. PLoS ONE. 2020;15(9):e0239262.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Huang J, Huth C, Covic M, Troll M, Adam J, Zukunft S, Prehn C, Wang L, Nano J, Scheerer MF, et al. Machine learning approaches reveal metabolic signatures of incident chronic kidney disease in individuals with Prediabetes and Type 2 diabetes. Diabetes. 2020;69(12):2756–65.

    Article  CAS  PubMed  Google Scholar 

  29. Rashed-Al-Mahfuz M, Haque A, Azad A, Alyami SA, Quinn JMW, Moni MA. Clinically Applicable Machine Learning Approaches to identify attributes of chronic kidney disease (CKD) for use in low-cost diagnostic screening. IEEE J Transl Eng Health Med. 2021;9:4900511.

    Article  PubMed  Google Scholar 

  30. Ferguson T, Ravani P, Sood MM, Clarke A, Komenda P, Rigatto C, Tangri N. Development and External Validation of a machine learning model for progression of CKD. Kidney Int Rep. 2022;7(8):1772–81.

    Article  PubMed  PubMed Central  Google Scholar 

  31. Xiao J, Ding R, Xu X, Guan H, Feng X, Sun T, Zhu S, Ye Z. Comparison and development of machine learning tools in the prediction of chronic kidney disease progression. J Transl Med. 2019;17(1):119.

    Article  PubMed  PubMed Central  Google Scholar 

  32. Christodoulou M, Aspray TJ, Schoenmakers I. Vitamin D supplementation for patients with chronic kidney disease: a systematic review and Meta-analyses of trials investigating the response to supplementation and an overview of Guidelines. Calcif Tissue Int. 2021;109(2):157–78.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Figueroa SM, Araos P, Reyes J, Gravez B, Barrera-Chimal J, Amador CA. Oxidized albumin as a mediator of kidney disease. Antioxid (Basel) 2021, 10(3).

  34. Levitt DG, Levitt MD. Human serum albumin homeostasis: a new look at the roles of synthesis, catabolism, renal and gastrointestinal excretion, and the clinical value of serum albumin measurements. Int J Gen Med. 2016;9:229–55.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Obert LA, Elmore SA, Ennulat D, Frazier KS. A review of specific biomarkers of Chronic Renal Injury and their potential application in Nonclinical Safety Assessment Studies. Toxicol Pathol. 2021;49(5):996–1023.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Lopez-Giacoman S, Madero M. Biomarkers in chronic kidney disease, from kidney function to kidney damage. World J Nephrol. 2015;4(1):57–73.

    Article  PubMed  PubMed Central  Google Scholar 

Download references


Not applicable.


This work was supported by grants from the National Natural Science Foundation of China (82070710), Shanghai Science and Technology Innovation Action Plan (21S21902900, 19DZ2205600, and 21002411500), Shanghai Municipal Key Clinical Specialty (shslczdzk02501), Shanghai Clinical Research Center for Kidney Disease (22MC1940100), Shanghai Key Laboratory of Kidney and Blood Purification, Shanghai Science and Technology Commission (20DZ2271600), Shanghai Municipal Hospital Frontier Technology Project supported by Shanghai Shen Kang Hospital Development Center (SHDC12018127 and SHDC2202230), and Shanghai Municipal Natural Science Foundation (20ZR1455600).

Author information

Authors and Affiliations



Study concept and design: DW, NS, YF and XD; acquisition of data: Y Li, BZ, JZ, YY, WC, ZY, BS and AC; analysis and interpretation of data: all authors; first drafting of the manuscript: YL and YN; critical revision of the manuscript for important intellectual content: DW, NS, and XD; statistical analysis: Y Lu, YN, and Y Li; obtained funding: DW, NS, YF, BS and XD; study supervision: DW, NS, and XD. Y Lu and YN had full access to all data used in the study and were responsible for the integrity of the data and accuracy of the data analysis. All authors have read and approved the final manuscript.

Corresponding authors

Correspondence to Dong Wang, Nana Song or Xiaoqiang Ding.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethics approval and consent to participate

The study received ethical approval from the Ethics Committee of Zhongshan Hospital (Approval No: B2021-740). The study was conducted in compliance with the World Medical Association Declaration of Helsinki on Ethical Principles for Medical Research Involving Human Subjects, and research regulations of the country. Considering the retrospective nature of the study, informed consent was waived by the Ethics Committee of Zhongshan Hospital.

Consent for publication

Not applicable.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lu, Y., Ning, Y., Li, Y. et al. Risk factor mining and prediction of urine protein progression in chronic kidney disease: a machine learning- based study. BMC Med Inform Decis Mak 23, 173 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: