Skip to main content

Improved liver disease prediction from clinical data through an evaluation of ensemble learning approaches



Liver disease causes two million deaths annually, accounting for 4% of all deaths globally. Prediction or early detection of the disease via machine learning algorithms on large clinical data have become promising and potentially powerful, but such methods often have some limitations due to the complexity of the data. In this regard, ensemble learning has shown promising results. There is an urgent need to evaluate different algorithms and then suggest a robust ensemble algorithm in liver disease prediction.


Three ensemble approaches with nine algorithms are evaluated on a large dataset of liver patients comprising 30,691 samples with 11 features. Various preprocessing procedures are utilized to feed the proposed model with better quality data, in addition to the appropriate tuning of hyperparameters and selection of features.


The models’ performances with each algorithm are extensively evaluated with several positive and negative performance metrics along with runtime. Gradient boosting is found to have the overall best performance with 98.80% accuracy and 98.50% precision, recall and F1-score for each.


The proposed model with gradient boosting bettered in most metrics compared with several recent similar works, suggesting its efficacy in predicting liver disease. It can be further applied to predict other diseases with the commonality of predicate indicators.

Peer Review reports


Liver disease is a significant global health burden, accounting for two million deaths annually, with approximately two-thirds in men [1]. Liver-related fatalities constituted 4% of the deaths observed in the current century [2]. Liver disease encompasses a spectrum of conditions, including fatty liver disease, cirrhosis, and hepatocellular carcinoma, which can lead to liver failure and death. The primary factors contributing to the development of liver disease are the frequent and prolonged consumption of drugs and alcohol, as well as the presence of obesity and diabetes [3]. Intervention and early diagnosis are essential for enhancing patient outcomes in liver disease. However, the sensitivity and specificity of conventional diagnostic techniques, including liver function tests and biopsies, are currently limited.

Machine learning (ML) has emerged as a promising tool for improving the diagnosis and prognosis of human diseases, including liver disease. ML algorithms can empower the analysis of large but complex clinical data, often including patient demographics, family history, patient medical records, laboratory results, and imaging findings, to identify patterns and relationships associated with liver disease. This information can then be used to develop predictive models for early disease detection and risk stratification. Several studies have investigated the application of ML for liver disease prediction using clinical data [4, 5]. These studies have explored various ML algorithms, including support vector machines (SVMs), random forests (RFs), and artificial neuron networks (ANNs). However, the performance of these algorithms can be affected by factors such as data quality, feature selection, and model parameters.

One of the most powerful ML approaches for medical diagnosis is ensemble learning. Ensemble methods combine multiple base learners to create a single, more robust model. This can improve the accuracy and generalizability of predictions compared to individual models. Ensemble learning has numerous advantages compared to conventional ML methodologies, rendering it a potent methodology for enhancing prediction efficacy across diverse workloads. Some of the notable advantages of ensemble learning are summarised in Fig. 1. Ensemble learning methods are becoming increasingly popular for more precise disease prediction [6,7,8,9,10]. Considering its success in other disease prediction, ensemble learning has also been explored to predict liver disease because this is a major disease type with a large amount of data [11].

Fig. 1
figure 1

Advantages of ensemble learning approaches

Several ensemble learning strategies have been developed. Among them, the most common ones include bagging (e.g., BDT (bagged decision tree), RF (random forest), ET (extra trees), etc.) [12], boosting (e.g., AdaBoost, GB (gradient boosting), and XGB (eXtreme gradient boosting)) [13], and stacking/voting (e.g., LR, DT, SVM, etc.) [14]. The selection of an ensemble technique should be determined by the particular problem at hand, the characteristics of the dataset being used, and the computational resources that are accessible.

This paper aims to extensively evaluate the ensemble learning methods for liver disease prediction and find the best-performing one. The main works in this paper are highlighted as follows:

  • An EDA is conducted to augment the dataset under consideration so that it can be utilized more effectively in experiments.

  • Different subsidiary methods are employed, such as data sampling, standardization, normalization, hyperparameter tuning, and feature selection.

  • Nine ensemble algorithms are applied for prediction model development.

  • The model’s performances with the considered ensemble algorithms are exhaustively evaluated and compared using several performance metrics.

  • The best performance of the proposed model is compared with other recent research works.

The remainder of the paper is structured as follows. Related work is discussed in Sect. 2. Section 3 briefly discusses the considered ensemble learning algorithms and the research methodology adopted in this paper. Details of the dataset and data preprocessing are discussed in Sect. 4. The details of the experimental setup are described in Sect. 5. The comparative analysis of the performance of the ensemble learning algorithms, along with other similar works, is presented in Sect. 6. The conclusion of our work, mentioning the limitations and future scopes, is given in Sect. 7. The acronyms used in this paper are listed in Table 1.

Table 1 List of abbreviations used in this paper

Related work

The rise of ML has led it to be applied in various application areas, including diagnoses and predictions of diseases [15,16,17]. Pasha et al. [18] offered a prediction model for liver disease. They also compared their model’s prediction accuracy with other ML algorithms like RF, LR and SVM. Mutlu et al. [19] built a CNN-based model to identify liver disease. For the experiment, they used two datasets, BUPA (from BUPA Medical Research LtdFootnote 1.) and ILPD. Both datasets are available in the UCI ML repositoryFootnote 2,Footnote 3. The model attained 75.55% and 72% accuracy for the BUPA and ILPD datasets, respectively. The authors also compared this model’s performance with other ML techniques such as NB, SVM, KNN, and LR.

Kalaiselvi et al. [20] experimented with different ML algorithms like kNN, DT and ANFIS to determine which is more appropriate for liver disease prediction. They used the ILPD, which is available at KaggleFootnote 4. It was observed that ANFIS performed best in terms of all the performance metrics. Thirunavukkarasu et al. [21] attempted to predict liver disease using classification algorithms like LR, kNN and SVM. The experimental results on the ILPD showed that LR and kNN achieved equal accuracy and were better than SVM; however, LR performed better in sensitivity and specificity. Velu et al. [22] experimented with NB and C4.5 DT on ILPD to predict liver disease. The latter achieved a better accuracy of 98.40% with the test dataset.

In ensemble ML, complex and more efficient models are built by combining diverse ML techniques to gain their combined advantages. This collaborative approach has been proven to be successful in the prediction, detection, diagnosis, and prognosis of different diseases [23,24,25,26,27].

Amin et al. [28] proposed an integrated feature extraction approach to predict liver disease. They applied different dimensionality reduction methods like PCA, FA, and LDA on ILPD. Various ML classifiers like LR, RF, KNN, SVM, MLP and ensemble were evaluated on the extracted features using 10-fold cross-validation. RF achieved the best performance with 88.1% accuracy, 85.33% precision, 92.3% recall and 88.68% F1-score on the integrated feature space. Afrin et al. [29] used ensemble learning to predict liver disease using various classification algorithms like LR, DT, RF, AdaBoost, kNN, LDA, GB, and SVM. They used the ILPD and applied LASSO to identify the most important features correlated to liver disease. When using all features, LR performed best, with an accuracy of 77.14%. However, DT performed the best with LASSO features with 94.29% accuracy. DT also had the highest precision of 92%, sensitivity of 99% and F1-score 96% based on LASSO features.

Dritsas and Trigka [30] compared various ML models (NB, LR, SVM, J48, RT, and RepTree) and ensemble methods (bagging, RF, RotF, AdaBoostM1, voting, stacking, MLP, and kNN) for liver disease risk prediction. They applied SMOTE and 10-fold cross-validation. It was found that the voting performed the best with an accuracy of 80.1%, a precision of 80.4 and a recall of 80.1%. Nahar [31] compared different ensemble methods (AdaBoost, LogitBoost, RF, and bagging with J48 and Reptree) for liver disease prediction. They used the ILPD for the experiment and the WEKA toolkit to build and evaluate the model. The authors analyzed the performance of the ensemble methods over multiple iterations, showing how accuracy improves with more models. They evaluated the models using accuracy, RMSE, TPR, FPR and ROC curve, providing a comprehensive model performance analysis. The results indicate that LogitBoost has the best accuracy of 71.53%. Kuzhippallil et al. [32] compared various ML classification models and feature selection techniques to predict liver disease. They used a genetic algorithm and XGB to select features. They evaluated various models, including LR, kNN, DT, RF, GB, AdaBoost, XGB, LGBM, and the stacking model. After feature selection and outlier removal, LGBM and the stacking model achieved the highest accuracy of 86%. To find a better potential solution for liver disease prediction, Naseem et al. [33] presented an extensive comparison of ten classifiers, viz. A1DE, MLP, NB, kNN, SVM, CHIRP, CDT, Forest-PA, J48, and RF. They experimented with two different datasets taken from the UCI ML repository (BUPAFootnote 5) and the GitHub repository (SanikaVTFootnote 6). For the first dataset, RF exhibited overall better performance, while for the second, SVM was observed as best.

Quadir et al. [34] proposed an ensemble ML approach using enhanced preprocessing techniques to classify liver disease. They applied various preprocessing techniques like imputation, balancing, scaling, and selection to improve the model’s performance. The authors applied six ensemble algorithms (GB, XGB, bagging, RF, ET, and stacking) and evaluated them on the preprocessed data derived from ILPD. The extra trees classifier achieved the highest testing accuracy of 91.82% for liver disease classification. Dalal et al. [35] proposed a hybrid XGB model for predicting liver disease. When evaluated, the proposed model achieved a significantly higher accuracy of 93.65% compared to the individual DT models like CHAID and CART. It also had better performance metrics like AUC and Gini coefficient. Bulucu et al. [36] conducted a study to predict liver disease from clinical data using ensemble learning methods like RF, J48, AdaBoost, GB and LGBM. They performed SMOTE oversampling to balance the classes before classification. The LGBM algorithm performed best with 98.8% accuracy, 98.1% precision, 99.4% recall and 0.98% kappa statistic in 10-fold cross-validation.

Edeh et al. [37] experimented with an ensemble model comprising MLP, Bayesian network, and QUEST for Hepatitis C prediction. They used the HCV data setFootnote 7, which allowed them to integrate the clinical data and blood biomarkers. An accuracy of 95.59% was achieved by the ensemble model, which was better than the individual performances of the considered algorithms. A predictive ML model of clinical outcomes presented by Meng et al. [38] aimed to assess the progression of Alpha-1 antitrypsin deficiency associated with liver disease (AATD-LD). They applied a supervised stacking ensemble learning technique combining RF, ENRR, GB, and ANN-MLP. They further mapped the importance of the feature for better interpretability of the predictive model. The authors extracted liver patient data from the UK Biobank for the experiment. Bayani et al. [39] used the factors that have the most influence on the prediction of EV grades among cirrhosis patients. To select the most potent predictors of EV grades, the authors used Catboost and XGB. In the experiment on a dataset of 490 patients with cirrhosis, 100% precision was attained with the Catboost model, while the XGB model had 91.02% accuracy. Child score, WBC, vitalism K, and INR were the most significant factors for predicting EV grades among cirrhosis patients. Gupta et al. [40] conducted a comparison of various ML approaches, such as GB, XGB, and LGB, to forecast liver disease. The dataset utilized for this purpose was the ILPD. 63% was the highest level of accuracy attained using RF and LGB. To predict liver disease using ILPD, Hameed et al. [41] also implemented many ML techniques, including boosting methods such as AdaBoost and GB. The findings indicate that the DT, AdaBoost, and RF achieved the highest accuracy during training, whereas the RF achieved the highest accuracy (80.36%) during testing. Zhao et al. [42] considered single classifiers (SVM and Gaussian process) and ensemble classifiers (XGB, bagging, and RF) for predicting liver disorders. The prediction performance was evaluated through accuracy, balanced accuracy, precision, recall, and F1-score. Experimenting with the BUPA dataset, the best performance was achieved through RF with an accuracy of 80.35%. However, bagging turned out to be a better performer in terms of recall.

All the above-mentioned studies used some basic machine learning models along with one or two ensemble models for liver disease prediction. Due to this, an exclusive performance assessment of the ensemble learning methods could not be availed. In this study, we built models using boosting, bagging and voting. Since the aggregation method is the fundamental policy of both stacking and voting, we kept only voting in this study. We performed a comprehensive comparison, considering the algorithms from different families of ensemble learning. In the actual experiment, we considered five algorithms from each category; however, here, we report the top three performers for each category.

Furthermore, most previous works reported only limited evaluation metrics that are generally common, e.g., accuracy, precision and recall. In this paper, we conducted thirteen statistical measurements to show the effectiveness of the proposed model from different aspects.

Research methodology

A synopsis of the research procedures undertaken and the ensemble learning methods implemented in the experiment are described in this section.

Research workflow

Figure 2 summarises the workflow of this study. First, we performed EDA to assess and augment the quality of the considered dataset. Here, we searched for the missing values and replaced them by employing data imputation methods. Further, for spotting possible outliers, the IQR method was used. Besides, other libraries were used to check for corrupt and noisy data, if any, in the dataset. Afterwards, data sampling, normalization, standardization, hyperparameter tuning, and ranking of features as per their importance were made. To develop the prediction model, we used and compared nine ensemble algorithms. The results were assessed through various performance metrics. The ensemble algorithms were trained using 60% of the dataset, while the remaining 40% was allocated for testing and validating their effectiveness.

Fig. 2
figure 2

Proposed methodology for research work

Ensemble learning models

Ensemble learning is an ML methodology that improves the accuracy and robustness of predictions by combining multiple models, instead of relying solely on individual models [43]. The basic idea behind ensemble learning is that it can make up for the shortcomings of any single model by combining the strengths of different models, leading to better performance. A number of ensemble learning methods are suggested [44, 45]. We took into consideration the following ensemble learning techniques in this study:


The boosting algorithm is a prominent method within the ensemble learning framework. Boosting methods involve an iterative training process where base models are trained, with increasing emphasis on misclassified examples in each iteration. In this manner, the emphasis is placed on rectifying errors committed by preceding models. Various boosting algorithms can be found in the literature [46, 47]. In this experiment, we considered the following three boosting algorithms.

  • XGBoost: XGB is a popular boosting algorithm that combines different kinds of DTs (weak learners) to independently calculate similarity scores [48]. It is known for its speed, accuracy, and ability to handle complex data.

  • Gradient boost: In this method, the weak learners undergo sequential training, while the weights of each estimator are adjusted individually before being added [49]. Predicting residual errors introduced by prior estimators, the GB algorithm attempts to minimize the discrepancy between predicted and actual values.

  • LightGBM: LGBM is another popular boosting algorithm similar to XGB, but it is faster and more memory-efficient. It can manage sizable datasets while consuming less memory during model evaluation [50]. LGBM also has several features that make it well-suited for real-world applications, such as parallelization and out-of-core training.


The bagging (bootstrap aggregating) technique entails the independent training of multiple base models on randomly selected subsets of the training data, with replacement. The final prediction is typically determined by taking the average (in the case of regression) or by voting (in the case of classification) the predictions generated by the base models. There are several bagging algorithms; however, in this study, the following methods gave the best results.

  • Bagged decision tree: BDTs are the most basic implementation of the bagging technique [51]. They are generated by aggregating the predictions of numerous DTs trained on bootstrap samples of the data. Bagged DTs have demonstrated efficacy in mitigating variation and enhancing accuracy. However, it is worth noting that there is a potential for overfitting to the training data in certain cases.

  • Random forest: RF is a more sophisticated bagging method that adds an element of randomness to the DT by randomly selecting a subset of features to examine at each split [52]. This further decorrelates the trees and can better the overall performance of the ensemble.

  • Extra trees: ET is another kind of bagging that employs a different splitting rule for the DTs than standard bagging does [53]. Instead of employing a conventional approach of finding the optimal split at each node, additional trees adopt a randomization technique by randomly choosing a subset of attributes and solely considering those features during the split-making process. This approach can potentially mitigate the correlation among trees and enhance the overall efficacy of the ensemble.


By combining the predictions of base learners, this ensemble learning method generates new features for training sets to improve the desired outcomes [54]. This approach generates the meta-features required for the final prediction by integrating both conventional and sophisticated classifiers. Based on weighted techniques and majority votes, the output of base classifiers is aggregated.

  • Logistic regression: LR combines multiple logistic regression models to improve overall prediction accuracy [55]. The process involves training logistic regression models iteratively, with each model concentrating on the misclassified instances from the preceding model. This approach exhibits notable efficacy when applied to binary classification tasks.

  • Decision tree: Boosted DTs sequentially build a series of weak DTs and combine their outputs to create a strong predictive model [56]. It achieves this by repeatedly training DTs in an iterative manner, with each tree concentrating on the most challenging examples from the preceding tree.

  • SVM: Boosting SVM involves combining the outputs of multiple SVMs to improve classification performance [57]. It trains SVMs iteratively, with each SVM concentrating on the support vectors from the preceding SVM. This method is especially useful for classification tasks involving high-dimensional data.

Dataset collection and manipulation

We used the Liver Disease Patient DatasetFootnote 8 as the experimental data set, collected from liver patients worldwide and publicly available at the UCI ML repository. This section discusses the details of the dataset and various data preprocessing.

Dataset description

This data set contains records of a total of 30,691 people, among which 21,917 had liver disease while the rest, 8774 did not have liver ailments. The dataset contains eleven attributes for each record. The first ten attributes are predicate, and the last is a target attribute. Among these, four attributes are of integer type, five are decimal, and two are of categorical type.

Table 2 shows the attribute information such as mean, standard deviation (std), and value range (minimum and maximum). For example, the minimum and maximum values of the total bilirubin (TB) attribute are 0.4 and 75, respectively. And its mean and std values are 3.370 and 6.256, respectively. It has also been observed that less than or equal to 25% of the patients have a TB value of 0.8, while less than or equal to 50% and 75% have a TB value of 1 and 2.7, respectively.

Table 2 Summary of attributes of the dataset

Exploratory data analysis

We employed a variety of data visualization techniques to examine and illustrate the data samples’ distribution. The histograms depicted in Fig. 3 are normally distributed and combine the dataset attributes within a given range of values. The X- and Y-axes represent the attribute values and number of patients having those values, respectively. The probability density generated by the KDE method is illustrated in Fig. 4. The X- and Y-axes represent each attribute’s parameter value and probability density function, respectively. It can be observed, for instance, that most patients’ ages in the dataset are between 25 and 65. The IQR approach was exercised to address the presence of outliers in the dataset.

Fig. 3
figure 3

Histogram of dataset attributes

Fig. 4
figure 4

Density plot for KDE (kernel density estimation)

We employed the CCA approach to determine and visualize the relationship between the attributes in the dataset. A substantial correlation or association between the collection of predicate and target attributes indicates a higher-quality dataset. The CCA for the experimental dataset attributes is shown in Fig. 5. The relationship range is bounded by + 1 and − 1 on the X- and Y-axes.

Fig. 5
figure 5

Correlation coefficient analysis

Data preprocessing

Before applying ML techniques to the model, preparing the data to build a strong and reliable system is important. Several approaches were utilized to handle different data preparation concerns in this study.

Outlier detection

Identifying outliers and neutralizing them, especially in predictive modelling, is vital in the initial data preparation phase. The process entails identifying data points that exhibit substantial deviation from the other data within the dataset. If outliers are not correctly addressed, they can significantly affect the accuracy of prediction models. We used the IQR method to better visualize outliers in the dataset, if any. We set the threshold of an IQR factor of three for all the features. It was found that the attributes AP, ALA, and ASA had most of the outliers, which is shown in the left column of Fig. 6. The Z-score method, defined by Eq. 1, where x = observed value, µ = mean of the sample, and σ = standard deviation of the sample, was used to replace the outliers. To neutralize the outliers, we set the range for AP, ALA, and ASA as 175–275, 25–45, and 25–55, respectively. The right column of Fig. 6 shows that the outliers of the three attributes are completely removed.

$$Z=\frac{x-\mu }{\sigma }$$
Fig. 6
figure 6

Detecting and replacing outliers in the dataset for (a) AP, (b) ALA and (c) ASA

Missing value imputation

Missing value imputation is an important part of predictive modelling because it makes the model work better, reduces bias, improves stability, and improves data representation. This is an important part of preparing the data to ensure that predictive models are accurate, reliable, and useful in many situations. The process entails substituting absent values with credible estimations to guarantee the completeness and coherence of the data before constructing a predictive model. Figure 7 shows the total number of missing values for each attribute in the dataset. We used isnull() to find missing values and calculate each attribute’s percentage of null values. Afterwards, we filled in the missing values by the particular attribute’s mean, and median of available values. Figure 8 shows the process of the missing value imputation method. Figure 9 represents the dataset before and after applying the imputation method.

Fig. 7
figure 7

Total number of missing values for each attribute

Fig. 8
figure 8

Imputation process of missing values

Fig. 9
figure 9

Comparison of missing value identification and replacement. Left panel: before missing value imputation. Right panel: after missing value imputation

Data sampling

If the dataset is imbalanced, ML algorithms perform poorly. The dataset used in this study was significantly skewed toward the positive class (liver disease) rather than the negative class (no liver disease). Originally, out of 30,691 records, 21,917 records were of patients with liver disease, whereas 8779 records were there for patients who did not have liver disease. We balanced the training dataset with respect to the target variable using SMOTE, as shown in Fig. 10.

Fig. 10
figure 10

Class balancing of the target variable

Data normalisation and standardization

For scaling the features, we used the MinMaxScaler() function. In our study, we chose this method due to two major advantages. First, it allows to maintain the range of the original features. Second, it is generally robust to outliers because it scales the data based on the minimum and maximum values in the dataset. Outliers are effectively bounded by the range, preventing them from disproportionately affecting the scaling process. Since our dataset originally had outliers, even after removing them, using a min-max scaler would provide a double safeguard.

By applying Eq. 2, we scaled the data values to achieve standardization and batch normalization, with mean and standard deviation values being 0 and 1, respectively.

$$N\left(X\right)=\frac{\sum _{i=1}^{N}{x}_{i}-{x}_{min}}{{x}_{max }-{x}_{min} }$$

where, N, X, xi, xmin, and xmax denote the total data sample, ith attribute, the attributes’ mean, the attributes’ sample variance, the sample’s minimum value, and the sample’s maximum value, respectively.

The feature scaling procedure includes normalization, which places the data samples inside a predetermined range that can be determined by the dataset’s type. All of the attributes in our study were scaled from 0 to 1 using min-max as defined by Eq. 3.


where x is the attribute value, and xmin and xmax denote the minimum and maximum values of x, respectively.


This section contains the experimental details of predicting liver disease using ensemble learning algorithms. The details of the experimental setup and configuration are shown in Table 3.

Table 3 Hardware and software used to conduct the experiment

Hyperparameter tuning

Hyperparameter tuning is crucial since it governs the behaviour of the training algorithm and has a big impact on the model’s performance assessment. We tuned the hypermeter using the grid search and random search methods to attain optimality in the performance of the suggested model. We preferred these two techniques because they have recently been used in most of the literature and are fairly straightforward to implement. Also, most machine learning frameworks and libraries provide built-in functions or modules for grid and random search. However, we took the search results from the grid search because of better convergence. Grid search also provides better customization and flexibility. Grid search allows for a systematic exploration of different combinations of hyperparameters by defining a grid or a specific set of values for each hyperparameter. This guarantees that all possible options are explored to identify the most optimal values for the hyperparameters. Grid search is deterministic, meaning that it consistently produces the same results when the same hyperparameters and data are utilized. This attribute enables transparent testing and assessment by ensuring that outcomes are easy to reproduce and compare. Table 4 displays the specifics of the hyperparameters for every method. In our experiment, we discovered that the optimal values for each parameter in the corresponding method were those that were listed.

Table 4 Hyperparameters for the boosting algorithms

Cross validation

K-fold cross-validation is commonly employed to mitigate bias in the dataset. This approach involves dividing the dataset into k subsets of roughly equal size, referred to as “folds”. The experiment involved implementing k-fold cross-validation on the training dataset. We tested with different values of k from 4 to 12. For k = 4 to 9, we found overfitting for most of the considered models, while values 11 and 12 of k introduced underfitting to the models. Our training and testing evaluation for all the models indicated the best balance between overfitting (smaller k values) and underfitting (higher k values) is k = 10.

Feature importance and selection

The feature significance procedure ranks the predictor variables (input attributes) according to how well they help predict the target variable (output feature). This stage is critical for generating more accurate predictions for ML and ensemble learning models. We used the feature significance score (F-score), a metric that determines the frequency with which an attribute is utilized for splitting during the training process. Figure 11 illustrates the contributions made by each predicate parameter utilized in this investigation. The features and their degree of significance are plotted on the Y- and X-axis, respectively. As seen in the figure, DB, AP, ALA, and ASA are the most significant factors that lead to an accurate prognosis of liver disease; on the other hand, the demographic parameters (GN and AGE) are the least significant factors that influence the prediction are liver disease. We also checked for potential collinearity among features using the VIF method and found that none of the attributes had high collinearity. The observed VIF value lay between 0 and 4, eliminating the possibility of overfitting.

Fig. 11
figure 11

Feature importance for prediction using (a) boosting, (b) bagging, and (c) voting

Results and performance evaluation

This section discusses the performance of the designed prediction model for considered ensemble algorithms using various performance indicators.

Evaluation metrics

Evaluation metrics are used to assess the performance of a model on a problem statement. Different evaluation metrics are used depending on the problem type and the data’s nature [58]. In this study, the experimental findings for the presented model are evaluated using various performance metrics, as summarised in Table 5 [59], where, true positive (TP): the patient has liver disease, and the model predicts liver disease, true negative (TN): the patient does not have liver disease and the model predicts negative, false positive (FP): the patient does not have liver disease but the model predicts liver disease, and false negative (FN): the patient has the liver disease but the model predicts negative.

The evaluation of the ensemble algorithms’ predictive capability is generally conducted across multiple levels by employing the ROC curve. By analyzing the ROC curve, we can determine how well the models can distinguish between the TPR and FPR. The model’s ability to differentiate between the two classes is indicated by a higher ROC curve [60]. The AUC is also used to measure how well two classes can be separated. Generally, a good separability measure has an AUC close to 1, whereas a poor separative measure has an AUC close to 0. A value of 0.5 suggests the model is not classifying well.

Table 5 Performance evaluation metrics

Comparing bagging, boosting and voting methods

The evaluation of the algorithms’ classification performances is conducted by means of confusion matrices. The confusion matrices of all the considered algorithms are shown in Fig. 12. Figure 13 depicts the testing accuracies of all algorithms. As per our experiment, GB outperformed other algorithms by attaining the maximum accuracy rate of 98.80%, followed by XGB and LGBM, while ET attained the lowest accuracy of 81.86%. The precision, recall, F1- score, and support of the algorithms are shown in Figs. 14, 15, 16 and 17. In most cases, GB performed best. The nearest competitor was found to be XGB, whereas LGBM and RF had fair overall performance.

The other comparing measurements (FPR, FNR, FDR, NPV, specificity, MCC, MCR, and run time) are shown in Fig. 18. It can be observed that GB excels in FPR, FDR specificity, MCC, and MCR, whereas XGB betters in FNR and NPV. In only one parameter (RT), GB fails. It took the second most time (after BDT), while LGBM took the least time.

The AUC-ROC curves for the considered algorithms are shown in Fig. 19. According to the curves, GB (0.986) performed marginally inferior to the top performer, XGB (0.987) the best, while RF (0.866) performed the worst of the algorithms tested.

Fig. 12
figure 12

Confusion matrices of (a) boosting, (b) bagging and (c) voting algorithms

Fig. 13
figure 13

Accuracy comparison of the considered algorithms

Fig. 14
figure 14

Comparison of precision values of the considered algorithms

Fig. 15
figure 15

Comparison of recall values of the considered algorithms

Fig. 16
figure 16

Comparison of F1-score values of the considered algorithms

Fig. 17
figure 17

Comparison of support values of the considered algorithms

Fig. 18
figure 18

Comparisons of (a) FPR, (b) FNR, (c) FDR, (d) NPV, (e) specificity, (f) MCC, (g) MCR, and (h) run time of the considered algorithms

Fig. 19
figure 19

The AUC-ROC curves for the considered algorithms

Comparative analysis with literature

To establish the performance of our model, we compared it with several similar research papers in respect of various metrics, as shown in Table 6. Given that GB demonstrated superior overall performance in predicting liver disease in our experiment, we compared the outcomes achieved exclusively with GB. The better performance attained by our model can be ascribed to the implemented methodologies, which include data imputation to account for missing values, identification and substitution of outliers, and efficient data normalization and standardization.

Table 6 Comparing the proposed model with recent literature


Liver disease causes two million deaths annually and affects many more patients worldwide. In this paper, we designed ensemble learning based models and evaluated them to find the best model that would accurately predict liver disease. We examined the effectiveness of three ensemble learning approaches: boosting, bagging and voting. Furthermore, for each approach, we considered three algorithms, i.e., gradient boosting, XGB, and LGBM for boosting, RF, ET and BDT for bagging and LR, DT and SVM for voting.

GB demonstrated the highest level of performance in the experiment, attaining an accuracy rate of 98.80%. However, in some parameters (e.g., precision (liver disease), recall (no liver disease), false negative rate, negative predicted values, and ROC), XGB performed better. The performances of LGBM and BDT were also fair. LGBM was the fastest to execute, while GB was the slowest. Our proposed model was compared with several similar works, in which it was found to outperform them.

Due to their simplicity and convenience, we used mean and median methods to fill in the missing values. However, the straightforwardness of these methods brings some obvious limitations, such as loss of variability, distortion of relationships, introduction of biases, underestimation of uncertainty, and sensitivity to missingness patterns. To mitigate these limitations, alternative imputation methods that consider the underlying characteristics of the data and the missingness mechanism can be explored. Also, we used the SMOTE method to balance the dataset, which may introduce issues like overfitting, data leakage, noise amplification, parameter-sensitivity, and imbalanced feature representation. Though we carefully evaluated the impact of SMOTE on the ensemble models’ performance and took measures such as feature selection, alternative techniques, such as modified versions of SMOTE (e.g., borderline-SMOTE, ADASYN) or other data resampling methods, can be explored to address class imbalance while minimizing the potential drawbacks associated with the SMOTE method.

To broaden the applicability of this study, the proposed method may be extended to encompass additional healthcare datasets that possess similar characteristics. In subsequent research, investigating deep learning techniques might result in improved liver disease detection and prediction. The developments in deep learning and advanced machine learning may lead to more precise and effective medical treatments.

Availability of data and materials

No datasets were generated or analysed during the current study.











  1. Devarbhavi H, Asrani SK, Arab JP, Nartey YA, Pose E, Kamath PS. Global burden of liver disease: 2023 update. J Hepatol. 2023;79:516–37.

    Article  PubMed  Google Scholar 

  2. Shaheamlung G, Kaur H. The diagnosis of chronic liver disease using machine learning techniques. Inform Technol Ind. 2021;9(2):554–65.

    Google Scholar 

  3. Tapper EB, Parikh ND. Mortality due to cirrhosis and liver cancer in the United States, 1999–2016: observational study. BMJ. 2018;362:k2817.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Mostafa F, Hasan E, Williamson M, Khan H. Statistical machine learning approaches to liver disease prediction. Livers. 2021;1(4):294–312.

  5. Tanwar N, F Rahman K. Machine learning in liver disease diagnosis: current progress and future opportunities. IOP Conf Series: Mater Sci Eng (ICCRDA 2020). 2021;1022:012029.

    Article  Google Scholar 

  6. Ganie SM, Malik MB. An ensemble machine learning approach for predicting type-II diabetes mellitus based on lifestyle indicators. Healthc Analytics. 2022;22:100092.

    Article  Google Scholar 

  7. Naveen RK, Sharma, Nair AR. Efficient breast cancer prediction using ensemble machine learning models, in 4th International conference on recent trends on electronics, information, communication & technology (RTEICT), Bangalore, India, 2019.

  8. Ganie S, Pramanik PKD, BashirMalik M, Nayyar A. An improved ensemble learning approach for heart disease prediction using boosting algorithms. Comput Syst Sci Eng. 2023;46(3):3993–4006.

    Article  Google Scholar 

  9. Shanbhag PA, Prabhu KA, Reddy Subba NV, Rao BA. Prediction of lung cancer using ensemble classifiers. J Phys Conf Ser. 2022;2161(012007):012007.

    Article  Google Scholar 

  10. Verma AK, Pal S, Tiwari BB. Skin disease prediction using ensemble methods and a new hybrid feature selection technique. Iran J Comput Sci. 2020;3:207–16.

    Article  Google Scholar 

  11. Ganie SM, Pramanik PKD. Predicting chronic liver disease using boosting technique. in 1st International conference on artificial intelligence for innovations in healthcare industries (ICAIIHI-2023). Raipur, India; 2024.

  12. Dai P, Gwadry-Sridhar F, Bauer M, Borrie M. Bagging ensembles for the diagnosis and prognostication of Alzheimer’s disease, Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, vol. 30, no. 1, 2016.

  13. Ganie SM, Pramanik PKD, Mallik S, Zhao Z. Chronic kidney disease prediction using boosting techniques based on clinical parameters. PLoS ONE. 2023;18(12):e0295234.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Mahajan P, Uddin S, Hajati F, Moni MA. Ensemble learning for disease prediction: A review. Healthcare. 2023;11(12):1808.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Ganie SM, Malik MB. Comparative analysis of various supervised machine learning algorithms for the early prediction of type-II diabetes mellitus. Int J Med Eng Inf. 2022;14(6):473–83.

    Google Scholar 

  16. Nissa N, Jamwal S, Mohammad S. Early detection of cardiovascular disease using machine learning techniques an experimental study. Int J Recent Technol Eng. 2020;9(3):635–41.

    Google Scholar 

  17. Shaikh FJ, Rao DS. Prediction of cancer disease using machine learning approach, Materialstoday: Proceedings, 2022;50:(Part 1):40–47.

  18. Pasha SN, Ramesh D, Mohmmad S, Anil Kishan NPP, Sandeep CH. Liver disease prediction using ML techniques, AIP Conference Proceedings, 2022;2418:no. 1:020010.

  19. Mutlu EN, Devim A, Hameed AA, Jamil A. Deep learning for liver disease prediction. In: Djeddi C, Siddiqi I, Jamil A, Ali Hameed A, Kucuk İ, editors. Pattern recognition and artificial intelligence (MedPRAI 2021). Communications in computer and information science. Volume 1543. Cham: Springer; 2022. pp. 95–107.

    Google Scholar 

  20. Kalaiselvi R, Meena K, Vanitha V. Liver disease prediction using machine learning algorithms. In international conference on advancements in electrical, electronics, communication, computing and automation (ICAECA). Coimbatore, India; 2021.

  21. Thirunavukkarasu K, Singh AS, Irfan M, Chowdhury A. Prediction of liver disease using classification algorithms, in 4th International Conference on Computing Communication and Automation (ICCCA), Greater Noida, India, 2018.

  22. Velu SR, Ravi V, Tabianan K. Identifying predictors of varices grading in patients with cirrhosis using ensemble learning. Health Technol. 2022;12:1211–35.

    Article  Google Scholar 

  23. Latha CBC, Jeeva SC. Improving the accuracy of prediction of heart disease risk based on ensemble classification techniques. Inf Med Unlocked. 2019;16:100203.

    Article  Google Scholar 

  24. Senthilkumar B, Zodinpuii D, Pachuau L, Chenkual S, Zohmingthanga J, Kumar NcS, Hmingliana L. Ensemble modelling for early breast cancer prediction from diet and lifestyle. IFAC-PapersOnLine. 2022;55(1):429–35.

    Article  Google Scholar 

  25. Verma AK, Pal S, Kumar S. Comparison of skin disease prediction by feature selection using ensemble data mining techniques. Inf Med Unlocked. 2019;16:100202.

    Article  Google Scholar 

  26. Yadav DC, Pal S. Prediction of thyroid disease using decision tree ensemble method. Human-Intelligent Syst Integr. 2020;2:89–95.

    Article  Google Scholar 

  27. Hakim MA, Jahan N, Zerin ZA, Farha AB. Performance evaluation and comparison of ensemble based bagging and boosting machine learning methods for automated early prediction of myocardial infarction, in 12th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kharagpur, India, 2021.

  28. Amin R, Yasmin R, Ruhi S, Rahman MH, Reza MS. Prediction of chronic liver disease patients using integrated projection based statistical feature extraction with machine learning algorithms. Inf Med Unlocked. 2023;36:101155.

    Article  Google Scholar 

  29. Afrin S, Shamrat FMJM, Nibir TI, Muntasim MF, Moharram MS, Imran MM, Abdulla M. Supervised machine learning based liver disease prediction approach with LASSO feature selection. Bull Electr Eng Inf. 2021;10(6):3369–76.

    Google Scholar 

  30. Dritsas E, Trigka M. Supervised machine learning models for liver disease risk prediction,. Computers. 2023;12(1):19.

  31. Nahar N, Ara F, Neloy MAI, Barua V, Hossain MS, Andersson K. A comparative analysis of the ensemble method for liver disease prediction. In: in 2nd International conference on innovation in engineering and technology (ICIET). Dhaka, Bangladesh; 2019.

  32. Kuzhippallil M, Joseph C, Kannan A. Comparative analysis of machine learning techniques for indian liver disease patients. in 6th International Conference on advanced computing and communication systems (ICACCS). Coimbatore, India; 2020.

  33. Naseem R, Khan B, Shah MA, Wakil K, Khan A, Alosaimi W, Uddin MI, Alouffi B. Performance assessment of classification algorithms on early detection of liver syndrome, J Health Eng, 2020;2020(Article ID 6680002).

  34. MD Quadir A, Kulkarni S, Joshua CJ, Vaichole T. Mohan Sk, Iwendi C. Enhanced preprocessing approach using ensemble machine learning algorithms for detecting liver disease, Biomedicines. 2023;11(2):581.

  35. Dalal S, Onyema EM, Malik A. Hybrid XGBoost model with hyperparameter tuning for prediction of liver disease with better accuracy. World J Gastroenterol. 2022;28(46):6551–63.

    Article  PubMed  PubMed Central  Google Scholar 

  36. Bulucu FO, Acer İ, LATİFOĞLU F. Predicting liver disease using decision tree ensemble methods. J Institue Sci Technol. 2022;38(2):261–7.

    Google Scholar 

  37. Edeh MO, Dalal S, Dhaou IB, Agubosim CC, Umoke CC, Richard-Nnabu NE, Dahiya N. Artificial intelligence-based ensemble learning model for prediction of hepatitis C disease. Front Public Health. 2022;10:892371.

    Article  PubMed  PubMed Central  Google Scholar 

  38. Meng L, Treem W, Heap GA, Chen J. A stacking ensemble machine learning model to predict alpha-1 antitrypsin deficiency-associated liver disease clinical outcomes based on UK Biobank data. Sci Rep. 2022;12(1):17001.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Bayani A, Hosseini A, Asadi F, Hatami B, Kavousi K, Aria M, Zali MR. Identifying predictors of varices grading in patients with cirrhosis using ensemble learning. Clin Chem Lab Med (CCLM). 2022;60(12):1938.

  40. Gupta K, Jiwani N, Afreen N, Divyarani D. Liver disease prediction using machine learning classification techniques, in 11th international conference on communication systems and network technologies (CSNT). Indore, India; 2022.

  41. Hameed EM, Hussein IS, Altameemi HG, Kadhim QK. Liver disease detection and prediction using SVM techniques. 3rd Information Technology to enhance e-learning and other application (IT-ELA). Iraq: Baghdad; 2022.

  42. Zhao J, Wang P, Pan Y. Predicting liver disorder based on machine learning models. J Eng. 2022;2022(10):978–84.

    Google Scholar 

  43. Brown G. Ensemble learning. In: Sammut C, Webb GI, editors. Encyclopedia of machine learning. Boston, MA: Springer; 2011. pp. 312–20.

    Chapter  Google Scholar 

  44. Sagi O, Rokach L. Ensemble learning: a survey. WIREs Data Min Knowl Discov. 2018;8(4):e1249.

    Article  Google Scholar 

  45. Zhang C, Ma Y, editors. Ensemble machine learning: methods and applications. New York: Springer; 2012.

    Google Scholar 

  46. Ferreira AJ, Figueiredo MAT. Boosting algorithms: a review of methods, theory, and applications. In: Zhang C, Ma Y, editors. Ensemble machine learning. Boston, MA: Springer; 2012. pp. 35–85.

    Chapter  Google Scholar 

  47. Tanha J, Abdi Y, Samadi N, Razzaghi N, Asadpour M. Boosting methods for multi-class imbalanced data classification: an experimental review. J Big Data. 2020;7:1.

    Article  Google Scholar 

  48. Chen T, Guestrin C. XGBoost: A scalable and portable parallel tree boosting framework, in 22nd ACM SIGKDD international conference on knowledge discovery and data mining. SanFrancisco, USA; 2016.

  49. Aziz N, Akhir EAP, Aziz IA, Jaafar J, Hasan MH, Abas ANC. A study on gradient boosting algorithms for development of AI monitoring and prediction systems. in International conference on computational intelligence (ICCI). Malaysia; 2020.

  50. Ke G, Meng Q, Finley T, Wang T, Chen W, Chen W, Ma W, Ye Q, Liu T-Y. LightGBM: A highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst (NIPS 2017). 2017;30:3146–54.

    Google Scholar 

  51. Breiman L. Bagging predictors. Maching Learn. 1996;24:123–40.

    Article  Google Scholar 

  52. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

    Article  Google Scholar 

  53. Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Mach Learn. 2006;63:3–42.

    Article  Google Scholar 

  54. Ganie SM, Malik MB. An ensemble machine learning approach for predicting Type-II diabetes mellitus based on lifestyle indicators. Healthc Analytics. 2022;2:100092.

    Article  Google Scholar 

  55. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction. Springer-; 2009.

    Book  Google Scholar 

  56. Freund Y, Schapire RE. A decision-theoretic generalization of On-line learning and an application to boosting. J Comput Syst Sci. 1997;55(1):119–39.

    Article  Google Scholar 

  57. Schapire RE, Singer Y. Improved boosting algorithms using confidence-rated predictions. Mach Learn. 1999;37:297–336.

    Article  Google Scholar 

  58. Le NQK, Do DT, Nguyen TTD, Le QA. A sequence-based prediction of Kruppel-like factors proteins using XGBoost and optimized features. Gene. 2021;787:145643.

    Article  CAS  PubMed  Google Scholar 

  59. Pramanik PKD, Bandyopadhyay G, Choudhury P. Predicting relative topological stability of mobile users in a P2P mobile cloud. SN Appl Sci. 2020;2(1827):11.

    Google Scholar 

  60. Ganie SM, Pramanik PKD, Malik MB, Mallik S, Qin H. An ensemble learning approach for diabetes prediction using boosting techniques. Front Genet. 2023;14:1252159.

    Article  PubMed  PubMed Central  Google Scholar 

Download references


Not applicable.

Author information

Authors and Affiliations



SMG: Conceptualization, Data curation, Methodology; Formal analysis, Validation, Visualization, Writing - review & editing; PKDP: Investigation, Formal analysis, Validation, Prepared figures, Writing - original draft, Writing - review & editing; ZZ: Supervision, Funding, Writing - review & editing.

Corresponding authors

Correspondence to Pijush Kanti Dutta Pramanik or Zhongming Zhao.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ganie, S.M., Dutta Pramanik, P.K. & Zhao, Z. Improved liver disease prediction from clinical data through an evaluation of ensemble learning approaches. BMC Med Inform Decis Mak 24, 160 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: