 Research article
 Open Access
 Published:
Applying probability calibration to ensemble methods to predict 2year mortality in patients with DLBCL
BMC Medical Informatics and Decision Making volume 21, Article number: 14 (2021)
Abstract
Background
Under the influences of chemotherapy regimens, clinical staging, immunologic expressions and other factors, the survival rates of patients with diffuse large Bcell lymphoma (DLBCL) are different. The accurate prediction of mortality hazards is key to precision medicine, which can help clinicians make optimal therapeutic decisions to extend the survival times of individual patients with DLBCL. Thus, we have developed a predictive model to predict the mortality hazard of DLBCL patients within 2 years of treatment.
Methods
We evaluated 406 patients with DLBCL and collected 17 variables from each patient. The predictive variables were selected by the Cox model, the logistic model and the random forest algorithm. Five classifiers were chosen as the base models for ensemble learning: the naïve Bayes, logistic regression, random forest, support vector machine and feedforward neural network models. We first calibrated the biased outputs from the five base models by using probability calibration methods (including shaperestricted polynomial regression, Platt scaling and isotonic regression). Then, we aggregated the outputs from the various base models to predict the 2year mortality of DLBCL patients by using three strategies (stacking, simple averaging and weighted averaging). Finally, we assessed model performance over 300 holdout tests.
Results
Gender, stage, IPI, KPS and rituximab were significant factors for predicting the deaths of DLBCL patients within 2 years of treatment. The stacking model that first calibrated the base model by shaperestricted polynomial regression performed best (AUC = 0.820, ECE = 8.983, MCE = 21.265) in all methods. In contrast, the performance of the stacking model without undergoing probability calibration is inferior (AUC = 0.806, ECE = 9.866, MCE = 24.850). In the simple averaging model and weighted averaging model, the prediction error of the ensemble model also decreased with probability calibration.
Conclusions
Among all the methods compared, the proposed model has the lowest prediction error when predicting the 2year mortality of DLBCL patients. These promising results may indicate that our modeling strategy of applying probability calibration to ensemble learning is successful.
Background
Diffuse large Bcell lymphoma (DLBCL), the most common subtype of Bcell nonHodgkin lymphoma (NHL), accounts for 30–40% of all NHLs [1]. Due to its heterogeneity in clinical presentations and prognoses, DLBCL currently remains a significant clinical challenge [2, 3]. Although the application of rituximab has improved the overall survival rate, 30–50% of DLBCL patients remain sensitive to chemotherapy or relapse after remission and eventually die [4, 5]. In addition, under the influence of a chemotherapy regimen, a clinical staging, an immunologic expression or other factors, the prognoses of patients are markedly different [2, 6, 7]. Thus, according to clinicopathologic factors, we aim to predict the mortality hazard for patients with DLBCL at the individual level. Actually, clinicians need precise risk estimates for individuals to help them make optimal therapeutic decisions to achieve precision medicine [8]. Through early risk assessment, appropriate therapies may be initiated quickly and ultimately improve the clinical outcomes of individual cases [9, 10].
Instead of using a single model, we consider using ensemble learning to predict the 2year mortality of patients with DLBCL. Ensemble learning can often achieve better performance than a single model by constructing and combining multiple models [11, 12]. These multiple models are also called individual models or base models. Dietterich [13] explained why the ensemble method works from statistical, computational and representational points of view. In addition, Kohavi [14] explained the reason for the success of the ensemble method from biasvariance decomposition. Due to its superior performance, ensemble learning has been applied in many fields. Wang Yuanchao [15] proposed a heterogeneous ensemble model composed of the random forest and XGBoost algorithms to classify pulsar candidates which achieved higher recall rate than other two algorithms. In [16], an ensemble of deep belief networks with different parameters was introduced into regression and time series forecasting. Several datasets were used for evaluation and eventually the ensemblebased model achieved better performance compared with other methods. Ensemble learning is also used to assist physicians in decisionmaking such as [17] which developed an ensemble model of deep belief networks for the early diagnosis of the Alzheimer’s Disease.
To achieve high performance, the base model in ensemble learning should have good accuracy and diversity [11, 12]. Diversity is key to ensemble learning, and there are three main ways to achieve diversity: data diversity, parameter diversity and structural diversity methods [12]. The data diversity method uses multiple datasets generated from the original dataset to train different base models. Since the multiple datasets are different from each other, the trained base models generate diverse outputs. The parameter diversity method trains different base models using different parameter sets. Even with the same algorithm or the same training data, the outputs from base models may vary with different parameters. The structural diversity method uses different algorithms to generate different base models and such ensemble models are also called heterogeneous ensembles. In our work, we consider five different algorithms for heterogeneous ensemble prediction, including the naïve Bayes (NB), logistic regression (logit), random forest (RF), support vector machine (SVM) and feedforward neural network (FNN) algorithms. However, some evidence [18,19,20,21,22,23] suggests that a few classifiers, such as the NB, RF and SVM classifiers, cannot generate precise probability estimates, although they achieve good classification performance in many realworld problems. Therefore, a different model structure is proposed to achieve both high accuracy and high diversity for the base models. That is, before aggregating the outputs from various base models, we first correct the poorly calibrated base models by using probability calibration methods. Probability calibration methods try to find a calibration function that maps the initial outputs of classifiers into more accurate posterior probabilities [18]. Many methods, such as the popular Platt scaling (Platt) and isotonic regression (IsoReg) approaches, have been proposed in this field. However, they all have their own preconditions. For example, Platt is effective when the output of the classifier is sigmoidshaped, and IsoReg is prone to overfitting when the sample size is small. Thus, we calibrate the base model with shaperestricted polynomial regression (RPR), which is a flexible and powerful calibration method that is not constrained by a specific classifier [24]. By calibrating the base models first and then aggregating them, we hope to achieve strong ensemble performance.
Our research has three characteristics. First, this paper focuses not on the prediction of categories, but on membership probability. Second, a different model structure is proposed that applies probability calibration to ensemble learning. Third, both discrimination and calibration are considered in the model comparison.
Methods
Data sources and predictive variables
The data used in this study were derived from Shanxi Cancer Hospital, China. We assessed 406 patients diagnosed with DLBCL between April 2010 and May 2017, 116 of whom died within 2 years after treatment. We collected 17 variables for each patient from electronic medical records. See Table 1 for the names and groupings of each variable.
We applied three methods to select the predictive variables, including the Cox model, the logistic model and the random forest algorithm. The Cox model uses outcomes and survival times as dependent variables to analyze the influencing factors of events. The results produced by the Cox model may not be reliable when the number of positive events is less than 10 times the number of covariates [25, 26]. Therefore, we first analyzed each variable by using the univariate Cox model. Then, the variables (including age, gender, stage, IPI, KPS, LDH, β_{2}MG, rituximab, and Bcl6) that showed a univariate relationship (P < 0.1) with death were included in the multivariate Cox model. In addition to these 9 variables, we also added Ki67 to the model. Ki67 is a nuclear protein that can be used as a biomarker for cell proliferation, and its expression is widely used to evaluate the prognoses of many kinds of cancers, including lymphoma. A metaanalysis suggested that Ki67 expression is associated with the prognoses of lymphoma patients. Subgroup analysis showed that high Ki67 expression is highly associated with a poor overall survival rate for DLBCL [27]. Another study also suggested that patients with high Ki67 expression have poor prognoses with standard firstline therapy [28]. Therefore, Ki67 was also included in the multivariate Cox model in our study, although it was not significant in the univariate Cox analysis. Thus, these 10 variables were entered into the multivariate Cox model, and 8 variables remained (P < 0.1). The results of the multivariate Cox analysis are displayed in Table 2.
We also analyzed all the collected variables using the logit and RF approaches. The variables selected (P < 0.1) by the logit model were consistent with those selected by the Cox model (see Table 2). If the threshold was 0.05, the Cox model selected 5 variables (gender, stage, IPI, KPS, and rituximab), while logit selected 4 variables (stage, IPI, KPS, and rituximab). The only difference is that the Cox model contains “gender” and the logit model does not. Figure 1 shows the ranking of variable importance obtained by the RF algorithm. The ranking showed that stage, IPI, KPS, and rituximab were the four most important variables regardless of which importance metric was used. Moreover, we found that the ranking based on the Gini index (Fig. 1b) also included “gender” in the top eight variables.
According to the results of these three methods, we first used 0.05 as the threshold and took the union of the variables selected by the Cox and logit models as predictive variables (including gender, stage, IPI, KPS, and rituximab). Based on these 5 variables, we pretrained all the base models with 100 iterations. Then, we further added the remaining 3 variables (age, LDH, and β_{2}MG) to each base model. Although each base model at this point contained all 8 variables selected by the Cox and logit models when the threshold was 0.1, the predictive performance was not significantly improved. Thus, we excluded these 3 variables and only used gender, stage, IPI, KPS, and rituximab for prediction.
Model structure
As Fig. 2 shows, there are three main components of the model.
The first part is the construction of the base models. We used five common classifiers with good classification performances on many realworld problems as the base models, including the NB, Logit, RF, SVM, and FNN models. NB, which calculates the posterior probability that an observation belongs to each class based on Bayes’ theorem, classifies an observation as a member of the class for which the posterior probability is the largest. Logit is a generalized linear model used to solve classification problems. Since the model takes the logistic function as the link function, the posterior probability can be generated. The RF algorithm builds a number of decision trees based on bootstrapped training sets for prediction. In our RF, we used the voting ratio of all decision trees as the initial probability estimate. SVMs attempt to find an optimal separating hyperplane to classify observations into different classes. The sign of the output determines the class of the sample, and the magnitude of the output can be used as a measure of predictive confidence, since examples far from the hyperplane are more likely to be classified correctly than examples close to the hyperplane. The FNN, in which the neurons in each layer are fully connected to the neurons in the next layer and there is no loop in the network, is a common neural network structure. In our study, we built a 3layer network, and the hidden layer included 500 units. We designed our network with a large number of hidden nodes because several studies suggest that a neural network with excess capacity generalizes better than a simple network when trained with back propagation and early stopping [29,30,31]. Other studies have also shown that a feedforward neural network can fit any continuous function of arbitrary complexity, and it only needs a single hidden layer containing enough units [32, 33].
The second part is the probability calibration of the base models. We selected RPR to calibrate the base models since it is a flexible and powerful method that is not constrained by a specific classifier [24]. We also applied two other methods (Platt, IsoReg) to investigate whether RPR was the best.
RPR calibrates initial probability using polynomial regression in which calibration function f has the following form [24]:
where s is the initial probability of an observation. The solution is the following optimization problem:
With constraint (a) all corrected probabilities are guaranteed to fall in the interval between 0 and 1. Constraint (b), which derives from the differentiability of f(s), ensures the monotonicity of the polynomial. In constraint (c), a l_{1}norm of a is used to prevent the polynomial from overfitting.
Platt is a parametric method which uses the sigmoid function as a calibration function [19]:
where A and B are estimated using the maximum likelihood estimation on the training set (s_{i}, y_{i}).
IsoReg is a nonparametric method which attempts to find an isotonic (i.e. nondecreasing) function that satisfies the following objective [21]:
where y_{i} = [ y_{1}, y_{2}, y_{3}, …, y_{N}] is the label sequence of all samples that has been sorted by their initial probabilities. A wellknown method called pair adjacent violators (PAV) algorithm is used to estimate the isotonic function [34].
The third part is the combination of the base models. We used three methods (simple averaging, weighted averaging, and stacking) to combine the above 5 base models. Stacking or stacked generalization, which takes the outputs of the base models as its inputs, uses another machine learning algorithm (also called a metalearner) to estimate the weight of each base model [35]. In our research, logit was used as a metalearner. Since the outputs of all base models are numeric in our study, we also used simple averaging and weighted averaging to combine these base models and compared the results with those obtained by stacking. In weighted averaging, the weight of each base model was set to the reciprocal of the expected calibration error (ECE) and maximum calibration error (MCE). We first combined 5 uncalibrated base models (NB, logit, RF, SVM, FNN) by using these three methods. Then, according to our modeling strategy, we used the same methods to combine the calibrated base models (NBRPR, logit, RFRPR, SVMRPR, FNN). We did not calibrate logit and FNN models because they could already generate accurate probability estimates, and their performances after calibration were not significantly improved.
Model construction and evaluation
We combined the holdout test with 3fold crossvalidation to complete the model construction and evaluation processes. To reduce statistical variability, all experiments were repeated 300 times. In each dataset, we randomly sampled 80% of the data (320 samples) as the training set and the residual 20% of the data (86 samples) as the testing set. Stratified sampling was used for each division to ensure the consistency of the data distribution across 300 experiments. In 3fold crossvalidation, the data were divided into 3 mutually exclusive subsets that were approximately equal in size. In each iteration, 2 of the 3 subsets were used as the training set, and the remaining subset was used as the validation set. The final evaluation depended on the average performance of the model on the 3 validation sets.
Amid the construction of the 5 base models, we first performed 3fold crossvalidation within the training set to find the optimal hyperparameters of the RF, SVM, and FNN models. Then, we used all training data to train these three models with the optimal hyperparameters, and assessed them within the testing set. For the RF algorithm, the number of candidate variables at each split was set to 2 and 3, and the choices for the number of trees were {500, 600, …, 1500}. For the SVM, the kernel function was selected from linear and Gaussian. The choices for the parameters C and Gamma were \({\left\{{10}^i\right\}}_{i=4}^4\). In addition, we used all training data to fit the NB and logit models and assessed them within the testing set. Although they do not include hyperparameters, we also performed 3fold crossvalidation for them within the training set. For each base model, we finally extracted all the predictive values in the 3 validation sets as the training set of the calibration function.
Amid the probability calibration of the 5 base models, we first used 3fold crossvalidation within the calibration training set to determine the k and λ parameters of RPR. Then, we used the entire calibration training set to train the Platt, IsoReg, and RPR models with the optimal k and λ parameters. Finally, we calibrated the predictive values of the 5 base models in the testing set by using the trained Platt, IsoReg, and RPR models and then evaluated them. For the RPR, the choices for the degree k were {4, 5, 6, …, 20} and the choices for λ were \({\left\{{4}^i\right\}}_{i=1}^5\).
Amid the combination of the 5 base models, the weight of each base model was set to 0.2 for simple averaging. For weighted averaging, the weight of each base model was set to the reciprocal of its ECE and MCE in the training data. For stacking, the outputs from the 5 base models were used as the inputs of the metalearner (the logit model, in our study). To avoid overfitting, we used the union of the predictions of the base models on validation sets to train the metalearner. Finally, we used the weights learned by stacking to combine the 5 base models.
Both discrimination and calibration were considered in the model evaluation. Discrimination and calibration are indispensable properties for assessing the accuracy of risk prediction models [36]. Discrimination is the ability to distinguish between patients who will have an event and those who will not. Calibration measures the consistency between the predicted probability and the true (observed) probability of patients at different risk strata. Although providing accurate probability estimates is our purpose, there is no need to further assess the calibration when the model has poor discrimination [36]. Therefore, we used the AUC to measure discrimination. The HL test, ECE and MCE were used to evaluate the calibration of the models.
The HL test is used to evaluate whether the difference between the predicted probability and the true probability is caused by chance [37]. The ECE and MCE are two measures related to the reliability diagram [38]. In computing these, the predictions are first sorted and then partitioned into k bins of equal size. For each bin, the predicted probability is the mean of the predictions in this bin, and the true (observed) probability is the proportion of positive observations in this bin. The ECE and MCE calculate the average prediction error and maximum prediction error over these bins, respectively:
where p_{i} and o_{i} are the predicted probability and the observed probability in the ith bin, respectively. The lower the values of the ECE and MCE are, the lower the calibration error of the prediction.
The NB, logit, RF, and SVM models were implemented in the R 3.6 using the “e1071”, “glm”, “randomForest” and “e1071” packages. The FNN and RPR models were performed with Keras and CVXPY in the Python 3.6 [39, 40].
Results
To reduce the variability caused by the data partition, the holdout test was executed 300 times. The final evaluation was based on the average results of these 300 iterations.
Performances of the 5 base models before and after calibration
We calibrated the NB, Logit, RF, SVM, and FNN models using RPR and compared the results with those of the Platt and IsoReg models. The performances of the 5 base models before and after calibration are shown in Table 3. The main features are summarized as follows:
The AUCs of the 5 uncalibrated base models were all greater than 0.75, so they were considered as having good discrimination. Except for that of the SVM, the AUCs of the other 4 models were all greater than 0.8. In terms of calibration performance, the FNN had the lowest calibration error (ECE = 9.211, MCE = 23.500). The logit could also generate accurate probability estimates (ECE = 9.517, MCE = 24.400). According to the HL test, these two models achieved good calibration (P > 0.05) 228 and 257 times out of 300 experiments. See Fig. 3a. By comparison, the NB, RF and SVM models had large prediction errors and they achieved good calibration only 7, 1, and 107 times out of 300 experiments, respectively.
After probability calibration was completed, the errors of the NB, RF and SVM models decreased significantly, and their frequencies of generating accurate probability estimates increased to 244, 219, and 196, respectively. However, regardless of which calibration method was used, the calibration error of the logit and FNN models were not further decreased. Out of the 3 evaluated probability calibration methods, RPR achieved the best correction for the SVM (ECE = 10.893, MCE = 26.300). IsoReg only achieved a good correction effect for the RF, while NBIsoReg and SVMIsoReg still had large calibration errors. According to Fig. 3a, NBRPR, RFRPR and SVMRPR had higher frequencies of achieving good calibration in 300 experiments than the Platt and IsoReg models.
Using 3 methods to combine the base models
First, we combined the NB, Logit, RF, SVM, and FNN models by simple averaging, weighted averaging using the ECE, weighted averaging using the MCE, and stacking. The results are presented in Table 4. Of all the models, the FNN had the best performance in terms of both discrimination and calibration (AUC = 0.813, ECE = 9.211, MCE = 23.500). Compared with the NB, RF and SVM models, the calibration errors of the 4 ensemble models decreased significantly.
Then, we combined the NBRPR, Logit, RFRPR, SVMRPR, and FNN models in the same ways. The results are presented in Table 5. StackingENC performed the best out of all the models (AUC = 0.820, ECE = 8.983, MCE = 21.265). The MCEs of the 4 ensemble models were all lower than those of any of the base models. Except for SAENC, the remaining 3 ensemble models had lower ECEs than that of any base model.
Last, we compared the 4 ensemble models (SAENC, ECEENC, MCEENC, StackingENC) that underwent probability calibration and the 4 ensemble models (SAEN, ECEEN, MCEEN, StackingEN) that did not. Regardless of which combination of methods was used, the 4 ensemble models that underwent probability calibration had lower calibration errors in terms of both the ECE and MCE than the ensemble models that did not undergo probability calibration. According to the HL test, our 4 ensemble models had the highest frequencies of achieving good calibration (P > 0.05), which were 270, 274, 272 and 272 out of 300 experiments. By comparison, the frequencies were 204, 219, 221 and 238 for the 4 ensemble models that did not undergo probability calibration. See Fig. 3b for the results.
Discussions
We predicted 2year mortality in patients with DLBCL at the individual level. The proposed method that applied probability calibration to ensemble learning performed satisfactorily in terms of both discrimination and calibration.
We used 5 variables (gender, stage, IPI, KPS, and rituximab) to predict the 2year mortality of patients with DLBCL. Most of the 5 variables have been proven to affect the prognoses of patients with DLBCL. The development of rituximab is a breakthrough in the treatment of DLBCL, and current research indicates that rituximab can improve the clinical outcomes for almost all DLBCL subgroups [41,42,43,44]. IPI is a recognized prognostic indicator of DLBCL, and high IPI values are associated with poor prognoses of patients [45, 46]. An advanced disease stage has been reported to be a poor prognostic factor of DLBCL, and it is associated with a low overall survival rate and a short diseasefree survival time [46]. Although there is no research about the relationship between KPS and the prognoses of patients with DLBCL, the statuses of patients [47] affect treatment options and, perhaps, patient outcomes.
Although the discrimination abilities of the 5 base models are very similar, the differences in calibration are obvious. The logit and FNN models can accurately predict the 2year mortality of DLBCL patients. However, the errors of the initial probabilities of the NB, RF and SVM models are large. These results are consistent with those of several reports [18,19,20,21,22,23]. The logit and FNN models return wellcalibrated predictions because they directly optimize the logloss of probability [23]. For the NB classifier, the output is often pushed to 0 or 1 because the required conditional independence assumption may not be valid in reality [21,22,23]. Inversely, the output of the RF algorithm is often pushed away form 0 and 1 because it is difficult to obtain identical predictions from all decision trees [18, 20, 23]. Furthermore, the SVM also pushes the prediction away from 0 and 1 and induces a sigmoidshaped distortion. Although the magnitude of the output can be taken as a measure of confidence in the prediction, these values are often poorly calibrated [19, 23].
The Platt model is effective when the distortion of the prediction is sigmoidshaped. In our study, the biased probabilities obtained by the NB, RF and SVM models achieved good correction effects by using the Platt model. IsoReg is a universal calibration method in which the only restriction is that the mapping function is isotonic (i.e. nondecreasing). However, the calibration errors of the NB and SVM models are still large after calibration by IsoReg. This may be caused by an overfitting effect resulting from scarce data. A report by NiculescuMizil suggested that IsoReg may not be suitable for calibration with a small sample size, especially when the sample size is less than 1000 [23]. By comparison, RPR is more flexible and powerful. Unlike the Platt model, RPR can calibrate any classifier since it has no constraints on the distribution of the initial output. Compared with IsoReg, RPR uses polynomial regression as a calibration function so that it is continuous over the entire interval. In addition, RPR strictly satisfies the monotonicity requirement and can be solved conveniently by optimization tools such as CVXPY [39, 40]. RPR can theoretically fit any calibration function as the polynomial degree increases. Therefore, RPR was used to correct poorly calibrated classifiers in our study. According to the HL test, NBRPR, RFRPR and SVMRPR had the highest frequencies of achieving good calibration in 300 experiments compared with the results of the Platt and IsoReg models. Although the MCE of NBPlatt was lower than that of NBRPR and the ECE of RFIsoReg was lower than that of RFRPR, we believe that RPR achieved better calibration for the NB, RF and SVM models than the other calibration methods if we also consider the results of the HL test.
Ensemble methods can obtain better performance than single models by combining multiple models, especially when the individual models are weak. Many theories in this field were initially proposed for weak models, such as the family of boosting models. Although aggregating weak models is sufficient for obtaining good performance in theory, models with good accuracy are often used as the base model in practice, especially when few base models are available. In our work, we used probability calibration to improve the accuracy of the base models’ probabilistic predictions to obtain strong ensemble performance. The results show that the calibration performance of the ensemble model was improved by calibrating the base models first. In terms of both the ECE and MCE, the calibration errors of our 4 ensemble models (SAENC, ECEENC, MCEENC, StackingENC) were lower than those of the 4 ensemble models (SAEN, ECEEN, MCEEN, StackingEN) that did not undergo probability calibration. According to the HL test, our 4 ensemble models had the highest frequencies of achieving good calibration performance in 300 experiments, which were 270, 274, 272 and 272. However, they were 204, 219, 221 and 238 for the 4 ensemble models that did not undergo probability calibration. For all models, the stacking method that first calibrated the base model by RPR performed best in terms of both discrimination and calibration (AUC = 0.820, ECE = 8.983, MCE = 21.265). By contrast, the NB classifier generates probabilistic predictions with the highest error (ECE = 14.206, MCE = 38.900). Overall, our model, which applies probability calibration to ensemble learning, achieves the desired results.
Our research has limitations. First, the bootstrapping method may be more appropriate for the model evaluation of our study than the holdout method. Since the sample size of this research is small, the model trained on the bootstrapped dataset, which has the same sample size with the training data, may be has lower estimation error. Second, it may be possible to improve the maximum calibration error in our study. The probability after calibration may not change significantly, as the calibration function has to ensure global monotonicity over the entire range. This means that the calibration error is largely influenced by misclassified samples. In the future, we will collect more feature information about DLBCL patients to improve the discrimination ability of the model, which may further increase the accuracy of its probabilistic predictions. Third, we only selected five common classifiers as the base models in our work. The impact of increasing or decreasing the number of base models or performing selective ensemble learning on the final prediction can be further investigated. Fourth, this research was based on the data provided by a certain hospital. Therefore, an external validation is required to evaluate the generalizability of the model.
Conclusions
In conclusion, we predicted the 2year mortality of patients with DLBCL at the individual level. The proposed method performs satisfactorily and has the potential to help clinicians improve the outcomes of individuals by providing accurate risk predictions. Furthermore, these promising results may indicate that our modeling strategy of applying probability calibration to ensemble learning is successful.
Availability of data and materials
The dataset generated and analyzed during the current study are not publicly available due to subsequent studies have not been completed but are available from the corresponding author on reasonable request. In addition, we agreed to share all the code in this study which has been written into separated file. Therefore, other interested researchers can easily implement our model.
Abbreviations
 DLBCL:

Diffuse Large Bcell Lymphoma
 NB:

Naïve Bayes
 Logit:

Logistic Regression
 RF:

Random Forest
 SVM:

Support Vector Machines
 FNN:

Feedforward Neural Network
 Platt:

Platt Scaling
 IsoReg:

Isotonic Regression
 RPR:

Shaperestricted Polynomial Regression
 IPI:

International Prognostic Index
 KPS:

Karnofsky Performance Status
 AUC:

Area Under the ROC Curve
 HL:

HosmerLemeshow
 ECE:

Expected Calibration Error
 MCE:

Maximum Calibration Error.
References
 1.
Jemal A, Siegel R, Xu JQ, et al. Cancer statistics. CA Cancer J Clin. 2013;52(5):1–24.
 2.
Roschewski M, Staudt LM, Wilson WH. Diffuse large Bcell lymphoma—treatment approaches in the molecular era. Nat Rev Clin Oncol. 2014;11(1):12–23.
 3.
Pasqualucci L, DallaFavera R. Genetics of diffuse large Bcell lymphoma. Blood. 2018;131(21):2307–19.
 4.
Martelli M, Ferreri AJM, Agostinelli C, et al. Diffuse large Bcell lymphoma. Crit Rev Oncol Hematol. 2013;87(2):146–71.
 5.
Tilly H, Vitolo U, Walewski J, et al. Diffuse large Bcell lymphoma (DLBCL): ESMO Clinical Practice Guidelines for diagnosis, treatment and followup. Ann Oncol. 2012;23(Suppl 7):vii78–82.
 6.
Horn H, Ziepert M, Wartenberg M, et al. Different biological risk factors in young poorprognosis and elderly patients with diffuse large Bcell lymphoma. Leukemia. 2015;29(7):1564–70.
 7.
Morrison VA, Hamlin P, Soubeyran P, et al. Diffuse large Bcell lymphoma in the elderly: impact of prognosis, comorbidities, geriatric assessment, and supportive care on clinical practice. An International Society of Geriatric Oncology (SIOG) expert position paper. J Geriatr Oncol. 2015;6(2):141–52.
 8.
Jameson JL, Longo DL. Precision medicine — personalized, problematic, and promising. N Engl J Med. 2015;372(23):2229–34.
 9.
Stenberg E, Cao Y, Szabo E, et al. Risk prediction model for severe postoperative complication in bariatric surgery. Obes Surg. 2018;28(7):1869–75.
 10.
Degnim AC, Winham SJ, Frank RD, et al. Model for predicting breast cancer risk in women with atypical hyperplasia. J Clin Oncol. 2018;36(18):1840–6.
 11.
Zhou ZH. Ensemble learning. In: Maching learning. Beijing: Tsinghua University press; 2016. p. 171–96.
 12.
Ren Y, Zhang L, Suganthan PN. Ensemble classification and regressionrecent developments, applications and future directions. IEEE Comput Intell Mag. 2016;11(1):41–53.
 13.
Dietterich T.G. (2000) Ensemble methods in machine learning. In: Multiple classifier systems. MCS 2000. Lecture notes in computer science, vol 1857. Berlin, Heidelberg: Springer. https://doi.org/10.1007/3540450149_1.
 14.
Kohavi R, Wolpert DH. Bias plus variance decomposition for zeroone loss functions. In: ICML'96: Proceedings of the thirteenth international conference on international conference on machine learning. 1996. p. 275–283.
 15.
Wang Y, Pan Z, Zheng J, et al. A hybrid ensemble method for pulsar candidate classification. Astrophysics Space Sci. 2019;364(8):139.
 16.
Qiu X, Zhang L, Ren Y, et al. Ensemble deep learning for regression and time series forecasting. Computational intelligence in ensemble learning, 2015.
 17.
Ortiz A, Munilla J, Górriz JM, et al. Ensembles of deep learning architectures for the early diagnosis of the Alzheimer’s disease. Int J Neural Syst. 2016;26(7):1650025.
 18.
Boström H. Calibrating random forests. Seventh Int Conf Mach Learn Appl. 2008;2008:121–6.
 19.
Platt J. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Margin Classifiers. 1999;10(3):61–74.
 20.
Boström H. Estimating class probabilities in random forests. In: Sixth international conference on machine learning and applications (ICMLA 2007), Cincinnati, OH. 2007. p. 211–216. https://doi.org/10.1109/ICMLA.2007.64.
 21.
Zadrozny B, Elkan C. Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2002. p. 694–9.
 22.
Zadrozny B, Elkan C. Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. ICML. 2001;1:609–16.
 23.
NiculescuMizil A, Caruana R. Predicting good probabilities with supervised learning. Bonn: Association for Computing Machinery; 2005. p. 625–32.
 24.
Wang Y, Li L, Dang C. Calibrating classification probabilities with shaperestricted polynomial regression. IEEE Trans Pattern Anal Mach Intell. 2019;41(8):1813–27.
 25.
Stone GW, Maehara A, Lansky AJ, et al. A prospective naturalhistory study of coronary atherosclerosis. N Engl J Med. 2011;364(3):226–35.
 26.
Peduzzi P, Concato J, Feinstein AR, et al. Importance of events per independent variable in proportional hazards regression analysis II. Accuracy and precision of regression estimates. J Clin Epidemiol. 1995;48(12):1503–10.
 27.
He X, Chen Z, Fu T, et al. Ki67 is a valuable prognostic predictor of lymphoma but its utility varies in lymphoma subtypes: evidence from a systematic metaanalysis. BMC Cancer. 2014;14(1):153.
 28.
Song MK, Chung JS, Lee JJ, et al. High Ki67 expression in involved bone marrow predicts worse clinical outcome in diffuse large B cell lymphoma patients treated with RCHOP therapy. Int J Hematol. 2015;101(2):140–7.
 29.
Weigend A. On overfitting and the effective number of hidden units. Proc Connectionist Models Summer School. 1993;4(4):381–91.
 30.
Caruana R, Lawrence S, Giles CL. Overfitting in neural nets: backpropagation, conjugate gradient, and early stopping. In: Advances in neural information processing systems. 2000. p. 402–408.
 31.
Lawrence S, Giles CL, Tsoi AC. Lessons in neural network training: overfitting may be harder than expected. In: National conference on artificial intelligence. 1997. p. 540–545.
 32.
Hornik K, Stinchcombe M, White H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989;2(5):359–66.
 33.
Zhou ZH. Neural networks. In: Maching learning. Beijing: Tsinghua University press; 2016. p. 97–120.
 34.
Ayer M, Brunk HD, Ewing GM, et al. An empirical distribution function for sampling with incomplete information. Ann Math Stat. 1955;26(4):641–7.
 35.
Wolpert DH. Stacked generalization. Neural Netw. 1992;5(2):241–59.
 36.
Alba AC, Agoritsas T, Walsh M, et al. Discrimination and calibration of clinical prediction models: users’ guides to the medical literature. JAMA. 2017;318(14):1377–84.
 37.
Hosmer DW, Hosmer T, Le Cessie S, et al. A comparison of goodnessoffit tests for the logistic regression model. Stat Med. 1997;16(9):965–80.
 38.
Naeini MP, Cooper GF, Hauskrecht M. Binary classifier calibration using a Bayesian nonparametric approach. In: Proceedings of the 2015 SIAM International Conference on Data Mining; 2015. p. 208–16.
 39.
Diamond S, Boyd S. CVXPY: a Pythonembedded modeling language for convex optimization. J Mach Learn Res. 2016;17(1):2909–13.
 40.
Agrawal A, Verschueren R, Diamond S, et al. A rewriting system for convex optimization problems. J Control Decis. 2018;5(1):42–60.
 41.
Coiffier B, Lepage E, Brière J, et al. CHOP chemotherapy plus rituximab compared with CHOP alone in elderly patients with diffuse largeBcell lymphoma. N Engl J Med. 2002;346(4):235–42.
 42.
Pfreundschuh M, Trümper L, Osterborg A, et al. CHOPlike chemotherapy plus rituximab versus CHOPlike chemotherapy alone in young patients with goodprognosis diffuse largeBcell lymphoma: a randomised controlled trial by the MabThera international trial (MInT) group. Lancet Oncol. 2006;7:379–91.
 43.
Coiffier B, Thieblemont C, Van Den Neste E, et al. Longterm outcome of patients in the LNH98.5 trial, the first randomized study comparing rituximabCHOP to standard CHOP chemotherapy in DLBCL patients : a study by the Groupe d'Etudes des Lymphomes de l'Adulte. Blood. 2010;116(12):2040–5.
 44.
Fu K, Weisenburger DD, Choi WWL, et al. Addition of rituximab to standard chemotherapy improves the survival of both the germinal center Bcelllike and nongerminal center Bcelllike subtypes of diffuse large Bcell lymphoma. J Clin Oncol. 2008;26(28):4587–94.
 45.
Chinese Society of Hematology. Guidelines for the diagnosis and treatment of diffuse large Bcell lymphoma in China (2013 edition). Chin J Hematol. 2013;34(9):816–9.
 46.
Zhang A, Ohshima K, Sato K, et al. Prognostic clinicopathologic factors, including immunologic expression in diffuse large Bcell lymphomas. Pathol Int. 2010;49(12):1043–52.
 47.
Zelenetz A, Gordon L, Abramson J, et al. NCCN clinical practice guidelines in oncology: Bcell lymphomas, Version 3.2019. J Natl Compr Canc Netw. 2019;17(6):650–61.
Acknowledgements
Not applicable.
Funding
This work was supported by the National Natural Science Foundation of China [Grant Number: 81502897 and 81973154], PhD Fund of Shanxi Medical University [Grant Number: BS2017029], to guarantee our investigation. The funder HMY and YHL, provided many valuable suggestions for design, analysis and manuscript writing of the study.
Author information
Affiliations
Contributions
SLF analyzed and interpreted the data and drafted the manuscript. ZQZ and HMY were responsible for preprocessing the data and checking the results. LW, CCZ, XQH, ZY and MX participated in sample collection. QL and YHL provided the methods and reviewed the manuscript. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
We have got the ethics approval from the “Shan Xi tumor hospital ethics committee” and get the reference number of 201835. Because our research did not involve any participants’ identifiers such as name, address, only oral consent was obtained from each participant after they were informed about the study’s purpose and procedure and the confidentiality of data. This procedure was approved by the ethics committee.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Fan, S., Zhao, Z., Yu, H. et al. Applying probability calibration to ensemble methods to predict 2year mortality in patients with DLBCL. BMC Med Inform Decis Mak 21, 14 (2021). https://doi.org/10.1186/s12911020013540
Received:
Accepted:
Published:
Keywords
 DLBCL
 Risk prediction
 Probability calibration
 Ensemble method
 Discrimination
 Calibration