Applying probability calibration to ensemble methods to predict 2-year mortality in patients with DLBCL

Background Under the influences of chemotherapy regimens, clinical staging, immunologic expressions and other factors, the survival rates of patients with diffuse large B-cell lymphoma (DLBCL) are different. The accurate prediction of mortality hazards is key to precision medicine, which can help clinicians make optimal therapeutic decisions to extend the survival times of individual patients with DLBCL. Thus, we have developed a predictive model to predict the mortality hazard of DLBCL patients within 2 years of treatment. Methods We evaluated 406 patients with DLBCL and collected 17 variables from each patient. The predictive variables were selected by the Cox model, the logistic model and the random forest algorithm. Five classifiers were chosen as the base models for ensemble learning: the naïve Bayes, logistic regression, random forest, support vector machine and feedforward neural network models. We first calibrated the biased outputs from the five base models by using probability calibration methods (including shape-restricted polynomial regression, Platt scaling and isotonic regression). Then, we aggregated the outputs from the various base models to predict the 2-year mortality of DLBCL patients by using three strategies (stacking, simple averaging and weighted averaging). Finally, we assessed model performance over 300 hold-out tests. Results Gender, stage, IPI, KPS and rituximab were significant factors for predicting the deaths of DLBCL patients within 2 years of treatment. The stacking model that first calibrated the base model by shape-restricted polynomial regression performed best (AUC = 0.820, ECE = 8.983, MCE = 21.265) in all methods. In contrast, the performance of the stacking model without undergoing probability calibration is inferior (AUC = 0.806, ECE = 9.866, MCE = 24.850). In the simple averaging model and weighted averaging model, the prediction error of the ensemble model also decreased with probability calibration. Conclusions Among all the methods compared, the proposed model has the lowest prediction error when predicting the 2-year mortality of DLBCL patients. These promising results may indicate that our modeling strategy of applying probability calibration to ensemble learning is successful.

currently remains a significant clinical challenge [2,3]. Although the application of rituximab has improved the overall survival rate, 30-50% of DLBCL patients remain sensitive to chemotherapy or relapse after remission and eventually die [4,5]. In addition, under the influence of a chemotherapy regimen, a clinical staging, an immunologic expression or other factors, the prognoses of patients are markedly different [2,6,7]. Thus, according to clinicopathologic factors, we aim to predict the mortality hazard for patients with DLBCL at the individual level. Actually, clinicians need precise risk estimates for individuals to help them make optimal therapeutic decisions to achieve precision medicine [8]. Through early risk assessment, appropriate therapies may be initiated quickly and ultimately improve the clinical outcomes of individual cases [9,10].
Instead of using a single model, we consider using ensemble learning to predict the 2-year mortality of patients with DLBCL. Ensemble learning can often achieve better performance than a single model by constructing and combining multiple models [11,12]. These multiple models are also called individual models or base models. Dietterich [13] explained why the ensemble method works from statistical, computational and representational points of view. In addition, Kohavi [14] explained the reason for the success of the ensemble method from bias-variance decomposition. Due to its superior performance, ensemble learning has been applied in many fields. Wang Yuanchao [15] proposed a heterogeneous ensemble model composed of the random forest and XGBoost algorithms to classify pulsar candidates which achieved higher recall rate than other two algorithms. In [16], an ensemble of deep belief networks with different parameters was introduced into regression and time series forecasting. Several datasets were used for evaluation and eventually the ensemble-based model achieved better performance compared with other methods. Ensemble learning is also used to assist physicians in decision-making such as [17] which developed an ensemble model of deep belief networks for the early diagnosis of the Alzheimer's Disease.
To achieve high performance, the base model in ensemble learning should have good accuracy and diversity [11,12]. Diversity is key to ensemble learning, and there are three main ways to achieve diversity: data diversity, parameter diversity and structural diversity methods [12]. The data diversity method uses multiple datasets generated from the original dataset to train different base models. Since the multiple datasets are different from each other, the trained base models generate diverse outputs. The parameter diversity method trains different base models using different parameter sets. Even with the same algorithm or the same training data, the outputs from base models may vary with different parameters. The structural diversity method uses different algorithms to generate different base models and such ensemble models are also called heterogeneous ensembles. In our work, we consider five different algorithms for heterogeneous ensemble prediction, including the naïve Bayes (NB), logistic regression (logit), random forest (RF), support vector machine (SVM) and feedforward neural network (FNN) algorithms. However, some evidence [18][19][20][21][22][23] suggests that a few classifiers, such as the NB, RF and SVM classifiers, cannot generate precise probability estimates, although they achieve good classification performance in many real-world problems. Therefore, a different model structure is proposed to achieve both high accuracy and high diversity for the base models. That is, before aggregating the outputs from various base models, we first correct the poorly calibrated base models by using probability calibration methods. Probability calibration methods try to find a calibration function that maps the initial outputs of classifiers into more accurate posterior probabilities [18]. Many methods, such as the popular Platt scaling (Platt) and isotonic regression (IsoReg) approaches, have been proposed in this field. However, they all have their own preconditions. For example, Platt is effective when the output of the classifier is sigmoid-shaped, and IsoReg is prone to overfitting when the sample size is small. Thus, we calibrate the base model with shape-restricted polynomial regression (RPR), which is a flexible and powerful calibration method that is not constrained by a specific classifier [24]. By calibrating the base models first and then aggregating them, we hope to achieve strong ensemble performance.
Our research has three characteristics. First, this paper focuses not on the prediction of categories, but on membership probability. Second, a different model structure is proposed that applies probability calibration to ensemble learning. Third, both discrimination and calibration are considered in the model comparison.

Data sources and predictive variables
The data used in this study were derived from Shanxi Cancer Hospital, China. We assessed 406 patients diagnosed with DLBCL between April 2010 and May 2017, 116 of whom died within 2 years after treatment. We collected 17 variables for each patient from electronic medical records. See Table 1 for the names and groupings of each variable.
We applied three methods to select the predictive variables, including the Cox model, the logistic model and the random forest algorithm. The Cox model uses outcomes and survival times as dependent variables to analyze the influencing factors of events. The results produced by the Cox model may not be reliable when the number of positive events is less than 10 times the number of covariates [25,26]. Therefore, we first analyzed each variable by using the univariate Cox model. Then, the variables (including age, gender, stage, IPI, KPS, LDH, β 2 -MG, rituximab, and Bcl-6) that showed a univariate relationship (P < 0.1) with death were included in the multivariate Cox model. In addition to these 9 variables, we also added Ki-67 to the model. Ki-67 is a nuclear protein that can be used as a biomarker for cell proliferation, and its expression is widely used to evaluate the prognoses of many kinds of cancers, including lymphoma. A meta-analysis suggested that Ki-67 expression is associated with the prognoses of lymphoma patients. Subgroup analysis showed that high Ki-67 expression is highly associated with a poor overall survival rate for DLBCL [27]. Another study also suggested that patients with high Ki-67 expression have poor prognoses with standard first-line therapy [28]. Therefore, Ki-67 was also included in the multivariate Cox model in our study, although it was not significant in the univariate Cox analysis. Thus, these 10 variables were entered into the multivariate Cox model, and 8 variables remained (P < 0.1). The results of the multivariate Cox analysis are displayed in Table 2.
We also analyzed all the collected variables using the logit and RF approaches. The variables selected (P < 0.1) by the logit model were consistent with those selected  by the Cox model (see Table 2). If the threshold was 0.05, the Cox model selected 5 variables (gender, stage, IPI, KPS, and rituximab), while logit selected 4 variables (stage, IPI, KPS, and rituximab). The only difference is that the Cox model contains "gender" and the logit model does not. Figure 1 shows the ranking of variable importance obtained by the RF algorithm. The ranking showed that stage, IPI, KPS, and rituximab were the four most important variables regardless of which importance metric was used. Moreover, we found that the ranking based on the Gini index ( Fig. 1b) also included "gender" in the top eight variables.
According to the results of these three methods, we first used 0.05 as the threshold and took the union of the variables selected by the Cox and logit models as predictive variables (including gender, stage, IPI, KPS, and rituximab). Based on these 5 variables, we pretrained all the base models with 100 iterations. Then, we further added the remaining 3 variables (age, LDH, and β 2 -MG) to each base model. Although each base model at this point contained all 8 variables selected by the Cox and logit models when the threshold was 0.1, the predictive performance was not significantly improved. Thus, we excluded these 3 variables and only used gender, stage, IPI, KPS, and rituximab for prediction.

Model structure
As Fig. 2 shows, there are three main components of the model.
The first part is the construction of the base models. We used five common classifiers with good classification performances on many real-world problems as the base models, including the NB, Logit, RF, SVM, and FNN models. NB, which calculates the posterior probability that an observation belongs to each class based on Bayes' theorem, classifies an observation as a member of the class for which the posterior probability is the largest. Logit is a generalized linear model used to solve classification problems. Since the model takes the logistic function as the link function, the posterior probability can be generated. The RF algorithm builds a number of decision trees based on bootstrapped training sets for prediction. In our RF, we used the voting ratio of all decision trees as the initial probability estimate. SVMs attempt to find an optimal separating hyperplane to classify observations into different classes. The sign of the output determines the class of the sample, and the magnitude of the output can be used as a measure of predictive confidence, since examples far from the hyperplane are more likely to be classified correctly than examples close to the hyperplane. The FNN, in which the neurons in each layer are fully connected to the neurons in the next layer and there is no loop in the network, is a common neural network structure. In our study, we built a 3-layer network, and the hidden layer included 500 units. We designed our network with a large number of hidden nodes because several studies suggest that a neural network with excess capacity generalizes better than a simple network when trained with back propagation and early stopping [29][30][31]. Other studies have also shown that a feedforward neural network can fit any continuous function of arbitrary complexity, and it only needs a single hidden layer containing enough units [32,33].
The second part is the probability calibration of the base models. We selected RPR to calibrate the base models since it is a flexible and powerful method that is not constrained by a specific classifier [24]. We also applied two other methods (Platt, IsoReg) to investigate whether RPR was the best.
RPR calibrates initial probability using polynomial regression in which calibration function f has the following form [24]: where s is the initial probability of an observation. The solution is the following optimization problem: With constraint (a) all corrected probabilities are guaranteed to fall in the interval between 0 and 1. Constraint (b), which derives from the differentiability of f(s), ensures the monotonicity of the polynomial. In constraint (c), a l 1 -norm of a is used to prevent the polynomial from overfitting.
Platt is a parametric method which uses the sigmoid function as a calibration function [19]: where A and B are estimated using the maximum likelihood estimation on the training set (s i , y i ).
IsoReg is a nonparametric method which attempts to find an isotonic (i.e. non-decreasing) function that satisfies the following objective [21]: where y i = [ y 1 , y 2 , y 3 , …, y N ] is the label sequence of all samples that has been sorted by their initial probabilities. A well-known method called pair adjacent violators (PAV) algorithm is used to estimate the isotonic function [34].
The third part is the combination of the base models. We used three methods (simple averaging, weighted averaging, and stacking) to combine the above 5 base models. Stacking or stacked generalization, which takes the outputs of the base models as its inputs, uses another machine learning algorithm (also called a meta-learner) to estimate the weight of each base model [35]. In our research, logit was used as a meta-learner. Since the outputs of all base models are numeric in our study, we also used simple averaging and weighted averaging to combine these base models and compared the results with those obtained by stacking. In weighted averaging, the weight of each base model was set to the reciprocal of the expected calibration error (ECE) and maximum calibration error (MCE). We first combined 5 uncalibrated base models (NB, logit, RF, SVM, FNN) by using these three methods. Then, according to our modeling strategy, we used the same methods to combine the calibrated base models (NB-RPR, logit, RF-RPR, SVM-RPR, FNN). We did not calibrate logit and FNN models because they could already generate accurate probability estimates, and their performances after calibration were not significantly improved.

Model construction and evaluation
We combined the hold-out test with 3-fold cross-validation to complete the model construction and evaluation processes. To reduce statistical variability, all experiments were repeated 300 times. In each dataset, we randomly sampled 80% of the data (320 samples) as the training set and the residual 20% of the data (86 samples) as the testing set. Stratified sampling was used for each division to ensure the consistency of the data distribution across 300 experiments. In 3-fold cross-validation, the data were divided into 3 mutually exclusive subsets that were approximately equal in size. In each iteration, 2 of the 3 subsets were used as the training set, and the remaining subset was used as the validation set. The final evaluation depended on the average performance of the model on the 3 validation sets. Amid the construction of the 5 base models, we first performed 3-fold cross-validation within the training set to find the optimal hyperparameters of the RF, SVM, and FNN models. Then, we used all training data to train these three models with the optimal hyperparameters, and assessed them within the testing set. For the RF algorithm, the number of candidate variables at each split was set to 2 and 3, and the choices for the number of trees were {500, 600, …, 1500}. For the SVM, the kernel function was selected from linear and Gaussian. The choices for the parameters C and Gamma were 10 i 4 i=−4 . In addition, we used all training data to fit the NB and logit models and assessed them within the testing set. Although they do not include hyperparameters, we also performed 3-fold cross-validation for them within the training set. For each base model, we finally extracted all the predictive values in the 3 validation sets as the training set of the calibration function.
Amid the probability calibration of the 5 base models, we first used 3-fold cross-validation within the calibration training set to determine the k and λ parameters of RPR. Then, we used the entire calibration training set to train the Platt, IsoReg, and RPR models with the optimal k and λ parameters. Finally, we calibrated the predictive values of the 5 base models in the testing set by using the trained Platt, IsoReg, and RPR models and then evaluated them. For the RPR, the choices for the degree k were {4, 5, 6, …, 20} and the choices for λ were 4 i 5 Amid the combination of the 5 base models, the weight of each base model was set to 0.2 for simple averaging. For weighted averaging, the weight of each base model was set to the reciprocal of its ECE and MCE in the training data. For stacking, the outputs from the 5 base models were used as the inputs of the meta-learner (the logit model, in our study). To avoid overfitting, we used the union of the predictions of the base models on validation sets to train the meta-learner. Finally, we used the weights learned by stacking to combine the 5 base models.
Both discrimination and calibration were considered in the model evaluation. Discrimination and calibration are indispensable properties for assessing the accuracy of risk prediction models [36]. Discrimination is the ability to distinguish between patients who will have an event and those who will not. Calibration measures the consistency between the predicted probability and the true (observed) probability of patients at different risk strata. Although providing accurate probability estimates is our purpose, there is no need to further assess the calibration when the model has poor discrimination [36]. Therefore, we used the AUC to measure discrimination. The H-L test, ECE and MCE were used to evaluate the calibration of the models.
The H-L test is used to evaluate whether the difference between the predicted probability and the true probability is caused by chance [37]. The ECE and MCE are two measures related to the reliability diagram [38]. In computing these, the predictions are first sorted and then partitioned into k bins of equal size. For each bin, the predicted probability is the mean of the predictions in this bin, and the true (observed) probability is the proportion of positive observations in this bin. The ECE and MCE calculate the average prediction error and maximum prediction error over these bins, respectively: where p i and o i are the predicted probability and the observed probability in the i-th bin, respectively. The lower the values of the ECE and MCE are, the lower the calibration error of the prediction.

Results
To reduce the variability caused by the data partition, the hold-out test was executed 300 times. The final evaluation was based on the average results of these 300 iterations.

Performances of the 5 base models before and after calibration
We calibrated the NB, Logit, RF, SVM, and FNN models using RPR and compared the results with those of the Platt and IsoReg models. The performances of the 5 base models before and after calibration are shown in Table 3. The main features are summarized as follows: The AUCs of the 5 uncalibrated base models were all greater than 0.75, so they were considered as having good discrimination. Except for that of the SVM, the AUCs of the other 4 models were all greater than 0.8. In terms of calibration performance, the FNN had the lowest calibration error (ECE = 9.211, MCE = 23.500). The logit could also generate accurate probability estimates (ECE = 9.517, MCE = 24.400). According to the H-L test, these two models achieved good calibration (P > 0.05) 228 and 257 times out of 300 experiments. See Fig. 3a. By comparison, the NB, RF and SVM models had large prediction errors and they achieved good calibration only 7, 1, and 107 times out of 300 experiments, respectively.
After probability calibration was completed, the errors of the NB, RF and SVM models decreased significantly, and their frequencies of generating accurate probability estimates increased to 244, 219, and 196, respectively. However, regardless of which calibration method was used, the calibration error of the logit and FNN models were not further decreased. Out of the 3 evaluated probability calibration methods, RPR achieved the best correction for the SVM (ECE = 10.893, MCE = 26.300). IsoReg only achieved a good correction effect for the RF, while NB-IsoReg and SVM-IsoReg still had large calibration errors. According to Fig. 3a, NB-RPR, RF-RPR and SVM-RPR had higher frequencies of achieving good calibration in 300 experiments than the Platt and IsoReg models.

Using 3 methods to combine the base models
First, we combined the NB, Logit, RF, SVM, and FNN models by simple averaging, weighted averaging using the ECE, weighted averaging using the MCE, and stacking. The results are presented in Table 4. Of all the models, the FNN had the best performance in terms of both discrimination and calibration (AUC = 0.813, ECE = 9.211, MCE = 23.500). Compared with the NB, RF and SVM models, the calibration errors of the 4 ensemble models decreased significantly.
Then, we combined the NB-RPR, Logit, RF-RPR, SVM-RPR, and FNN models in the same ways. The results are presented in Table 5. Stacking-EN-C performed the best out of all the models (AUC = 0.820, ECE = 8.983, MCE = 21.265). The MCEs of the 4 ensemble models were all lower than those of any of the base models. Except for SA-EN-C, the remaining 3 ensemble models had lower ECEs than that of any base model.
Last, we compared the 4 ensemble models (SA-EN-C, ECE-EN-C, MCE-EN-C, Stacking-EN-C) that underwent probability calibration and the 4 ensemble models (SA-EN, ECE-EN, MCE-EN, Stacking-EN) that did not. Regardless of which combination of methods was used, the 4 ensemble models that underwent probability calibration had lower calibration errors in terms of both the ECE and MCE than the ensemble models that did not undergo probability calibration. According to the H-L test, our 4 ensemble models had the highest frequencies of achieving good calibration (P > 0.05), which were 270, 274, 272 and 272 out of 300 experiments. By comparison,

Table 3 Performance of the 5 base models before and after calibration
In each cell M (P 25 -P 75 ): M is the median, P 25 is the 25th percentile and P 75 is the 75th percentile of 300 tests. For each individual model, the best performance in each column is in bold and the secondary best performance in each column is underlined NB Naïve Bayes, Logit Logistic regression, RF Random forest, SVM Support vector machine, FNN Feedforward neural network, Platt Platt scaling, IsoReg Isotonic regression, RPR Shape-restricted polynomial regression. "-Platt", "-IsoReg" and "-RPR" represent performing probability calibration using corresponding method the frequencies were 204, 219, 221 and 238 for the 4 ensemble models that did not undergo probability calibration. See Fig. 3b for the results.

Discussions
We predicted 2-year mortality in patients with DLBCL at the individual level. The proposed method that applied probability calibration to ensemble learning performed satisfactorily in terms of both discrimination and calibration. We used 5 variables (gender, stage, IPI, KPS, and rituximab) to predict the 2-year mortality of patients with DLBCL. Most of the 5 variables have been proven to affect the prognoses of patients with DLBCL. The development of rituximab is a breakthrough in the treatment of DLBCL, and current research indicates that rituximab can improve the clinical outcomes for almost all DLBCL subgroups [41][42][43][44]. IPI is a recognized prognostic indicator of DLBCL, and high IPI values are associated with poor prognoses of patients [45,46]. An advanced disease stage has been reported to be a poor prognostic factor of DLBCL, and it is associated with a low overall survival rate and a short disease-free survival time [46]. Although there is no research about the relationship between KPS Fig. 3 The frequency of achieving good calibration performance in 300 experiments and the prognoses of patients with DLBCL, the statuses of patients [47] affect treatment options and, perhaps, patient outcomes.
Although the discrimination abilities of the 5 base models are very similar, the differences in calibration are obvious. The logit and FNN models can accurately predict the 2-year mortality of DLBCL patients. However, the errors of the initial probabilities of the NB, RF and SVM models are large. These results are consistent with those of several reports [18][19][20][21][22][23]. The logit and FNN models return well-calibrated predictions because they directly optimize the log-loss of probability [23]. For the NB classifier, the output is often pushed to 0 or 1 because the required conditional independence assumption may not be valid in reality [21][22][23]. Inversely, the output of the RF algorithm is often pushed away form 0 and 1 because it is difficult to obtain identical predictions from all decision trees [18,20,23]. Furthermore, the SVM also pushes the prediction away from 0 and 1 and induces a sigmoidshaped distortion. Although the magnitude of the output can be taken as a measure of confidence in the prediction, these values are often poorly calibrated [19,23].
The Platt model is effective when the distortion of the prediction is sigmoid-shaped. In our study, the biased probabilities obtained by the NB, RF and SVM models achieved good correction effects by using the Platt model. IsoReg is a universal calibration method in which the only restriction is that the mapping function is isotonic (i.e. non-decreasing). However, the calibration errors of the NB and SVM models are still large after calibration  by IsoReg. This may be caused by an overfitting effect resulting from scarce data. A report by Niculescu-Mizil suggested that IsoReg may not be suitable for calibration with a small sample size, especially when the sample size is less than 1000 [23]. By comparison, RPR is more flexible and powerful. Unlike the Platt model, RPR can calibrate any classifier since it has no constraints on the distribution of the initial output. Compared with IsoReg, RPR uses polynomial regression as a calibration function so that it is continuous over the entire interval. In addition, RPR strictly satisfies the monotonicity requirement and can be solved conveniently by optimization tools such as CVXPY [39,40]. RPR can theoretically fit any calibration function as the polynomial degree increases. Therefore, RPR was used to correct poorly calibrated classifiers in our study. According to the H-L test, NB-RPR, RF-RPR and SVM-RPR had the highest frequencies of achieving good calibration in 300 experiments compared with the results of the Platt and IsoReg models. Although the MCE of NB-Platt was lower than that of NB-RPR and the ECE of RF-IsoReg was lower than that of RF-RPR, we believe that RPR achieved better calibration for the NB, RF and SVM models than the other calibration methods if we also consider the results of the H-L test.
Ensemble methods can obtain better performance than single models by combining multiple models, especially when the individual models are weak. Many theories in this field were initially proposed for weak models, such as the family of boosting models. Although aggregating weak models is sufficient for obtaining good performance in theory, models with good accuracy are often used as the base model in practice, especially when few base models are available. In our work, we used probability calibration to improve the accuracy of the base models' probabilistic predictions to obtain strong ensemble performance. The results show that the calibration performance of the ensemble model was improved by calibrating the base models first. In terms of both the ECE and MCE, the calibration errors of our 4 ensemble models (SA-EN-C, ECE-EN-C, MCE-EN-C, Stacking-EN-C) were lower than those of the 4 ensemble models (SA-EN, ECE-EN, MCE-EN, Stacking-EN) that did not undergo probability calibration. According to the H-L test, our 4 ensemble models had the highest frequencies of achieving good calibration performance in 300 experiments, which were 270, 274, 272 and 272. However, they were 204, 219, 221 and 238 for the 4 ensemble models that did not undergo probability calibration. For all models, the stacking method that first calibrated the base model by RPR performed best in terms of both discrimination and calibration (AUC = 0.820, ECE = 8.983, MCE = 21.265). By contrast, the NB classifier generates probabilistic predictions with the highest error (ECE = 14.206, MCE = 38.900). Overall, our model, which applies probability calibration to ensemble learning, achieves the desired results.
Our research has limitations. First, the bootstrapping method may be more appropriate for the model evaluation of our study than the hold-out method. Since the sample size of this research is small, the model trained on the bootstrapped dataset, which has the same sample size with the training data, may be has lower estimation error. Second, it may be possible to improve the maximum calibration error in our study. The probability after calibration may not change significantly, as the calibration function has to ensure global monotonicity over the entire range. This means that the calibration error is largely influenced by misclassified samples. In the future, we will collect more feature information about DLBCL patients to improve the discrimination ability of the model, which may further increase the accuracy of its probabilistic predictions. Third, we only selected five common classifiers as the base models in our work. The impact of increasing or decreasing the number of base models or performing selective ensemble learning on the final prediction can be further investigated. Fourth, this research was based on the data provided by a certain hospital. Therefore, an external validation is required to evaluate the generalizability of the model.

Conclusions
In conclusion, we predicted the 2-year mortality of patients with DLBCL at the individual level. The proposed method performs satisfactorily and has the potential to help clinicians improve the outcomes of individuals by providing accurate risk predictions. Furthermore, these promising results may indicate that our modeling strategy of applying probability calibration to ensemble learning is successful.