Prognostic prediction for postoperative NSCLC patients is a typical imbalanced learning problem, especially for short-term prognosis prediction. Therefore, directly applying the traditional machine learning algorithms may lead to poor performance [13]. In this study, we propose the ELAS to alleviate the problem. Figure 1 illustrates the process of ELAS. The ELAS mainly consists of three parts, i.e., data initialization, active sampling, and model ensemble. We will elaborate on the details of the ELAS as follows.
Data initialization
For training set \(D_{{{\text{train}}}} = \{ x_{1} ,x_{2} , \ldots ,x_{{N_{{{\text{train}}}} }} \}\) where x is the patient sample and \(N_{{{\text{train}}}}\) is the sample size of the training set. Before active sampling, we first randomly select 20% of the samples from the \(D_{{{\text{train}}}}\) as the internal validation set \(D_{{{\text{internalVal}}}}\). Note that the \(D_{{{\text{internalVal}}}}\) is designed for the selection of the base classifiers in the ELAS model, which is different from the traditional validation set \(D_{{{\text{val}}}}\) for hyperparameter selection or early stopping. And the remaining 80% of samples in \(D_{{{\text{train}}}}\) are regarded as the training data pool \(D_{{{\text{trainPool}}}}\) with sample size \(N_{{{\text{trainPool}}}}\). When obtaining the \(D_{{{\text{trainPool}}}}\), we randomly select \(N_{{{\text{seed}}}} /2\) samples with no replacement from the majority class and minority class of \(D_{{{\text{trainPool}}}}\) respectively as a balanced initial training data seed \(D_{{{\text{trainSeed}}}}\) to train the first base classifier, where \(N_{{{\text{seed}}}}\) is the sample size of the \(D_{{{\text{trainSeed}}}}\). And the \(D_{{{\text{trainPool}}}}\) is updated by removing the samples in the \(D_{{{\text{trainSeed}}}}\).
Active sampling
Using the balanced \(D_{{{\text{trainSeed}}}}\), we train the first base classifier \(c_{1}\) with any reasonable supervised machine learning algorithms. When the first base classifier \(c_{1}\) is trained, we employ it to predict the risks of samples in the \(D_{{{\text{trainPool}}}}\) and select the \(N_{{{\text{batch}}}}\) most informative samples from \(D_{{{\text{trainPool}}}}\) using any reasonable query strategies. In this study, we employ the ranked batch-mode sampling (RBMS) described in the literature [18] as the query strategy. In comparison with the traditional active learning query strategies like uncertainty sampling, RBMS uses Eq. (1) to assign the final scores for a batch of samples not only considering the informativeness of each sample but also the similarity between the samples and the already selected ones.
$$\begin{array}{*{20}c} {S_{{{\text{final}}}} = \alpha \times \left( {1.0 - S_{{{\text{similarity}}}} } \right) + (1.0 - \alpha ) \times S_{{{\text{uncertainty}}}} } \\ \end{array}$$
(1)
Note that the α parameter is responsible for weighting the impact of similarity score \(S_{{{\text{similarity}}}}\) and uncertainty score \(S_{{{\text{uncertainty}}}}\) in the sample’s final score \(S_{{{\text{final}}}}\). Using Eq. (2), α leads the query strategy to prioritize diversity on the initial iterations where the \({\text{N}}_{{{\text{trainData}}}}\) is much smaller than the \(N_{{{\text{trainPool}}}}\) while, with the increase of the queried samples, shift the priority to samples in which the classifier is uncertain about. \(N_{{{\text{trainData}}}}\) is equal to \(N_{{{\text{seed}}}}\) at the first active sampling iteration.
$$\begin{array}{*{20}c} {\alpha = \frac{{N_{{{\text{trainData}}}} }}{{N_{{{\text{trainPool}}}} + N_{{{\text{trainData}}}} }}} \\ \end{array}$$
(2)
To determine the uncertainty of the sample, the RBMS uses the least confident uncertainty score. Let \(y_{{x_{i} }}^{j}\) be the probability of a sample \(x_{i}\) belonging to class j predicted by the classifier, then the uncertainty score can be calculated by Eq. (3).
$$\begin{array}{*{20}c} {S_{{{\text{uncertainty}}}} = 1.0 - \mathop {\max }\limits_{j} y_{{x_{i} }}^{j} } \\ \end{array}$$
(3)
Moreover, RBMS employs Eq. (4) to find the similarity score, where \(x_{i}\) is the current sample, \(D_{{{\text{estimated}}}}\) is the dataset including samples in \(D_{{{\text{trainData}}}}\) and the selected samples in this query round. \(\emptyset\) is the similarity function to measure the distance between the \(x_{i}\) and the sample in \(D_{{{\text{estimated}}}}\). We used the Euclidean distance as the similarity function in this study.
$$\begin{array}{*{20}c} {S_{{{\text{similarity}}}} = \mathop {\max }\limits_{{x_{j} \in D_{{{\text{estimated}}}} }} \emptyset \left( {x_{i} , x_{j} } \right)} \\ \end{array}$$
(4)
Based on the RBMS, we can avoid the sub-optimal sample selection caused by traditional active learning query strategies when selecting \(N_{{{\text{batch}}}}\) informative samples. The queried \(N_{{{\text{batch}}}}\) patient samples are added into \(D_{{{\text{trainSeed}}}}\) as the new training data \(D_{{{\text{trainData}}}}\) and removed from \(D_{{{\text{trainPool}}}}\). So far, the first active sampling process is done, and we obtain the first classifier \(c_{1}\), new training data \(D_{{{\text{trainData}}}}\), and training data pool \(D_{{{\text{trainPool}}}}\). Based on the new \(D_{{{\text{trainData}}}}\) and \(D_{{{\text{trainPool}}}}\), we can start the next round of active sampling process until not enough samples in \(D_{{{\text{trainPool}}}}\) can be sampled into \(D_{{{\text{trainData}}}}\) for base classifier development. During each active sampling iteration, one base classifier is trained and used to query new samples for the next base classifier. All the trained base classifiers during this process are stored in the base classifier list L waiting for the final base classifier selection. In this study, we do not use the stop criteria to early terminate the training process [19,20,21], because the discrimination ability of the base classifier does not always improve with the addition of queried samples when using the real clinical data.
Model ensemble
After the active sampling, we can obtain a base classifier list L with \(\frac{{N_{{{\text{trainPool}}}} - N_{{{\text{seed}}}} }}{{N_{{{\text{batch}}}} }} + 1\) base classifiers, where \(N_{{{\text{trainPool}}}}\) is the sample size of the \(D_{{{\text{trainPool}}}}\) before training data seed sampling. Among these base classifiers, we select top K base classifiers with the best prediction performances on the internal validation set \(D_{{{\text{internalVal}}}}\) for the ensemble model.
However, the \(D_{{{\text{internalVal}}}}\) only accounts for 20% of the \(D_{{{\text{train}}}}\), which may lead the selected base classifiers to overfit this \(D_{{{\text{internalVal}}}}\) and deteriorate the generalization ability of the ensemble model. Thus, we apply a stratified fivefold cross-validation mechanism to generate the \(D_{{{\text{internalVal}}}}\). Each fold is regarded as one \(D_{{{\text{internalVal}}}}\) for base classifier evaluation, and the remaining 4 folds are combined as the \(D_{{{\text{trainPool}}}}\) for base classifier training. Using this strategy, each sample in the \(D_{{{\text{train}}}}\) will be used to evaluate and select base classifiers, and we can obtain 5 base classifier lists where each list corresponds to a \(D_{{{\text{trainPool}}}}\) to avoid overfitting to one specific \(D_{{{\text{trainPool}}}}\).
Moreover, we also notice that the different initial training data seed \(D_{{{\text{trainSeed}}}}\) will lead to the different first base classifier and the following active sampling results and then the different subsequent base classifiers. To obtain more stable and robust prognostic prediction performance, we initialize \(D_{{{\text{trainSeed}}}}\) \(T_{{{\text{seed}}}}\) times with different random seeds and repeat the whole active sampling process separately to obtain \(T_{{{\text{seed}}}}\) base classifier lists during each \(D_{{{\text{internalVal}}}}\) fold. Thus, when using fivefold cross-validation for multiple \(D_{{{\text{internalVal}}}}\) generations and \(T_{{{\text{seed}}}}\) times \(D_{{{\text{trainSeed}}}}\) initializations, we can obtain a total of \(5 \times T_{{{\text{seed}}}}\) base classifier lists. We select the top K base classifiers from each L based on their performances on corresponding internal validation sets. The ELAS will average the \(5 \times T_{{{\text{seed}}}} \times K\) base classifiers’ outputs as the final ensemble result. The details of the whole training process of the ELAS are given in Algorithm I.
Experimental setup
To develop the ELAS model, we selected support vector machine (SVM) [22], logistic regression with L2 regularization (L2-LR) [23], and classification and regression trees (CART) [24], to train the base classifiers. We randomly divided 80% of samples as the training set and the remaining 20% as the test set. To tune the hyper-parameters, fivefold cross-validation was employed on the training set, and a grid search strategy was applied for the base classifiers on the hyper-parameter spaces: \(C \in \{ 0.1,1,10\}\) for SVM, \(C \in \{ 1,10,100\}\) for L2-LR, \(\max \_depth \in \{ {\text{None}},5, 10\}\) and \(\min \_sample\_leaf \in \{ 1,3,5\}\) for CART. To release the problem of massive possible value sets of the hyper-parameters, we selected radial basis function kernel for SVM, Gini impurity for CART, and \(N_{{{\text{seed}}}} \in \{ 50,100\}\), 10 for \(N_{{{\text{batch}}}}\), 3 for \(T_{{{\text{seed}}}}\), 20 for \(K\). Note that we should keep the \(N_{{{\text{seed}}}} /2\) no more than the sample size of minority class because we want the \(D_{{{\text{trainSeed}}}}\) to be a balanced dataset. Besides, we should also keep the K no more than \(\frac{{N_{{{\text{trainPool}}}} - N_{{{\text{seed}}}} }}{{N_{{{\text{batch}}}} }} + 1\) to ensure that the top K base classifiers can be selected from.
In this study, we conducted extensive experiments to explore the effectiveness of the proposed ELAS approach. First, we compared the ELAS with the base classifier algorithms, i.e., SVM, L2-LR, and CART, to explore whether the ELAS can improve the performance of prognostic prediction. And then, as the ELAS is an ensemble method, we also selected two famous ensemble methods, i.e., AdaBoost [25] and Bagging [26, 27], as the benchmarks. Moreover, we also applied two resampling methods to deal with imbalanced data, namely SMOTE [28] and TomekLinks [29], to explore which strategy is better. To evaluate the ELAS and benchmarks’ performances, we employed the area under the receiver operating characteristic curve (AUROC) and the area under the precision–recall curve (AUPRC) as the metrics. To eliminate the bias caused by the test set partition, the whole data set segmentation, model development, and evaluation process was repeated 10 times with different random seeds so that we can obtain the averaged AUROC value and AUPRC value with their standard deviations (SD) for each prognostic task. The paired student t-test was performed to determine whether the AUROC and AUPRC values of ELAS are statistically significantly different from the values of the benchmark algorithms and a p value less than 0.05 was considered significant.