Improving random forest predictions in small datasets from two-phase sampling designs

Background While random forests are one of the most successful machine learning methods, it is necessary to optimize their performance for use with datasets resulting from a two-phase sampling design with a small number of cases—a common situation in biomedical studies, which often have rare outcomes and covariates whose measurement is resource-intensive. Methods Using an immunologic marker dataset from a phase III HIV vaccine efficacy trial, we seek to optimize random forest prediction performance using combinations of variable screening, class balancing, weighting, and hyperparameter tuning. Results Our experiments show that while class balancing helps improve random forest prediction performance when variable screening is not applied, class balancing has a negative impact on performance in the presence of variable screening. The impact of the weighting similarly depends on whether variable screening is applied. Hyperparameter tuning is ineffective in situations with small sample sizes. We further show that random forests under-perform generalized linear models for some subsets of markers, and prediction performance on this dataset can be improved by stacking random forests and generalized linear models trained on different subsets of predictors, and that the extent of improvement depends critically on the dissimilarities between candidate learner predictions. Conclusion In small datasets from two-phase sampling design, variable screening and inverse sampling probability weighting are important for achieving good prediction performance of random forests. In addition, stacking random forests and simple linear models can offer improvements over random forests. Supplementary Information The online version contains supplementary material available at 10.1186/s12911-021-01688-3.


B Additional information for hyperparameters in RF
Standard RF is fitted with the ranger function from the ranger R package (Wright and Ziegler, 2017). The key parameters to the function and their default values are listed below: • mtry: the number of variables randomly sampled at each split. The default is square root of the number of variables (rounded down to the nearest integer).
• min.node.size: the minimum size of terminal nodes. The default is 1 for classification problem, where trees are grown to the maximum possible. Setting larger values causes smaller trees to be grown, also known as pruning.
• sample.fraction: fraction of observations that are drawn for each tree during bootstrapping. The default is 1, indicating no subsampling.
• case.weights: weights for bootstrap sampling. The default is NULL, in which case the bootstrap is not weighted.
• num.trees: the number of trees. The default is 500.
The tuning algorithm is performed with the tuneRanger function from the tuneRanger R package (Probst et al., 2019). The major parameters to the function and their default values are listed below: • measure: performance measure to optimize hyperparameters. The default is brier score for classification, but AUC is used in this study.
• iters.warmup: the minimum number of searches to find an optimal hyperparameter. The default is 30, but 50 is used in this study.
• iters: the maximum number of searches to find an optimal hyperparameter. The default is 70, but 100 is used in this study.
• tune.parameters: hyperparameters. The default is mtry, sample.fraction, and min.node.size. Each hyperparameter is tuned over the ranges as follows: (1)  We do not consider the number of trees (num.trees) in the hyperparameter tuning because more trees produce more stable estimates in general. The 500 trees by default may suffice to obtain a stable estimate in the HVTN 505 dataset with small sample sizes. In the literature, empirical results with many datasets showed that constructing the first 100 trees makes the biggest performance gain (Oshiro et al., 2012;Probst and Boulesteix, 2017).

C Comparison of random forest models with and without variable screening
We compared three sets of prediction performance between RF without screening and RF with screening using all immunologic markers: (1) AUC on the bootstrapped data, i.e. in-bag data, which is used to construct trees in RF, (2) AUC on the training data (the training data includes observations sampled into the bootstrapped data and observations left out of the bootstrapped data), and (3) CV-AUC, which is the AUC on the validation data. For (3), the results are the same as those in Table 1 in the manuscript. For (1) and (2), we averaged the AUCs from 100 replicates of 5-fold cross-validation schemes just as for (3). The results are shown in First, the results show that without screening the AUC on the bootstrapped data is all 1. This is because the RF algorithm constructs individual trees with maximal depth without pruning (Breiman, 2001). When there are a lot of features to choose from, all the terminal nodes are pure (all cases or all controls, the right panel in Figure C.2), and the resulting RF model has perfect prediction performance on the bootstrapped data. With screening, the number of features is limited, and the RF algorithm sometimes (3.4% of all trees) has difficulty producing all pure terminal nodes (the left panel in Figure C.2) and so the AUC is close to but not 1. This provides the first indication that RF without screening has a bigger overfitting problem than RF with screening. Second, RF with screening has a bigger AUC on the training data than RF without screening. The training data differs from the bootstrapped data in that it contains the out-of-bag data that are not sampled into the bootstrapped data. This suggests the RF model with screening generalizes better than the RF model without screening. Third, RF with screening has a far bigger AUC on the validation data than RF without screening. These results are similar to the second set of results, but the over-fitting of RF without screening is more pronounced here because none of the validation data are used in model training.
The overfitting is fundamentally caused by the fact that the RF algorithm can achieve near perfect separation using any predictors. Without screening, there are many noise predictors, and we expect RF to use many noise predictors in tree construction, which would lead to poor generalization performance. Figure C.3 shows the number of times each predictor is used for splitting in RF in the experiments described above. RF with screening uses only eight predictors that pass screening. In contrast, RF without screening uses almost all the predictors as we expect (the first eight predictors are shown in blue, and they are the same ones used by RF with screening). All together, these results suggest that screening helps reduce overfitting by only using informative predictors in tree construction.

D Two-phase studies
Two-phase studies are useful when we have a large study (phase 1) in which it is only practical to measure some predictor variables of interest for a subset of study participants (phase 2). More formally, in phase 1 we draw a cohort sample that is representative of the target population relevant to the clinical application at hand. Every participant's disease status and easy to measure covariates are obtained. In phase 2, a subsample is drawn without replacement from the phase 1 sample. The probability of a phase 1 individual being drawn into phase 2 can depend on certain stratification variables, including disease status.
There are different types of two-phase study designs, including case-control (Breslow, 1996) and case-cohort (Prentice, 1986;Borgan et al., 2000) studies. So long as each phase 1 participant's sampling probability is known, the bias in sampling can be corrected through inverse probability weighting or other techniques. Let's illustrate two-phase studies with the example used in Breslow et al. (2009). The phase 1 sample is a cohort of 12,345 study participants of the Atherosclerosis Risk in Communities (ARIC) study who were free from coronary heart disease, had plasma samples taken at their second follow-up visit, and had no missing values of phase 1 covariates. The biomarkers Lp-PLA 2 and C-reactive protein could not be measured for every participant in the study. A phase 2 sample was taken stratified by race, gender, and age as well as case status for a total of k=9 strata. The table below lists the number of phase 1 and phase 2 participants in each stratum. A weight is computed for each stratum and equals the number of phase 1 participants divided by the number of phase 2 participants in the stratum.