Experimental data
The raw data for the injury microarray experiment by [14] were downloaded from the Gene Expression Omnibus repository (http://www.ncbi.nlm.nih.gov/geo/). The accession number is GSE36809. This experiment was hybridized on Affymetrix HGU133 plus2 chips. Because we focus on identifying genes that present longitudinal expression change pattern between the trauma patients with complication and those without complication, only patients with uncomplicated recoveries and patients with complicated recoveries were considered here.
Based on the duration of recovery, uncomplicated recovery represents recovery within 5 days while complicated recovery includes recovery after 14 days, no recovery by 28 days, or death [14]. Thus, the potential time points for an uncomplicated recovery are days 1/2, 1, 4, 7 and 14, while those for a complicated recovery are days 1/2, 1, 4, 7,14, 21, and 28. Here, we collected 18 complicated patients with 7 full measures and 25 uncomplicated patients with 5 full measures and used the resulting dataset to train final models. Next, the data for patients with complications were truncated at 14 days in order to make comparisons with uncomplicated patients at each single time point. Lastly, we used the rest of complicated and uncomplicated patients as a test set to validate the results given by the proposed method. There were 50 uncomplicated patients and 23 complicated patients in the test set, and the time points considered in the test data were days 1/2, 1, 4, and 7 since early discharge of patients occurred before day 14. The characteristics of patients in the training set and the test set may be different since the patients in the test set were those had been discharged early from the hospitals. Of note, since different pre-processing procedures may impact the data analysis, the summary expression values of the experimental data provided by the GEO were downloaded and used directly. No pre-processing procedures were carried out.
Statistical methods
In this subsection, we first gave an introduction to the SAMGSR algorithm, and then provided a detailed description on the proposed procedure, which is referred to as the longitudinal SAMGSR method herein.
SAMGSR
The SAMGSR method extends the SAMGS method [15] by providing an additional reduction of significant gene sets into respective core subsets. This reduction step may approximately result in a 90% of reduction in the size of selected genes, in an effort to improve predictive performance and allow biological patterns to become more obvious. The SAMGSR method consists of two major steps [1]. The first step is the SAMGS process in which an SAMGS statistic is calculated. This statistic is the squared sum of SAM statistics over all genes within the specific gene set. Using a permutation test by perturbing phenotype labels to calculate a p-value for the SAMGS statistic, the statistical significance of a gene set is determined. In the second step, only those genes within significant gene sets identified by the first step are considered. The SAMGSR method reorders the genes within gene set j based on the magnitude of its SAM statistic and gradually partitions the entire gene set into two subsets: the reduced subset Rk which includes the first k genes with largest SAM statistic, and the residual subset \( {\overline{R}}_k \) being the complement of Rk for k = 1,…, |S-1|. Here S is the size of gene set j. Let ck be the SAMGS p-value of the residual subset \( {\overline{R}}_k \). The optimal size of reduced set Rk is the smallest k such that ck is larger than a pre-specified cut-off, e.g., 0.05. Conceptually, the significance level of a gene within a gene set is determined by the magnitude of its SAM statistic. It implies that if in a gene set |SAMi| > |SAMj| for genes i and j, gene j cannot enter the reduced subset unless gene i is inside the reduced subset.
Modification to SAMGSR for longitudinal data
In the longitudinal SAMGSR method, a gene set has different meaning, namely, it refers to a gene’s expression profiles across time. Firstly, the significant genes were selected in which the original SAMGS statistic was modified to as,
$$ {SAMGS}_g={\sum}_{t=1}^T{d}_t^2,\kern0.5em {d}_t=\left({\overline{x}}_d(t)-{\overline{x}}_c(t)\right)/\left(s(t)+{s}_{0t}\right) $$
(1)
here, dt is the SAM statistic [16] of gene g (g = 1,…,G) at time point t (t = 1,…, T). In the SAM statistic, \( {\overline{x}}_d(t) \) and \( {\overline{x}}_c(t) \) are the sample averages of gene g at time point t for the diseased and control group, respectively. Parameter s(t) is a pooled standard deviation that is estimated by pooling the expression values of all samples at the specific time point t, while s0t is a small positive constant used to offset the small variability in microarray expression measurements and thus to avoid the denominator of the SAM statistic being zero. Both s(t) and s0t are specific for individual time points because the variability of expression measurements may differ over time.
From the above equation, it is observed that each gene’s expression profiles over time are combined into a gene set. Namely, a gene set represents one specific gene over different time points. Our rational is that a gene’s expression values for the same individual over time are correlated, mimicking a gene set/pathway. This method first calculates the SAMGS statistics for all genes to select the relevant genes and then determines exact time point(s) where the expression values of the specific gene differ between two phenotypes.
In the method, ck is regarded as a tuning parameter. Using the sequence of 0.05, 0.1, …, 0.5, the optimal value of ck corresponds to the one associated with the minimum 5-fold cross-validation (CV) error. Figure 1 elucidates graphically the longitudinal SAMGSR algorithm. Of note, the essential difference between the SAMGSR method and the longitudinal SAMGSR algorithm is what a gene set refers — while one corresponds to a set of genes and a single time point, the other contains only a single gene but many time points.
Performance statistics
In this study, we use four metrics - Belief Confusion Metric (BCM), Area Under the Precision-Recall Curve (AUPR), Generalized Brier Score (GBS) and the misclassified error – to evaluate the performance of a classifier. Our previous study [17] described those metrics in detail. Briefly, they all range from 0 to 1. For the first two metrics the closer to 1 the better a classifier is, whereas a value of 0 is optimal for the GBS and the misclassified error.
Since an evaluation on individual time points using the selected statistical metrics might be unfair for the SAMGSR extension in that its tendency to identify those genes that are insignificant at isolated time points but significant jointly over time, we used the following steps to calculate overall performance statistics. Specifically, for those methods incapable of providing the final model simultaneously with the selection of relevant genes, e.g., the longitudinal SAMGSR method, a linear support vector machine (SVM) model using the selected genes was fit to estimate the β coefficients at individual time points. Then, the posterior probabilities of the samples belonging to the diseased group were calculated at each time point on the basis of the SVM models, and the averages of those probabilities over time were taken and used to obtain these performance statistics. Figure 2 shows graphically how to calculate BCM, AUPR, GBS and the error rate. For those methods that are able to provide final models, e.g., LASSO, no extra SVM fitting was needed.
Statistical language and packages
Statistical analysis was conducted in the R language version 3.2.1 (http://www.r-project.org), and R codes for the longitudinal SAMGSR algorithm are available in an additional file (see Additional file 1).