Bayesian variable selection for high dimensional predictors and self-reported outcomes

Background The onset of silent diseases such as type 2 diabetes is often registered through self-report in large prospective cohorts. Self-reported outcomes are cost-effective; however, they are subject to error. Diagnosis of silent events may also occur through the use of imperfect laboratory-based diagnostic tests. In this paper, we describe an approach for variable selection in high dimensional datasets for settings in which the outcome is observed with error. Methods We adapt the spike and slab Bayesian Variable Selection approach in the context of error-prone, self-reported outcomes. The performance of the proposed approach is studied through simulation studies. An illustrative application is included using data from the Women’s Health Initiative SNP Health Association Resource, which includes extensive genotypic (>900,000 SNPs) and phenotypic data on 9,873 African American and Hispanic American women. Results Simulation studies show improved sensitivity of our proposed method when compared to a naive approach that ignores error in the self-reported outcomes. Application of the proposed method resulted in discovery of several single nucleotide polymorphisms (SNPs) that are associated with risk of type 2 diabetes in a dataset of 9,873 African American and Hispanic participants in the Women’s Health Initiative. There was little overlap among the top ranking SNPs associated with type 2 diabetes risk between the racial groups, adding support to previous observations in the literature of disease associated genetic loci that are often not generalizable across race/ethnicity populations. The adapted Bayesian variable selection algorithm is implemented in R. The source code for the simulations are available in the Supplement. Conclusions Variable selection accuracy is reduced when the outcome is ascertained by error-prone self-reports. For this setting, our proposed algorithm has improved variable selection performance when compared to approaches that neglect to account for the error-prone nature of self-reports.


Likelihood incorporating self-reports
Here we present the details involving the assumptions and derivation of the observed data log likelihood that incorporates error-prone self-reports. This derivation has been previously published in [1].
Let X refer to the random variable denoting the unobserved time to event for an individual, with associated survival, density and hazard functions denoted by S(x), f (x) and λ(x), for x ≥ 0. The time origin is set to 0, corresponding to the baseline visit at which all subjects enrolled in the study are assumed to be eventfree. In other words, Pr(X > 0) = 1. This assumption reflects the design and goals of population-based studies in which one is interested in estimating predictors of incident disease (i.e. X > 0).
Without loss of generality, we set X = ∞ when the event of interest does not occur. Let N denote the number of subjects and n i denote the number of visits for the i th subject during the follow-up period. At each visit, we assume that each subject would self-report his/her disease status as either positive or negative. For example, at each semi-annual (WHI-CT) or annual contact (WHI-OS), all participants were asked, "Since the date given on the front of this form, has a doctor prescribed any of the following pills or treatments?" Choices included "pills for diabetes" and "insulin shots for diabetes". Thus, incident treated diabetes was ascertained, and was defined as a self-report of a new physician diagnosis of diabetes treated with oral drugs or insulin.
For the i th subject, we let R i = {R i1 , · · · , R ini } and t i = {t i1 , · · · , t ini } denote the 1 × n i vectors of self-reported, binary outcomes and corresponding visit times, respectively. R ik is equal to 1 if the k th self-report for the i th subject is positive (indicating occurrence of the event of interest such as diabetes) and 0 otherwise. We assume that self-reports are collected at pre-scheduled visits up to the time of the first positive self-report -thus, the vectors of test results or self-reports (R i ), visit times (t i ) and the number of self-reports collected per subject (n i ) are random. Let τ 1 , · · · , τ J denote the distinct, ordered visit times in the dataset among N subjects, where 0 = τ 0 < τ 1 < ... < τ J < τ J+1 = ∞. Thus, the time axis can be divided into The joint probability of the observed data for the ith subject can be expressed as: To simplify, we make the assumption that an individual's n i self-reports are independent conditional on the true time to event X i . That is, This assumption implies that the observed values of other self-reported outcomes do not provide additional information about the distribution of a particular selfreported outcome from that provided by the actual time of the event. We note that this assumption is similar in spirit to the common assumption in measurement error regression models that the conditional distribution of the response variable, given the covariate and its proxy, is the same as the conditional distribution given only the covariate.
It can be shown that the joint probability of the observed data for the i th subject can be simplified as: . See details presented in [1]. We assume that the probability of a positive self-report at the k th visit (R ik = 1) conditional on the interval containing the true event time and visit time can be expressed as: Here, ϕ 1 and ϕ 0 denote the sensitivity and specificity of self-reports, respectively. Thus the terms C ij , for j = 1, · · · , J + 1 in equation (1) can be expressed as a product involving the constants ϕ 1 and ϕ 0 . Thus, in the absence of covariates, the log likelihood for a random sample of N subjects can be expressed as: For the special case where self-reports are perfect (ϕ 1 = ϕ 0 = 1), the likelihood above reduces to the non-parametric likelihood for interval censored observations given in Turnbull (1976).
Let Z denote the P × 1 vector of covariates with corresponding P × 1 vector of regression coefficients denoted by β. To incorporate the effect of covariates, we assume the proportional hazards model, λ(x|Z = z) = λ 0 (x)e z β , or equivalently, To derive the form of the log-likelihood based on the assumption of the proportional hazards model, we re-parameterize the log likelihood in (1) in terms of the survival function, S = (1 = S 1 , S 2 , · · · , S J+1 ) T , where S j = Pr(X > τ j−1 ). Since S j = J+1 l=j θ l , the vector of interval probabilities can be expressed as θ = T r S, where T r is the (J + 1) × (J + 1) transformation matrix. Let C = [C ij ] denote the N × (J + 1) matrix of the coefficients, C ij , and let the N × (J + 1) matrix D be defined as D N ×(J+1) = C × T r . Then, the log-likelihood function for the one-sample setting can be expressed as: where S 1 = 1 and S 2 , S 3 , · · · , S J+1 are the unknown parameters of interest.
Let 1 = S 1 > S 2 > ... > S J+1 denote the baseline survival functions (i.e. corresponding to Z = 0), evaluated at the left boundaries of the intervals [0, τ 1 ), [τ 1 , τ 2 ), · · · , [τ J , ∞). Then, for subject i, with corresponding covariate vector z i , S (i) j = (S j ) e z i β . Thus, the log-likelihood function for a random sample of N subjects in equation (2) can be extended to incorporate covariates as The elements of the D matrix are functions of the observed data including the visit times, the corresponding self-reported results (t i , R i for i = 1, · · · , n i ), and the constants ϕ 0 , ϕ 1 .

Metropolis Hastings algorithm
In both steps 2 and 3 of the MH algorithm, the acceptance probabilities (e ∆ ) are obtained according to the following: , where Q(·) denotes the transition probability, θ * denotes the proposed value and θ (t−1) denotes the current value of θ.
In Step 2 of the MH algorithm, the randomly selected predictor and its corresponding regression coefficient are updated. Note that If . In this case, In Step 3 of the MH algorithm, the regression coefficients for all included covariates and interval probabilities are updated. The acceptance probability in Step 3c) is e ∆ , obtained as follows:

Model diagnostics
The simulation studies reported in the Results section were based on individual MCMC chains that were run up to 100,000 iterations, where the first 20,000 iterations were discarded as burn-in. To justify the choice of the number of iterations and burn-in, we carried out several diagnostic tests.
The Heidelberg and Welch convergence diagnostic [3] was used to test the null hypothesis that the sampled values of ω are drawn from a stationary distribution. The total number of iterations needed and the estimated burn-in were calculated using the Raftery and Lewis diagnostic [4] -this run length control diagnostic estimates the number of iterations required to estimate the 2.5 th percentile to within an accuracy of +/ − 0.005, with probability 0.95. Model diagnostics were estimated using the R package coda. Table 1 presents model diagnostics for the simulation settings considered in the paper, exploring various values of cumulative incidence, sensitivity and specificity. The Heidelberg and Welch convergence diagnostic p values were all greater than 0.05, indicating no evidence to reject the null hypothesis of a stationary distribution -however, for three of the eight settings considered the p values were < 0.10. For two of these settings, the estimated number of iterations and burn-in were under 100,000 and 20,000, respectively.
Additionally, trace plots of the log posterior and of the number of included variables in the model were examined.

Simulation study: continuous features motivated by metabolomics
The 100 × 100 design matrix for the simulation study was randomly selected from data profiled in a cardiovascular disease metabolomics study that was conducted to discover prognostic biomarkers in blood plasma for near-term cardiovascular events [2]. Participants were selected from the CATHGEN project, which collected peripheral blood samples from consenting research participants undergoing cardiac catheterization at Duke University Medical Center from 2001 through 2011. 68 cases were selected from among individuals who had a major adverse cardiac event (MACE) within two years following the time of their sample collection. In a 1:1 matched study design, 68 controls were selected from individuals who were MACE-free for the two years following sample collection and were matched to cases on age, gender, race/ethnicity and severity of coronary artery disease. High-content mass spectrometry and multiplexed immunoassay-based techniques were employed to quantify 472 metabolites from each subject's serum specimen. Comprehensive metabolite profiling of the individual samples was based on a combination of four platforms employing mass spectrometry (MS) based techniques to profile lipids, fatty acids, amino acids, sugars and other metabolites. A detailed description of the mass spectrometry based platforms can be found in a previous publication [2]. Metabolite levels in the 100 × 100 randomly selected design matrix were natural logarithm transformed and standardized to render their distributions with mean zero and unit variance. Table 2 shows the proportion of simulated datasets in which a metabolite that is associated with the outcome was found as ranking among the top five metabolites by the posterior probability of inclusion -results are averaged over the five metabolites with true associations with the outcome. Similar trends to that observed in Section 3.1 were observed here. In particular, the largest gains in performance by the proposed approach (BVS e ) when compared to BVS (assuming perfect self-reports) and RSF were observed when there are no missed visits, and when specificity (ϕ 0 ) is less than perfect -for example, when CIR= 0.3, ϕ 1 = 1.00, ϕ 0 = 0.90, the probability of a metabolite ranking among the top five by BVS e when all self-reports are included in analysis (NMISS) is 0.84 (±0.01) as compared to 0.57 (±0.01) and 0.44 (±0.01) by BVS (assuming perfect self-reports) and RSF, respectively. The false positive rates were comparable across algorithms for each simulation setting.

Application: WHI Clinical Trial and Observational Study SHARe
Within each race/ethnicity subgroup, the following two step procedure was followed to reduce dimensionality: First, SNPs that satisfied one or more of the following three criteria were excluded from the analysis: (i) more than 1% missing values; or (ii) less than 5% minor allele frequency; or (iii) a Hardy-Weinberg equilibrium test p value less than 10 −6 . Second, the univariate association of each SNP with incident diabetes was estimated using a two-sided likelihood ratio test based on the observed data likelihood in equation (3) implemented in the R package icensmis [1]. The SNPs were ranked by increasing p value and the top 300,000 SNPs were included in our analysis. All missing genotypes were imputed to be the homozygous wildtype ("AA"). Each SNP was coded as a numeric variable by setting genotype "AA" as 0, genotype "Aa" as 1, and "aa" as 2.
To quantify the extent of genetic variability within each self-reported race category (African American and Hispanic American), we carried out a principal components analysis and extracted the top two principal components [2]. In the African American dataset, the first and second principal components (PC1, PC2) accounted for 1.9% and 0.1% of the total variability, respectively. Similarly in the Hispanic American dataset, the first and second principal components (PC1, PC2) accounted for 2.1% and 0.9% of the total variability, respectively. Figures 2-3 in the Supplement show the scatterplot of PC1 versus PC2, estimated separately from the 300K SNPs in the African American and Hispanic American datasets within the WHI. The top two principal components were incorporated into the analysis to account for population stratification within each self-reported race/ethnicity group.
The 300K SNPs included in each analysis were ranked from most to least important by the following three methods: 1 Proposed BVS algorithm (BVS e ), where ϕ 1 = 0.61, ϕ 0 = 0.995. The values of the hyper-parameters governing the prior distributions were set to the following: b = 1.0, w 1 = 5, w 2 = 100. We note that the choice of b = 1 allows the log hazard ratios to vary between -2 and 2 with probability 0.95. SNPs were ranked based on the posterior probability of inclusion.
2 BVS algorithm (BVS perf ect ) assuming perfect self-reports, where ϕ 1 = 1.0, ϕ 0 = 1.0. The values of the hyper-parameters governing the prior distributions were set to the following: b = 1.0, w 1 = 5, w 2 = 100. We note that the choice of b = 1 allows the log hazard ratios to vary between -2 and 2 with probability 0.95. SNPs were ranked based on the posterior probability of inclusion.
3 Univariate (SNP by SNP) analysis based on the likelihood in Equation (3) and implemented in the R package icensmis [1] (denoted icensmis). In this analysis, we set ϕ 1 = 0.61, ϕ 0 = 0.995. The first two principal components were included as additional covariates. SNPs were ranked based on a twosided likelihood ratio test p value.
4 Univariate analysis (SNP by SNP) analysis based on a Cox PH model. The time origin for each participant was set to the time of entry into the WHI. The time-to-event outcome was defined as the minimum of the time to the first positive self-report (observed event) and the time to the last observation (censored event). The first two principal components were included as additional covariates. SNPs were ranked based on a two-sided likelihood ratio test p value.
The computing time for the MCMC analysis of the dataset of African American WHI participants (300K SNPs on 6704 participants) on a 2013 Mac Pro 3.5 GHz 6-Core Intel Xeon E5 computer was about 10 days. The computing time for the dataset of Hispanic American WHI participants (300K SNPs on 3169 participants) on the same computer was about 5 days. In each analysis, the MCMC algorithm was run up to 5 million iterations and the first 1 million iterations were discarded as burn-in. The MCMC algorithm was run using R [3].  Table 2 Probability of ranking among the top five metabolites by posterior probability of inclusion, for metabolites that are associated with outcome. CIR denotes the cumulative incidence rate in the reference group, RSF denotes Random Survival Forests, BVS denotes the proposed BVS algorithm assuming perfect self-reports and BVSe denotes the proposed BVS procedure. NTFP denotes a study design setting in which all self-reports following the first positive result are discarded and NMISS denotes the setting where there are no missed visits.