 Research article
 Open access
 Published:
Bayesian variable selection for high dimensional predictors and selfreported outcomes
BMC Medical Informatics and Decision Making volume 20, Article number: 212 (2020)
Abstract
Background
The onset of silent diseases such as type 2 diabetes is often registered through selfreport in large prospective cohorts. Selfreported outcomes are costeffective; however, they are subject to error. Diagnosis of silent events may also occur through the use of imperfect laboratorybased diagnostic tests. In this paper, we describe an approach for variable selection in high dimensional datasets for settings in which the outcome is observed with error.
Methods
We adapt the spike and slab Bayesian Variable Selection approach in the context of errorprone, selfreported outcomes. The performance of the proposed approach is studied through simulation studies. An illustrative application is included using data from the Women’s Health Initiative SNP Health Association Resource, which includes extensive genotypic (>900,000 SNPs) and phenotypic data on 9,873 African American and Hispanic American women.
Results
Simulation studies show improved sensitivity of our proposed method when compared to a naive approach that ignores error in the selfreported outcomes. Application of the proposed method resulted in discovery of several single nucleotide polymorphisms (SNPs) that are associated with risk of type 2 diabetes in a dataset of 9,873 African American and Hispanic participants in the Women’s Health Initiative. There was little overlap among the top ranking SNPs associated with type 2 diabetes risk between the racial groups, adding support to previous observations in the literature of disease associated genetic loci that are often not generalizable across race/ethnicity populations. The adapted Bayesian variable selection algorithm is implemented in R. The source code for the simulations are available in the Supplement.
Conclusions
Variable selection accuracy is reduced when the outcome is ascertained by errorprone selfreports. For this setting, our proposed algorithm has improved variable selection performance when compared to approaches that neglect to account for the errorprone nature of selfreports.
Background
The time to a silent event in several clinical settings can only be assessed through sequentially administered diagnostic tests. For example, diabetes can be detected by measuring levels of fasting blood glucose or glycosylated hemoglobin levels (HbA1c). Although gold standard diagnostic tests are often available, the associated cost is prohibitive in large epidemiological studies which often recruit hundreds of thousands participants. Instead, disease incidence is often ascertained through less expensive but errorprone procedures such as selfreport. One example is the Women’s Health Initiative (WHI), which enrolled 161,808 postmenopausal women aged 5079 years at 40 clinical centers across the U.S. from 19931998 with ongoing followup [1]. Due to cost considerations, prevalent and incident type 2 diabetes is ascertained by selfreported questionnaires at each annual visit. In this paper, we propose and apply a Bayesian variable selection (BVS) approach for variable selection in high dimensional datasets while simultaneously accounting for the errorprone nature of selfreported outcomes. We apply the proposed methods to discover single nucleotide polymorphisms (SNPs) associated with type 2 diabetes risk in the WHI Clinical Trial and Observational Study SNP Health Association Resource (SHARe), which includes extensive genotypic (>900,000 SNPs) and phenotypic data on 9,873 African American and Hispanic American women.The proposed methods equally apply when a silent outcome is ascertained through laboratorybased diagnostic procedures that are subject to misclassification.
When a timetoevent outcome is ascertained by a perfect diagnostic test that is administered at prespecified time points during the course of follow up, the outcome is intervalcensored. In this context, methods to estimate the survival distribution and assess covariate effects have been developed [2, 3]. However, when an errorprone diagnostic procedure such as a selfreport is used instead, standard methods for interval censored outcomes lead to bias. Previous work in this area include methods for modeling errorprone outcomes with application to studies in HIV, HPV and STD [4–7]. A formal likelihood framework was developed to estimate the distribution of the timetoevent of interest in the presence of errorprone laboratorybased diagnostic tests, in the context of pediatric HIV clinical trials [5]. Also in the context of pediatric HIV studies, the discrete proportional hazard model was extended to incorporate mismeasured outcomes and also covariates [7]. In related work, generalized Cox models were considered in settings involving timetoevent outcomes with incomplete event adjudication [8–10]. Other related work includes that proposed in the context of HPV studies [6], where the authors accommodate misclassification by incorporating ideas of binary generalized linear models with outcomes subject to misclassification [11]. A formal likelihood framework was proposed to accommodate sequentially administered, errorprone selfreports or laboratory based diagnostic tests for modeling the association of a targeted set of covariates with the timetoevent outcome of interest [12]. While a rich literature exists to handle estimation and hypothesis testing in the presence of errorprone survival outcomes, none of these approaches can be applied directly to variable selection in highdimensional data, in which the number of features (p) far exceeds the number of subjects (n). In this setting, standard likelihood based estimation approaches are intractable.
The BVS method has been previously proposed for variable selection in high dimensional datasets [13, 14]. The Bayesian model proceeds by assigning a mixture prior distribution to the regression coefficients (β) corresponding to the high dimensional predictors, for example  a mixture of a point mass at 0 and a uniform distribution [13], a mixture of two normal distributions centered at 0 but with distinct variances [14]. The estimated posterior probabilities of the latent binary indicators for inclusion in the model is used for variable selection. Several papers have applied this approach for discovering a sparse feature set associated with an outcome in highdimensional microarray data for various settings. Previous works include models for binary outcomes [15], multicategory responses [16], and censored outcomes [17]. Notably, the use of the BVS method in largescale settings such as genomewide association studies (GWAS) was successfully demonstrated [18]. The BVS approach has also been extended to in application to clustered data to simultaneously discover group structure and identify discriminating variables [19–21]. One advantage of the BVS approach when compared to other variable selection methods is that it can be naturally extended to incorporate external information such as biological pathway membership [22, 23]. A comprehensive review of BVS algorithms can be found in the literature [24]. Improvements to and novel applications of the BVS procedure continues to be an active area of research [25–28].
In this paper, we incorporate a BVS approach into a likelihoodbased model proposed by Gu, X. et al. (2015). This allows us to conduct variable selection in high dimensional data while accounting for the imperfect observation of a timetoevent outcome. Through simulation studies, we illustrate the impact of ignoring error in the outcome on variable selection and compare the performance of the BVS spikeandslab prior with that of our proposed algorithm. We apply the BVS approach to discover SNPs associated with incident type 2 diabetes in a dataset of 9,873 African American and Hispanic American women enrolled in the WHI.
The organization of this paper is as follows: In the “Methods” section, we present notation and the form of the likelihood function that accommodates error in selfreported outcomes. We incorporate this likelihood into the BVS algorithm, to handle highdimensional datasets. We conduct simulation studies to compare the variable selection performance of different approaches for high dimensional datasets arising from GWAS and metabolomic studies. We apply our proposed methods for the discovery of single nucleotide polymorphisms (SNPs) associated with type 2 diabetes risk in the WHI Clinical Trial and Observational Study SNP Health Association Resource (SHARe), among African American and Hispanic American women. Lastly, we discuss the findings of this study and highlight future directions.
Methods
In this section, we introduce notation, present the form of the likelihood function to accommodate errorprone, selfreported outcomes that has been previously described [12] and integrate with a BVS approach for variable selection.
Notation, likelihood function
Let X refer to the random variable denoting the unobserved timetoevent for an individual, with associated survival, density and hazard functions denoted by S(x),f(x) and λ(x), for x≥0. The time origin is set to 0, corresponding to the baseline visit at which all subjects enrolled in the study are assumed to be eventfree. This implies that Pr(X>0)=1.
Without loss of generality, we set X=∞ when the event of interest does not occur. Let N denote the number of subjects and n_{i} denote the number of visits for the i^{th} subject during the followup period. At each visit, we assume that a subject would selfreport their disease status as either positive or negative. For example, at each semiannual (WHICT) or annual contact (WHIOS), all participants were asked, “Since the date given on the front of this form, has a doctor prescribed any of the following pills or treatments?” Choices included “pills for diabetes” and “insulin shots for diabetes”. Thus, incident treated diabetes was ascertained, and was defined as a selfreport of a new physician diagnosis of diabetes treated with oral drugs or insulin.
Let τ_{1},⋯,τ_{J} denote the distinct, ordered visit times in the dataset among N subjects, where 0=τ_{0}<τ_{1}<...<τ_{J}<τ_{J+1}=∞. Thus, the time axis can be divided into J+1 disjoint intervals, [0,τ_{1}),[τ_{1},τ_{2}),⋯,[τ_{J},∞). Let Z denote the P×1 vector of covariates with corresponding P×1 vector of regression coefficients denoted by β. To incorporate the effect of covariates, we assume the proportional hazards (PH) model, \(\lambda \left (x\boldsymbol {Z}=\boldsymbol {z}\right)=\lambda _{0}(x)e^{\boldsymbol {z}'\boldsymbol {\beta }}\phantom {\dot {i}\!}\), or equivalently, \(S\left (x\boldsymbol {Z}=\boldsymbol {z}\right)=S_{0}(x)^{e^{\boldsymbol {z}'\boldsymbol {\beta }}}\). Thus, the loglikelihood function for a random sample of N subjects can be expressed as:
where θ_{j}=Pr(τ_{j−1}<X≤τ_{j}) and where \(\sum _{j=1}^{J+1} \theta _{j} = 1\). The elements D_{ij} of the matrix D are functions of the observed data including the visit times, the corresponding selfreported results, and the constants φ_{0},φ_{1} that correspond to the specificity and sensitivity of selfreports, respectively. The details of the derivation were originally reported in a previous publication [12] and have been included in Section 1 of the Supplement.
When P<<N and assuming that φ_{0},φ_{1} are known, the maximum likelihood estimates of the unknown parameters β_{1},⋯,β_{P},θ_{1},⋯,θ_{J} can be obtained by numerical maximization of the loglikelihood function in Eq. (1), subject to the constraints that \(\sum _{j=1}^{J+1}{\theta _{j}}=1\) and θ_{j}>0. Statistical inference regarding the parameters of interest (β_{1},⋯,β_{P},θ_{1},⋯,θ_{J+1}) can be made by using asymptotic properties of the maximum likelihood estimators [29]. For settings in which P>N, a Bayesian approach incorporating a spike and slab variable selection procedure is described below.
Bayesian variable selection (BVS)
In this section, we adapt the spike and slab BVS approach in the context of errorprone, selfreported outcomes. We introduce a latent vector γ=(γ_{p},1≤p≤P), where each γ_{p} is an indicator variable denoting whether the p^{th} covariate is included (γ_{p}=1) or not (γ_{p}=0) in the model. The BVS analysis proceeds via MCMC methods to estimate the posterior distribution of γ. With this latent variable formulation for variable selection, the log likelihood function in Eq. (1) is a function of the parameters θ_{1},⋯,θ_{J},β,γ and is denoted l(θ,β,γ). We assume the following hierarchical structure of the prior distributions corresponding to the unknown parameters in the model:
where δ_{0} is the Dirac function corresponding to a point mass at 0 and where b,w_{1},w_{2} are treated as known hyperparameters.
By treating the interval probabilities θ as nuisance parameters, we propose the following MetropolisHasting algorithm. At iteration t, we let the indices t−1 and ^{∗} denote the current and proposed values of the parameters, respectively.

1
Initialization: Set ω^{(0)} to a randomly generated value from Beta(w_{1},w_{2}) distribution. Set γ^{(0)}=0 and β^{(0)}=0. Optimize the loglikelihood function in Eq. (1) with respect to θ by fixing β=β^{(0)}, and then set θ^{(0)} to equal the optimized value for θ.

2
Update selected variable and associated regression coefficient: Select a covariate p∈(1,⋯,P) at random and let the proposed value \(\gamma _{p}^{*}=1\gamma _{p}^{t1}\).

(a)
If \(\gamma _{p}^{*}=0\), the corresponding regression coefficient \(\beta _{p}^{*}\) is set to 0. If \(\gamma _{p}^{*}=1\), the proposed regression coefficient \(\beta _{p}^{*}\) is sampled from N(0,b^{2}) distribution.

(b)
Optimize the loglikelihood function in Eq. (1) with respect to θ by fixing β=β^{∗}, denote the optimized value to be θ^{∗}. The optimized value θ^{∗} is used as the proposed value for θ.

(c)
Accept the proposed values \(\left (\gamma _{p}^{*}, \beta _{p}^{*}, \theta ^{*}\right)\) with probability min(e^{Δ},1) where:
$$\begin{array}{@{}rcl@{}} \Delta &=& l\left(\boldsymbol \theta^{*}, \boldsymbol \beta^{*}\right)  l\left(\boldsymbol \theta^{(t1)}, \boldsymbol \beta^{(t1)}\right) \\ &&+ \left(2\gamma_{p}^{*}1\right)\log\left(\frac{\omega^{(t1)}}{1\omega^{(t1)}} \right) \end{array} $$See Section 2 of the Supplement for details regarding the derivation of Δ.

(a)

3
Update regression coefficients and interval probabilities: For each included covariate, update the coefficient with a user defined probability p_{main}, for example 0.30. If a main effect β_{p} is chosen for update,

(a)
Let the proposed value, \(\beta _{p}^{*}\), be a random sample from the distribution \(N\left (\beta _{p}^{(t1)}, b^{2}\right)\).

(b)
Optimize the loglikelihood function in Eq. (1) with respect to θ by fixing β=β^{∗}, where we denote the optimized value to be θ^{∗}. The optimized value θ^{∗} is used as the proposed value for θ.

(c)
Accept the proposed values \(\left (\beta _{p}^{*}, \boldsymbol \theta ^{*}\right)\) with probability min(e^{Δ},1) where
$$\begin{array}{@{}rcl@{}} \Delta &=& l\left(\boldsymbol \theta^{*}, \boldsymbol \beta^{*}\right)  l\left(\boldsymbol \theta^{(t1)}, \boldsymbol \beta^{(t1)}\right) \\ &&+ \frac{\left(\beta_{p}^{(t1)}\right)^{2}  \left(\beta_{p}^{*}\right)^{2} }{2b^{2}} \end{array} $$See Section 2 of the Supplement for details regarding the derivation of Δ.

(a)

4
Update ω: Using Gibbs sampling, update ω by generating a sample from Beta (w_{1}+K_{γ},w_{2}+P−K_{γ}), where K_{γ} is the number of main effects selected.
After the burnin period, the covariates are ranked from most to least important by their inclusion probabilities based on the posterior distribution of γ_{p}.
Results
Simulation studies
We report results from simulation studies to evaluate the performance of the proposed BVS algorithm in the presence of outcomes subject to error, under various parameter settings. First, we consider a high dimensional dataset in which each feature is a random variable with three levels, reflecting the two possible homozygous (AA, aa) and the single heterozygous (Aa) combination of alleles of a SNP. In Section 3.2 of the Supplement, we consider a high dimensional dataset in which each feature is a continuous random variable, scaled to have mean 0 and unit variance.
The results presented here are averages of 1,000 simulated datasets, where each dataset included n=100 subjects and P=100 covariates. To mimic real data settings, the 100×100 design matrix was obtained by random sampling of a subset of 100 covariates for 100 participants from the WHI Clinical Trial and Observational Study SHARe (described below) and a metabolomics study of cardiovascular disease (see Section 3.2. of the Supplement). The design matrix was standardized before simulation.
Results from errorprone selfreports for each subject were simulated assuming four prescheduled visit times per subject over 8 years of follow up, with no missed visits. The distribution of true event times in the reference group (i.e. Z=0) was assumed to be exponential with baseline hazard denoted by λ_{0}. The value of λ_{0} was determined by fixing the corresponding cumulative incidence rate (CIR) in the reference group to equal CIR=10% or30%, over the 8year study duration. The true event time for each subject in the study was simulated from an exponential distribution, where the hazard function (λ) was determined through the PH model (λ=λ_{0}e^{βZ}). In each dataset, five out of 100 covariates were randomly sampled as true associations with the outcome, with a corresponding coefficient β=1.0 (or, hazard ratio of e^{β}≈2.7) in the PH model. The regression coefficients for the remaining 95 covariates in the PH model were set to 0. For each subject, an errorprone selfreport (positive or negative) at each visit was simulated from a Bernoulli distribution. At each visit, the probability of a positive selfreport was governed by the sensitivity (φ_{1}) if the visit time is after the true event time or the complement of the specificity (φ_{0}) if the visit time precedes the true event time. The values of (φ_{1},φ_{0}) were varied between [(1,1),(1,0.9),(0.75,1),(0.61,0.995)]. We note that the sensitivity and specificity values (0.61,0.995) correspond to the properties of diabetes selfreports in the WHI [30].
For each parameter setting, we compared the variable selection performance of the following three strategies: (1) Random survival forests assuming no error in selfreport [31]; (2) the proposed BVS algorithm assuming no error in the selfreport (φ_{1}=φ_{0}=1); and (3) the proposed BVS accounting for the imperfect nature of selfreports. Our rationale for selecting these algorithms for comparison include evaluating the: (1) performance of the proposed BVS approach when compared to a distinct, yet multivariable, nonparametric, treebased ensemble approach such as the Random Forests; and (2) potential increase in variable selection accuracy when accounting for error in selfreports through the proposed BVS approach.

1
Random survival forests (RSF): The algorithm implemented in the randomForestSRC R package [32] was applied to each dataset by defining the timetoevent outcome as the time from baseline (origin) to the time of the first positive selfreport (observed event) or the time of last observation (censored observation). In particular, this results in no difference in the handling of the two study designs (“No missed visits” versus “NTFP”). The Random survival forests was implemented using the function rfsrc in the package with default parameter setting, e.g. there are 1,000 trees in the forest. The 100 covariates were ranked by the variable importance metric output by the algorithm. The computing time for a single 100×100 dataset on a Macbook Pro 2017 is about 2 seconds.

2
Proposed BVS assuming no error in selfreports (BVS_{perfect}): This analysis was based on the proposed BVS algorithm by setting φ_{1}=φ_{0}=1. Note that in this case, we only consider selfreports up to the first positive report since negative selfreports that follow a positive selfreport have zero probability. The MCMC algorithm was run up to 100,000 iterations and the first 20,000 iterations were discarded as burnin. We set the values of the hyperparameters governing the prior distributions to the following: b=1.0,w_{1}=5,w_{2}=100. The 100 covariates were ranked based on the posterior distribution of γ. The computing time for a single 100×100 dataset on a Macbook Pro 2017 is about 3 minutes.

3
Proposed BVS algorithm (BVS_{e}): This analysis was based on the proposed BVS algorithm by setting φ_{1},φ_{0} to equal the values assumed in the data generation process. This algorithm is implemented in two ways: (1) Selfreports collected at all prescheduled visits are included in the analysis (referred to as “NMISS”); (2) Selfreports following the first positive are discarded (referred to as “No Test after First Positive or NTFP”). The MCMC algorithm was run up to 100,000 iterations and the first 20,000 iterations were discarded as burnin. We set the values of the hyperparameters governing the prior distributions to the following: b=1.0,w_{1}=5,w_{2}=100. The computing time for a single 100×100 dataset on a Macbook Pro 2017 is about 3 minutes.
The R code for implementing the simulations described in this Section has been included in Section 4 of the Supplement. Model diagnostics of the convergence of the MCMC chain and run length control indicated that the choice of 100,000 iterations with a burnin of 20,000 was justified (See Section 3.1 in the Supplement).
The 100×100 design matrix for this simulation study was randomly selected from the existing GWAS data from the WHI Clinical Trial and Observational Study SHARe, which includes extensive genotypic (>900K SNPs) and phenotypic data for 12,007 African American and Hispanic American women. All missing genotypes were imputed to be homozygous for the major allele “AA”. Among the 100 SNPs selected in the design matrix for the simulation, 74 had minor allele frequency (MAF) between 0 and 0.35 and the remaining 26 SNPs had MAF between 0.35 and 0.5. Each SNP was incorporated into the PH model as a numeric covariate that was generated by coding genotype “AA” as 0, genotype “Aa” as 1, and “aa” as 2. In this model, the homozygous major allele (AA) category serves as the reference group and the homozygous ‘aa’ category has twice the effect on outcome as the heterozygous ‘Aa’ category [33]. We note that this implies the assumption of a linear effect across the ordered genotype categories. While relaxing this assumption is straightforward from a modeling perspective, it would result in a significant increase in computational complexity.
Figure 1 in the Supplement presents the posterior distribution of γ for the 100 SNPs from a single representative simulation. Here, the data generating mechanism for the timetoevent outcome and selfreports was based on the first five SNPs, each with corresponding regression coefficient β=0.7 in the PH model. The results were based on φ_{1}=0.61,φ_{0}=0.995, a 30% rate of cumulative incidence in the reference group, and assuming that there were no missed visits. We observed that true associations (SNPs 15) had significantly higher posterior probabilities of inclusion when compared to the average corresponding value for those SNPs that were not associated with outcome.
Table 1 shows the proportion of simulated datasets in which a SNP that is associated with the outcome was found as ranking among the top five SNPs by the posterior probability of inclusion  results are averaged over the five SNPs with true associations with the outcome. In all settings with the exception of one (CIR=0.1,φ_{1}=1,φ_{0}=0.9), the BVS and BVS_{e} algorithms perform better than RSF. When selfreports after the first positive are excluded from analysis (NTFP), both BVS and BVS_{e} have comparable performance indicating that the performance loss due to assuming incorrect sensitivity and specificity values is negligible when test results are discarded. However, when selfreports at all visit times are included in analysis (“NMISS”), and when specificity (φ_{0}) is less than perfect, BVS_{e} results in a significantly higher probability of discovering true associations when compared to BVS. For example, when CIR =0.1, φ_{1}=1.00, φ_{0}=0.90, the probability of a SNP ranking among the top five by BVS_{e} under NMISS is 0.81 (±0.01) as compared to 0.34 (±0.01) and 0.41 (±0.01) by BVS (assuming perfect selfreports) and RSF, respectively. The false positive rates were comparable across algorithms for each simulation setting.
Similar results were observed for the setting where the features were continuous measurements, representing data observed in metabolomic studies (Section 3.2 of the Supplement).
Application
The proposed methods were applied to data from the WHI Clinical Trial and Observational Study SHARe, to identify SNPs associated with risk of incident type 2 diabetes mellitus. The dataset includes extensive genotypic (909,622 SNPs) and phenotypic information on 12,008 African American and Hispanic American women. After excluding participants who selfreported diabetes at baseline, the analysis was restricted to 9,873 participants.
Prevalent diabetes at baseline and incident diabetes were assessed through selfreported questionnaires in the WHI. At baseline and at each annual visit, every participant was asked whether she had ever received a physician diagnosis of and/or treatment for diabetes when not pregnant since the time of the last selfreport/visit. Using data from a WHI substudy [30], estimates of sensitivity, specificity, and baseline negative predictive value of selfreported diabetes outcomes were obtained by comparing selfreported outcomes to fasting glucose levels and medication data. A participant was considered to be truly diabetic if she had either taken antidiabetic medication and/or had a fasting glucose level ≥ 126mg/dL. By using a subset of 5485 participants, with information at baseline on diabetes selfreports, fasting glucose levels and medication inventory, we estimated that selfreports have a sensitivity (φ_{1}) of 0.61 and a specificity (φ_{0}) of 0.995 [30]. These parameter values are used in our analysis.
After excluding participants with selfreported diabetes at baseline, the remaining subsets of 6,704 African American and 3,169 Hispanic American women were analyzed independently. The results presented here are based on follow up until 2013. The average follow up from baseline was 11.6 years, with a maximum follow up of 16 years  during this period, 21.2% of the African American and 18.5% of the Hispanic American women selfreported a new diagnosis of diabetes.
The data preprocessing procedures resulted in 300,000 SNPs being included in the statistical analysis (Section 5 of the Supplement, Supplementary Figs. 23). Results from the proposed BVS algorithm (BVS_{e}) are compared to the BVS algorithm assuming perfect selfreports (BVS_{perfect}) and to two SNPbySNP approaches, including a model based on the likelihood in Eq. (1) (icensmis) [34] and the Cox proportional hazards (PH) model. Details regarding data preprocessing and the statistical models are presented in Section 5 of the Supplement.
Figures 45 in the Supplement show bar plots of the posterior probability of inclusion from the proposed algorithm (BVS_{e}) for the 300,000 SNPs included in each analysis in the African American and Hispanic American datasets, respectively. Since the dimensionality reduction procedure was carried out within each dataset, not all SNPs entered the BVS_{e} analysis in both datasets. SNPs that were found to rank among the top 10 most important by at least one of the aforementioned analyses in the datasets of African American women and Hispanic American women are shown in Tables 2 and 3, respectively. Each SNP is annotated with its host gene (if known) and its upstream and downstream genes. In both populations, the rankings by the proposed BVS_{e} algorithm differed significantly from the rankings by BVS assuming perfect selfreports and from each of the univariate (SNP by SNP) analyses (Tables 23), underscoring the value in utilizing a broad set of analytical approaches for variable selection in high dimensional datasets. The BVS algorithm assuming perfect selfreports identified only 2 SNPs in the African American dataset and one SNP in the Hispanic American dataset, with nonzero posterior probability of inclusion. Interestingly, none of the SNPs discovered among the top 10 (by at least one approach) in the African American dataset overlapped with the corresponding set in the Hispanic American subcohort.
In the dataset of African American women, a total of 19 SNPs were identified in the top 10 by at least one of the four strategies, while simultaneously adjusting for population stratification (Table 2). SNP rs2805434 was ranked within the top 2 ranks by all four analysis approaches. The host gene RYR2 has been implicated in insulin secretion in previous studies [35]. RYR2 is host to another SNP (rs2805429) identified as second most important by both univariate approaches in this population. rs2805434 was removed from the analysis dataset due to the preprocessing procedures in the Hispanic American women.
Another finding with support in the literature is that of SNP rs10126793  its upstream gene PDK3 or pyruvate dehydrogenase kinase 3 is in the class of PDK isoenzymes that have been shown to be strong therapeutic targets for preventing and treating metabolic diseases [36]. rs10126793 had a weak association with type 2 diabetes risk in the Hispanic American dataset with a posterior probability of inclusion of 0 in both BVS analyses and pvalues of 0.07 and 0.02 in univariate icensmis and Cox PH analyses, respectively.
In the dataset of Hispanic American women, a total of 19 SNPs were identified in the top 10 by at least one of the three strategies, while simultaneously adjusting for population stratification (Table 3). SNP rs6547248 was ranked as the top candidate by the BVS_{e} algorithm and among the top five SNPs by Cox PH and icensmis. The host gene CTNNA2 (or Catenin Alpha 2) is a protein coding gene and may function as a linker between cadherin adhesion receptors and the cytoskeleton to regulate cellcell adhesion and differentiation in the nervous system [37]. The host gene CTNNA2 was found among 24 novel candidate genes associated with type 2 diabetes risk in a African American subsample of 973 participants with type 2 diabetes and 104 healthy control participants in the GENNID study [38].
SNP rs488672 within host gene TRIM29 was ranked as second most important by BVS_{e} and among the top 30 SNPs by Cox PH and icensmis. RNA sequence data from an animal study involving a mouse model of type 2 diabetes showed that TRIM29 acts as an E3 ligase that targets both insulin receptor (IR) and insulin receptor substrate 1 (IRS1) for ubiquitindependent degradation, resulting in insulin resistance [39]. In this study, the authors showed that TRIM29 levels are ≥2.5 fold higher in the kidney cortex of diabetic mice when compared to wild type mice, respectively.
SNPs rs6079637 was found among the top 10 SNPs by all models with the exception of BVS_{perfect}. Similarly, SNP rs6135332 was found among the top 100 SNPs by BVS_{e} and among the top five SNPs by Cox PH and icensmis. Both SNPs are located in the intron of gene MACROD2 and are significantly correlated (p<0.0001), with R^{2}=0.936 and \(\phantom {\dot {i}\!}D^{'} = 1.0\) [40]. In a study of 1,100 Han Chinese individuals from 398 families in the Stanford Asian Pacific Program for Hypertension and Insulin Resistance study, genetic loci within the MACROD2 gene were associated with vascular adhesion protein1 levels (VAP1) in females. VAP1 is a membranebound amine oxidase highly expressed in mature adipocytes and released into the circulation. VAP1 has been strongly implicated in several pathological processes, including diabetes, inflammation, hypertension, hepatic steatosis and renal diseases, and is an important disease marker and therapeutic target [41].
SNP rs6899814 was ranked in the top 10 by each strategy with the exception of BVS_{perfect}, flanked by upstream gene RNY4 and downstream genes SH3BGRL2 and C6orf7. Of note, SH3BGRL2 was identified as a gene that is implicated in type 1 diabetes, type 2 diabetes and gestational diabetes in a transcriptome metaanalysis of peripheral lymphomononuclear cells [42]. rs6899814 was an insignificant predictor of incident type 2 diabetes in the African American dataset, with a posterior probability of inclusion of 0 in both BVS analyses and pvalues of 0.34 and 0.16 in univariate icensmis and Cox PH analyses, respectively.
In our independent analyses of African American and Hispanic American participants in the WHI, we identified several novel SNPs associated with type 2 diabetes risk. An overlap in the findings in the two datasets were two SNPs in the same genomic region that are flanked upstream by gene LYZL1 and downstream by C10orf126  these SNPs were rs17693218 and rs519206 in the datasets of African American and Hispanic American participants, respectively. Among the other top ranking SNPs, there was little overlap between the race/ethnicity groups (Tables 23). These results add to previous observations in the literature of disease associated genetic loci that are often not all generalizable across race/ethnicity populations [43].
Discussion
In this paper, we propose a BVS procedure for variable selection in high dimensional datasets, in settings where a timetoevent outcome is observed with error. The models developed in this paper are motivated by selfreported outcomes of incident type 2 diabetes collected in the Women’s Health Initiative, that are subject to misclassification. The proposed methods apply to other settings in which the event of interest is diagnosed using an imperfect laboratorybased diagnostic test that is administered at prescheduled times during followup.
We presented results from simulations, considering different data types (GWAS, metabolomics) and a variety of settings with regard to cumulative incidence of event during the study and sensitivity/specificity of the selfreport (or imperfect diagnostic test). When silent outcomes are ascertained through imperfect selfreports with imperfect specificity, our proposed algorithm has a significantly better performance with regard to variable selection when compared to the approaches that assume no error in the outcome ascertainment. In studies where collection of selfreports or diagnostic test results ceases after the first positive result, our modified algorithm no longer performs better than other approaches that ignore the error in outcomes. We applied the proposed algorithm to data from the WHI Clinical Trial and Observational Study SHARe in separate analyses of the data from African American and Hispanic American women. We found a distinct genetic signature associated with type 2 diabetes risk among the African American and Hispanic American populations in the WHI, with little overlap in risk alleles between groups.
The computational burden of BVS approaches can be considerable. In future work, it would be useful to explore efficient alternatives to the stochastic search algorithms  for example, the expectation maximization variable selection approach [26] could result in significant improvement in computational efficiency. Other useful extensions of the BVS procedure could involve incorporating known biological relationships between predictors, as discussed in previous work such as [22, 23, 25].
Conclusion
In high dimensional data applications, variable selection can be negatively impacted when the outcome of interest is observed with error such as in selfreports. In settings where the specificity of selfreported outcomes is less than perfect, a significant degradation in variable selection accuracy was observed. For this setting, our proposed algorithm employs a Bayesian variable selection approach that incorporates a likelihood function that models the error prone nature of the selfreported outcomes. The proposed algorithm had significantly better variable selection performance when compared to similar algorithms that ignore the error in the outcome. The proposed algorithm was applied to GWAS data in the WHI Clinical Trial and Observational Study SHARe to discover novel SNPs associated with risk of incident type 2 diabetes in African American and Hispanic American populations.
Availability of data and materials
The Women’s Health Initiative  SNP Health Association Resource (WHISHARe) data (dbGaP Study Accession: phs000200.v12.p3) can be obtained by requesting access through the dbGaP website https://dbgap.ncbi.nlm.nih.gov/.
Abbreviations
 SNP:

Single Nucleotide Polymorphism
 HbA1c:

glycosylated hemoglobin
 WHI:

Women’s Health Initiative
 BVS:

Bayesian Variable Selection
 SHARe:

SNP Health Association Resource
 HIV:

Human Immunodeficiency Virus
 HPV:

Human Papilloma Virus
 STD:

Sexually Transmitted Disease
 GWAS:

Genome wide association study
 WHICT:

Women’s Health Initiative Clinical Trials
 WHIOS:

Women’s Health Initiative Observational Study
 MCMC:

Markov Chain Monte Carlo
 CIR:

Cumulative Incidence Rate
 PH:

Proportional Hazards
 RSF:

Random Survival Forests
 NTFP:

No Tests following First Positive
 NMISS:

No Missing tests
 MAF:

Minor Allele Frequency
 CTNNA2:

Catenin Alpha 2
 VAP1:

Vascular Adhesion Protein 1
References
Anderson G, Cummings S, Freedman L, et al. Design of the women’s health initiative clinical trial and observational study. Control Clin Trials. 1998; 19(1):61–109.
Turnbull B. Empirical distribution function with arbitrarily grouped, censored and truncated data. J R Stat Soc Ser B Methodol. 1976; 38:290–5.
Finkelstein D. A proportional hazards model for intervalcensored failure time data. Biometrics. 1986; 42:845–54.
Balasubramanian R, Lagakos S. Estimation of the timing of perinatal transmission of HIV. Biometrics. 2001; 57:1048–58.
Balasubramanian R, Lagakos S. Estimation of a failure time distribution based on imperfect diagnostic tests. Biometrika. 2003; 90:171–82.
McKeown K, Jewell N. Misclassification of current status data. Lifetime Data Anal. 2010; 16:215–30.
Meier A, Richardson B, Hughes J. Discrete proportional hazards models for mismeasured outcomes. Biometrics. 2003; 59:947–54.
Snapinn S. Survival analysis with uncertain endpoints. Biometrics. 1998; 54:209–18.
Cook T. Adjusting survival analysis for the presence of unadjudicated study events. Control Clin Trials. 2000; 21:208–22.
Cook T, Kosorok M. Analysis of timetoevent data with incomplete event adjudication. J Am Stat Assoc. 2004; 99:1140–52.
Neuhaus J. Bias and efficiency loss due to misclassified responses in binary regression. Biometrika. 1999; 86:843–55.
Gu X, Ma Y, Balasubramanian R. Semiparametric time to event models in the presence of errorprone, selfreported outcomes  with application to the women’s health initiative. Ann Appl Stat. 2015; 9(2):714–30.
Mitchell T, Beauchamp J. Bayesian variable selection in linearregression. J Am Stat Assoc. 1988; 83(404):1023–32.
George E, Mcculloch R. Variable selection via Gibbs sampling. J Am Stat Assoc. 1993; 88(423):881–9.
Lee K, Sha N, Dougherty E, Vannucci M, Mallick B. Gene selection: a Bayesian variable selection approach. Bioinformatics. 2003; 19(1):90–7.
Sha N, Vannucci M, Tadesse M, Brown P, Dragoni I, Davies N, Roberts T, Contestabile A, Salmon M, Buckley C, et al. Bayesian variable selection in multinomial probit models to identify molecular signatures of disease stage. Biometrics. 2004; 60(3):812–9.
Sha N, Tadesse M, Vannucci M. Bayesian variable selection for the analysis of microarray data with censored outcomes. Bioinformatics. 2006; 22(18):2262–8.
Guan Y, Stephens M. Bayesian variable selection regression for genomewide association studies and other largescale problems. Ann Appl Stat. 2011; 5(3):1780–815.
Tadesse M, Sha N, Vannucci M. Bayesian variable selection in clustering highdimensional data. J Am Stat Assoc. 2005; 100(470):602–17.
Kim S, Tadesse M, Vannucci M. Variable selection in clustering via Dirichlet process mixture models. Biometrika. 2006; 93(4):877–93.
Dunson D, Herring A, Engel S. Bayesian selection and clustering of polymorphisms in functionally related genes. J Am Stat Assoc. 2008; 103(482):534–46.
Li F, Zhang N. Bayesian variable selection in structured highdimensional covariate spaces with applications in genomics. J Am Stat Assoc. 2010; 105(491):1202–14.
Stingo F, Chen Y, Tadesse M, et al. Incorporating biological information into linear models: A Bayesian approach to the selection of pathways and genes. Ann Appl Stat. 2011; 5(3):1978–2002.
O’Hara R, Sillanpaa M. A review of Bayesian variable selection methods: What, how and which. Bayesian Anal. 2009; 4(1):85–118.
Rockova V, Lesaffre E. Incorporating grouping information in Bayesian variable selection with applications in genomics. Bayesian Anal. 2014; 9:221–58.
Rockova V, George E. EMVS: The EM approach to Bayesian variable selection. J Am Stat Assoc. 2014; 109:828–46.
Jacobs R, Lesaffre E, Teunis P, et al. Identifying the source of foodborne disease outbreaks: An application of Bayesian variable selection. Stat Methods Med Res. 2017; 28(4):1–15.
Chen S, Nunez S, Reilly M, Foulkes A. Bayesian variable selection for postanalytic interrogation of susceptibility loci. Biometrics. 2017; 73:603–14.
Cox D, Hinkley D. Theoretical Statistics. Chapman and Hall CRC. 1979. https://www.routledge.com/TheoreticalStatistics/CoxHinkley/p/book/9780412161605.
Margolis K, Qi L, Brzyski R, et al. Validity of diabetes selfreports in the Women’s Health Initiative: comparison with medication inventories and fasting glucose measurements. Clin Trials. 2008; 5:240–7.
Ishwaran H, Kogalur U, Blackstone E, et al. Random survival forests. Ann Appl Stat. 2008; 2(3):841–60.
Ishwaran H, Kogalur U. randomForestSRC: Random forests for survival, regression and classification. (RFSRC) R package version 1.6.1. 2015.
Bush W, Moore J. Genomewide association studies. PLoS Comput Biol. 2012; 8:e1X002822.
Gu X, Balasubramanian R. icensmis: Study Design and Data Analysis in the presence of errorprone diagnostic tests and selfreported outcomes. R package version 1.1. 2013.
Dixit S, Wang T, Manzano E, Yoo S, Lee J, Chiang D, Ryan N, Respress J, Yechoor V, Wehrens X. Effects of CaMKIImediated phosphorylation of ryanodine receptor type 2 on islet calcium handling, insulin secretion, and glucose tolerance. Plos ONE. 2013; 8(3):e58655.
Jeoung N. Pyruvate dehydrogenase kinases: Therapeutic targets for diabetes and cancers. Diabetes Metab J. 2015; 39:188–97.
Genecards for Gene CTNNA2. https//doi.org/www.genecards.org/cgibin/carddisp.pl?gene=CTNNA2.
Hasstedt S, Highland H, Elbein S, Hanis C, Das S. Five linkage regions each harbor multiple type 2 diabetes genes in the African American subset of the GENNID Study. J Hum Genet. 2013; 58(6):378–83.
Habib S. TRIM29 Is a New Gene That Regulates IRS1 to Induce Insulin Resistant in Diabetes. Diabetes. 2018; Supplement 1:513–P. https://diabetes.diabetesjournals.org/content/67/Supplement_1/513P.
Linkage disequilibrium statistics. https://doi.org/ldlink.nci.nih.gov/.
Chang Y, Hee S, Lee W, Li H, Chang T, Lin M, Hung Y, Lee I, Hung K, Assimes T, et al. Genomewide scan for circulating vascular adhesion protein1 levels: MACROD2 as a potential transcriptional regulator of adipogenesis. J Diabetes Inv. 2018; 9(5):1067–74.
Collares C, Evangelista A, Xavier D, Takahashi P, Almeida R, Macedo C, ManoelCaetano F, Foss MC, FossFreitas M, Rassi D, et al. Transcriptome metaanalysis of peripheral lymphomononuclear cells indicates that gestational diabetes is closer to type 1 diabetes than to type 2 diabetes mellitus. Mol Biol Rep. 2013; 40:5351–8. https://pubmed.ncbi.nlm.nih.gov/23657602/.
Hutter C, Young A, OchsBalcom H, Carty C, Wang T, Chen C, Rohan T, Kooperberg C, Peters U. Replication of breast cancer GWAS susceptibility loci in the Women’s Health Initiative African American SHARe Study. Cancer Epidemiol Biomarkers Prev. 2011; 20:1950–9.
Acknowledgements
We are grateful to BG Medicine Inc. for providing the metabolomics and proteomics data from the cardiovascular disease study.
Funding
This work was partly supported by National Institutes of Health 1R01HL12224101A1, supporting effort on this project for XG, MGT and RB. The WHI program is funded by the National Heart, Lung, and Blood Institute, National Institutes of Health, U.S. Department of Health and Human Services through contracts HHSN268201100046C, HHSN268201100001C, HHSN268201100002C, HHSN268201100003C, HHSN268201100004C, and HHSN271201100004C.
Author information
Authors and Affiliations
Contributions
RB, MGT and XG conceived the study. XG carried out simulations and data analyses. ASF and YM provided guidance on the WHI data analyses. XG, MGT and RB drafted the manuscript. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
All WHI participants provided written informed consent. This study involving secondary analysis of data from the Women’s Health Initiative was approved by the Institutional Review Board at the University of Massachusetts  Amherst.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
Supplement to Bayesian variable selection for high dimensional predictors and selfreported outcomes
Additional file 2
The online supplementary material includes the following: File name: bmc_article_supplement.pdfDescription: Supplementary methods and results
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Gu, X., Tadesse, M.G., Foulkes, A.S. et al. Bayesian variable selection for high dimensional predictors and selfreported outcomes. BMC Med Inform Decis Mak 20, 212 (2020). https://doi.org/10.1186/s1291102001223w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1291102001223w