Can multiple SNP testing in BRCA2 and BRCA1 female carriers be used to improve risk prediction models in conjunction with clinical assessment?

Background Several single nucleotide polymorphisms (SNPs) at different loci have been associated with breast cancer susceptibility, accounting for around 10% of the familial component. Recent studies have found direct associations between specific SNPs and breast cancer in BRCA1/2 mutation carriers. Our aim was to determine whether validated susceptibility SNP scores improve the predictive ability of risk models in comparison/conjunction to other clinical/demographic information. Methods Female BRCA1/2 carriers were identified from the Manchester genetic database, and included in the study regardless of breast cancer status or age. DNA was extracted from blood samples provided by these women and used for gene and SNP profiling. Estimates of survival were examined with Kaplan-Meier curves. Multivariable Cox proportional hazards models were fit in the separate BRCA datasets and in menopausal stages screening different combinations of clinical/demographic/genetic variables. Nonlinear random survival forests were also fit to identify relevant interactions. Models were compared using Harrell’s concordance index (1 - c-index). Results 548 female BRCA1 mutation carriers and 523 BRCA2 carriers were identified from the database. Median Kaplan-Meier estimate of survival was 46.0 years (44.9-48.1) for BRCA1 carriers and 48.9 (47.3-50.4) for BRCA2. By fitting Cox models and random survival forests, including both a genetic SNP score and clinical/demographic variables, average 1 - c-index values were 0.221 (st.dev. 0.019) for BRCA1 carriers and 0.215 (st.dev. 0.018) for BRCA2 carriers. Conclusions Random survival forests did not yield higher performance compared to Cox proportional hazards. We found improvement in prediction performance when coupling the genetic SNP score with clinical/demographic markers, which warrants further investigation.

Background BRCA1 and BRCA2 are major susceptibility genes that confer high lifetime risks for both breast and ovarian cancer. Deleterious mutations in these autosomal dominant cancer genes account for approximately 15-20% of the familial component of breast cancer [1][2][3]. The variable penetrance exhibited by these BRCA mutations suggest other genetic factors to be present [4], and several studies have now identified a large number of breast cancer susceptibility alleles [5][6][7]. Genome association studies had identified until recently 19 common variants at 18 loci that are associated with breast cancer susceptibility [5,7] though the risk attributed to each of these single nucleotide polymorphisms (SNPs) are often modest and largely remain unexplained [6]. More recent studies into these polymorphisms have found direct associations between specific SNPs and breast cancer in BRCA1/2 mutation carriers; TOX3, FGFR2, MAP3K, LSP1, 2q35, SLC4A7, 1p11.2, 5p12, 6q25.1 loci have all been associated with increased risk in breast cancer for BRCA2 mutation carriers [6,7]. Antoniou et al. [6] further determined TOX3, 2q35, and 6q25.1 were polymorphisms that increased risk for BRCA1 mutation carriers. However, a recent study by Ingham et al. [8] found the 18 validated breast cancer susceptibility SNPs do not differentiate the risks of breast cancer in those with BRCA1 mutations.
Some genetic modifiers may in themselves influence breast cancer risk factors rather than be directly associated; such as the genetic component associated with high mammographic density [4,9]. A recent study by Mitchell et al. looking at mammographic density in 206 BRCA1 and BRCA2 carriers compared to non-carriers found a significant association between increased breast cancer risk and increasing density in BRCA1/2 carriers [9].
Alongside risk factors with a genetic component there are several hormonal risk factors that are thought to be associated with breast cancer both among the general population and those with hereditary breast cancer [10]. Correlations have been made between changes in breast mitotic/apoptotic activity and alterations in hormone levels across the menstrual cycle, and that if the levels of oestrogen and progesterone are reduced then the risk of breast cancer is reduced [11,12]. Though some debate surrounds the association of these factors with breast cancer among BRCA1/2 carriers, with studies finding an association only in BRCA1 mutation carriers [13] and other finding no association [12]. Modifiable factors, such as body mass index (BMI) are also thought to influence the risk of breast cancer. Obesity has a well-documented association with breast cancer in the general population, due to influence of biological pathways [14], and postmenopausal weight gain has been associated with increased risk among BRCA carriers [15].
At present, several personalised risk prediction models have been developed using familial, demographic, clinical, laboratory, genetic information domains, with a few combinations thereof [8,[16][17][18][19], as for instance the Gail, BOA-DICEA or IBIS methods [20], as well as more specific studies as surveys on gene expression markers [21], and use of machine learning for predicting recurrence or re-defying subtypes [22,23].
The aim of this study was to determine whether validated susceptibility SNPs improve the predictive ability of risk models in conjunction and comparison to demographic and clinical information.

Study population
Patients included in this study were BRCA1 and BRCA2 female pathogenic mutation carriers ascertained from the Genetic Medicine department, St Mary's Hospital, Manchester, UK. This clinic is one of the largest specialist genetics departments within the UK, and all families with a history of breast or ovarian cancer within the North West region are referred. Patients were included in this study regardless of breast cancer status or age. Dates of birth were taken from the information collected at time of family referral to the genetics department. Cases of breast cancer were confirmed by means of hospital records or the North West Cancer Intelligence Service. Dates of last follow-up were either date of breast cancer diagnosis or date the woman was last in contact with the genetics department or other NHS service or date of death.

Ethics statement
This research has been performed in accordance with the Declaration of Helsinki. The NHS Health Research Authority, National Health Research Ethics Committee North West, Greater Manchester Central (Barlow House, 4 Minshull Street, Manchester, M1 3DZ), reviewed this study and gave ethical approval; the Research Ethics Committee reference number is 10/H1008/24, dated 11 th July 2013. Written informed consent was obtained from all study participants (none minor at the time of enrolment).

DNA testing
DNA was extracted from blood samples provided by women attending the genetic clinics, using DNA Sanger sequencing and multiplex ligation-dependent probe amplification analysis for gene and SNP profiling; BRCA1 and BRCA2 mutations were identified as well as the presence of any of the 18 tested breast cancer SNPs. Overall breast cancer SNP risk scores were calculated for each woman using the methods as recorded in the article Ingham et al. [8].

Statistical models
The study population was stratified by BRCA type (1 or 2) and menopausal stage (ovulating vs. menopause). Incidence of breast cancer was calculated for the strata, as well as Kaplan-Meier [24] estimates of survival. Maineffect multivariable Cox proportional hazards (CPH) [25] models were fit in the separate BRCA data sets and then in the menopausal stages. End-point was the time to cancer, censored by the current age (or loss to follow up, or death for other causes). Proportional hazards assumption was tested via weighted residuals [26]. Variables included in the analyses were (see Table 1): year of birth, Manchester score [27] (transformed using the inverse hyperbolic sine), BMI; parity; age of menarche; age of menopause; age of first full-term pregnancy; oral contraception usage; time of diagnosis of an ovarian cancer followed up by oophorectomy (if any); time of mastectomy (if any); SNPs rs614367, rs704010, rs713588, rs889312, rs909116, rs1011970, rs1156287, rs1562430, rs2981579, rs3757318, rs3803662, rs4973768, rs8009944, rs9790879, rs10995190, rs11249433, rs13387042, rs10931936, genetic predisposition score (GPS), calculated on the mentioned SNPs according to Ingham et al. [8] Missing values were preliminarily analysed by means of univariable CPH, comparing Akaike information criterion (AIC) [28] and coefficient p-values of models with median/modes imputation vs. stratification into quartiles and addition of a category for those values which were missing. The following CPH models were fit for each population stratum: (i) GPS; (ii) GPS + year of birth + Manchester score + BMI + parity + age of menarche + age of menopause + age of full-term pregnancy + oral contraception usage + oophorectomy + mastectomy; (iii) SNPs; (iv) SNPs + year of birth + Manchester score + BMI + parity + age of menarche + age of menopause + age of full-term pregnancy + oral contraception usage + oophorectomy + mastectomy; (v) year of birth + Manchester score + BMI + parity + age of menarche + age of menopause + age of full-term pregnancy + oral contraception usage + oophorectomy + mastectomy; (vi) all variables. CPH models (ii), (iii), (iv) and (vi) were featureselected using a forward/backward stepwise heuristic driven by AIC [29]. Nonlinear random survival forests (RSF) [30] were also fit on all variables to identify putative variable interactions (333 trees, choosing the log-rank splitting rule). Table 1 summarises which variables were used for each model. CPH and RSF were compared using the complementary value of Harrell's concordance index (1 -c-index) [31] and the area under the receiver operating characteristic (AUROC) [32], under a bootstrap-based (100 resampled sets, using the out-of-bag predictions) method of extrasample error estimation [33].
All analyses were carried out using the R software [34].

Results
The BRCA1 population included 548 subjects, whilst the BRCA2 population 523. Table 2 shows population characteristics stratified by BRCA type and menopausal stage.
When applying models (i) through (vi) and RSF on the whole BRCA1 population, using the out-of-bag estimator, average (st. dev.) 1 -c-index values of models were (see Table 3 The hypothesis of a lower difference in mean with respect to model (ii) for all other models could be rejected, except for model (i) and (iii), which included only genetic variables (all p < 0.0001 for both BRCA1 and BRCA2, Student's t-test corrected for sample overlap from multiple validation). Notably a re-calibrated SNP score, i.e. models (iii) and (iv), did not perform as well as the GPS. Consistent results were obtained by looking at the AUROC in the 1 st , 2 nd and 3 rd quartiles of observation times. The AUROC estimation was performed on a smaller out-ofbag sample (333 out-of-bag instances) for computational reasons. Figures 2 and 3 show c-index/AUROC graphs for BRCA1/2 sets based on the out-of-bag estimator. Similar figures were obtained when stratifying for the menopausal stage (data not shown).
Tables 4 and 5 report relative hazards obtained by fitting Cox model (ii) on BRCA1 and BRCA2 populations, overall and stratified by menopausal stage. There was a calendar year of birth effect, increasing the risk of cancer for both BRCA1/2 carrier cohorts (RH ranging from 1.06 to 1.08, p < 0.0005 across all strata). The Manchester score had a protective effect in the BRCA1 menopause stratum (RH = 0.35, p = 0.0006) and showed the same trend in the whole BRCA1 population (RH = 0.8, p = 0.1), but the RH directions were not consistent across all strata as well as significance levels. The GPS score had a protective effect in the whole BRCA1 population and in the ovulating strata (RH 0.76/0.58, p < 0.015), and was associated to a higher hazard of breast cancer in the BRCA2 whole population (RH = 1.33, p = 0.035).
The ovulating stratum (i.e. "not yet" in the menopausal stage as from Tables 4 and 5) had a higher hazard of breast cancer as compared to the first age quartile of the menopausal stage stratum (i.e. women entering the menopausal stage at~40 years old). An early age of menopause (first age quartile,~40 years old) was associated with a higher hazard of breast cancer as compared to an older age of menopause (yet a higher hazard than the ovulating stratum), consistently across all BRCA1/2 carrier types, in the whole population and in the menopausal stage stratum. Note that menopause may be happening within the same year a chemotherapy was initiated right upon breast cancer diagnosis, resulting de facto in competing events (as diagnosis of menopause was given to the nearest year of age). Women who had either oophorectomy had a lower hazard as compared to those who had not (mastectomy could not be properly assessed due to the low number of events).
Finally, when fitting model (vi), i.e. feature-selected Cox regression using a forward/backward stepwise heuristic driven by AIC, for both BRCA1/2 sets only the year of birth, all the menopausal age stages (along with ovulating stratum), and the oophorectomy variables were selected in the final model (RH were in line with those obtained from other models).

Discussion
In this study we applied a robust model selection framework composed of linear and non-linear statistical techniques for survival analysis, with the objective to test the predictive ability of existing risk scores for breast cancer in a population of BRCA1/2 carriers, and to improve over the current state-of-the-art, from the models based on early genotyping and familial assessment to the most recent SNP scoring, trying to combine both clinical/ demographic information with high-resolution genetics. Also, we assessed the incidence and the determinants of breast cancer in the study population, and stratified the analyses by the menopausal status. RSF did not yield higher performance as compared to CPH, even if for some of the data sets the proportional hazard assumption was not met. Interestingly, the recalibration of GPS via the inclusion of SNPs in a CPH did not produce a better model fit (in terms of c-index or AUROC) than using the original GPS in a CPH. In our case, the c-index estimation through out-of-bag distributions may be a conservative choice, but robust to over-training.
This study further highlights the predictive ability of GPS for BRCA2, showing an increased RH 1.33 (1.1-1.61) in the whole population, although not significant at the 0.05 level in the menopausal/ovulating stage strata. Instead, for BRCA1 the effect of GPS was protective (RH = 0.76, p = 0.01) in the whole BRCA1 population and in the ovulating stage stratum (also protective but not significant at the 0.05 level in the menopausal stratum). Previous findings of Ingham et al. [8] already pointed out the predictive ability of 18 SNP GPS in BRCA2 but not BRCA1 carriers. This significant association of GPS however was not supported when fitting the stepwise models, retaining only the year of birth, the menopausal stage and the oophorectomy variables (across all carrier types and strata). The age cohort and oophorectomy had been previously associated with increased and decreased risk of breast cancer, respectively [35,36]. We found that an later ages of menopause have a lower hazard of breast cancer as compared to the first age quartile,~40 years old, which seems in contradiction with previous results by Tyrer et al. [18], and being on the ovulating stratum has a higher hazard than experiencing early menopause. This is likely a model artefact, because the menopause may happen (being induced) right after to the initiation of a chemotherapy (i.e. competing events), and the menopause age is given to the nearest year. In any case, as women entering the menopausal stage early may be subject to treatment for preserving fertility, this warrants further investigation including a number of potential confounders. Limitations of this study are in the usage of the cindex as a measure of model performance, which presents a series of flaws [37][38][39], although our results were confirmed using the AUROC estimator. Alternative measures have been presented, like prediction error curves [40] that may be employed as additional indicators. Another limitation is that we did not fit the Cox models using time-updated covariates (as for menopausal stage or age of menarche, for instance) and this may dilute their effect across all time, instead of calculating the hazard on specific time intervals.

Conclusions
We exploited model selection in machine learning towards the personalised diagnosis of breast cancer, incorporating different domains of information including genetics, clinical, and demographics. Given the improvement in prediction performance obtained by coupling a genetic progression score with clinical and demographic markers, further investigation for identifying both genetic and non-genetic factors (along with their interactions in terms of epigenetics) is warranted.