- Open Access
Prognostic factor analysis for breast cancer using gene expression profiles
© Joe and Nam. 2016
- Published: 18 July 2016
The survival of patients with breast cancer is highly sporadic, from a few months to more than 15 years. In recent studies, the gene expression profiling of tumors has been used as a promising means of predicting prognosis factors.
In this study, we used gene expression datasets of tumors to identify prognostic factors in breast cancer. We conducted log-rank tests and used unsupervised clustering methods to find reciprocally expressed gene sets associated with worse survival rates. Prognosis prediction scores were determined as the ratio of gene expressions.
As a result, four prognosis prediction gene set modules were constructed. The four prognostic gene sets predicted worse survival rates in three independent gene expression data sets. In addition, we found that cancer patient with poor prognosis, i.e., triple-negative cancer, HER2-enriched, TP53 mutated and high-graded patients had higher prognosis prediction scores than those with other types of breast cancer.
In conclusion, based on a gene expression analysis, we suggest that our well-defined scoring method of the prediction of survival outcome may be useful for developing prognostic factors in breast cancer.
- Breast Cancer
- Triple Negative Breast Cancer
- Breast Cancer Type
- Maximal Clique Algorithm
Breast cancer is one of the most common cancer types in women. In 2015, an estimated 234,190 new cases will be diagnosed, and 40,730 deaths from breast cancer will occur . Prognosis and therapy selection for those with breast cancer are usually affected by clinical and pathology features based on conventional histology and immunohistochemistry findings . In general cases, the menopausal status of the patient, the stage of the disease, the grade of the primary tumor, the estrogen (ER) and progesterone receptor (PR) status, and the level of human epidermal growth factor type 2 receptor (HER2) expression have been used for prognosis predictions. More recently, various uses of molecular profiling in breast cancer also includes ER and PR status testing, HER2/neu receptor status testing, and gene profile testing with, for example, MammaPrint  or Ocnotype DX [4, 5].
With regard to clinical intervention, it is critical to identify which patients are at risk of developing a more fatal type of breast cancer. Well-known prognostic factors such as ER and HER2 can be used to predict which patients face higher levels of risk. However, in addition to these traditional makers, there are still novel prognostic factors which are required for predictions of survival for patients with ill-defined breast cancer types. Triple-negative breast cancer is one of the subtypes currently having no such prognostic factors and no targeted drug therapies. Recently, several gene signatures have been identified to predict prognostic outcomes. Tang et al. found that a decreased level of BECEN1 gene expression in human breast cancer is associated with poor prognosis . The CENPA gene was a significantly independent prognostic marker for patients with ER-positive breast cancer . More recently, AI-Ejeh et al. identified eight genes (MAPT, MYB, MELK, MCM10, CENPA, EXO1, TTK and KIF2C) associated with poor survival in breast cancer patients through biological evidence pertaining to TNBC, metastases, and patient survival . In the latest studies, Liu et al. identified and validated five genes (CDK1, DLGAP5, MELK, NUSAP1, and RRM2), the expression levels of which were strongly associated with shorten survival time . Although these significant genes were identified, still remains a need for a more comprehensive and exhaustive analysis to find novel prognostic factors.
Gene expression profiles
The dataset used in this study
40 ~ 60
Illumina HT 12v3
Affymetrix HG U133A
Affymetrix HG U133A
Affymetrix HG U133A
Affymetrix HG U133A
Prognostic factor gene set selection
A total of 24,924 genes in METABRIC dataset were used in this research. To identify high/low expressed genes based on patient’s poor survival, we implemented a log-rank test and used an expression fold-change between patients who separated to first quartile and forth quartile corresponding to each gene expression level. This process was implemented by each gene. Hazard ratio was calculated between first and forth quartile patient groups and adjusted p-value cutoff was determined as 0.001. Therefore, if hazard ratio is greater than one with proper threshold and patients’ expression fold-change (first/fourth) is greater than 2, we selected the gene as a high-expressed gene in poor survival. Similarly, if hazard ratio is less than one with proper p-value cutoff and an expression fold-change (fourth/first) is less than 0.5, we selected the gene as a low-expressed gene in poor survival (Additional file 1: Figure S1). In the log-rank test of every 24,924 gene, we found 413 highly expressed genes associated with poor survival and 411 low-expressed genes associated with poor survival.
Identification of four prognostic modules
To construct the list of candidate genes for predicting patient’s outcome, we initially used over 20,000 genes and we selected a list of prognostic candidate genes by using a survival log-rank test. However, since too many number of genes showed significance in the log-rank test, we proposed an algorithm for minimizing and clustering genes according to their significance and co-expressed pattern. For clustering the two previously defined gene sets, we used the maximal clique algorithm  with Pearson correlation coefficient scores. Among the 413 high-expressed genes, we connected genes if two genes had a Pearson correlation coefficient which exceeded 0.4. We then determined the maximal clique in the 413 genes, after which we eliminated these genes and found the next maximal clique. Similarly, for the 411 low-expressed genes associated with poor survival, we also clustered genes with a minimum Pearson correlation coefficient of 0.4. To avoid the cluster which has too small number of genes, we used only two major clusters. Here, we used clusters for high/low expression gene sets which have over 15 independent genes. After clustering, we obtained two high-expressed gene groups associated with poor survival and two low-expressed gene groups. The connections between the high- and the low-expressed genes were also identified with a Pearson correlation coefficient of -0.4 through the maximal bi-clique generation algorithm . Finally, there were four matched gene sets which are strongly connected to each other, as represented by high correlation values from the gene expression data. Each gene set has high-expressed and low-expressed genes associated with poor survival. Thus, we identified four prognosis prediction scores as the ratio between the median of the high-expressed gene level to the low-expressed gene level in the four matched gene sets. We defined the module 1 score as the ratio of the 26 high-expressed genes associated with poor survival to the 17 low-expressed genes associated with poor survival. Similarly, Modules 2, 3 and 4 scores were respectively defined as the ratios between the eight, nine, and four high-expressed genes associated with poor survival to the 10, nine, and eight low-expressed genes associated with poor survival. Because we used the maximal clique algorithm to cluster each gene set, there was a strong correlation between the expression levels of each high-expressed gene and low-expressed gene associated with poor survival (Pearson’s r > 0.4). Between high-expressed genes and low-expressed genes, the maximum Pearson correlation coefficient was found to be -0.4.
We analyzed three sets of detailed clinical data from each of the studies used. These were GSE2034, GSE25066, and GSE3494. We used the Disease-Free Survival (DFS) clinical information in GSE2034 and GSE25066, and the Distant Recurrence Free Survival (DRFS) in GSE3494. In a Kaplan Meier survival plots, the median of a measured module’s score was used to dichotomize the data, allowing stratification into high and low groups within each of the three individual datasets.
Genes associated with triple negative breast cancer
To investigate genes related to triple negative breast cancer (TNBC), after comparing the three independent expression profiles, we selected 230 up-regulated genes and 237 down-regulated genes in TNBC (Cut off p-value < 0.05, FDR < 0.05, from t-test, log fold change < 0.5) from METABRIC, GSE2109 and GSE25066 datasets.
Worse survival with four modules
Prognostic factor gene set in module 1
The gene list used for module 1
CHEK1 a, b
checkpoint kinase 1
forkhead box M1
CCNA2 a, b
CDC20 a, b
cell division cycle 20
TTK a, b
TTK protein kinase
CENPA a, b
centromere protein A
KIF2C a, b
kinesin family member 2C
BUB1, mitotic checkpoint serine/threonine kinase
minichromosome maintenance complex component 6
cell division cycle 45
anillin actin binding protein
minichromosome maintenance 10 replication initiation factor
CDCA8 a, b
cell division cycle associated 8
maternal embryonic leucine zipper kinase
CEP55 a, b
centrosomal protein 55 kDa
DLGAP5 a, b
discs, large (Drosophila) homolog-associated protein 5
Holliday junction recognition protein
cell division cycle associated 5
TRIP13 a, b
thyroid hormone receptor interactor 13
GTSE1 a, b
G2 and S-phase expressed 1
CDCA3 a, b
cell division cycle associated 3
proline rich 11
family with sequence similarity 83 member D
GTP binding protein 4
estrogen receptor 1
GATA binding protein 3
leucine-rich repeats and immunoglobulin-like domains 1
rabaptin, RAB GTPase binding effector protein 1
cold inducible RNA binding protein
WD repeat domain 19
signal peptide, CUB domain, EGF-like 2
kinesin family member 13B
TBC1 domain family member 9
ankyrin repeat family A member 2
dynein, light chain, roadblock-type 2
NME/NM23 family member 5
cancer susceptibility candidate 1
basal body orientation factor 1
RUN domain containing 1
The discovery of prognostic factors is crucial work in breast cancer biomarker research. In this study, using a large-scale transcriptomic dataset, we found that four types of prognostic gene sets are strongly related with poor patient outcomes. We used each of the four gene set expressions to evaluate three independent breast tumors and found that scores based on gene expression gave generally consistent predictions of outcomes. When comparing tumor characteristics and scores, tumors with high scores were more likely to have TP53 mutations, to be HER2-enriched or to have basal-like intrinsic subtypes, triple-negative status, and worse survival rates.
The twenty six genes and 17 genes used in module 1 were strongly co-expressed in METABRIC dataset, and the ratio of the expression levels of the two DEG groups were used as a prognostic marker in this research. Among these high-expressed genes associated with poor survival of patients, many were associated with genes involved in the cell cycle process , including several well-defined genes as prognostic factor. Recently, Abdel-Fatah et al. showed that high CHEK1 expression level is linked to poor prognosis in breast cancer and aggressive breast cancer . HJURP was also recently identified as an independent biomarker of cancer outcome in luminal A patients . Breast cancer progression can include the FOXM1-CDCA8 signature which assists as a promising therapeutic target and potential prognostic factor . In addition, Kwok et al. showed that the knockdown of FOXM1 with thiostrepton in micelle nanoparticles reduced tumor growth rates and increased apoptosis . Thus, they showed that FOXM1 is one of the primary cellular targets of thiostrepton in breast cancer cells. Karra et al. discovered that high CDC20 and securin immunoexpression are correlated with unusually poor outcomes of breast cancer patients . BUB1 has important roles in the proliferation or progression of breast cancer, and the nuclear BUB1 immunohistochemical status is considered to be an influential prognostic factor in human breast cancer patients . Liu et al. identified and validated five hub genes (CDK1, DLGAP5, MELK, NUSAP1, and RRM2), the expression levels of which were strongly associated with poor survival. Highly expressed MELK revealed poor survival in luminal A/B molecular subtypes of breast cancer . Furthermore, among low-expressed genes associated with poor survival, several well-defined genes were found to be prognostic factor. The role of GATA3 in breast cancer as a tumor suppressor has been established. Interestingly, the GATA3 down-regulation is required for the progestin-induced upregulation of cyclin A2(CCNA2) and for progestin-induced in vivo and in vitro breast cancer cell growth . Thompsons et al. presented low expression of LRIG1 is a prognostic factor for breast cancer patients . Cheng et al. showed patients with negative SCUBE2 protein-expression tumors had worse prognosis than those with positive SCUBE2 protein-expression tumors in breast cancer . The latest studies suggested the deregulation of NME5, DNALI1 in malignant breast cancer . In addition, 30 genes out of 44 module 1 genes were DEGs in TNBC and non-TNBC (19 upregulated genes and 11 down-regulated genes in TNBC). Thus, we confirmed that the DEGs of classical poor prognosis breast cancer type were also related to our results.
In conclusion, our finding presents the score of prognosis prediction modules that are strongly associated with shortened survival times in breast cancer, and the score of the module is consistently high in aggressive breast cancer types such as TNBC and ER-negative and HER2-enriched types. In addition, we found that this score is associated with the tumor grade in breast cancer. Thus, we suggest the inclusion of these enriched genes as aggressive cancer markers; 26 co-expressed and 17 genes can be used as new prognostic markers, and we expect that these investigations can be adapted to research on target therapies for ill-defined breast cancer types.
Research ethics approval was obtained from the Gwangju Institute of Science & Technology. The METABRIC study protocol was also approved by the ethics committees in previous study . All datasets which were used in this study were composed of anonymized patient information.
The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) datasets supporting the conclusions of this article are available in the Synapse repository (https://www.synapse.org/#, Synapse ID: syn1688369). The other four datasets are also available in the Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/, GSE25066, GSE2034, GSE3494, GSE2109).
This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HR14C0005), a grant from the “Systems Biology Infrastructure Establishment Grant” provided by the Gwangju Institute of Science & Technology in 2015, a grant from the Bio-Synergy Research Project (NRF-2014M3A9C4066449) of the Ministry of Science, ICT and Future Planning through the National Research Foundation, and by a grant from the GIST Research Institute(GRI) in 2016.
Publication costs for this article were sourced from a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number : HR14C0005), a grant from the “Systems Biology Infrastructure Establishment Grant” provided by the Gwangju Institute of Science & Technology in 2015, a grant from the Bio-Synergy Research Project (NRF-2014M3A9C4066449) of the Ministry of Science, ICT and Future Planning through the National Research Foundation, and by a grant from the GIST Research Institute(GRI) in 2016.
This article has been published as part of BMC Medical Informatics and Decision Making Volume 16 Supplement 1, 2016: Proceedings of the ACM Ninth International Workshop on Data and Text Mining in Biomedical Informatics. The full contents of the supplement are available online at https://bmcmedinformdecismak.biomedcentral.com/articles/.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Howlader N NA, Krapcho M, Garshell J, Miller D, Altekruse SF, Kosary CL, Yu M, Ruhl J, Tatalovich Z,Mariotto A, Lewis DR, Chen HS, Feuer EJ, Cronin KA: SEER Cancer Statistics Review, http://seer.cancer.gov/csr/1975_2012/, based on November 2014 SEER data submission, posted to the SEER web site. 2015.
- Simpson JF, Gray R, Dressler LG, Cobau CD, Falkson CI, Gilchrist KW, Pandya KJ, Page DL, Robert NJ. Prognostic value of histologic grade and proliferative activity in axillary node-positive breast cancer: results from the Eastern Cooperative Oncology Group Companion Study, EST 4189. J Clin Oncol Off J Am Soc Clin Oncol. 2000;18(10):2059–69.Google Scholar
- van’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415(6871):530–6.Google Scholar
- Buyse M, Loi S, van’t Veer L, Viale G, Delorenzi M, Glas AM, d'Assignies MS, Bergh J, Lidereau R, Ellis P, et al. Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. J Natl Cancer Inst. 2006;98(17):1183–92.Google Scholar
- Paik S, Shak S, Tang G, Kim C, Baker J, Cronin M, Baehner FL, Walker MG, Watson D, Park T, et al. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N Engl J Med. 2004;351(27):2817–26.View ArticlePubMedGoogle Scholar
- Tang H, Sebti S, Titone R, Zhou Y, Isidoro C, Ross TS, Hibshoosh H, Xiao G, Packer M, Xie Y. Decreased BECN1 mRNA expression in human breast cancer is associated with estrogen receptor-negative subtypes and poor prognosis. EBioMedicine. 2015;2(3):255–63.Google Scholar
- McGovern SL, Qi Y, Pusztai L, Symmans WF, Buchholz TA. Centromere protein-A, an essential centromere protein, is a prognostic marker for relapse in estrogen receptor-positive breast cancer. Breast Cancer Res. 2012;14(3):R72.View ArticlePubMedPubMed CentralGoogle Scholar
- Al-Ejeh F, Simpson P, Sanus J, Klein K, Kalimutho M, Shi W, Miranda M, Kutasovic J, Raghavendra A, Madore J. Meta-analysis of the global gene expression profile of triple-negative breast cancer identifies genes for the prognostication and treatment of aggressive breast cancer. Oncogenesis. 2014;3(4):e100.Google Scholar
- Liu R, Guo CX, Zhou HH. Network-based approach to identify prognostic biomarkers for estrogen receptor-positive breast cancer treatment with tamoxifen. Cancer Biol Ther. 2015;16(2):317–24.View ArticlePubMedPubMed CentralGoogle Scholar
- Curtis C, Shah SP, Chin SF, Turashvili G, Rueda OM, Dunning MJ, Speed D, Lynch AG, Samarajiwa S, Yuan Y, et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012;486(7403):346–52.Google Scholar
- Gluck S, Ross JS, Royce M, McKenna Jr EF, Perou CM, Avisar E, Wu L. TP53 genomics predict higher clinical and pathologic tumor response in operable early-stage breast cancer treated with docetaxel-capecitabine +/- trastuzumab. Breast Cancer Res Treat. 2012;132(3):781–91.Google Scholar
- Al-Ejeh F, Shi W, Miranda M, Simpson PT, Vargas AC, Song S, Wiegmans AP, Swarbrick A, Welm AL, Brown MP. Treatment of triple-negative breast cancer using anti-EGFR–directed radioimmunotherapy combined with radiosensitizing chemotherapy and PARP inhibitor. J Nucl Med. 2013;54(6):913–21.Google Scholar
- Miller LD, Smeds J, George J, Vega VB, Vergara L, Ploner A, Pawitan Y, Hall P, Klaar S, Liu ET, et al. An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc Natl Acad Sci U S A. 2005;102(38):13550–5.Google Scholar
- Edgar R, Barrett T. NCBI GEO standards and services for microarray data. Nat Biotechnol. 2006;24(12):1471–2.View ArticlePubMedPubMed CentralGoogle Scholar
- Palla G, Derenyi I, Farkas I, Vicsek T. Uncovering the overlapping community structure of complex networks in nature and society. Nature. 2005;435(7043):814–8.View ArticlePubMedGoogle Scholar
- Alexe G, Alexe S, Crama Y, Foldes S, Hammer PL, Simeone B. Consensus algorithms for the generation of all maximal bicliques. Discret Appl Math. 2004;145(1):11–21.View ArticleGoogle Scholar
- Madden SF, Clarke C, Gaule P, Aherne ST, O’Donovan N, Clynes M, Crown J, Gallagher WM. BreastMark: an integrated approach to mining publicly available transcriptomic datasets relating to breast cancer outcome. Breast Cancer Res. 2013;15(4):R52.Google Scholar
- da Huang W, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009;37(1):1–13.View ArticleGoogle Scholar
- Abdel-Fatah TM, Middleton FK, Arora A, Agarwal D, Chen T, Moseley PM, Perry C, Doherty R, Chan S, Green AR, et al. Untangling the ATR-CHEK1 network for prognostication, prediction and therapeutic target validation in breast cancer. Mol Oncol. 2015;9(3):569–85.Google Scholar
- de Oca RM, Gurard-Levin ZA, Berger F, Rehman H, Martel E, Corpet A, de Koning L, Vassias I, Wilson LO, Meseure D. The histone chaperone HJURP is a new independent prognostic marker for luminal A breast carcinoma. Mol Oncol. 2015;9(3):657–74.Google Scholar
- Jiao D, Lu Z, Qiao J, Yan M, Cui S, Liu Z. Expression of CDCA8 correlates closely with FOXM1 in breast cancer: public microarray data analysis and immunohistochemical study. Neoplasma. 2014;62(3):464–9.View ArticleGoogle Scholar
- Kwok JM, Myatt SS, Marson CM, Coombes RC, Constantinidou D, Lam EW. Thiostrepton selectively targets breast cancer cells through inhibition of forkhead box M1 expression. Mol Cancer Ther. 2008;7(7):2022–32.View ArticlePubMedGoogle Scholar
- Karra H, Repo H, Ahonen I, Löyttyniemi E, Pitkänen R, Lintunen M, Kuopio T, Söderström M, Kronqvist P. Cdc20 and securin overexpression predict short-term breast cancer survival. Br J Cancer. 2014;110(12):2905–13.Google Scholar
- Takagi K, Miki Y, Shibahara Y, Nakamura Y, Ebata A, Watanabe M, Ishida T, Sasano H, Suzuki T. BUB1 immunolocalization in breast carcinoma: its nuclear localization as a potent prognostic factor of the patients. Horm Cancer. 2013;4(2):92–102.Google Scholar
- Izzo F, Mercogliano F, Venturutti L, Tkach M, Inurrigarro G, Schillaci R, Cerchietti L, Elizalde PV, Proietti CJ. Progesterone receptor activation downregulates GATA3 by transcriptional repression and increased protein turnover promoting breast tumor growth. Breast Cancer Res. 2014;16(6):491.Google Scholar
- Thompson PA, Ljuslinder I, Tsavachidis S, Brewster A, Sahin A, Hedman H, Henriksson R, Bondy ML, Melin BS. Loss of LRIG1 locus increases risk of early and late relapse of stage I/II breast cancer. Cancer Res. 2014;74(11):2928–35.Google Scholar
- Cheng C-J, Lin Y-C, Tsai M-T, Chen C-S, Hsieh M-C, Chen C-L, Yang R-B. SCUBE2 suppresses breast tumor cell proliferation and confers a favorable prognosis in invasive breast cancer. Cancer Res. 2009;69(8):3634–41.Google Scholar
- Parris TZ, Danielsson A, Nemes S, Kovacs A, Delle U, Fallenius G, Mollerstrom E, Karlsson P, Helou K. Clinical implications of gene dosage and gene expression patterns in diploid breast carcinoma. Clin Cancer Res. 2010;16(15):3860–74.Google Scholar