 Research article
 Open access
 Published:
Deep learning based featurelevel integration of multiomics data for breast cancer patients survival analysis
BMC Medical Informatics and Decision Making volume 20, Article number: 225 (2020)
Abstract
Background
Breast cancer is the most prevalent and among the most deadly cancers in females. Patients with breast cancer have highly variable survival lengths, indicating a need to identify prognostic biomarkers for personalized diagnosis and treatment. With the development of new technologies such as nextgeneration sequencing, multiomics information are becoming available for a more thorough evaluation of a patient’s condition. In this study, we aim to improve breast cancer overall survival prediction by integrating multiomics data (e.g., gene expression, DNA methylation, miRNA expression, and copy number variations (CNVs)).
Methods
Motivated by multiview learning, we propose a novel strategy to integrate multiomics data for breast cancer survival prediction by applying complementary and consensus principles. The complementary principle assumes each omics data contains modalityunique information. To preserve such information, we develop a concatenation autoencoder (ConcatAE) that concatenates the hidden features learned from each modality for integration. The consensus principle assumes that the disagreements among modalities upper bound the model errors. To get rid of the noises or discrepancies among modalities, we develop a crossmodality autoencoder (CrossAE) to maximize the agreement among modalities to achieve a modalityinvariant representation. We first validate the effectiveness of our proposed models on the MNIST simulated data. We then apply these models to the TCCA breast cancer multiomics data for overall survival prediction.
Results
For breast cancer overall survival prediction, the integration of DNA methylation and miRNA expression achieves the best overall performance of 0.641 ± 0.031 with ConcatAE, and 0.63 ± 0.081 with CrossAE. Both strategies outperform baseline singlemodality models using only DNA methylation (0.583 ± 0.058) or miRNA expression (0.616 ± 0.057).
Conclusions
In conclusion, we achieve improved overall survival prediction performance by utilizing either the complementary or consensus information among multiomics data. The proposed ConcatAE and CrossAE models can inspire future deep representationbased multiomics integration techniques. We believe these novel multiomics integration models can benefit the personalized diagnosis and treatment of breast cancer patients.
Background
Breast cancer is the most common type of cancer in females worldwide. In 2018, breast cancer constituted over 25% of about 8.5 million new cancer diagnoses in female patients [1]. This prevalence pattern is found in the US as well, where women have over a 12% risk of being diagnosed with breast cancer in their lives, and breast cancer cases are expected to encompass about 30% of new cancer cases [2]. While the principal risk factor for breast cancer is age, it is known that selected gene mutations account for about 10% of all breast cancer cases [3]. Research into prognostic genomic biomarkers beyond mutational status is ongoing and may offer insights into disease mechanisms and new therapies. Breast cancer maintains the secondhighest mortality rate for cancers in females at about 13% [2]. Survival rates for breast cancer are typically measured by 5year postdiagnosis survival. The 5year survival rate is 90% when all stages are combined [4]. If each cancer stage is considered separately, the 5year survival rate is 99% for localized breast cancer and drops to 85 and 27% for regionally and distantly spread cancer, respectively.
Public multiomics datasets such as The Cancer Genome Atlas (TCGA) [5] have greatly accelerated the research for cancer study [6], including accurate cancer grading, staging, and survival prediction [7,8,9]. The cancer survival analysis can be categorized into binary classification or risk regression. In a binary classification task, the patients are typically split into a shortsurvival group and a longsurvival group based on a predefined threshold (e.g., 5 years). While in risk regression studies, a risk score is calculated for each patient, typically with the Cox proportional hazards model [10] and its extensions.
Various models have been developed for survival prediction in large and heterogeneous cancer datasets. For example, Zhao et al. have tested various classification algorithms to predict 5year breast cancer survival by integrating gene expression data with clinical and pathological factors [11]. Authors find that various classification methods (e.g., gradient boosting, random forest, artificial neural networks, and support vector machine) have similar accuracy and area under the curve (AUC) of 0.72 and 0.67, respectively. This study demonstrates that classification methods may not matter as much as the quality of the data itself [11]. Goli et al. have developed a breast cancer survival prediction model with clinical and pathological data using support vector regression and find similar positive results [12]. This study has established the use of support vectors as a promising route in survival prediction with an imbalanced dataset. Similarly, Gevaert et al. have integrated microarray gene expression data with clinical data using Bayesian Networks and achieved a maximum AUC of 0.845 [13]. This study shows that incorporating both data modalities improved predictions beyond either clinical or gene expression alone. Sun et al. have created 5year breast cancer survival prediction models using genomic data (e.g., gene expression, copy number alteration, methylation, and protein expression) coupled with pathological imaging data also from TCGA. The authors utilize multiple kernel learning to enact featurelevel integration of all data. Their multiomics model, excluding imaging data, has an AUC of 0.802 ± 0.032. When incorporating the imaging data, the AUC goes up slightly to 0.828 ± 0.034 [14]. Ma et al. have applied factorization autoencoder to integrate gene expression, miRNA expression, DNA methylation, and protein expression for progressionfree interval event prediction and achieve an AUC of 0.74 on bladder cancer and an AUC of 0.825 on brain glioma [15].
Instead of binary classification, the survival risk regression aims to predict the expected duration of time until one or more events happen by modeling the time to event data. The proportional hazards model assumes the covariates are multiplicatively related to the hazard [16]. Assuming the proportional hazards assumption holds, the Cox proportional hazards model can estimate the effect parameters without considering the hazard function [10]. Recently, the Cox proportional hazards model has been extended by deep neural networks. For example, Deep Surv [17] and CoxTime [18] replace the linear relationship in the Cox proportional hazards model with nonlinear neural networks. In addition, L_{1} and L_{2} regularization terms have been utilized on the network parameters to reduce the overfitting of the models. The survival risk regression model has also been applied to multiomics data. For example, Huang et al. have developed a Coxproportional hazards model based multiomics neural network for breast cancer survival regression [19].
In our previous study [20], we have built a transnational pipeline for overall survival prediction of breast cancer patients by decisionlevel integration of multiomics data (e.g., gene expression, DNA methylation, miRNA expression, and copy number variations (CNVs)). However, many rightcensored samples have been discarded to enable binary classification. In this study, we extended the work by replacing the binary survival classification with survival risk regression to make the most of the TCGA dataset. We hypothesize there are both complementary and consensus information in the multiomics data. To utilize the complementary and consensus information among multiomics data, we replace the decisionlevel integration with deep learningbased featurelevel integration. The remainder of the paper is structured as follows: in section 2, we first describe the simulated twoview data from the Modified National Institute of Standards and Technology (MNIST) database and multiomics breast cancer (BRCA) data from the TCGA database (referred as TCGABRCA hereafter). We then present the proposed methods for multiomics data integration by utilizing the complementary information and consensus information among modalities. In section 3, we present the results of the baseline models and proposed models on both MNIST simulated data and TCGABRCA multiomics data. We will discuss the results and conclude the current work in section 4 and section 5, respectively.
Methods
Simulated multiview MNIST dataset
To validate the proposed featurelevel integration network, we simulate the multimodality data from the Modified National Institute of Standards and Technology (MNIST) database. The MNIST database consists of 60,000 training samples and 10,000 testing samples. Each sample in the MNIST database is a 28 × 28 image of a single handwritten digit from 0 to 9. The goal is to train a multiclass classifier to predict the digit from the input image.
We simulate twoviews of each handwritten digit image from the MNIST database (Fig. 1a). The first view (X_{1}) is the original image from the MNIST database, while the second view (X_{2}) is the corresponding rotated image (90degree counterclockwise rotation). We further simulate noises for the data because the task is easy even for singleview data. We have simulated two kinds of noises and apply them to both views of the handwritten digit images: random erasing (Fig. 1b) and pixelwise Gaussian noise (Fig. 1c). We flatten the image to a vector with a length of 784 as the final input to deep neural networks.
TCGABRCA breast cancer multiomics dataset
TCGA database [5] is a public database containing genomic data for over 20,000 paired cancer and normal samples from 33 cancer types. In this study, we are using TCGABRCA, which has 1060 patients with all four types of omics data (e.g., gene expression, miRNA expression, DNA methylation, and CNVs) and survival information (see Supplementary Material Section S2). Table 1 contains information about the four omics data types. For gene expression, the number of features includes different isoforms for each gene and some noncoding RNA transcripts. The DNA methylation beta value ranges from 0 to 1, where a beta value of 0 means that no methylation is detected for that probe, while a 1 means that the CpG was always methylated. For CNV features, “Gain” means more copies of a gene than normal, while “Loss” means fewer copies of a gene than normal. More details for the TCGA multiomics data can be found in Supplementary Material Section S1.
The overall pipeline for multiomics survival analysis is presented in Fig. 2. Quality control and preprocessing are essential for making sense of multiomics data. To get rid of the lowquality features, we remove features with missing data. For the gene expression and miRNA expression data, we also apply a log transform log_{2}(X + 1) to the features, where X is the FPKM for gene expression and RPM for miRNA expression. We then apply minmax normalization to scale all four data modalities to a range of 0 to 1. After the quality control and normalization, we apply a stratified fourfold split of the data into a training set (60%), a validation set (15%), and a testing set (25%) in each fold.
The multiomics data usually suffer from the “curse of dimensionality,” where the number of features is significantly larger than the number of samples. To mitigate this challenge, we apply feature selection or dimension reduction techniques to get rid of the unrelated or redundant features, which are essential for the success of downstream analysis such as classification or survival analysis. For classification, supervised univariate feature selection methods such as minimum Redundancy Maximum Relevance (mRMR) [21] and mutual information can be used. For survival analysis, various unsupervised or knowledgeguided feature selection can be applied. For example, Huang et al. have applied gene coexpression analysis as the dimension reduction approach [19]. In this study, with the focus on deeplearning based featurelevel integration, we use both principal component analysis (PCA) and unsupervised variancebased feature selection. In PCAbased dimension reduction, we apply PCA to the training dataset and use the first 100 principal components (PCs) of training, validation, and testing datasets for survival analysis. In unsupervised variancebased feature selection, we select the top 1000 features with the highest variances from the training dataset, and then use them for survival analysis in training, validation, and testing datasets.
Singlemodality network
For singlemodality data, we use an autoencoder and a taskspecific network for singlemodality classification or survival analysis (Fig. 3). For the input data x after feature selection, we first apply an encoder q(x) to transform the input data to a hidden feature z, and then reconstruct the input data \( \hat{x} \) from the hidden feature with a decoder p(z). We then feed the hidden feature z into a taskspecific network for classification or survival analysis.
Endpoint 1: multiclass classification
For the classification network c(z), we use a fully connected network with the output dimension size the same as the number of classes. Thus, the whole network is trained with the reconstruction loss L_{recon} and the classification loss L_{cls}. In this study, we use the meansquare error for the reconstruction loss:
where N is the batch size. We use the crossentropy loss for the classification loss:
where C is the number of classes and j ∈ {1, …, C}. For each epoch, we first train the encoderdecoder with the reconstruction loss L_{recon} and then train the encoder and classification network with the crossentropy loss L_{clf}.
The multiclass classification performance is evaluated by accuracy, weighted precision, and weighted recall. These metrics are in the range of [0, 1], and the higher the better. We do not include AUC as a metric because we perform 10class classification with the simulated MNIST dataset instead of binary classification.
Endpoint 2: survival analysis
For the survival analysis, we use a fully connected neural network s(z), to replace the Cox proportional hazards model. The output of the survival network s(z) is the hazard h of the patient. Based on the Cox proportional hazards model, the survival network is trained with the negative log partial likelihood loss L_{sur}:
Where C_{i} = 1 indicates the occurrence of the event for patient i, N_{ob} is the total number of events in the batch, and T_{i} and T_{j} are the survival time for patient i and patient j, respectively.
To evaluate the risk scores predicted by survival models, various metrics have been developed to measure the concordance between the predicted risk scores and the actual survival time. Following the previous studies in deeplearningbased survival analysis [19], we evaluate the overall survival analysis performance with the concordance index (Cindex) [22]. Cindex evaluates how well the survival risk we computed aligns with the actual survival time given any two comparable pairs:
Novel multimodality integration network
We develop novel multiomics integration networks based on two principles in multiview machine learning: 1) the complementary principle assumes that each view contains information other views do not have, and we should extract the difference from each view while preserving the common information; and 2) the consensus principle assumes that the disagreements between views upper bound the classification errors; thus, we should aim to maximize the agreement between views. Based on these principles, we have used this novel strategy to learn meaningful representations by integrating data from multiple modalities.
Integrating the complementary information: concatenation autoencoder (ConcatAE)
We use the concatenation autoencoder (ConcatAE) to integrate the complementary information from each data modality (Fig. 4). For each modality, we train an independent autoencoder and transform the input features into a hidden space. We then concatenate the hidden features from each modality and feed the concatenated hidden feature into the taskspecific network. Compared to the singlemodality network, we have a separate reconstruction loss for each data modality. Thus, the reconstruction loss is the summation of these separate reconstruction losses. For example, when integrating two modalities, the new reconstruction loss would be:
The taskspecific network training procedure remains the same, with the input becoming the concatenation of hidden features represented from each modality.
Integrating the consensus information: crossmodality autoencoder (CrossAE)
We use the crossmodality autoencoder (CrossAE) to integrate the consensus information from each data modality (Fig. 5) through crossmodality translation. To enable consensus representation among modalities, it uses the hidden features extracted from one modality to reconstruct the input features from other modalities.
We train the framework with three steps. In the first step, we train an autoencoder for each modality independently, as we have done in the ConcatAE model with \( {L}_{recon}^{\prime } \). In the second step, we train these encoders and decoders again with crossmodality reconstruction. For example, the modality 1 encoder q_{1}(x) is used to transform input data x_{1} to hidden feature z_{1} = q_{1}(x_{1}). We then use the modality 2 decoder p_{2}(z) to reconstruct the modality 2 input data x_{2} from z_{1}, which is denoted as \( {\hat{x}}_{21}={p}_2\left({z}_1\right) \). We can perform similar crossmodality reconstruction from modality 2 hidden features z_{2} to modality 1 input data x_{1}. Thus, the crossmodality reconstruction loss L_{cross _ recon} for step 2 with two modalities is
In the third step, we combine the hidden features from each modality with the elementwise average and then train the encoders and taskspecific network with taskspecific loss (e.g., the crossentropy loss for classification or the negative partial loglikelihood loss for survival regression). We implemented and tested the proposed integration models on two data modalities. These frameworks can be naturally extended to the integration of more than two data modalities.
Implementation and experiments
The traintest split for crossvalidation and the classification metrics are implemented with [23]. The neural networks are designed and implemented with PyTorch 1.1.0. For cancer type classification, we use a batch size of 32, and Adam optimizer with a learning rate of 0.001, and training epochs of 200. For survival analysis, we use a batch size of 128, and Adam optimizer with a learning rate of 0.001, and training epochs of 200. More details of the model implementation and training details can be found at Github repo (https://github.com/tongli1210/BreastCancerSurvivalIntegration).
Results
Multimodality integration simulation
We first test the proposed single and multimodal integration networks on the simulated MNIST datasets (S_{1} and S_{2}). The results are presented in Table 2. From the results, we observe significant classification performance improvements after multimodality data integration for both random erasing dataset S_{1} and the Gaussian noise erasing dataset S_{2}. For dataset S_{1}, we assume the model should take the complementary information from X_{1} and X_{2} to get better performance. From the experiment results, the integration model ConcatAE does perform slightly better compared to the integration model CrossAE. For dataset S_{2}, because of the global noises for both views, we assume the model should take the consensus information from S_{1} and S_{2} to get better performance. From the experiment results, we observe CrossAE achieves better performance compared to ConcatAE, which is as expected.
Multimodality integration for breast cancer survival analysis
The performance of the singleomics survival analysis model is presented in Table 3. We observe that the model achieves better performance when using PCA features compared with that using the high variance features for all modalities except for CNVs. Among the four omics data, miRNA expression is the most predictive for overall survival, followed by DNA methylation and gene expression. Moreover, CNVs are the least predictive for breast cancer overall survival, which is consistent with our previous findings [20]. The best singleomics survival analysis performance is a Cindex of 0.616 ± 0.057, achieved by miRNA data with PCA features.
The performance of the novel multiomics integration survival analysis model is presented in Table 4. Based on the results, we observe that integration is not always beneficial for performance. For example, the integration of gene expression and DNA methylation high variance features can lead to lower Cindex (0.507 ± 0.036) than either gene expression (0.529 ± 0.033) or DNA methylation (0.581 ± 0.066) alone. Among the six combinations of twoomics data integration, we found the integration of DNA methylation and miRNA expression consistently achieves a good performance. Comparing the two integration strategies, we found that the ConcatAE outperforms the CrossAE in most experiments. Comparing the two feature selection strategies, we observed that the PCA features outperform high variance features in most experiments except for those involves CNV data. We believe the PCA dimension reduction approach may not be suitable for the discrete CNV data. Among all multiomics integration models, the best performance (0.641 ± 0.031) is achieved by integrating DNA methylation and miRNA expression using PCA features and the ConcatAE model.
To evaluate the consensus among hidden features, we measure the similarity of paired hidden features with the Euclidean distance, and visualize their distributions with grouped violin plots in Fig. 6. The violin plots are grouped by multiomics modalities under integration (e.g., GeneExp+miRNA) and compared for the two integration methods ConcatAE and CrossAE. For the hidden features (dimension of 10) represented from PCA features, we can observe higher similarities (or lower Euclidean distances) for integration using CrossAE compared to those using ConcatAE (Fig. 6a). However, for the hidden features (dimension of 100) represented from high variance features, the CrossAE method will not necessarily lead to higher similarities (Fig. 6b). The observation is further confirmed with grouped bar plots of the average Euclidean distances in Fig. 6c and d. The results indicate that the consensus constraints imposed by CrossAE work well for PCA features but suffer for the high variance features, which has a much higher dimension.
To further understand the similarity between paired hidden features, we tried to use the tDistributed Stochastic Neighbor Embedding (tSNE) to visualize the hidden features from the first fold of our fourfold crossvalidation in the Supplementary Material Section S3. If using 100 PCA features as the input data, we observe more overlap among the CrossAE hidden features (Green and Yellow) than the ConcatAE hidden features (Red and Blue) (See Fig. S2). This indicates that the multiomics data representation by CrossAE is more complied with consensus constraints. However, if using 1000 high variance features as the input data, we observe that the distribution patterns of the ConcatAE hidden features (Red and Blue) are similar to those of the CrossAE hidden features (Green and Yellow) (See Fig. S3). This implies that the effect of consensus constraints by CrossAE is not as significant.
Discussions
In this study, we have developed two novel multimodal data integration strategies: to integrate the complementary information among modalities with ConcatAE; and to integrate the consensus information using CrossAE. We have tested the two new models on the simulated MNIST data and validated their effectiveness. We then apply the two new models to the multiomics breast cancer survival data. ConcatAE model integrating DNA methylation and miRNA expression PCA features achieves the best performance with a Cindex of 0.641 ± 0.031 and outperforms that of the CrossAE model (0.63 ± 0.081). Both integration approaches outperform the corresponding singlemodality model, which uses DNA methylation or miRNA expression alone. The results indicate that these two modalities should have both complementary and consensus information for survival prediction.
Although the ConcatAE outperforms CrossAE, we believe this does not necessarily indicate that the complementary information is more important than the consensus information. As we have seen in the MNIST simulated data with Gaussian noise, if the multimodality data are noisy and equally predictive, consensus learning can achieve higher prediction performance compared to that of complementary learning. Moreover, the ConcatAE model should include both the modalityinvariant and modalityunique information, although neither has been specifically maximized.
The best survival prediction performance is achieved by integrating DNA methylation and miRNA expression PCA features. However, the results are insufficient to conclude that DNA methylation or miRNA expression is more informative than the other modalities. Due to the lack of biological groundtruth, the model interpretation and wetlab validation are needed to understand the model. As a blackbox model, we cannot currently locate which biomarkers (e.g., specific genes or methylation sites) are picked by the integration network and contribute more to the final survival prediction. Thus, as a future direction, we propose to apply model interpretation methods to the deep network and to validate the biomarkers by literature or by wetlab experiments. Such validation can provide insight into why some integration models outperform the others and is critical for translation to clinical practice.
Although we have demonstrated the effectiveness of ConcatAE and CrossAE for multiomics data integration in this study, future improvements can be made in the following three areas: 1) training data, 2) model validation, and 3) model improvements.
The first improvement is on the training data, which dictates the survival prediction performance. For example, in the TCGABRCA dataset, the CNV features are the least predictive for breast cancer survival. One potential cause is that the CNV features from the TCGA database are categorical (i.e., “gain”, “loss”, or “normal”) and might constrain the predictive capability of this modality. In addition, the gene expression data are normalized with FPKM and the miRNA expression data are normalized with RPM. FPKM and RPM normalization are potentially biased when comparing between samples. The survival prediction performance can be further improved for gene expression and miRNA expression if replacing the normalization method with more sophisticated bioinformatics techniques such as transcripts per million (TPM). Another essential limitation of the current training dataset is the relatively small sample size of the TCGABRCA dataset with around 1000 patients. For a datadriven approach, the performance of deep learning is significantly influenced by the amount of training data. One future direction is to improve our model by using a larger breast cancer survival dataset or by combining multisource breast cancer survival datasets. Another future direction is to make the most of the TCGA database by multitask learning, such as applying the integration methods to cancer staging, subtyping, and grading in addition to survival analysis.
The second limitation is model validation. In this study, we validate the effectiveness of ConcatAE and CrossAE networks with the simulated twoview imaging data from the MNIST database, in which we have controlled and visualized the consensus and complementary information. Ideally, a cancer genomics dataset with ground truth would be preferred to validate the proposed integration networks. However, to the best of our knowledge, there is no such golden standard multiomics dataset developed yet because many complex interactions among multiomics data remain unknown. If the groundtruth of multiomic interactions were known, it would be straightforward to validate the consensus and complementary principles for multiomics data integration methods. Before it happens, a more realistic approach is to collect data for the known crossmodality pathways (e.g., DNA methylation and gene expression pathways) to validate the consensus principle. Another way is to use the multiomics data simulation with ground truth to validate the proposed models. Although some multiomics data simulation works have been recently developed [24, 25], they are not specifically designed to validate the interactions across modalities with 1) consensus information (e.g., coregulation pathways), 2) complementary information (e.g., modalityspecific pathways/biomarkers), and 3) endpoint irrelevant information. Thus, one promising future step is to simulate multiomics data to validate the integration principles and methods in the followup studies.
The third limitation lies in the multimodality integration network. First, we have shown that the featureselection or dimensionreduction steps impact multimodality integration performance. Our current feature selection step contains unsupervised feature selection by variance ranking and unsupervised dimension reduction by PCA. One immediate future work is to utilize more sophisticated knowledgeguided feature selection. Another future work is to integrate feature selection with multiomics feature representation into the multimodality deep network to improve model performance. Second, combining consensus learning and complementary learning may further improve multiomics integration. We propose to extend the current ConcatAE framework by using two encoders or an encoder with branches to represent both the modalityunique hidden feature and the modalityconsensus feature. The modalityunique hidden features can be learned by maximizing the divergence among modalities, while the modalityconsensus hidden features can be learned by minimizing the divergence among modalities. Instead of crossmodality reconstruction in CrossAE, the consensus constraints and the complementary constraints are both realized by divergence optimization for better performance. Third, another future direction is to improve the survival model. In this study, we have implemented a simple deep learningbased survival network using the negative partial loglikelihood loss. One future work is to improve the survival network with regularization, such as L_{1} loss on the network weights. A robust survival network will further improve the multiomics integrated survival network.
Conclusions
In this study, we have investigated two novel multimodal data integration strategies: ConcatAE and CrossAE. We first tested the proposed models on the simulated MNIST data and validated the effectiveness of ConcatAE in integrating complementary information and CrossAE in integrating consensus information among multimodality data. We then apply the proposed models to the multiomics breast cancer survival data obtained from the TCGABRCA dataset. For the singleomics model, the miRNA expression is the most predictive for breast cancer survival analysis (0.616 ± 0.057), followed by DNA methylation and gene expression. CNV data is the least predictive for breast cancer overall survival analysis. For the multiomics model, the ConcatAE model integrating DNA methylation and miRNA expression PCA features achieves the best performance with a Cindex of 0.641 ± 0.031. The CrossAE model integrating DNA methylation and miRNA expression PCA features achieves a Cindex of 0.63 ± 0.081, which also outperforms either DNA methylation or miRNA expression alone. We conclude that the DNA methylation data and miRNA expression data contain both complementary and consensus information, and using such information can improve survival analysis performance. As a future direction, we can develop a sophisticated learning framework utilizing both consensus and complementary information simultaneously to further improve survival prediction for personalized breast cancer diagnosis and treatment.
Availability of data and materials
The MNIST database can be accessed at http://yann.lecun.com/exdb/mnist/
The TCGABRCA breast cancer multiomics data can be downloaded from https://portal.gdc.cancer.gov/
Abbreviations
 AUC:

Area under the curve
 BRCA:

BReast CAncer
 Cindex:

Concordance index
 CNVs:

Copy number variations
 ConcatAE:

Concatenation autoencoder
 CrossAE:

Crossmodality autoencoder
 FPKM:

Fragments per kilobase of transcript per million mapped reads
 MNIST:

Modified national institute of standards and technology
 mRMR:

minimum Redundancy Maximum Relevance
 PCA:

Principal component analysis
 PCs:

Principal components
 RPM:

Reads per million mapped reads
 TCGA:

The cancer genome atlas
 TPM:

Transcripts per million
References
Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2018;68(6):394–424.
Cancer Facts & Figures 2019 [https://www.cancer.org/research/cancerfactsstatistics/allcancerfactsfigures/cancerfactsfigures2019.html]..
Breast Cancer Risk in American Women [https://www.cancer.gov/types/breast/riskfactsheet].
Survival Rates for Breast Cancer [https://www.cancer.org/cancer/breastcancer/understandingabreastcancerdiagnosis/breastcancersurvivalrates.html#written_by].
Grossman RL, Heath AP, Ferretti V, Varmus HE, Lowy DR, Kibbe WA, Staudt LM. Toward a shared vision for cancer genomic data. N Engl J Med. 2016;375(12):1109–12.
Phan JH, Quo CF, Cheng C, Wang MD. Multiscale integration ofomic, imaging, and clinical data in biomedical informatics. IEEE Rev Biomed Eng. 2012;5:74–87.
Kaddi CD, Wang MD. Developing robust predictive models for head and neck cancer across microarray and RNAseq data. In: Proceedings of the 6th ACM conference on bioinformatics, Computational Biology and Health Informatics: 2015; 2015. p. 393–402.
Mishra S, Kaddi CD, Wang MD. Pancancer analysis for studying cancer stage using protein and gene expression data. In: 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC): IEEE; 2016. p. 2440–3.
Phan JH, Hoffman R, Kothari S, Wu PY, Wang MD. Integration of multimodal biomedical data to predict cancer grade and patient survival. In: IEEEEMBS International Conference on Biomedical and Health Informatics (BHI): IEEE; 2016, 2016. p. 577–80.
Cox DR. Regression models and lifetables. J R Stat Soc Ser B Methodol. 1972;34(2):187–202.
Zhao M, Tang Y, Kim H, Hasegawa K. Machine learning with kmeans dimensional reduction for predicting survival outcomes in patients with breast cancer. Cancer Informat. 2018;17:1176935118810215.
Goli S, Mahjub H, Faradmal J, Mashayekhi H, Soltanian AR. Survival prediction and feature selection in patients with breast cancer using support vector regression. Comput Math Methods Med. 2016;2016.
Gevaert O, Smet FD, Timmerman D, Moreau Y, Moor BD. Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks. Bioinformatics. 2006;22(14):e184–90.
Sun D, Li A, Tang B, Wang M. Integrating genomic data and pathological images to effectively predict breast cancer clinical outcome. Comput Methods Prog Biomed. 2018;161:45–53.
Ma T, Zhang A. Multiview factorization AutoEncoder with network constraints for multiomic integrative analysis. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM): 2018; 2018. p. 702–7.
Breslow NE. Analysis of survival data under the proportional hazards model. Int Stat Rev/Rev Int Stat. 1975;43(1):45–57.
Katzman JL, Shaham U, Cloninger A, Bates J, Jiang T, Kluger Y. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med Res Methodol. 2018;18(1):24.
Kvamme H, Borgan Ø, Scheel I. Timetoevent prediction with neural networks and Cox regression. J Mach Learn Res. 2019;20(129):1–30.
Huang Z, Zhan X, Xiang S, Johnson TS, Helm B, Yu CY, Zhang J, Salama P, Rizkalla M, Han Z, et al. SALMON: survival analysis learning with multiomics neural networks on breast cancer. Front Genet. 2019;10:166.
Mitchel J, Chatlin K, Tong L, Wang MD. A translational pipeline for overall survival prediction of breast Cancer patients by decisionlevel integration of multiomics data. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM): 2019: IEEE; 2019. p. 1573–80.
Peng H, Long F, Ding C. Feature selection based on mutual information criteria of maxdependency, maxrelevance, and minredundancy. IEEE Trans Pattern Anal Mach Intell. 2005;27(8):1226–38.
Uno H, Cai T, Pencina MJ, D'Agostino RB, Wei LJ. On the Cstatistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Stat Med. 2011;30(10):1105–17.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikitlearn: machine learning in python. J Mach Learn Res. 2011;12(Oct):2825–30.
Chung RH, Kang CY. A multiomics data simulator for complex disease studies and its application to evaluate multiomics data analysis methods for disease classification. Gigascience. 2019;8(5):giz045.
MartínezMira C, Conesa A, Tarazona S. MOSim: MultiOmics Simulation in R. bioRxiv. 2018:421834.
Acknowledgments
The authors would like to thank Mr. Hang Wu for his kind suggestions on the experiment design and analysis.
Funding
The work was supported in part by grants from the National Institute of Health (NIH) under Award R01CA163256, Giglio Breast Cancer Research Fund, Petit Institute Faculty Fellow and Carol Ann and David D. Flanagan Faculty Fellow Research Fund, and Georgia Cancer Coalition Distinguished Cancer Scholar award. This work was also supported in part by the scholarship from China Scholarship Council (CSC) under the Grant CSC NO. 201406010343. The content of this article is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
The results published here are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.
Author information
Authors and Affiliations
Contributions
L.T. and M.D.W conceived of and organized the study. L.T. developed the theory and performed the experiments. L.T., J.M, K.C., and M.D.W contributed to the analysis. L.T., J.M, K.C., and M.D.W wrote the manuscript and made the figures. All authors discussed the results and contributed to the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The study involves only simulated data and deidentified data from the TCGA database, which is publicly available. There is no way for the study team to reidentify the TCGA data. Thus, the ethics approval by the Georgia Institute of Technology (Georgia Tech) Institutional Review Boards (IRB) was waived, and the informed consent to participate is not applicable in this study.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1.
Supplementary file and supplementary figures.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Tong, L., Mitchel, J., Chatlin, K. et al. Deep learning based featurelevel integration of multiomics data for breast cancer patients survival analysis. BMC Med Inform Decis Mak 20, 225 (2020). https://doi.org/10.1186/s12911020012258
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12911020012258