In-silico interaction-resolution pathway activity quantification and application to identifying cancer subtypes
- Sungwon Jung^{1}Email author
https://doi.org/10.1186/s12911-016-0295-2
© Jung. 2016
Published: 18 July 2016
Abstract
Background
Identifying subtypes of complex diseases such as cancer is the very first step toward developing highly customized therapeutics on such diseases, as their origins significantly vary even with similar physiological characteristics. There have been many studies to recognize subtypes of various cancer based on genomic signatures, and most of them rely on approaches based on the signatures or features developed from individual genes. However, the idea of network-driven activities of biological functions has gained a lot of interests, as more evidence is found that biological systems can show highly diverse activity patterns because genes can interact differentially across specific molecular contexts.
Methods
In this study, we proposed an in-silico method to quantify pathway activities with a resolution of genetic interactions for individual samples, and developed a method to compute the discrepancy between samples based on the quantified pathway activities.
Results
By using the proposed discrepancy measure between sample pathway activities in clustering melanoma gene expression data, we identified two potential subtypes of melanoma with distinguished pathway activities, where the two groups of patients showed significantly different survival patterns. We also investigated selected pathways with distinguished activity patterns between the two groups, and the result suggests hypotheses on the mechanisms driving the two potential subtypes.
Conclusions
By using the proposed approach of modeling pathway activities with a resolution of genetic interactions, potential novel subtypes of disease were proposed with accompanying hypotheses on subtype-specific genetic interaction information.
Background
Since the emergence of high throughput genomic profiling techniques, genomic profile data became a primary source of information in recognizing the various statuses of complex diseases. Cancer is one of such complex diseases, where even tumors from the same tissue locations can have strikingly diverse molecular mechanisms for their origins. Such high heterogeneity in cancer is one of the main obstacles in treatment, as different driving mechanisms may require different therapeutic approaches to repair their abnormality. For this reason, identifying subtypes of cancer with different functional mechanisms is very important for improving their successful diagnosis and treatment.
One of the popular approaches to recognizing the subtypes of cancer is clustering the gene expression data of patient samples (for example, [1–7]), as expression data can give a comprehensive snapshot of transcription activities for whole genes. Many clustering studies consider each gene as a feature for clustering, assuming the expression levels of individual genes are factors that discriminate the different subtypes of cancer. However, the main drawback of such approaches is that they focus on individual genes, while a set of interacting genes constitutes a functional module in many real biological systems. For this reason, using individual genes as features often suffer with the issue of low reproducibility, which indicates the expression levels of genes reflect only some part of discrepancy residing between different subtypes.
In order to overcome such limitation, utilizing known pathway information together with the expression data can be a promising approach. Considering that a joint probability distribution of a set of variables can give a comprehensive picture of its pattern, an ideal approach is modeling the joint probability distribution that describes the combinatorial gene expression levels within a pathway. However, this approach is not practical due to the complexity of the model to represent the joint probability distribution, and the lack of available data to infer such complex models with sufficient reliability. Hence, most of the methods to utilize pathway information focus on specific features of pathways rather than considering the complete joint probability distributions. Characterizing individual samples with pathway information and applying it to clustering achieved limited success, while there is a recent study that proposed a method called PARADIGM [8], which infers patient-specific gene activities from multi-dimensional genomic data using known genetic interactions from pathways. PARADIGM can convert multiple genomic data of a gene from a sample into a single aggregated value called IPA, which represents the summarized activity level of the gene for the sample and it is evaluated in consideration of genetic interactions from pathway information. The computed IPA values of genes can be used for clustering instead of their raw expression values, but it still represents the activity levels of individual genes rather than the activity levels of pathways.
In this study, a method was proposed to compute the dissimilarity between two gene expression samples based on features that represent pathway activity patterns. Unlike conventional methods, our proposed method converts a gene-level matrix (for example, gene expression matrix) to a pathway-level matrix, where each cell in the matrix represents a pathway activity pattern for a sample. We applied the proposed sample dissimilarity measure to clustering of cancer samples, where the RNA-Seq data of 267 melanoma patients from The Cancer Genome Atlas (TCGA) was clustered based on their pathway activities. Two patient groups of potential subtypes were identified with clear difference in their survival patterns, where they were associated with different stages of melanoma. Investigation on selected pathway activity patterns across two patient groups suggested hypotheses on different functional mechanisms driving two potential subtypes.
Methods
Our approach is based on an assumption that the activity pattern of a pathway for a sample can be represented with the probability distribution of the genetic network likelihoods from the pathway, which is computed from the given gene expression data. A sample pathway activity vector (PAV), which represents the comprehensive picture of all pathway activities of a single sample, is represented as a collection of pathway activities for all pathways for the sample. A pathway activity vector distance (PAVd) is proposed as a discrepancy measure between two sample pathway activity vectors, which represents the dissimilarity between the two samples from the perspective of pathways. As PAVd is a distance metric (this will be discussed in the following subsections), arbitrary clustering methods and cluster validation indexes can be used for clustering and quality evaluation. Details of this formulation will be given in the following subsections.
Pathway activity distribution
We compute the activity of a pathway for a sample by approximating the probability distribution of genetic networks from the pathway. Specifically, the pathway activity distribution Pr(PA _{ i }, s _{ j }) of a pathway PA _{ i } for a sample s _{ j } is computed from the following steps:
Step 1) Consider PA _{ i } as a discrete random variable that has a finite set of N genetic network structures g _{ 1 }, g _{ 2 }, …, g _{ N } as its possible values.
Step 2) Compute the likelihood L _{ k } = P(g _{ k }|s _{ j }) for each genetic network g _{ k } for sample s _{ j }. The collection of likelihoods [L _{ 1 }, L _{ 2 } … L _{ N-1 }, L _{ N }] for N genetic network structures constitutes the pathway activity distribution Pr(PA _{ i }, s _{ j }) of pathway PA _{ i } for a sample s _{ j }.
Compared to the idea of computing a single scalar-valued activity for a pathway, our approach of computing the pathway activity as a probability distribution is a generalized version of such idea. From this generalization of considering multiple genetic networks, it is expected to achieve more reliable measurement of pathway activities than the idea of computing a single scalar-valued activity.
where D represents the collection of all samples. Even though we use the Bayesian network model assuming discrete random variables, our formulation is independent of model choices. Thus other network and random variable models can be also used as long as the likelihood of a network structure can be computed based on the model of preference.
Sample pathway activity vector
For A pathways and S samples, the pathway activity distribution matrix R is defined as a A × S matrix, where a cell R(i, j) corresponds to a pathway activity distribution Pr(PA _{ i }, s _{ j }) of a pathway PA _{ i } for a sample s _{ j }. In other words, R is a collection of column vectors PAV(s _{ j }, PA) for S samples.
Discrepancy measure between two sample pathway activity vectors
If a column vector in a pathway activity distribution matrix R is a scalar-valued vector with each pathway activity represented with a scalar value, conventional distance measures (such as Euclidean distance) assuming ordinary scalar-valued vectors can be used to evaluate the discrepancy between two samples. In our approach of representing the pathway activity with a discrete probability distribution Pr(PA _{ i }, s _{ j }), the representation of sample pathway activity PAV(s _{ j }, PA) of a sample s _{ j } for a set of A pathways is a vector of probability distributions as shown in Eq. (2). As each element of a pathway activity vector PAV(s _{ j }, PA) is a probability distribution rather than a scalar value, a new method is necessary to compute the distance between two vectors of probability distributions PAV(s _{ l }, PA) and PAV(s _{ m }, PA) from two samples s _{ l } and s _{ m }.
where JS is the Jensen-Shannon divergence. The Jensen-Shannon divergence is a symmetrized version of the Kullback-Leibler divergence, and a popular method of measuring the similarity between two probability distributions. Note that PAVd is a metric, as it satisfies the four required properties – non-negativity, identity of indiscernibles, symmetry and triangle inequality.
Corollary 1. PAVd satisfies a property, the non-negativity.
Corollary 2. PAVd satisfies a property, the identify of indiscernibles.
Corollary 3. PAVd satisfies a property, symmetry.
Corollary 4. PAVd satisfies a property, triangle inequality.
Theorem 1. PAVd is a distance metric.
Proof. From Corollary 1 to 4, PAVd satisfies the four properties of metric.
By using this distance metric PAVd with conventional clustering algorithms, we can group samples based on the sample pathway activities.
Utilizing pathway information
We collected 1932 filtered gene sets of canonical pathways, Gene Ontology (GO) biological process and molecular functions from MSigDB [12], where each gene set has up to 50 genes, and used them as pathways in our study. The gene sets from MSigDB do not include genetic interaction information. For genetic interaction information, 854,464 human genetic interactions were obtained from Pathway Commons [13], and genes in each pathway were interconnected based on the obtained genetic interactions.
Analysis of TCGA melanoma RNA-Seq data
We obtained the RNA-Seq data of 267 melanoma patients from TCGA. The normalized gene-level transcript counts were used for the analysis. The normalized counts of each gene were discretized into two values of 0 (not expressed) and 1 (expressed) using SIBER [14]. A sample pathway activity vector has been computed for each of the 267 patient samples, and a pathway activity distribution matrix R was built as a result. Using PAVd as a distance measure between sample pathway activity vectors that correspond to the columns of R, hierarchical clustering with complete linkage was applied to R to find groups of patients.
Results
Identification of two patient groups from the clustering result
Identified two groups of patients and their melanoma stages
Group I | Group II | |
---|---|---|
Number of patients | 90 | 85 |
Stage I | 18 (p = 0.0362) | 7 (p = 0.9764) |
Stage II | 11 (p = 0.9779) | 24 (p = 0.0049) |
Stage III | 16 (p = 0.4976) | 18 (p = 0.1623) |
Stage IV | 2 (p = 0.3258) | 2 (p = 0.2889) |
Survival analysis of identified patient groups
Survival statistics of identified patient groups
Patient group | Number of patients | Median survival (days) | Comparison versus the rest of the patients | |
---|---|---|---|---|
Hazard ratio | p-value (log-rank test) | |||
Group I | 90 | 4,254 | 0.6298 | 0.0044 |
Group II | 85 | 1,625.5 | 2.3281 | 0.03294 |
All – (Group I ∪ Group II) | 92 | 1,982 | 0.7150 | 0.46775 |
Discussion
We compared the pathway activity patterns between Group I and II, and two pathways with distinguished activity patterns were selected for further investigation. The first pathway was a GO gene set Regulation of Cell-Cell Adhesion with one genetic interaction from Pathway Commons, and the second pathway was a DNA Fragmentation pathway with eight genetic interactions.
Difference between Group I and II, based on regulation of cell-cell adhesion
Difference between Group I and II, based on DNA Fragmentation
Conclusions
Functional difference of the proposed PAV compared to previous pathway evaluation methods
Interaction-resolution | Multi-type data integration | Single sample evaluation | |
---|---|---|---|
GSEA | No | No | No |
PARADIGM | No | Yes | Yes |
PAV | Yes | No | Yes |
We are considering several directions for future studies. From our current study, we modeled the pathway activity patterns with a genetic network of normal status and genetic networks with only one missing interactions. Even though we showed a successful application of our formulation in this study, it will be definitely beneficial to consider genetic networks with more than one missing interactions, which leads to the generalization of the formulation with up to K missing interactions. We can also develop methods to quantitatively evaluate the abilities of pathways in discerning different clusters. As our formulation of the distance measure satisfies the properties of metric, we can incorporate the ideas of many conventional cluster validation indexes to evaluate the quality of clusters based on individual pathways. Lastly, we considered only gene expression data in our study, but integrating multiple types of genomic data in evaluation of pathway activity patterns and computing the effect of latent environment variables on pathways can be a promising direction to extend our current models.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Availability of data and materials
All the data used in this article were obtained from public sources as cited.
Declarations
Declarations
The publication cost of this article was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number : HI15C1593).
This article has been published as part of BMC Medical Informatics and Decision Making Volume 16 Supplement 1, 2016: Proceedings of the ACM Ninth International Workshop on Data and Text Mining in Biomedical Informatics. The full contents of the supplement are available online at https://bmcmedinformdecismak.biomedcentral.com/articles/supplements/volume-16-supplement-1.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- de Souto MC, Costa IG, de Araujo DS, Ludermir TB, Schliep A. Clustering cancer gene expression data: a comparative study. BMC Bioinformatics. 2008;9:497.View ArticlePubMedPubMed CentralGoogle Scholar
- Getz G, Gal H, Kela I, Notterman DA, Domany E. Coupled two-way clustering analysis of breast cancer and colon cancer gene expression data. Bioinformatics. 2003;19(9):1079–89.View ArticlePubMedGoogle Scholar
- Liu W, Yuan K, Ye D. On alpha-divergence based nonnegative matrix factorization for clustering cancer gene expression data. Artif Intell Med. 2008;44(1):1–5.View ArticlePubMedGoogle Scholar
- Mukhopadhyay A, Bandyopadhyay S, Maulik U. Multi-class clustering of cancer subtypes through SVM based ensemble of pareto-optimal solutions for gene marker identification. PLoS One. 2010;5(11):e13803.View ArticlePubMedPubMed CentralGoogle Scholar
- Pal NR, Aguan K, Sharma A, Amari S. Discovering biomarkers from gene expression data for predicting cancer subgroups using neural networks and relational fuzzy clustering. BMC Bioinformatics. 2007;8:5.View ArticlePubMedPubMed CentralGoogle Scholar
- Sorlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S, et al. Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci U S A. 2003;100(14):8418–23.Google Scholar
- Zhiwen Y, Le L, Jane Y, Hau-San W, Guoqiang H. SC(3): triple spectral clustering-based consensus clustering framework for class discovery from cancer gene expression profiles. IEEE/ACM Trans Comput Biol Bioinform. 2012;9(6):1751–65.View ArticlePubMedGoogle Scholar
- Vaske CJ, Benz SC, Sanborn JZ, Earl D, Szeto C, Zhu J, Haussler D, Stuart JM. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics. 2010;26(12):i237–45.Google Scholar
- Buntine W. Theory refinement on bayesian networks. In: The 7th Conference on Uncertainty in Artificial Intelligence. Burlington: Morgan Kaufmann Publishers; 1991. p. 52–60.Google Scholar
- Endres DM, Schindelin JE. A new metric for probability distributions. IEEE Trans Inf Theory. 2003;49(7):1858–60.View ArticleGoogle Scholar
- Osterreicher FVI. A new class of metric divergences on probability spaces and its applicability in statistics. Ann Inst Stat Math. 2003;55(3):639–53.View ArticleGoogle Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102(43):15545–50.Google Scholar
- Cerami EG, Gross BE, Demir E, Rodchenkov I, Babur O, Anwar N, Schultz N, Bader GD, Sander C . Pathway Commons, a web resource for biological pathway data. Nucleic Acids Res. 2011;39(Database issue):D685–90.Google Scholar
- Tong P, Chen Y, Su X, Coombes KR. SIBER: systematic identification of bimodally expressed genes using RNAseq data. Bioinformatics. 2013;29(5):605–13.View ArticlePubMedPubMed CentralGoogle Scholar