Our approach is based on an assumption that the activity pattern of a pathway for a sample can be represented with the probability distribution of the genetic network likelihoods from the pathway, which is computed from the given gene expression data. A sample pathway activity vector (PAV), which represents the comprehensive picture of all pathway activities of a single sample, is represented as a collection of pathway activities for all pathways for the sample. A pathway activity vector distance (PAVd) is proposed as a discrepancy measure between two sample pathway activity vectors, which represents the dissimilarity between the two samples from the perspective of pathways. As PAVd is a distance metric (this will be discussed in the following subsections), arbitrary clustering methods and cluster validation indexes can be used for clustering and quality evaluation. Details of this formulation will be given in the following subsections.
Pathway activity distribution
We compute the activity of a pathway for a sample by approximating the probability distribution of genetic networks from the pathway. Specifically, the pathway activity distribution Pr(PA
i
, s
j
) of a pathway PA
i
for a sample s
j
is computed from the following steps:
Step 1) Consider PA
i
as a discrete random variable that has a finite set of N genetic network structures g
1
, g
2
, …, g
N
as its possible values.
Step 2) Compute the likelihood L
k
= P(g
k
|s
j
) for each genetic network g
k
for sample s
j
. The collection of likelihoods [L
1
, L
2
… L
N-1
, L
N
] for N genetic network structures constitutes the pathway activity distribution Pr(PA
i
, s
j
) of pathway PA
i
for a sample s
j
.
Compared to the idea of computing a single scalar-valued activity for a pathway, our approach of computing the pathway activity as a probability distribution is a generalized version of such idea. From this generalization of considering multiple genetic networks, it is expected to achieve more reliable measurement of pathway activities than the idea of computing a single scalar-valued activity.
In computing the pathway activity distribution Pr(PA
i
, s
j
), we utilize existing knowledge base on pathways and gene regulatory interactions. Instead of enumerating all possible genetic network structures for PA
i
, a selected list of candidate genetic networks are considered in reference with a knowledge base of choice, to compute Pr(PA
i
, s
j
). The list of candidate networks will include one genetic network of normal status, where we assume that a normally acting pathway has all known genetic interactions functioning. From this normal network with all known genetic interactions, we assume that eliminating an interaction can represent a disturbed status of the pathway. This assumption is based on an idea that disruption of a pathway by external variables (for example, regulation by miRNA, epigenetic changes, gene copy number variation) can be represented with silencing certain genetic interactions within the pathway. As a result, a normal network and disturbed networks with one silenced interaction are considered as possible values of PA
i
. The schematic outline of this approach is illustrated in Fig. 1.
To compute the likelihood L
k
= P(g
k
|s
j
) for each genetic network g
k
for sample s
j
, we model a genetic network with a Bayesian network structure assuming discrete random variables as its nodes. The computation of likelihood is done using the Bayesian Dirichlet equivalence uniform (BDeu) scoring method [9]. However, the direct computation of likelihoods with only single sample yields uniform likelihoods for all samples as the BDeu scoring method considers each instance (sample) of variables with the same preference, especially with the uniform prior assumption. For this reason, we take an indirect approach to compute likelihoods, with the following formulation:
$$ P\left({g}_k\Big|{s}_j\right)=\frac{BDeu\left({g}_k\Big|D\right)}{BDeu\left({g}_k\Big|D-\left\{{s}_j\right\}\right)} $$
(1)
where D represents the collection of all samples. Even though we use the Bayesian network model assuming discrete random variables, our formulation is independent of model choices. Thus other network and random variable models can be also used as long as the likelihood of a network structure can be computed based on the model of preference.
Sample pathway activity vector
The sample pathway activity vector PAV(s
j
, PA) of sample s
j
for a set of A pathways PA is defined with a vector of pathway activity distributions as follows:
$$ PAV\left({s}_i,PA\right)=< \Pr \left(P{A}_1,{s}_i\right), \Pr \left(P{A}_2,{s}_i\right),\dots, \Pr \left(P{A}_A,{s}_i\right)> $$
(2)
For A pathways and S samples, the pathway activity distribution matrix R is defined as a A × S matrix, where a cell R(i, j) corresponds to a pathway activity distribution Pr(PA
i
, s
j
) of a pathway PA
i
for a sample s
j
. In other words, R is a collection of column vectors PAV(s
j
, PA) for S samples.
Discrepancy measure between two sample pathway activity vectors
If a column vector in a pathway activity distribution matrix R is a scalar-valued vector with each pathway activity represented with a scalar value, conventional distance measures (such as Euclidean distance) assuming ordinary scalar-valued vectors can be used to evaluate the discrepancy between two samples. In our approach of representing the pathway activity with a discrete probability distribution Pr(PA
i
, s
j
), the representation of sample pathway activity PAV(s
j
, PA) of a sample s
j
for a set of A pathways is a vector of probability distributions as shown in Eq. (2). As each element of a pathway activity vector PAV(s
j
, PA) is a probability distribution rather than a scalar value, a new method is necessary to compute the distance between two vectors of probability distributions PAV(s
l
, PA) and PAV(s
m
, PA) from two samples s
l
and s
m
.
We designed a new distance measure pathway activity vector distance (PAVd) to compute the distance between two vectors of discrete probability distributions PAV(s
l
, PA) and PAV(s
m
, PA), which is defined as follows:
$$ PAVd\left(\left(PAV\left({s}_l,PA\right)\right)\left|\right|PAV\left({s}_m,PA\right)\right)={\displaystyle {\sum}_{i=1}^A\sqrt{JS\left( \Pr \left(P{A}_i\Big|{s}_l\right)\right)\left|\right| \Pr \left(P{A}_i\Big|{s}_m\right)}} $$
(3)
where JS is the Jensen-Shannon divergence. The Jensen-Shannon divergence is a symmetrized version of the Kullback-Leibler divergence, and a popular method of measuring the similarity between two probability distributions. Note that PAVd is a metric, as it satisfies the four required properties – non-negativity, identity of indiscernibles, symmetry and triangle inequality.
Corollary 1.
PAVd satisfies a property, the non-negativity.
Proof. The Jensen-Shannon divergence JS of two probability distributions is a non-negative value.
$$ \therefore PAVd\left(PAV\left({s}_l,PA\right)\left|\right|PAV\left({s}_m,PA\right)\right)={\displaystyle {\sum}_{i=1}^A\sqrt{JS\left( \Pr \left(P{A}_i\Big|{s}_l\right)\left|\right| \Pr \left(P{A}_i\Big|{s}_m\right)\right)}}\ge 0 $$
(4)
Corollary 2.
PAVd satisfies a property, the identify of indiscernibles.
Proof. PAVd(PAV(s
l
, PA) || PAV(s
m
, PA)) is a sum of non-negative values from Eq. (3). Thus, PAVd(PAV(s
l
, PA) || PAV(s
m
, PA)) = 0 requires JS(Pr(PA
i
| s
l
) || Pr(PA
i
| s
m
)) to be 0 for all i. As the square root of the Jensen-Shannon divergence is a metric [10, 11], JS(Pr(PA
i
| s
l
) || Pr(PA
i
| s
m
)) = 0 if and only if Pr(PA
i
| s
l
) = Pr(PA
i
| s
m
). If Pr(PA
i
| s
l
) = Pr(PA
i
| s
m
) for all i, then PAV(s
l
, PA) = PAV(s
m
, PA).
$$ \therefore PAVd\left(PAV\left({s}_l,PA\right)\left|\right|PAV\left({s}_m,PA\right)\right)=0\;\mathrm{if}\;\mathrm{and}\;\mathrm{only}\;\mathrm{if}\;PAV\left({s}_l,PA\right)=PAV\left({s}_m,PA\right). $$
(5)
Corollary 3.
PAVd satisfies a property, symmetry.
Proof. The Jensen-Shannon divergence JS is a symmetrized version of the Kullback-Leibler divergence.
$$ \therefore PAVd\left(PAV\left({s}_l,PA\right)\left|\right|PAV\left({s}_m,PA\right)\right)= PAVd\left(PAV\left({s}_m,PA\right)\left|\right|PAV\left({s}_l,PA\right)\right) $$
(6)
Corollary 4.
PAVd satisfies a property, triangle inequality.
Proof. Consider three sample pathway activity vector PAV(s
l
, PA), PAV(s
m
, PA) and PAV(s
n
, PA). As the square root of the Jensen-Shannon divergence JS is a metric, the following is true for all i:
$$ \begin{array}{l}\kern14.36em \sqrt{JS\left( \Pr \left(P{A}_i\Big|{s}_l\right)\left|\right| \Pr \left(P{A}_i\Big|{s}_n\right)\right)}\\ {}\le \sqrt{JS\left( \Pr \left(P{A}_i\Big|{S}_l\right)\left|\right| \Pr \left(P{A}_l\Big|{s}_m\right)\right)}+\sqrt{JS\left( \Pr \left(P{A}_i\Big|{s}_m\right)\left|\right| \Pr \left(P{A}_i\Big|{s}_n\right)\right)}\end{array} $$
(7)
Thus, the following is also true:
$$ \begin{array}{l}{\displaystyle {\sum}_{i=1}^A\sqrt{JS\left( \Pr \left(P{A}_i\Big|{S}_l\right)\left|\right| \Pr \left(P{A}_i\Big|{s}_n\right)\right)}}\\ {}\le {\displaystyle {\sum}_{i=1}^A\sqrt{JS\left( \Pr \left(P{A}_i\Big|{S}_l\right)\left|\right| \Pr \left(P{A}_i\Big|{s}_m\right)\right)}}+{\displaystyle {\sum}_{i=1}^A\sqrt{JS\left( \Pr \left(P{A}_i\Big|{S}_m\right)\left|\right| \Pr \left(P{A}_i\Big|{s}_n\right)\right)}}\end{array} $$
(8)
$$ \begin{array}{l}\kern1.56em \therefore PAVd\left(PAV\left({s}_l,PA\right)\left|\right|PAV\left({s}_n,PA\right)\right)\\ {}\le PAVd\left(PAV\left({s}_l,PA\right)\left|\right|PAV\left({s}_m,PA\right)\right)+ PAVd\left(PAV\left({s}_m,PA\right)\left|\right|PAV\left({s}_n,PA\right)\right)\end{array} $$
(9)
Theorem 1.
PAVd is a distance metric.
Proof. From Corollary 1 to 4, PAVd satisfies the four properties of metric.
By using this distance metric PAVd with conventional clustering algorithms, we can group samples based on the sample pathway activities.
Utilizing pathway information
We collected 1932 filtered gene sets of canonical pathways, Gene Ontology (GO) biological process and molecular functions from MSigDB [12], where each gene set has up to 50 genes, and used them as pathways in our study. The gene sets from MSigDB do not include genetic interaction information. For genetic interaction information, 854,464 human genetic interactions were obtained from Pathway Commons [13], and genes in each pathway were interconnected based on the obtained genetic interactions.
Analysis of TCGA melanoma RNA-Seq data
We obtained the RNA-Seq data of 267 melanoma patients from TCGA. The normalized gene-level transcript counts were used for the analysis. The normalized counts of each gene were discretized into two values of 0 (not expressed) and 1 (expressed) using SIBER [14]. A sample pathway activity vector has been computed for each of the 267 patient samples, and a pathway activity distribution matrix R was built as a result. Using PAVd as a distance measure between sample pathway activity vectors that correspond to the columns of R, hierarchical clustering with complete linkage was applied to R to find groups of patients.