Skip to main content

MiRNA-disease association prediction via hypergraph learning based on high-dimensionality features

Abstract

Background

MicroRNAs (miRNAs) have been confirmed to have close relationship with various human complex diseases. The identification of disease-related miRNAs provides great insights into the underlying pathogenesis of diseases. However, it is still a big challenge to identify which miRNAs are related to diseases. As experimental methods are in general expensive and time‐consuming, it is important to develop efficient computational models to discover potential miRNA-disease associations.

Methods

This study presents a novel prediction method called HFHLMDA, which is based on high-dimensionality features and hypergraph learning, to reveal the association between diseases and miRNAs. Firstly, the miRNA functional similarity and the disease semantic similarity are integrated to form an informative high-dimensionality feature vector. Then, a hypergraph is constructed by the K-Nearest-Neighbor (KNN) method, in which each miRNA-disease pair and its k most relevant neighbors are linked as one hyperedge to represent the complex relationships among miRNA-disease pairs. Finally, the hypergraph learning model is designed to learn the projection matrix which is used to calculate uncertain miRNA-disease association score.

Result

Compared with four state-of-the-art computational models, HFHLMDA achieved best results of 92.09% and 91.87% in leave-one-out cross validation and fivefold cross validation, respectively. Moreover, in case studies on Esophageal neoplasms, Hepatocellular Carcinoma, Breast Neoplasms, 90%, 98%, and 96% of the top 50 predictions have been manually confirmed by previous experimental studies.

Conclusion

MiRNAs have complex connections with many human diseases. In this study, we proposed a novel computational model to predict the underlying miRNA-disease associations. All results show that the proposed method is effective for miRNA–disease association predication.

Background

MicroRNAs (miRNAs) are endogenous non-coding single-stranded RNA molecules that play important roles in eukaryotic gene expression through posttranscriptional regulation [1,2,3]. Functional studies indicate that miRNA plays a significant role in manifold biological processes, such as cell proliferation, stem cell maintenance, immune responses and so on [4,5,6]. Dysregulation of miRNA expression and function is reported in various diseases including cancer, metabolic disorders as well as neurological disorders [7]. Therefore, identifying disease-related miRNAs is important to treat, diagnose, and prevent human complex diseases [8, 9].

Generally, researchers use biological experimental methods such as quantitative reverse transcription, microarray analysis, or deep sequencing of small RNAs to explore miRNAs that are differentially expressed in a disease state. For example, Pan et al. used microarray analysis and found that miR-130a-3p, miR-424-5p, miR-574-5p, and miR-146a presented significant difference between tuberculous meningitis and healthy controls [10]. However, experimental identification of disease-related miRNAs by existing techniques is expensive and time-consuming. So, based on vast amount of biological data about miRNAs, researchers have developed computational methods for predicting miRNA-disease associations [11,12,13,14,15,16,17,18,19,20,21], which can select most promising miRNAs for further analysis and hence decrease the number of the experiments.

For predicting disease-related miRNAs, many methods are based on a credible assumption that functionally similar miRNAs tend to have associations with phenotypically similar diseases and vice versa. Xiao et al. proposed a method called GRNMF, which based on graph regularized non-negative matrix factorization from the similarity and association perspective of miRNAs and diseases to discover potential associations [22]. Liu et al. proposed the method for predicting miRNA–disease associations by performing random walks on heterogeneous omics data [23]. You et al. presented the prediction model of PBMDA by constructing a heterogeneous graph consisting of three interlinked sub-graphs, and performing a depth-first search algorithm on the heterogeneous network to infer disease-related miRNAs [24]. PBMDA integrated different types of heterogeneous biological datasets, so it can be applied to the new diseases/miRNAs without known associated miRNAs/diseases. Subsequently, Chen et al. proposed a novel method based on Hybrid Approach for MiRNA-Disease Association prediction (HAMDA) [25]. They considered network structure, information propagation, and node attribution, and used the hybrid graph-based recommendation algorithm to uncover disease-related miRNAs. In addition, Chen et al. devised a computational approach by Graphlet Interaction to predict disease-related miRNAs (GIMDA) [26]. In this method, graphlet interaction was utilized to analyze the complex relationships between two nodes in a graph. However, HAMDA and GIMDA are not applicable to predicting a new association between a new miRNA and a new disease. Furthermore, Chen et al. developed a method of Graph Regression for MiRNA-Disease Association prediction (GRMDA) [27]. The graph regression was synchronously performed in three latent spaces, by using Singular Value Decomposition (SVD) and Partial Least-Squares (PLS) to extract important related attributes and filter the noise. But it is difficulties to choosing parameters in SVD and PLS. Lately, Jiang et al. implemented a improved collaborative filtering-based method to infer miRNA-disease associations (ICFMDA) [28]. They improved collaborative filtering algorithm by combining the similarity matrices, and defined significance SIG between pairs of diseases or miRNAs to predict disease-related miRNAs even new diseases without known association.

In addition, several computational models used machine learning to uncover the association between miRNAs and diseases. Xu et al. introduced an approach based on the miRNA target–dysregulated network (MTDN) to prioritize novel disease miRNAs [29]. They applied Support vector machine classifier to miRNAs in the MTDN. However, negative samples required by the classifier are difficult to obtain. To overcome this limitation, Chen et al. introduced a semi-supervised method named RLSMDA [30]. It is developed under the framework of regularized least squares and can predict new miRNAs for diseases which do not have any known related miRNAs. Similarly, Luo et al. developed another semi-supervised method named KRLSM based on Kronecker regularized least squares [31]. KRLSM integrated different omics data, combined the disease and miRNA space, and used the semi-supervised classifier of regularized least squares to predict disease-related miRNAs. However, this approach involves multiple parameters and establishing the optimal parameter values remains a challenging problem. Chen et al. designed a method based on restricted Boltzmann machine for predicting miRNA-disease associations [32]. This approach can also predict association types of miRNA-disease pairs, but can not applicable to a new disease with no known associated miRNAs. Furthermore, Chen et al. developed an effective method called HGIMDA [33]. HGIMDA calculated the disease-miRNA association possibility by investigating all the 3-length paths in the constructed heterogeneous graph. Recently, Chen et al. utilized Extreme Gradient Boosting Machine to uncover disease-related miRNAs and named EGBMMDA [34]. In this method, based on statistical measures, graph theoretical, and matrix factorization, they constructed an informative feature vector for each miRNA-disease pair and used a decision tree model to predict disease-related miRNAs.

Although existing methods have made great contributions to uncover disease-related miRNAs, there are still some limitations that could be improved. For example, many methods are difficult to extract the deep feature representation of the multiple kinds of data. In this study, we propose a novel prediction method via hypergraph learning based on high-dimensionality features and refer to it as HFHLMDA. Hypergraph learning, which can capture the high-order relationships of samples, has been widely used in clustering, classification and information retrieval tasks. In a hypergraph, an edge connects more than two vertices, thus it can well encode the relationship among more than two vertices. We construct high-dimensionality feature vectors for all the miRNA-disease pairs, and utilize K-Nearest-Neighbor (KNN) method to form a hypergraph to predict potential miRNA-disease association. To demonstrate the effectiveness of our method, we apply Leave-one-out cross validation (LOOCV) and fivefold cross validation to measure the prediction performance. We compare our method with four state‐of‐the‐art methods and the results indicate that our method can achieve better performance. In addition, case studies of three common diseases are implemented to further verify the reliability and robustness of HFHLMDA.

Methods

Human MiRNA-disease associations network

The human miRNA-disease associations used in this work come from the HMDDv2.0 [35], which contains 5430 experimentally associations between 495 miRNAs and 383 diseases. Technically, we use an adjacency matrix A with 495 (nm) rows and 383 (nd) columns to clearly describe the relation of each miRNA-disease pairs. The element A(m(i), d(j)) is equal to 1 if miRNA m(i) is verified to be associated with disease d(j), and 0 otherwise. Finally, 5430 entries of matrix A are assigned 1, the rest ones are assigned 0. Our goal is to confirm the uncertain associations between miRNAs and diseases.

MiRNA similarity matrix

Wang et al. developed a method named MISIM for calculating the function similarity scores of miRNA [36]. Here, we directly downloaded the miRNA functional similarity scores from http://www.cuilab.cn/files/images/cuilab/misim.zip. Then, an adjacency matrix SM with 495 rows and 495 columns is built to denote the similarity of miRNAs, in which the larger the SM(m(i), m(j)) is, the more similar m(i) and m(j) are.

However, SM has the problem of sparsity. Sparse matrix is difficult to provide more effective information, which will seriously affect the prediction performance of the computational model. So we calculate the Gaussian interaction profile kernel similarity of miRNAs [37]. Specifically, a binary vector BV(m(i)), i.e. the ith row of matrix A, is recorded as the interaction profiles of miRNA m(i) for representing the associations between m(i) itself and each disease. All known miRNA-disease associations in matrix A will be used to calculate similarity, two miRNAs would likely have greater similarities if they share more disease associations. Thus, the Gaussian interaction profile kernel similarity GKM(m(i), m(j)) of miRNA m(i) and miRNA m(j) is defined as

$$GKM\left( {m\left( i \right),m\left( j \right)} \right) \, = {\exp}( - \gamma_{m} ||BV\left( {m\left( i \right)} \right) \, - BV\left( {m\left( j \right)} \right)||^{{2}} )$$
(1)

where γm is a parameter used to control the kernel bandwidth, which is set as

$$\gamma_{m} = \frac{1}{{\frac{1}{nm}\mathop \sum \nolimits_{i = 1}^{nm} ||BV\left( {m\left( i \right)} \right)||^{2} }}$$
(2)

By integrating SM and GKM, a new complete miRNA similarity matrix SM can be obtained as

$$SM\left( {m\left( i \right),m\left( j \right)} \right) \, = \left\{ {\begin{array}{*{20}l} {GKM\left( {m\left( i \right),{ }m\left( j \right)} \right) } \hfill & {if SM\left( {m\left( i \right),{ }m\left( j \right)} \right) = 0} \hfill \\ {\frac{{SM\left( {m\left( i \right),{ }m\left( j \right)} \right) + GKM\left( {m\left( i \right),{ }m\left( j \right)} \right)}}{2}} \hfill & {otherwise} \hfill \\ \end{array} } \right.$$
(3)

Disease similarity matrix

The association between different diseases can be represented by a directed acyclic graph (DAG), which consists of some nodes and links. Each node represents a disease while a link represents the association of two diseases. For a given disease D, DAG = (D, TD, ED), where TD represents its ancestor nodes and itself while ED is the set of corresponding edges. The contribution values of disease d(t) to the semantic value of disease d(i) can be calculated as follows:

$$D_{d\left( i \right)} (d(t)) \, = - log\left( {\frac{{the\, number\, of\, DAGs\, including d\left( t \right){ }}}{the\, number \,of\, diseases}} \right)$$
(4)
$$DV\left( {d\left( i \right)} \right) = \mathop \sum \limits_{{d\left( t \right) \in D\left( {d\left( i \right)} \right) }} D_{d\left( i \right)} \left( {d\left( t \right)} \right)$$
(5)

where D(d(i)) is the node set in DAG(d(i)) including node d(i) itself. Therefore, the semantic similarity between disease d(i) and d(j) can be defined as follows:

$$SD\left( {d\left( i \right),d\left( j \right)} \right) = \frac{{\mathop \sum \nolimits_{{d\left( t \right) \in D\left( {d\left( i \right)} \right) \cap D\left( {d\left( j \right)} \right)}} \left( {D_{d\left( i \right)} \left( {d\left( t \right)} \right) + D_{d\left( j \right)} \left( {d(t} \right))} \right)}}{{DV\left( {d\left( i \right)} \right) + DV\left( {d\left( j \right)} \right)}}$$
(6)

Similarly, we also calculate the Gaussian interaction profile kernel similarity GKD for diseases by the follow formulas

$$GKD\left( {d\left( i \right),d\left( j \right)} \right) \, = {\exp}( - \gamma_{d} ||BV\left( {d\left( i \right)} \right) \, - BV\left( {d\left( j \right)} \right)||^{{2}} )$$
(7)
$$\gamma_{d} = \frac{1}{{\frac{1}{nd}\mathop \sum \nolimits_{i = 1}^{nd} ||BV\left( {d\left( i \right)} \right)||^{2} }}$$
(8)

where BV(d(i)) and BV(d(j)) denote the ith column and the j-th column of A. At last, the disease similarity matrix SD is obtained by

$$SD\left( {d\left( i \right),d\left( j \right)} \right) \, = \left\{ {\begin{array}{*{20}l} {GKD\left( {d\left( i \right),{ }d\left( j \right)} \right)} \hfill & {if\,SD\left( {d\left( i \right),{ }d\left( j \right)} \right) = 0} \hfill \\ {\frac{{GKD\left( {d\left( i \right),{ }d\left( j \right)} \right) + SD\left( {d\left( i \right),{ }d\left( j \right)} \right)}}{2}} \hfill & {otherwise} \hfill \\ \end{array} } \right.$$
(9)

HFHLMDA

The HFHLMDA model can be separated into three steps (see Fig. 1). First, feature factor construction, in which a feature factor x for each miRNA-disease pair consisting of corresponding rows of SM and SD. Second, hypergraph construction, where a hypergraph G is constructed to formulate the relationship between these feature vectors. Third, hypergraph learning, to learn the projection matrix P, which map the original feature x to the relevance score S = x.P, and thus it can be used to predict the association for the unknown miRNA-disease pair xunk.

Fig. 1
figure1

Flowchart of potential miRNA–disease association prediction based on HFHLMDA

Feature factor construction

According to the biological observation that miRNAs with more functional similarity tend to be more associated with similar diseases and vice versa, so the topologic information of miRNA/disease similarity network can be used to construct feature factor directly.

For each miRNA, there are 495 similarity scores. We use similarity scores as features to represent each miRNA by a 495-dimensional feature vector. For example, we represent miRNA m(i) by a feature vector, SM(m(i)) = (m1, m2, …, m495), where SM(m(i)) is the ith row vector of SM and represents the similarities between m(i) and all the miRNAs.

For each disease, we can obtain a 383-dimensonal feature vector in a similar way to miRNA, SD(d(j)) = (d1, d2, …, d383), where SD(d(j)) is the jth row of matrix SD. Therefore, each miRNA-disease pair can be described by an 878-dimensional vector x = (SM(m(i)), SD(d(j))). Furthermore, we consider (SM(m(i)), SD(d(j))) as a positive sample if miRNA m(i) is associated with disease d(j), otherwise as a negative sample. To construct the balanced dataset, the training set have 5,430 positive samples, and an equal number of samples were randomly selected as negative training examples from the pool of unknown associations. It is possible to use unconfirmed miRNA-disease pairs with association as negative samples, from the perspective of probability, because the miRNA-disease pairs we selected as negative samples account for only 5430 ÷ (495 × 383) ≈ 2.86% of all miRNA-disease pairs, which is negligible [38].

Hypergraph construction

Firstly, we briefly introduce the hypergraph learning theory. As a generalization of graph, hypergraph represents the structure of data via measuring the similarity between groups of points. Different from a simple graph, an edge in a hypergraph can connect three or more vertices, it can model high-order relations between their vertices by hyperedges, whose influence can be assessed by properly estimating their weights. Obviously, modeling the high-order relationship among objects can improve the predicting performance significantly. Moreover, the quality of the hypergraph structure plays an important role for data modeling. A well constructed hypergraph structure can represent the data correlation accurately, and leading to better performance.

A hypergraph is defined as G = (V, E, w), where V is a set of vertices, E is a set of hyperedges and each hyperedge e is given a positive weight w(e). The hypergraph G can be denoted by a |V| ×|E| incidence matrix H, in which each entry is defined by

$$h(v,e) = \left\{ {\begin{array}{*{20}l} 1 \hfill & {if\, v \in e} \hfill \\ 0 \hfill & { if\, v \notin e} \hfill \\ \end{array} } \right.$$
(10)

The degree of vertex vV and hyperedge eE can be respectively represented as:

$$d\left( v \right) \, = \mathop \sum \limits_{e \in E} w\left( e \right)h\left( {v,e} \right)$$
(11)
$$\delta \left( e \right) \, = \mathop \sum \limits_{{{ }v \in V}} h\left( {v,e} \right)$$
(12)

Accordingly, denote Dv and De as two diagonal matrices of the vertex degrees and the hyperedge degrees, respectively.

Zhou et al. proposed a regularization framework on hypergraph [39], which is defined as

$${\text{arg min}}_{f} \{ \lambda R_{{{\text{emp}}}} \left( f \right) + \Omega \left( f \right)\}$$
(13)

where f is the to-be-learned function, Ω(f) is a regularizer on the hypergraph, Remp(f) is an empirical loss, and λ > 0 is the tradeoff parameter. Usually, the empirical loss Remp(f) is defined as

$$R_{{{\text{emp}}}} \left( f \right) = \, ||f - Y||^{{2}}$$
(14)

where Y is the label matrix of samples. The regularizer on the hypergraph is defined by

$$\varOmega \left( f \right) = \frac{1}{2}\mathop \sum \limits_{e \in E} \mathop \sum \limits_{u,v \in V} \frac{w\left( e \right)}{{\delta \left( e \right)}}\left( {\frac{f\left( u \right)}{{\sqrt {d\left( u \right)} }} - \frac{f\left( v \right)}{{\sqrt {d\left( v \right)} }}} \right)$$
(15)

Let Θ = Dv−(1/2)HWDe−1HTDv−(1/2), the normalized cost function can be written as

$$\varOmega \left( f \right) \, = f^{{\text{T}}} \varDelta f$$
(16)

where Δ = IΘ, which is a positive semi-definite matrix.

In this study, given a set of training samples {xi |i = 1,…, n} \({\mathbb{R}}\)878, the data matrix X = [x1,..., xi,..., xn]T\({\mathbb{R}}\)n×878 contains n samples in its rows, the corresponding labels matrix Y = [y1,..., y2,..., yl] \({\mathbb{R}}\)n×l, yi is the label vector of the i-th class. A miRNA-disease pairs hypergraph \({\mathcal{G}} = \left( {{\mathcal{V}},{ \mathcal{E}},{ \mathcal{W}}} \right)\) is constructed, and its hyperedge is generated based on the KNN algorithm. Concretely, for each vertex v, we search its corresponding k nearest neighbors, and use these nearest neighbors to form a hyperedge e(v). We initialize k as 15 here empirically. An illustration on the hyperedge generation process is shown in Fig. 2. Moreover, the diagonal matrix \({\mathcal{W}}\) denote the weights of the hyperedges. All the hyperedges are initialized with an equal weight, e.g., w(e) = 1/ne, where ne is the number of hyperedges.

Fig. 2
figure2

Intuitive illustration of KNN hyperedge generation

Hypergraph learning

The hypergraph learning targets on learning a regularized projection to discriminate different categories. According to Zhang et al. introduction [40], the cost function F for learning the projection matrix P can be formulated as:

$$F = \, \{ \varOmega (P) + \, \lambda R_{emp} (P) + \mu \varPhi (P)\}$$
(17)

where λ and μ are positive parameters, and we empirically set them as 101,100 respectively, which can achieve the best performance. Specifically, hypergraph Laplacian regularizer Ω(P) is calculated as

$$\begin{aligned} \varOmega (P) & = \frac{1}{2}\mathop \sum \limits_{k = 1}^{l} \mathop \sum \limits_{e \in E} \mathop \sum \limits_{u,v \in V} \frac{{W\left( e \right)H\left( {u,e} \right)H\left( {v,e} \right)}}{\delta \left( e \right)}\left( {\frac{{\left( {XP} \right)\left( {u,k} \right)}}{{\sqrt {d\left( u \right)} }} - \frac{{\left( {XP} \right)\left( {v,k} \right)}}{{\sqrt {d\left( v \right)} }}} \right)^{2} \\ & = tr(P^{T} X^{T} \varDelta XP) \\ \end{aligned}$$
(18)

where function tr(·) returns the trace of matrix. The empirical loss term Remp (P) is defined as

$$R_{emp} (P) \, = \left| {\left| {XP - Y} \right|} \right|^{{2}}$$
(19)

Φ(P) is a l2 norm regularizer to avoid over-fitting for P, which is defined as:

$$\varPhi (P) = \, \left| {\left| P \right|} \right|^{{2}}$$
(20)

Consequently, Eq. (17) can be reformed as:

$${\text{arg min}}_{P} \left\{ {tr\left( {P^{{\text{T}}} X^{{\text{T}}} \Delta XP} \right) \, + \lambda ||XP - \, Y\left| {\left| {^{{2}} + \mu } \right|} \right|P||^{{2}} } \right\}$$
(21)

Such problem is a typical Least Square problem which can be efficiently solved, its solution is as follows:

$$P = \lambda (X^{T} \varDelta X + \lambda X^{{\text{T}}} X + \mu I)^{{ - {1}}} X^{{\text{T}}} Y$$
(22)

where I is an identity matrix. Based on the learned P, the relevance score of the unknown miRNA-disease pair xunk can be obtained by

$${\text{S}}\left( {x^{unk} } \right) = x^{unk.} P$$
(23)

Results

Effect of parameters on the performance of HFHLMDA

In this work, we used KNN algorithm to generate hyperedge, one parameters k was included, which represent the number of nearest neighbors of miRNA or disease. In the hypergraph learning section of the Methods, we defined two parameters, namely, λ and μ to balance the items in Eq. (17), the values of λ and μ ranged from 10–2, 10–1, 100, 101 to 102. We conducted a series of experiments on the above parameters to acquire the effects of these parameters. The experimental results are shown in Figs. 3 and 4. In Fig. 3, we can see that regardless of how k change, the AUC of fivefold cross validation keep around 0.9187. Thus, for efficiency, we set k = 15. Furthermore, Fig. 4 describes the prediction performances of HFHLMDA with different values of λ and μ. We can see that HFHLMDA obtains the best prediction performance when λ is set to be 101 and μ is set to be 100.

Fig. 3
figure3

ROC curve and AUC with different values for parameter k

Fig. 4
figure4

AUC with different values for parameters λ and μ

Performance evaluation

Based on the known miRNA–disease associations in HMDDv2.0 database, two validation schemas were used to evaluate the performance of HFHLMDA: LOOCV and fivefold cross validation. We selected four classical computational methods: EGBMMDA [34], ICFMDA [28], RLSMDA [30], and SACMDA [41] to compete with HFHLMDA in cross validation. Specifically, LOOCV selected a known miRNA-disease association in turn as a test sample, and the rest of the associations were considered as training samples. All unknown associations were used as candidate samples. Considering that the Gaussian interaction profile kernel similarity depend on known miRNA-disease associations, the corresponding value of a test sample in matrix A should be set to 0. The predicted score for the test sample was ranked relative to the scores for candidate samples and, each ranking will take turns as a threshold in each fold, if test ranking was above a given threshold, we obtained a successful prediction made by the model. By changing the threshold, we could calculate the corresponding true positive rate (TPR) and false positive rate (FPR). Furthermore, receiver-operating characteristics (ROC) curve could be drawn according to TPR against FPR. The areas under the ROC curve (AUC) was used to evaluate the whole prediction performance. Figure 5 shows the global LOOCV ROC curves for HFHLMDA and other methods. HFHLMDA, EGBMMDA, ICFMDA, RLSMDA and SACMDA obtained AUCs of 0.9209, 0.9123, 0.9067, 0.8426 and 0.8770, respectively. HFHLMDA achieved the better prediction performance.

Fig. 5
figure5

Performance comparisons between HFHLMDA and four classical models in terms of ROC curve and AUC based on LOOCV

As for fivefold cross validation, in order to make the validation more accurate, we repeated fivefold cross validation procedure 100 times. The average AUC values of the five methods (HFHLMDA, EGBMMDA, ICFMDA, RLSMDA, SACMDA) were 0.9187(± 0.0009), 0.9048(± 0.0012), 0.9045(± 0.0008), 0.8569(± 0.0020) and 0.8767(± 0.0011), respectively (see Fig. 6). In summary, under the same dataset, our model outperformed other competitive methods.

Fig. 6
figure6

Performance comparisons between HFHLMDA and four classical models in terms of ROC curve and AUC based on fivefold cross validation

Case studies

Case studies were conducted to further verify the capability of HFHLMDA to predict miRNA-disease associations. We implemented three different kinds of case studies in this study. In the first case study, we conducted HFHLMDA to predict potential disease-miRNA associations taking advantages of known diseases-miRNAs associations included in HMDD v2.0 database. Subsequently, top 50 miRNAs for the investigated disease ranked according to their predicted scores were verified using another two well-known miRNA-disease association databases of dbDEMC [42] and miR2Disease [43]. In the second case study, we simulated the situation where HFHLMDA was conducted for disease without known miRNA associations. More concretely, we removed the known miRNA associations of the disease of interest, after which HFHLMDA was implemented according newly obtained association records. The prediction results were also verified by other databases. The final case study investigated the robustness of HFHLMDA prediction performance. We evaluated the model with a smaller and earlier version HMDDv1.0 database [44].

Esophageal cancer (EC) is one of the most common cancers worldwide, and its 5-year survival rate is about 20% [45]. Study indicate that miR-130b plays an oncogenic role in esophageal squamous cell carcinoma cells by repressing phosphatase and tensin homolog expression and Akt phosphorylation [46]. Therefore, specific and sensitive biomarkers for diagnosis and targeted therapy of EC are urgently needed. As the first type of case study, 10 out of top 10, 28 out of top 30, 45 out of top 50 predicted esophageal neoplasms related miRNAs were confirmed by dbDEMC (See Table 1).

Table 1 The top 50 predicted miRNAs associated with esophageal cancer

Hepatocellular carcinoma (HC) is a complex polygenetic disease ascribed to the interactions between genetic predisposition and environmental factors [47]. The discovery of vital target for genetic therapy are of great clinical significance to the improvement of the comprehensive effect of HC. For example, miR-122, let-7 family, and miR-101 are down-regulated in HC, suggesting that it is a potential tumor suppressor of HC. miR-221 and miR-222 are up-regulated in HCC and may act as oncogenic miRNAs in hepatocarcinogenesis [48]. We took hepatocellular carcinoma as the second kind of case study. Finally, 49 out of top 50 miRNAs were experimentally confirmed by HMDD v2.0, dbDEMC and miR2Disease (See Table 2).

Table 2 The top 50 predicted miRNAs associated with hepatocellular carcinoma

Breast Neoplasms is the most common malignancy in women, accounting more than 40,000 deaths each year [49]. Data have shown that the number of affected people is climbing, and a forecast deemed that there will be nearly 3.2 million new patients per year by 2050 [50]. In breast cancer, approximately one-fifth of metastatic patients survive 5 years [51]. Researchers have found that many miRNAs are associated with breast neoplasms by clinical experiments, such as mir‐155 and mir‐21, both of which can lead to Breast Neoplasms tumorigenesis or metastasis [52]. We took breast neoplasms as the last kind of case study, in which we got the prediction with HFHLMDA using HMDDv1.0 database. Then, we verified the predicted potential breast neoplasms related miRNAs in other databases. At last, 48 out of top 50 miRNAs were experimentally confirmed by HMDD v2.0, dbDEMC and miR2Disease (See Table 3).

Table 3 The top 50 predicted miRNAs associated with breast neoplasms

The aforementioned case studies indicate that HFHLMDA has good prediction performance. HFHLMDA can efficiently predict disease-related miRNAs based on known miRNA-disease associations, disease semantic similarity and miRNA functional similarity, and a disease without known associations also can be predicted.

Discussion

In this work, we developed a new computational model based on hypergraph learning to predict potential miRNA‐disease associations. Several important factors contribute to the excellent performance of our model. First, high-dimensionality features. Based on a credible assumption that functionally similar miRNAs tend to have associations with phenotypically similar diseases. We use the miRNAs or diseases similarity scores directly as a feature factor, with a dimension of up to 878, which contains all similar information about miRNAs or diseases. Second, hypergraph is suitable to represent local group information and the high-order relationship of data, and can completely represent the complex relationships among miRNA-disease pairs. Different from the simple-graph learning methods consider only the pair-wise relationship between two samples, and they ignore the relationship in a higher-order, hypergraph learning aims to get the relationship between several samples in a higher order. Hypergraph learning is a kind of graph clustering algorithm, the process of graph clustering is actually the optimization of graph partition. The purpose of optimization is to reduce the similarity between sub-graphs and increase the similarity within sub-graphs. Hypergraph-based models have proven to be beneficial for a variety of classification/clustering tasks, and we think it can also be applied to different fields of bioinformatics, such as drug-disease associations [53], miRNA–drug interactions [54].

Despite the practicability and efficiency of HFHLMDA, there still has some limitations. Since our method is based on machine learning techniques, negative samples are required during the training process. However, experimentally confirmed negative samples are difficult to obtain. To resolve this issue, we have randomly selected a subset of unknown miRNA–disease associations as negative instances. In addition, in our method, after the hypergraph has been constructed, it never changes during the learning process, leading to a static hypergraph structure learning mechanism. However, it is uneasy to guarantee that the generated hypergraph structure is optimal and suitable for all applications. In future work, it is necessary to investigate the hypergraph structure optimization, leading to a dynamic hypergraph structure learning scheme.

Conclusion

Increasing evidence indicates that aberrant expression of miRNAs is closely related to the occurrence and development of human complex diseases. Understanding the underlying mechanisms of miRNAs in diseases is becoming an urgent problem worldwide. Compared with traditional methods, the computational model developed for processing heterogeneous biological big data is more efficient and convenient. To predict potentially disease-related miRNAs, we proposed a hypergraph learning method called HFHLMDA. Both cross-validation and case studies had proved the effectiveness of HFHLMDA in predicting potential miRNA-disease associations.

Availability of data and materials

The datasets used during this study is provided by Li et al. [35]. Please download the data from http://www.cuilab.cn/hmdd/ or contact the authors for data requests.

Abbreviations

KNN:

K-Nearest-Neighbor

HMDD:

Human microRNA disease database

dbDEMC:

Database of differentially expressed miRNAs in human cancers

ROC:

Receiver operating characteristics

AUC:

The area under the ROC curve

LOOCV:

Leave-one-out cross validation

References

  1. 1.

    Bartel DP. MicroRNAs: target recognition and regulatory functions. Cell. 2009;136:215–33.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  2. 2.

    Kye MJ, Gonçalves ICG. The role of miRNA in motor neuron disease. Front Cell Neurosci. 2014;8:15.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  3. 3.

    Adams BD, Kasinski AL, Slack FJ. Aberrant regulation and function of microRNAs in cancer. Curr Biol. 2014;24(16):R762–76.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  4. 4.

    Cheng AM, Byrom MW, Shelton J, Ford LP. Antisense inhibition of human miRNAs and indications for an involvement of miRNA in cell growth and apoptosis. Nucleic Acids Res. 2005;33(4):1290–7.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  5. 5.

    Karp X, Ambros V. Encountering microRNAs in cell fate signaling. Science. 2005;310(5752):1288–9.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  6. 6.

    Shivdasani RA. MicroRNAs: regulators of gene expression and cell differentiation. Blood. 2006;108(12):3646–53.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  7. 7.

    Sayed D, Abdellatif M. MicroRNAs in development and disease. Physiol Rev. 2011;91(3):827–87.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  8. 8.

    Tricoli JV, Jacobson JW. MicroRNA: potential for cancer detection, diagnosis, and prognosis. Cancer Res. 2007;67(10):4553–5.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  9. 9.

    Cho WCS. MicroRNAs: potential biomarkers for cancer diagnosis, prognosis and targets for therapy. Int J Biochem Cell Biol. 2010;42(8):1273–81.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  10. 10.

    Pan LP, Liu F, Zhang JL, et al. Genome-wide miRNA analysis identifies potential biomarkers in distinguishing tuberculous and viral meningitis. Front Cell Infect Microbiol. 2019;9:323.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  11. 11.

    Jiang QH, Hao YY, Wang GH, Juan LR, Zhang TJ, et al. Prioritization of disease microRNAs through a human phenome-microRNAome network. BMC Syst Biol. 2010;4(Suppl 1):S2.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  12. 12.

    Chen X, Liu MX, Yan GY. RWRMDA: predicting novel human microRNA-disease associations. Mol BioSyst. 2012;8(10):2792–8.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  13. 13.

    Shi HB, Xu J, Zhang GD, Xu LD, Li CQ, Wang L, et al. Walking the interactome to identify human miRNA-disease associations through the functional link between miRNA targets and disease genes. BMC Syst Biol. 2013;7(1):101.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  14. 14.

    Zhao XM, Liu KQ, Zhu G, et al. Identifying cancer-related microRNAs based on gene expression data. Bioinformatics. 2015;31(8):1226–34.

    PubMed  PubMed Central  Article  Google Scholar 

  15. 15.

    Qin GM, Li RY, Zhao XM. Identifying disease associated miRNAs based on protein domains. IEEE/ACM Trans Comput Biol Bioinform. 2016;13(6):1027–35.

    PubMed  PubMed Central  Article  Google Scholar 

  16. 16.

    Chen X, Huang L. LRSSLMDA: Laplacian regularized sparse subspace learning for MiRNA-disease association prediction. PLoS Comput Biol. 2017;13(12):e1005912.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  17. 17.

    Chen X, Xie D, Wang L, Zhao Q, You ZH, Liu H. BNPMDA: bipartite network projection for MiRNA-disease association prediction. Bioinformatics. 2018;34(18):3178–86.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  18. 18.

    Chen X, Wang L, Qu J, Guan NN, Li JQ. Predicting miRNA-disease association based on inductive matrix completion. Bioinformatics. 2018;34(24):4256–65.

    CAS  PubMed  PubMed Central  Google Scholar 

  19. 19.

    Chen X, Yin J, Qu J, Huang L. MDHGI: matrix decomposition and heterogeneous graph inference for miRNA-disease association prediction. PLoS Comput Biol. 2018;14(8):e1006418.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  20. 20.

    Chen X, Zhu CC, Yin J. Ensemble of decision tree reveals potential miRNA-disease associations. PLoS Comput Biol. 2019;15(7):e1007209.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  21. 21.

    Chen X, Xie D, Zhao Q, You ZH. MicroRNAs and complex diseases: from experimental results to computational models. Brief Bioinform. 2019;20(2):515–39.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  22. 22.

    Xiao Q, Luo JW, Liang C, Cai J, Ding PJ. A graph regularized non-negative matrix factorization method for identifying microRNA-disease associations. Bioinformatics. 2018;34(2):239–48.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  23. 23.

    Liu Y, Zeng X, He Z, Zou Q. Inferring microRNA-disease associations by random walk on a heterogeneous network with multiple data sources. IEEE/ACM Trans Comput Boil Bioinform. 2017;14(4):905–15.

    Article  Google Scholar 

  24. 24.

    You ZH, Huang ZA, Zhu Z, et al. PBMDA: a novel and effective path-based computational model for miRNA-disease association prediction. PLoS Comput Biol. 2017;13(3):e1005455.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  25. 25.

    Chen X, Niu YW, Wang GH, et al. HAMDA: hybrid approach for MiRNA-disease association prediction. J Biomed Inform. 2017;76:50–8.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  26. 26.

    Chen X, Guan NN, Li JQ, et al. GIMDA: graphlet interaction-based MiRNA-disease association prediction. J Cell Mol Med. 2018;22(3):1548–61.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  27. 27.

    Chen X, Yang JR, Guan NN, Li JQ. GRMDA: graph regression for MiRNA-disease association prediction. Front Physiol. 2018;9:92.

    PubMed  PubMed Central  Article  Google Scholar 

  28. 28.

    Jiang YD, Liu BT, Yu LH, et al. Predict MiRNA-disease association with collaborative filtering. Neuroinformatics. 2018;16:363–72.

    PubMed  PubMed Central  Article  Google Scholar 

  29. 29.

    Xu J, Li CX, Lv JY, et al. Prioritizing candidate disease miRNAs by topological features in the miRNA target-dysregulated network: case study of prostate cancer. Mol Cancer Ther. 2011;10(10):1857–66.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  30. 30.

    Chen X, Yan GY. Semi-supervised learning for potential human microRNA-disease associations inference. Sci Rep. 2014;4:5501.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  31. 31.

    Luo JW, Xiao Q, Liang C, Ding PJ. Predicting microRNA-disease associations using Kronecker regularized least squares based on heterogeneous omics data. IEEE Access. 2017;5:2503–13.

    Article  Google Scholar 

  32. 32.

    Chen X, Yan CC, Zhang X, Li Z, Deng L, et al. RBMMMDA: predicting multiple types of disease-microRNA associations. Sci Rep. 2015;5:13877.

    PubMed  PubMed Central  Article  Google Scholar 

  33. 33.

    Chen X, Yan CC, Zhang X, You ZH, et al. HGIMDA: heterogeneous graph inference for miRNA-disease association prediction. Oncotarget. 2016;7(40):65257–69.

    PubMed  PubMed Central  Article  Google Scholar 

  34. 34.

    Chen X, Huang L, Xie D, Zhao Q. EGBMMDA: extreme gradient boosting machine for MiRNA-disease association prediction. Cell Death Dis. 2018;9(1):3.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  35. 35.

    Li Y, Qiu CX, Tu J, Geng B, Yang JC, et al. HMDD v2.0: a database for experimentally supported human microRNA and disease associations. Nucleic Acids Res. 2014;42:D1070–4.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  36. 36.

    Wang D, Wang J, Lu M, Song F, Cui QH. Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases. Bioinformatics. 2010;26(13):1644–50.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  37. 37.

    Van Laarhoven T, Nabuurs SB, Marchiori E. Gaussian interaction profile kernels for predicting drug-target interaction. Bioinformatics. 2011;27(21):3036–43.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  38. 38.

    Wang L, You ZH, Huang YA, Huang DS, Chan KCC. An efficient approach based on multi-sources information to predict CircRNA-disease associations using deep convoltional neural network. Bioinformatics. 2020;36(13):4038–46.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  39. 39.

    Zhou DY, Huang JY, Schlkopf B. Learning with hypergraphs: clustering, classification, and embedding. Adv Neural Inf Process Syst. 2006;19:1601–8.

    Google Scholar 

  40. 40.

    Zhang ZZ, Liu HJ, Zhao XB, Ji RR, Gao Y. Inductive multi-hypergraph learning and its application on view-based 3D object classification. IEEE Trans Image Process. 2018;27(12):5957–68.

    PubMed  PubMed Central  Article  Google Scholar 

  41. 41.

    Shao BY, Liu BT, Yan CG. SACMDA: MiRNA-disease association prediction with short acyclic connections in heterogeneous graph. Neuroinformatics. 2018;16:373–82.

    PubMed  PubMed Central  Article  Google Scholar 

  42. 42.

    Yang Z, Ren F, Liu CN, He SM, Sun G, Gao Q, et al. dbDEMC: a database of differentially expressed miRNAs in human cancers. BMC Genomics. 2010;11(Suppl 4):S5.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  43. 43.

    Jiang QH, Wang YD, Hao YY, Juan LR, Teng MX, et al. miR2Disease: a manually curated database for microRNA deregulation in human disease. Nucleic Acids Res. 2009;37(1):D98–104.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  44. 44.

    Lu M, Zhang QP, Deng M, Miao J, Guo YH, et al. An analysis of human microRNA and disease associations. PLoS ONE. 2008;3(10):e3420.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  45. 45.

    Torre LA, Bray F, Siegel RL, Ferlay J, Lortet-Tieulent J, Jemal A. Global cancer statistics, 2012. CA Cancer J Clin. 2015;65:87–108.

    Article  Google Scholar 

  46. 46.

    Yu T, Cao R, Li S, et al. MiR-130b plays an oncogenic role by repressing PTEN expression in esophageal squamous cell carcinoma cells. BMC Cancer. 2015;15:29.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  47. 47.

    Nie X, Liu Y, Chen WD, Wang YD. Interplay of miRNAs and canonical Wnt signaling pathway in hepatocellular carcinoma. Front Pharmacol. 2018;9:657.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  48. 48.

    Saito Y, Suzuki H, Matsuura M, Sato A, Kasai Y, et al. MicroRNAs in hepatobiliary and pancreatic cancers. Front Gene. 2011;2:66.

    Article  Google Scholar 

  49. 49.

    Desantis CE, Fedewa SA, et al. Breast cancer statistics, 2015: convergence of incidence ratesbetween black and white women. CA Cancer J Clin. 2016;66(1):31–42.

    PubMed  PubMed Central  Article  Google Scholar 

  50. 50.

    Gomella LG. Prostate cancer statistics: anything you want them to be. Can J Urol. 2017;24(1):8603–4.

    PubMed  PubMed Central  Google Scholar 

  51. 51.

    Lee JH, Zhao XM, Yoon I, et al. Integrative analysis of mutational and transcriptional profiles reveals driver mutations of metastatic breast cancers. Cell Discov. 2016;2:16025.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  52. 52.

    Feber A, Xi L, Luketich JD, et al. MicroRNA expression profiles of esophageal cancer. J Thorac Cardiovasc Surg. 2008;135(2):255–60.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  53. 53.

    Yang K, Zhao X, Waxman D, Zhao XM. Predicting drug-disease associations with heterogeneous network embedding. Chaos. 2019;29(12):123109.

    PubMed  PubMed Central  Article  Google Scholar 

  54. 54.

    Xie WB, Yan H, Zhao XM. EmDL: extracting miRNA–drug interactions from literature. IEEE/ACM Trans Comput Biol Bioinform. 2019;16(5):1722–8.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

Download references

Acknowledgments

We would like to thank the editor and referees for the thoughtful and insightful comments.

About this supplement

This article has been published as part of BMC Medical Informatics and Decision Making Volume 21 Supplement 1, 2021: Proceedings of the 2019 International Conference on Intelligent Computing (ICIC 2019): medical informatics and decision making. The full contents of the supplement are available online at https://bmcmedinformdecismak.biomedcentral.com/articles/supplements/volume-21-supplement-1.

Funding

Publication costs are funded by the National Natural Science Foundation of China (Nos. 61873001, U19A2064, 61872220, 61672037, 61861146002, 11701318 and 61732012), the Key Project of Anhui Provincial Education Department (No. KJ2017ZD01), and the Xinjiang Autonomous Region University Research Program (XJEDU2019Y002). The funding bodies did not play any role in the design of the study or collection, analysis and interpretation of data or in writing the manuscript.

Author information

Affiliations

Authors

Contributions

JCN and CHZ supervised the entire project. YTW and QWW conceptualized and designed the study. YTW and ZG undertook data collection. YTW, QWW and ZG performed the data analysis. QWW drafted the initial version. JCN and CHZ revised the manuscript iteratively for important intellectual content. All authors edited the paper and gave final approval for the version to be published. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Jian-Cheng Ni or Chun-Hou Zheng.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wang, YT., Wu, QW., Gao, Z. et al. MiRNA-disease association prediction via hypergraph learning based on high-dimensionality features. BMC Med Inform Decis Mak 21, 133 (2021). https://doi.org/10.1186/s12911-020-01320-w

Download citation

Keywords

  • MicroRNA
  • Disease
  • MiRNA-disease association
  • K-nearest-neighbor
  • Hypergraph learning