 Research
 Open access
 Published:
Exploring potential circRNA biomarkers for cancers based on doubleline heterogeneous graph representation learning
BMC Medical Informatics and Decision Making volume 24, Article number: 159 (2024)
Abstract
Background
Compared with the timeconsuming and laborintensive for biological validation in vitro or in vivo, the computational models can provide highquality and purposeful candidates in an instant. Existing computational models face limitations in effectively utilizing sparse local structural information for accurate predictions in circRNAdisease associations. This study addresses this challenge with a proposed method, CDADGRL (Prediction of CircRNADisease Association based on Doubleline Graph Representation Learning), which employs a deep learning framework leveraging graph networks and a dualline representation model integrating graph node features.
Method
CDADGRL comprises several key steps: initially, the integration of diverse biological information to compute integrated similarities among circRNAs and diseases, leading to the construction of a heterogeneous network specific to circRNAdisease associations. Subsequently, circRNA and disease node features are derived using sparse autoencoders. Thirdly, a graph convolutional neural network is employed to capture the local graph network structure by inputting the circRNAdisease heterogeneous network alongside node features. Fourthly, the utilization of node2vec facilitates depthfirst sampling of the circRNAdisease heterogeneous network to grasp the global graph network structure, addressing issues associated with sparse raw data. Finally, the fusion of local and global graph network structures is inputted into an extra trees classifier to identify potential circRNAdisease associations.
Results
The results, obtained through a rigorous fivefold crossvalidation on the circR2Disease dataset, demonstrate the superiority of CDADGRL with an AUC value of 0.9866 and an AUPR value of 0.9897 compared to existing stateoftheart models. Notably, the hyperrandom tree classifier employed in this model outperforms other machine learning classifiers.
Conclusion
Thus, CDADGRL stands as a promising methodology for reliably identifying circRNAdisease associations, offering potential avenues to alleviate the necessity for extensive traditional biological experiments. The source code and data for this study are available at https://github.com/zywait/CDADGRL.
Introduction
Circular RNAs (circRNAs) are a new type of noncoding RNAs involved in the development of certain diseases, which plays an important role in gene expression and signaling pathways [1]. Compared with other noncoding RNAs, circRNAs as a biomarker of disease has demonstrated with better stability and integrity, thus offering great potential in tumor diagnosis [2, 3]. Gene expression and protein synthesis in cancer cells are also regulated by circRNAs [4]. Traditional works in biological validation for identifying the association between circRNA and disease are timeconsuming and usually lack specificity, although with high prediction accuracy [5]. Meanwhile, biological databases coming from traditional biological experiments and related literature increasingly provide the convenience and basis for computational methods to identify circRNAdisease associations more efficiently and economically [6]. Currently, existing computational methods for predicting circRNAdisease associations are classified into two major categories broadly: network computingbased models and machine learningbased models.
Network computingbased models
These models leverage circRNA (disease) similarity network and known circRNAdisease associations to construct the heterogeneity network. Subsequently, algorithms tailored for this network are employed to forecast potential associations. Lei et al. [7] proposed a method named RWRKNN, which integrated the random walk with restart (RWR) and knearest neighbors (KNN) to predict circRNAdisease associations. However, RWRKNN highly relies on priori information of circRNAs and diseases, it is slightly inadequate in revealing the relationship between isolated diseases and new circRNAs. Li et al. [8] proposed a novel method named DWNCPCDA based on DeepWalk and Network Consistency Projection. An important innovation of DWNCPCDA was adopted DeepWalk, an embedded method of network, to learn embedding of nodes in the network of known circRNAdisease associations. Zhang et al. [9] proposed a linear neighborhood label propagation method, named CDLNLP, to predict circRNAdisease associations. CDLNLP resulted in good performance mainly attributing to the following factors: the application of linear neighbor similarity (LNS) guaranteeing the basic effectiveness, and only using the known and reliable circRNAdisease associations as prior information. CDLNLP also could not be applied in prediction of associations involving new circRNAs or isolated diseases.
Machine learningbased models
These models utilize circRNA (disease) similarity network and known circRNAdisease associations to train supervised or unsupervised learning algorithms. These algorithms iteratively optimize their internal parameters to extract latent features from the circRNA and disease data. Lan et al. [10] proposed a new computational method (KGANCDA) to predict circRNAdisease associations based on knowledge graph attention network. CircRNAdisease knowledge graphs were constructed by collecting multiple relationship data between different types of nodes (circRNAs, diseases, miRNAs and lncRNAs). Embeddings of each entity in circRNAdisease knowledge graphs were obtained with attention network by distinguishing the importance of information from neighbors. Besides the loworder neighbor information, KGANCDA could also capture highorder neighbor information from multisource associations to alleviate the problem of rawdata sparsity. Ma et al. [11] proposed a novel algorithm CRPGCN to predict circRNAdisease associations based on Graph Convolutional Network (GCN) constructed with Random Walk with Restart (RWR) and Principal Component Analysis (PCA). RWR was used to calculate similarity between nodes. After that, PCA that was used to reduce dimensions and extract features intensified the association of circRNAs with diseases. However, CRPGCN produced the biased results due to some data were isolated in the process of data fusion. Zheng et al. [12] introduced iCDACGR, a novel approach aimed at identifying circRNAdisease associations by leveraging Chaos Game Representation (CGR). By incorporating sequence information and quantifying nonlinear relationships, iCDACGR addressed the limitation of model coverage. Nevertheless, there remains a scope for enhancing the predictive accuracy of iCDACGR. Li et al. [13] proposed SIMCCDA, a method that leverages inductive matrix completion techniques to impute the missing values within the known circRNAdisease association matrix. This approach reformulates the association prediction task as a recommendation system problem, achieving good performance with reduced memory requirements and training time. However, SIMCCDA cannot be applied to the prediction of new diseases without any associations or isolated circRNAs. Zuo et al. [14] proposed DMCCDA, an association prediction method based on double matrix completion. DMCCDA employs matrix completion methods to reconstruct the known association matrix. Subsequently, it utilizes the reconstructed matrix alongside a corresponding Gaussian similarity matrix to create a combined matrix, which is again reconstructed using matrix completion. The final prediction score integrates the results from these steps. Despite its methodological novelty, DMCCDA exhibits limitations in performance compared to alternative methods.
In recent years, deep learningbased models have emerged as a powerful tool in bioinformatics [5]. These models represent biological systems as graphs, where nodes represent biological entities and edges represent interactions between them [15]. Graph representation learning, a technique within deep learning, extracts features from graph networks and learns lowdimensional representations of nodes, links, and subgraphs, preserving the graph's topology and intrinsic properties [16]. Several studies have employed graph representation learning for various biological association prediction tasks: Zhang et al. [17] proposed a computational model based on graph representation learning that was composed of GCN and graph factorization (GF), named iGRLCDA, to identify circRNA–disease associations. Peng et al. [18] proposed a novel endtoend heterogeneous graph representation learningbased model, called EEGDTI, to identify drug–target interactions. Zhao et al. [19] proposed a novel model, namely HINGRL, to predict drugdisease associations with graph representation learning on heterogeneous information network. Jiang et al. [20] presented a novel computational model combining sparse autoencoder and rotation forest (SAEROF) to predict drugdisease association. Ha et al. [21] proposed a node2vecbased neural collaborative filter, named NCMD, to predict miRNAdisease associations. Zhao et al. [22] proposed a novel method to predict drugtarget interactions based on largescale graph representation learning. Zhao et al. [23] proposed MotifMDA, a novel motifaware model that integrates high and loworder structural information for miRNAdisease association prediction.
Extratree classifiers have also proven effective in bioinformatics tasks due to their ability to introduce randomization and achieve good flexibility and accuracy [24, 25]. Extratree classifiers have been successfully applied in leukocyte classification [26], lncRNAprotein interactions identification [27], and cardiovascular disease prediction [28].
While several computational methods have been proposed, they exhibit shortcomings such as reliance on prior information, inability to accommodate new circRNAs or isolated diseases, biased results, and limited prediction accuracy [7, 9,10,11,12, 15]. Furthermore, the inherent complexity of extracting relevant features from heterogeneous graphs poses a substantial challenge to the development of robust models for circRNAdisease association prediction [20,21,22, 24, 25, 29, 30]. To overcome these challenges, we propose a novel approach termed CDADGRL (CircRNADisease Association Prediction via DoubleLine Graph Representation Learning). This innovative model integrates diverse biological data sources, employs advanced feature extraction techniques, and comprehensively analyzes both local and global graph structures to enhance the identification of circRNAdisease associations. By addressing these challenges, CDADGRL aims to provide a more accurate and efficient means of predicting circRNAdisease associations, thereby facilitating advancements in disease diagnosis and treatment.

Step 1, diverse biological information encompassing circRNA functional similarity, disease semantic similarity, circRNA (disease) Gaussian interaction profile kernel similarity, and circRNAdisease known associations were integrated to form integrated circRNA (disease) similarity. These integrated similarities were then utilized to construct the circRNAdisease heterogeneous network (CDHN).

Step 2, the integrated circRNA (disease) similarity metric from step 1 was then fed into a sparse autoencoder to extract node features for both circRNAs and diseases within the CDHN.

Step 3, local graph networks were built by inputting the node features of CDHN into a GCN, enabling the capture of local graph structures.

Step 4, global graph networks were constructed using node2vec, employing depthfirst sampling within CDHN to comprehend the broader network structure comprehensively.

Step 5, the combination of local and global graph networks was inputted into an extratree classifier to identify potential circRNAdisease associations.
CDADGRL represents a novel approach that leverages the strengths of both local and global graph structures. By integrating diverse biological data sources, employing a sparse autoencoder for feature extraction, and comprehensively analyzing both the finegrained relationships (local structures) and the broader network context (global structures) within the circRNAdisease heterogeneous network, CDADGRL effectively identifies circRNAdisease associations.
Results
Experiment dataset
From the circR2Disease database [31], we assembled a dataset comprising 739 experimentally validated associations, involving 661 circRNAs and 100 diseases. Following the removal of redundant entries, our focus narrowed to 650 nonrepetitive associations linked specifically to human complex diseases as the known circRNAdisease associations. This refined benchmark dataset involved 585 distinct circRNAs and encompassed 88 unique complex diseases.
Evaluation metric and method
When evaluating circRNAdisease node pairs, whose prediction scores surpassing a predefined threshold are classified as positive samples; otherwise, those falling below the threshold are labeled as negative samples. True positive rate (TPR) and false positive rate (FPR) were computed at various threshold values, generating multiple TPR and FPR groups. These data points were utilized to construct receiver operating characteristic (ROC) curves plotting TPR against FPR. Common evaluation metrics including area under the ROC curve (AUROC), area under the precisionrecall (PR) curve (AUPR), accuracy, sensitivity, precision, specificity, and Matthews's correlation coefficient (MCC) were employed to evluate the predictive performance of the compared models under comparison. To mitigate the impact of result variance, a fivefold crossvalidation method was iterated 10 times to ensure robustness. The average values derived from these repetitions were calculated to yield final evaluation results.
Evaluation result and analyzation
Fivefoldcrossvalidation
After implementing fivefold crossvalidation, the results for each evaluation metric obtained from CDADGRL are presented in Table 1.
Based on the outcomes detailed in Table 1 for each metric, CDADGRL exhibited notable predictive performance across all folds within the fivefold crossvalidation. The consistent results observed across different folds underscore the model's proficiency and stability, affirming CDADGRL's capability for both excellent performance and consistent reliability.
Ablation experiment
To better assess the impact and significance of incorporating different network structures on addressing data sparsity within the biological network, we conducted ablation experiments employing three distinct experimental schemes: ① local graph structure only; ② global graph structure only; ③ both local and global graph structures. Subsequent to performing fivefold crossvalidation, the detailed experimental outcomes are presented in Table 2.
The outcomes in Table 2 illustrate that the third experimental scheme (ours) achieved the best predictive performance across all evaluation metrics. The first scheme only utilizes the local network structure, focusing on the immediate relationships between circRNAs and diseases. While this approach can capture finegrained details about these relationships, it may miss broader network context that could be informative for prediction. The second scheme solely leverages the global network structure, analyzing the overall connectivity patterns within the network. This can capture the broader context of circRNA and disease interactions but may lack the specificity of local relationships. For instance, it might identify circRNAs with similar disease associations even if they lack direct functional similarity. The third experimental scheme (ours) integrates both local and global network structures. This allows the model to capture both finegrained relationships between circRNAs and diseases and the broader network context. The superior performance of our scheme supports the theoretical notion that combining local and global network structures allows the model to extract more comprehensive features, leading to more accurate circRNAdisease association prediction.
Classifier comparison
To comprehensively validate our model, we employed various classifiers, such as random forest (RF) [17], logistic regression (LR) [32], Knearest neighbor classifier (KNN) [7], Gaussian Parsimonious Bayes (Gaussian NB) [17], and extratree classifier (ET). Each classifier was individually incorporated into our model to assess their respective contributions toward achieving optimal predictive performance. Employing fivefold crossvalidation with default parameters, we meticulously evaluated the performance of each classifier. Detailed evaluation results are presented in Table 3, outlining their respective predictive capacities.
The analysis of Table 3 reveals that the integration of the extratree classifier (ET) resulted in superior performance metrics compared to other classifiers. Specifically, the ET implementation facilitated an improvement of 0.65%, 22.49%, 5.97%, and 24.07% in AUROC values over alternative classifiers. Furthermore, the utilization of ET within our model led to the achievement of the highest AUPR value, showcasing enhancements of 0.55%, 27.43%, 5.98%, and 22.87% compared to other classifiers, respectively.
Model comparison
To assess the effectiveness of our CDADGRL model, we conducted a comparative analysis against three related stateoftheart models, SIMCCDA [13], CRPGCN [11] and DMCCDA [14]. This comparison was conducted using the refined benchmark dataset outlined in Sect. "Experiment Dataset". Hyperparameter selection for all involved models was guided by relevant lectures to ensure optimal configuration. Following a rigorous fivefold crossvalidation process, comprehensive evaluation results are visually presented in Table 4 and Fig. 1.
As the results shown in Table 4, our CDADGRL performs excellently across most key metrics, showing a balanced performance advantage. While it may not be the best in some individual metrics, its overall performance is very strong. Notably, it excels in accuracy, sensitivity, MCC, and AUC. While CDADGRL is slightly inferior in certain individual metrics compared to DMCCDA and CRPGCN, its overall performance is more balanced. For example, CDADGRL performs exceptionally well in sensitivity, precision, AUROC, and AUPR, indicating its potential advantage in handling imbalanced datasets and practical applications. As depicted in Fig. 1, CDADGRL demonstrates superior performance in both AUROC and AUPR values, especially on imbalanced datasets. Although DMCCDA achieves a marginally higher AUROC value (0.25%) than our CDADGRL, its AUPR value is notably lower by 10.97% in comparison. While SIMCCDA solely relies on network similarity for prediction, CDADGRL integrates diverse biological data sources and leverages both local and global network structures. This comprehensive approach likely contributes to CDADGRL's advantage in capturing complex relationships between circRNAs and diseases. Compared to CRPGCN, which utilizes GCNs to learn features from the local network structure, CDADGRL additionally analyzes the broader network context. This theoretically allows CDADGRL to capture more informative features, leading to its superior performance. Interestingly, DMCCDA achieves a marginally higher AUROC value than CDADGRL. However, its AUPR value is notably lower. DMCCDA incorporates multisource information but may not explicitly capture finegrained relationships between circRNAs and diseases, potentially explaining the lower AUPR. Conversely, CDADGRL's focus on both local and global structures likely contributes to its strong performance in both metrics. Consequently, CDADGRL exhibits the most comprehensive and superior performance across both evaluation metrics, highlighting the effectiveness of our proposed doubleline graph representation learning approach for circRNAdisease association prediction.
Robustness verification
Additional experiments were conducted to verify the robustness of our model across various domains: circRNAdisease association prediction, miRNAdisease association prediction, and drugtarget interaction prediction. The dataset concerning circRNAdisease association was sourced from the previously described benchmark dataset. Subsequently, datasets for miRNAdisease association and drugtarget interaction were acquired and processed in accordance with methodologies outlined in literature [33] and literature [22], respectively. The miRNAdisease association dataset encompasses 5430 established associations involving 495 distinct miRNAs and 383 diseases. On the other hand, the drugtarget interaction dataset consists of 11,396 known associations involving 984 drugs and 635 proteins. Employing a rigorous fivefold crossvalidation process, ROC plots and PR plots were generated for the three datasets, as depicted in Fig. 2. These experiments were conducted with the objective of assessing our model's predictive performance and robustness across diverse molecular interaction domains. They serve to demonstrate the efficacy of our model in predicting circRNAdisease associations, miRNAdisease associations, and drugtarget interactions, showcasing its versatility and effectiveness.
As depicted in Fig. 2, CDADGRL attained AUC values of 0.9437, 0.9668, and 0.9866, along with AUPR values of 0.9429, 0.9658, and 0.9897 for circRNAdisease association data, miRNAdisease association data, and drugtarget interaction data, respectively. These experimental outcomes substantiate the model's applicability across datasets characterized by distinct scales and content compositions. Furthermore, the results underscore its robustness and notable generalization capacity.
Case study
Many researchers are trying hard to minimize the incidence of cancers. Global cancer statistics [34] reported that breast cancer is the most prevalent type of cancer in women worldwide and ranks second in terms of death tolls. For gastric cancer, the fiveyear survival rate is generally 5–25%. Among the cancers, gastric cancer is more deadly [35]. To validate the predictive capabilities of CDADGRL in realworld scenarios, this study conducted case studies focusing on breast cancer and gastric cancer. Through computational analyses, the model identified circRNAs associated with these two cancers. After sorting the resultant association prediction scores in descending order, the top 10 ranked circRNAs related to each case were selected to be validated with crossreferencing relevant literature and reports available in the PMID database. The detailed results are presented in Tables 5 and 6 as follows.
In Tables 5 and 6, both only two out of ten circRNAs predicted haven’t been found to have any evidence described in the literature of PubMed database. Alrough there is no direct description of the association between “hsa_circ_0001649” and breast cancer in the literature so far, literatue [36] studied the relationship between hsa_circ_0001649 and miR20a and the underlying molecular mechanisms, and literature [37] demostrated the role for miR20a in the regulation of breast cancer angiogenesis. An accompanying file on the Royal Society of Chemistry's website delineates the association between “hsa_circ_0000064” and breast cancer, despite the absence of a direct explicit description of this association within available literature. In Table 5, there's no direct description in any literature currently available that associates “hsa_circ_0007534” with gastric cancer. However, numerous pieces of literature demonstrate a direct association between "hsa_circ_0007534" and colorectal cancer as well as pancreatic cancer, both of which belong to cancers affecting parts of the digestive system [38,39,40]. We believe that forthcoming research will unveil evidence linking 'hsa_circ_0007534' to gastric cancer, a digestive systemrelated cancer. As for “circMCTP1”, another circRNA lacking direct evidence, it has been demonstrated to be associated with multiple system atrophy (MSA) [41]. Furthermore, it's noteworthy that all patients diagnosed with MSA exhibit gastrointestinal abnormalities [42]. The potential for discovering evidence linking "hsa_circ_0007534" to gastric cancer remains open for future exploration.
Discussion
The precise identification of the association between circRNAs and diseases holds significant promise in expediting drug development, personalized diagnostics, and the treatment landscape for a spectrum of human diseases. In this study, we introduce a novel deep learning framework termed CDADGRL, which leverages a graph network structure and employs bilinear representation based on graph node features. This framework could capture both local and global structural information inherent in heterogeneous networks. By doing so, it mitigates the challenge of poor prediction accuracy stemming from the inherent sparsity of biological data. Notably, the model exhibits robustness and applicability across datasets with varying scales and contents. Our future endeavors involve the integration of diverse biological information, encompassing miRNA, lncRNA, and other pertinent elements, to construct an expansive circRNAdisease heterogeneity network. This holistic approach aims to enrich the pool of circRNA and diseaserelated information, facilitating more precise predictions of the association between circRNAs and diseases. With unraveling and interpreting the deep sea of circRNAs, it may serve as prognostic, diagnostic, and even therapeutic tools, or molecules to be targeted for biomedical research and clinical applications. While CDADGRL demonstrates promising performance, there is an opportunity to potentially enhance the effectiveness of local network structure representation. Inspired by the work presented in [43], we will explore how alternative attribute graph network construction methods might improve the model's capability to capture intricate rel.
Materials and methods
Network construction
CircRNADisease Heterogeneous Network (CDHN)
Utilizing the previously referenced benchmark dataset, a circRNAdisease association network was constructed and denoted as \({\mathbf{A}} \in {\mathbb{R}}^{n \times m}\), where the variables \(n\) and \(m\) represent the number of circRNAs and diseases involved, respectively. In this network, if a circRNA \(c_{i}\) has a known association with disease \(d_{j}\), the matrix element \({\mathbf{A}}(c_{i} ,d_{j} ) = 1\); conversely, \({\mathbf{A}}(c_{i} ,d_{j} ) = 0\). Subsequently, a heterogeneous network CDHN, represented by an adjacent matrix \({\mathbf{X}} \in {\mathbb{R}}^{(n + m) \times (n + m)}\), was constructed using the association information as follows:
where \({\mathbf{A}}^{T}\) represents the corresponding transpose matrix of \({\mathbf{A}}\). This construction results in a comprehensive heterogeneous network capturing both circRNAdisease associations and their interrelations.
Disease semantic similarity network
Semantic information regarding diseases was obtained from the U.S. National Library of Medicine database (https://www.nlm.nih.gov/mesh/), with which semantic similarities for diseases were calculated by using directed acyclic graphs (DAG) [44]. Within this framework, a disease node \(d\) is represented by \(DAG_{d} = \left( {d,T_{d} ,E_{d} } \right)\), where \(T_{d}\) denotes the set encompassing all ancestors of disease \(d\) (including \(d\) itself), and \(E_{d}\) signifies the set of edges connecting those diseases in the set. Consequently, the semantic contribution value of any disease \(d\) to disease \(d_{i}\) was defined with \(SC_{{d_{i} }} \left( d \right)\):
where \(\gamma\) represents the semantic contribution factor, empirically set to 0.5 in accordance with literature [44]. This formulation aims to quantify the semantic relationship between diseases based on their shared ancestry within the DAG framework.
The semantic value of disease \(d_{i}\) is represented by \(SV\left( {d_{i} } \right)\), with definition as:
The matrix element within the disease semantic similarity network (denoted as \({\mathbf{DS}} \in {\mathbb{R}}^{m \times m}\)) that represent the semantic similarity between disease \(d_{i}\) and disease \(d_{j}\) is denoted by \({\mathbf{DS}}\left( {d_{i} ,d_{j} } \right)\), with calculation as:
CircRNA functional similarity network
In accordance with the hypothesis suggesting that similar circRNAs tend to be associated with similar diseases and vice versa [45], circRNA functional similarity was calculated by integrating disease semantic similarity and experimentally validated circRNAdisease associations. The calculation involved determining the maximum semantic similarity value for any disease \(d\) within the disease set \(T = \left\{ {d_{1} ,d_{2} , \cdots ,d_{m} } \right\}\) was calculated as:
Matrix \({\mathbf{FS}} \in {\mathbb{R}}^{n \times n}\) denotes the circRNA functional similarity network whose element \({\mathbf{FS}}\left( {c_{i} ,c_{j} } \right)\) represents the circRNA functional similarity between circRNA \(c_{i}\) and circRNA \(c_{j}\):
where \(T_{i}\) represents the set of diseases associated with circRNA \(c_{i}\), \(T_{j}\) represents the set of diseases associated with circRNA \(c_{j}\), \(r\) and \(l\) denote the number of diseases in sets \(T_{i}\) and \(T_{j}\), respectively.
Gaussian interaction profile kernel similarity network
The sparsity inherent in the original circRNAdisease association network significantly impacts prediciton accuracy. To address this limitation, we introduced the Gaussian interaction profiles kernel similarity to fill the missing values within the original circRNAdisease association network [45]. Matrix \({\mathbf{CK}} \in {\mathbb{R}}^{n \times n}\) represents the Gaussian interaction profile kernel similarity for circRNAs, where the matrix element \({\mathbf{CK}}\left( {c_{i} ,c_{j} } \right)\) denotes the Gaussian interaction profile kernel similarity between circRNA \(c_{i}\) and circRNA \(c_{j}\):
where the parameter \(\lambda_{{\text{c}}}\) represents the control kernel bandwidth, employed to regulate the size of \({\mathbf{CK}}\left( {c_{i} ,c_{j} } \right)\):
Similarly, the Gaussian interaction profile kernel similarity for diseases (\({\mathbf{DK}} \in {\mathbb{R}}^{m \times m}\)), wherein the matrix element \({\mathbf{DK}}\left( {d_{i} ,d_{j} } \right)\) undergoes a similar calcuation processes as above.
Integrated similarity network
To improve the relatively low accuracy caused by sparsity within the circRNA (disease) semantic similarity network, we combined circRNA (disease) Gaussian interaction profile kernel similarity with circRNA functional similarity (disease semantic similarity). This combination resulted in the formation of the integrated circRNA similarity network (\({\mathbf{X}}_{c} \in {\mathbb{R}}^{n \times n}\)) and the integrated disease similarity network (\({\mathbf{X}}_{d} \in {\mathbb{R}}^{m \times m}\)), respectively:
Feature extraction
The relationships among nodes within HCDN are complex, and individual node features typically encompass multiple attributes. To precisely comprehend these relationships, node features necessitate extraction from various perspectives and dimensions to comprehensively capture the network's complexity.
Dimensionality reduction
The sparse autoencoder could not only fix the redundancy and sparsity problems existing in the original benchmark dataset, but also enhance the model's generalization ability, mitigating overfitting during the training phase [20]. To reduce the dimensionality of the integrated circRNA (disease) similarity and obtain a more concise representation, a novel sparse autoencoder based on a threelayer neural network structure was designed.
Integrated circRNA similarity network (\({\mathbf{X}}_{c}\)) as input was fed into the sparse autoencoder. The optimal number of neurons in the hidden layer, minimizing data loss during the transformation from the original space (input layer) to the new feature space (output layer), was denoted by \(k\), with a value set to 64 [22]. The input was compressed within the hidden layer, calculated as:
where \({\vec{\mathbf{y}}}_{c} \in {\mathbb{R}}^{1 \times k}\), a vector within matrix \({\mathbf{Y}}_{c} \in {\mathbb{R}}^{n \times k}\), represents the encoded mapping outcome derived from the output layer. Matrix \({\mathbf{W}}_{1} \in {\mathbb{R}}^{n \times k}\) denotes the weight matrix from the input layer to the hidden layer, while \({\vec{\mathbf{x}}}_{c} \in {\mathbb{R}}^{1 \times n}\) denotes a vector within matrix \({\mathbf{X}}_{c}\). Vector \({\vec{\mathbf{b}}}_{1} \in {\mathbb{R}}^{1 \times k}\) represents the bias, and \(\sigma (\cdot)\) denotes the activation function of the neurons.
Subsequently, within the output layer, \({\mathbf{Y}}_{c}\) was decompressed to reconstruct circRNA integration similarity (\({\mathbf{X}}_{c}\)), with calculation as:
where \({\vec{\mathbf{z}}}_{c} \in {\mathbb{R}}^{1 \times k}\), a vector within matrix \({\mathbf{Z}}_{c} \in {\mathbb{R}}^{n \times k}\), represents the reconstructed outcome subsequent to the decompression. Matrix \({\mathbf{W}}_{2} \in {\mathbb{R}}^{k \times k}\) denotes the weight matrix from the hidden layer to the output layer, and vector \({\vec{\mathbf{b}}}_{2} \in {\mathbb{R}}^{1 \times k}\) represents the bias.
Throughout the aforementioned calculation processes, the dimensionality of integrated circRNA similarity underwent reduction, potentially resulting in the loss of circRNArelated information. To mitigate this loss, the sparse autoencoder was trained by iteratively minimizing the loss between \({\mathbf{W}}_{1}\) and \({\mathbf{W}}_{2}\). Employing the gradient descent algorithm [19] to alternately optimize both the weight matrix and bias. Consequently, the loss function characterizing CDADGRL is defined as:
Similarly, the reconstruction of integrated disease similarity network \({\mathbf{X}}_{d}\) (denoted as \({\mathbf{Z}}_{d} \in {\mathbb{R}}^{m \times k}\)) followed a parallel calculation process as the aforementioned steps. Subsequently, by concatenating \({\mathbf{Z}}_{c} \in {\mathbb{R}}^{n \times k}\) and \({\mathbf{Z}}_{d} \in {\mathbb{R}}^{m \times k}\) together, the final circRNAdisease feature matrix \({\mathbf{Q}} = \left[ {{\mathbf{Z}}_{c} ,{\mathbf{Z}}_{d} } \right]^{T} \in {\mathbb{R}}^{{\left( {n + m} \right) \times k}}\) was derived.
Local graph network structure
GCN is a semisupervised technique that translates the topological relationships within a graph into topological graphs [22]. Through convolutional operations, GCN can acquire the embedding representation of nodes in the graph, enabling the direct extraction of structural information and node attributes. A spatial methodology employing a twolayer GCN configuration was used to capture the local structural details within the heterogeneous network HCDN:
where \({\mathbf{I}} \in {\mathbb{R}}^{{\left( {\text{n + m}} \right) \times \left( {\text{n + m}} \right)}}\) represents the identity matrix of matrix \({\mathbf{X}} \in {\mathbb{R}}^{(n + m) \times (n + m)}\), \({\tilde{\mathbf{D}}}\) signifies the metric matrix of \({\tilde{\mathbf{X}}}\), \({\mathbf{W}} \in {\mathbb{R}}^{{\left( {n + m} \right) \times \left( {n + m} \right)}}\) denotes the weight matrix initialized randomly for the network, \({\text{ReLU}} \left( \cdot \right)\) denotes the activation function utilized, and \({\mathbf{H}}_{l} \in {\mathbb{R}}^{{\left( {n + m} \right) \times k}}\) denotes the captured local graph network structure.
Global graph network structure
Node2vec is one type of graph representations that designs a flexible biased random walk technique. Node2vec generates traversal paths by integrating breadthfirst (BF) sampling and depthfirst (DF) sampling, introducing two hyperparameters \(p\) and \(q\), to smoothly transition between these two sampling methodologies [15, 46]. The adaptable biased random walk technique employed in Node2vec aims to preserve the highorder node proximities, thereby maximizing the network coverage while mapping nodes into a lowerdimensional feature space for learning node embeddings. For example, node \(v\) denotes the current node, and the probability of visiting the subsequent node \(x\), could be calculated as:
where \(Z\) represents a normalizing constant, \(\left( {v,x} \right) \in E\) denotes the existence of an edge connecting node \(v\) and node \(x\). When the current walk reaches node \(v\) through the edge connecting node \(t\) and node \(v\), \(\pi_{vx}\) denotes the unnormalized transition probability:
where \(w_{vx}\) represents the weight of the edge connecting node \(v\) and node \(x\), while \(d_{tx}\) represents the shortest distance from node \(t\) to node \(x\). Utilizing formula (18), the global graph network structure of the heterogeneous network (\({\mathbf{X}}\)) was captured and is denoted by \({\mathbf{H}}_{g} \in {\mathbb{R}}^{{\left( {n + m} \right) \times k}}\). Following multiple rounds of experimentation, the optimal values for the hyperparameters \(p\) and \(q\) were set to 1.0 and 0.25, respectively.
Extratree classifier prediction
The local graph network structure \({\mathbf{H}}_{l} \in {\mathbb{R}}^{{\left( {n + m} \right) \times k}}\), and the global graph network structure \({\mathbf{H}}_{g} \in {\mathbb{R}}^{{\left( {n + m} \right) \times k}}\), were contacted together to derive an integrated network structure \({\mathbf{H}} \in {\mathbb{R}}^{{\left( {n + m} \right) \times 2k}}\):
Finally, matrix \({\mathbf{H}}\) was fed into the extratree classifier [24, 25] with utilizing default parameters for training purposes. This process yielded prediction scores representing circRNAdisease associations as the outputs. Therefore, the comprehensive workflow of our model, CDADGRL, is concisely illustrated in Fig. 3.
Availability of data and materials
Data is provided within the manuscript.
References
Meng S, Zhou H, Feng Z, Xu Z, Tang Y, Li P, Wu M. CircRNA: functions and properties of a novel potential biomarker for cancer. Mol Cancer. 2017;16:94. https://doi.org/10.1186/s1294301706632.
Li P, Chen S, Chen H, Mo X, Li T, Shao Y, Xiao B, Guo J. Using circular RNA as a novel type of biomarker in the screening of gastric cancer. Clin Chim Acta. 2015;444:132–6. https://doi.org/10.1016/j.cca.2015.02.018.
Verduci L, Strano S, Yarden Y, Blandino G. The circRNAmicroRNA code: emerging implications for cancer diagnosis and treatment. Mol Oncol. 2019;13:669–80. https://doi.org/10.1002/18780261.12468.
Borran S, Ahmadi G, Rezaei S, Anari MM, Modabberi M, Azarash Z, Razaviyan J, Derakhshan M, Akhbari M, Mirzaei H. Circular RNAs: New players in thyroid cancer. Pathology  Research and Practice. 2020;216:153217. https://doi.org/10.1016/j.prp.2020.153217.
Xiao, Q.; Dai, J.; Luo, J. A survey of circular RNAs in complex diseases: databases, tools and computational methods. Brief Bioinform 2022, 23, https://doi.org/10.1093/bib/bbab444.
Wang, C.C.; Han, C.D.; Zhao, Q.; Chen, X. Circular RNAs and complex diseases: from experimental results to computational models. Brief Bioinform 2021, 22, https://doi.org/10.1093/bib/bbab286.
Lei X, Bian C. Integrating random walk with restart and kNearest Neighbor to identify novel circRNAdisease association. Sci Rep. 1943;2020:10. https://doi.org/10.1038/s41598020590400.
Li G, Luo J, Wang D, Liang C, Xiao Q, Ding P, Chen H. Potential circRNAdisease association prediction using DeepWalk and network consistency projection. J Biomed Inform. 2020;112:103624. https://doi.org/10.1016/j.jbi.2020.103624.
Zhang W, Yu C, Wang X, Liu F. Predicting CircRNADisease Associations Through Linear Neighborhood Label Propagation Method. IEEE Access. 2019;7:83474–83. https://doi.org/10.1109/access.2019.2920942.
Lan, W.; Dong, Y.; Chen, Q.; Zheng, R.; Liu, J.; Pan, Y.; Chen, Y.P.P. KGANCDA: predicting circRNAdisease associations based on knowledge graph attention network. Briefings in Bioinformatics 2021, 23, https://doi.org/10.1093/bib/bbab494.
Ma Z, Kuang Z, Deng L. CRPGCN: predicting circRNAdisease associations using graph convolutional network based on heterogeneous network. BMC Bioinformatics. 2021;22:551. https://doi.org/10.1186/s1285902104467z.
Zheng K, You ZH, Li JQ, Wang L, Guo ZH, Huang YA. iCDACGR: Identification of circRNAdisease associations based on Chaos Game Representation. PLoS Comput Biol. 2020;16:e1007872. https://doi.org/10.1371/journal.pcbi.1007872.
Li M, Liu M, Bin Y, Xia J. Prediction of circRNAdisease associations based on inductive matrix completion. BMC Med Genomics. 2020;13:42. https://doi.org/10.1186/s1292002006790.
Zuo ZL, Cao RF, Wei PJ, Xia JF, Zheng CH. Double matrix completion for circRNAdisease association prediction. BMC Bioinformatics. 2021;22:307. https://doi.org/10.1186/s12859021042313.
Yi, H.C.; You, Z.H.; Huang, D.S.; Kwoh, C.K. Graph representation learning in bioinformatics: trends, methods and applications. Briefings in Bioinformatics 2021, 23, https://doi.org/10.1093/bib/bbab340.
Zhang D, Yin J, Zhu X, Zhang C. Network Representation Learning: A Survey. IEEE Transactions on Big Data. 2020;6:3–28. https://doi.org/10.1109/tbdata.2018.2850013.
Zhang HY, Wang L, You ZH, Hu L, Zhao BW, Li ZW, Li YM. iGRLCDA: identifying circRNA–disease association based on graph representation learning. Brief Bioinform. 2022. https://doi.org/10.1093/bib/bbac083.
Peng, J.; Wang, Y.; Guan, J.; Li, J.; Han, R.; Hao, J.; Wei, Z.; Shang, X. An endtoend heterogeneous graph representation learningbased framework for drug–target interaction prediction. Briefings in Bioinformatics 2021, 22, https://doi.org/10.1093/bib/bbaa430.
Zhao, B.W.; Hu, L.; You, Z.H.; Wang, L.; Su, X.R. HINGRL: predicting drug–disease associations with graph representation learning on heterogeneous information networks. Briefings in Bioinformatics 2021, 23, https://doi.org/10.1093/bib/bbab515.
Jiang HJ, Huang YA, You ZH. SAEROF: an ensemble approach for largescale drugdisease association prediction by incorporating rotation forest and sparse autoencoder deep neural network. Sci Rep. 2020;10:4972. https://doi.org/10.1038/s41598020616169.
Ha, J.; Park, S. NCMD: Node2vecbased neural collaborative filtering for predicting miRNAdisease association. IEEE/ACM Trans Comput Biol Bioinform 2022, PP, https://doi.org/10.1109/TCBB.2022.3191972.
Zhao BW, You ZH, Hu L, Guo ZH, Wang L, Chen ZH, Wong L. A Novel Method to Predict DrugTarget Interactions Based on LargeScale Graph Representation Learning. Cancers. 2021;13:2111. https://doi.org/10.3390/cancers13092111.
Zhao, B.W.; He, Y.Z.; Su, X.R.; Yang, Y.; Li, G.D.; Huang, Y.A.; Hu, P.W.; You, Z.H.; Hu, L. MotifAware miRNADisease Association Prediction Via Hierarchical Attention Network. IEEE Journal of Biomedical and Health Informatics 2024, 1–14, https://doi.org/10.1109/JBHI.2024.3383591.
Abhishek, L. Optical character recognition using ensemble of SVM, MLP and extra trees classifier. In Proceedings of the 2020 International Conference for Emerging Technology (INCET), 2020; pp. 1–4.
Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Mach Learn. 2006;63:3–42. https://doi.org/10.1007/s1099400662261.
Baby, D.; Devaraj, S.J.; Hemanth, J.; M, A.R.M. Leukocyte classification based on feature selection using extra trees classifier: a transfer learning approach. Turkish Journal of Electrical Engineering & Computer Sciences 2021, 29, 2742–2757, https://doi.org/10.3906/elk2104183.
Peng L, Yuan R, Shen L, Gao P, Zhou L. LPIEnEDT: an ensemble framework with extra tree and decision tree classifiers for imbalanced lncRNAprotein interaction data classification. BioData Min. 2021;14:50. https://doi.org/10.1186/s13040021002774.
Deepika SS, Geetha TV. A metalearning framework using representation learning to predict drugdrug interaction. J Biomed Inform. 2018;84:136–47. https://doi.org/10.1016/j.jbi.2018.06.015.
Zhao BW, You ZH, Wong L, Zhang P, Li HY, Wang L. MGRL: Predicting DrugDisease Associations Based on MultiGraph Representation Learning. Front Genet. 2021;12:657182. https://doi.org/10.3389/fgene.2021.657182.
Battaglia PW, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:.01261 2018. https://doi.org/10.48550/arXiv.1806.01261.
Fan, C.; Lei, X.; Fang, Z.; Jiang, Q.; Wu, F.X. CircR2Disease: a manually curated database for experimentally supported circular RNAs associated with various diseases. Database (Oxford) 2018, 2018, https://doi.org/10.1093/database/bay044.
Ding Y, Chen B, Lei X, Liao B, Wu FX. Predicting novel CircRNAdisease associations based on random walk and logistic regression model. Comput Biol Chem. 2020;87:107287. https://doi.org/10.1016/j.compbiolchem.2020.107287.
Zhou S, Wang S, Wu Q, Azim R, Li W. Predicting potential miRNAdisease associations by combining gradient boosting decision tree with logistic regression. Comput Biol Chem. 2020;85:107200. https://doi.org/10.1016/j.compbiolchem.2020.107200.
Sung, H.; Ferlay, J.; Siegel, R.L.; Laversanne, M.; Soerjomataram, I.; Jemal, A.; Bray, F. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA: a cancer journal for clinicians 2021, 71, 209–249, https://doi.org/10.3322/caac.21660.
Tabari A, Chan SM, Omar OMF, Iqbal SI, Gee MS, Daye D. Role of machine learning in precision oncology: Applications in gastrointestinal cancers. Cancers. 2022;15:63.
Sun H, Wang Q, Yuan G, Quan J, Dong D, Lun Y, Sun B. Hsa_circ_0001649 restrains gastric carcinoma growth and metastasis by downregulation of miR20a. J Clin Lab Anal. 2020;34:e23235. https://doi.org/10.1002/jcla.23235.
LuengoGil, G.; GonzalezBillalabeitia, E.; PerezHenarejos, S.A.; Navarro Manzano, E.; ChavesBenito, A.; GarciaMartinez, E.; GarciaGarre, E.; Vicente, V.; Ayala de la Peña, F. Angiogenic role of miR20a in breast cancer. PloS one 2018, 13, e0194638, https://doi.org/10.1371/journal.pone.0194638.
Li XW, Yang WH, Xu J. Circular RNA in gastric cancer. Chin Med J. 2020;133:1868–77. https://doi.org/10.1097/cm9.0000000000000908.
Yuan X, Yuan Y, He Z, Li D, Zeng B, Ni Q, Yang M, Yang D. The Regulatory Functions of Circular RNAs in Digestive System Cancers. Cancers. 2020;12:770.
Zhao R, Han Z, Zhou H, Xue Y, Chen X, Cao X. Diagnostic and prognostic role of circRNAs in pancreatic cancer: a metaanalysis. Front Oncol. 2023;13:1174577. https://doi.org/10.3389/fonc.2023.1174577.
Chen BJ, Mills JD, Takenaka K, Bliim N, Halliday GM, Janitz M. Characterization of circular RNAs landscape in multiple system atrophy brain. J Neurochem. 2016;139:485–96. https://doi.org/10.1111/jnc.13752.
Palma JA, NorcliffeKaufmann L, Kaufmann H. Diagnosis of multiple system atrophy. Auton Neurosci. 2018;211:15–25.
Yang, Y.; Su, X.; Zhao, B.; Li, G.; Hu, P.; Zhang, J.; Hu, L. FuzzyBased Deep Attributed Graph Clustering. IEEE Transactions on Fuzzy Systems 2023, PP, 1–14, https://doi.org/10.1109/TFUZZ.2023.3338565.
Wang D, Wang J, Lu M, Song F, Cui Q. Inferring the human microRNA functional similarity and functional network based on microRNAassociated diseases. Bioinformatics. 2010;26:1644–50. https://doi.org/10.1093/bioinformatics/btq241.
Deepthi K, Jereesh AS. Inferring Potential CircRNADisease Associations via Deep AutoencoderBased Classification. Mol Diagn Ther. 2021;25:87–97. https://doi.org/10.1007/s4029102000499y.
Zhou J, Liu L, Wei W, Fan J. Network Representation Learning: From Preprocessing, Feature Extraction to Node Embedding. ACM Comput Surv. 2023;55:1–35. https://doi.org/10.1145/3491206.
Acknowledgements
The authors thank the anonymous reviewers for suggestions that helped improve the paper substantially.
Funding
This research was funded by National Natural Science Foundation of China, grant number 62166014 and 62162019, Natural Science Foundation of Guangxi Zhuang Autonomous Region, grant number 2020GXNSFAA297255.
Author information
Authors and Affiliations
Contributions
Conceptualization, Y.Z.; Data curation, Z.W.; Formal analysis, Z.W.; Funding acquisition, Y.Z.; Methodology, Y.Z., Z.W.; Software, Z.W.; Validation, H.W.; Writing—original draft, Y.Z.; Writing—review and editing, Y.Z. and M.C.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Zhang, Y., Wang, Z., Wei, H. et al. Exploring potential circRNA biomarkers for cancers based on doubleline heterogeneous graph representation learning. BMC Med Inform Decis Mak 24, 159 (2024). https://doi.org/10.1186/s12911024025646
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12911024025646