Improved method for protein complex detection using bottleneck proteins
© Ahn et al.; licensee BioMed Central Ltd. 2013
Published: 5 April 2013
Detecting protein complexes is one of essential and fundamental tasks in understanding various biological functions or processes. Therefore accurate identification of protein complexes is indispensable.
For more accurate detection of protein complexes, we propose an algorithm which detects dense protein sub-networks of which proteins share closely located bottleneck proteins. The proposed algorithm is capable of finding protein complexes which allow overlapping with each other.
We applied our algorithm to several PPI (Protein-Protein Interaction) networks of Saccharomyces cerevisiae and Homo sapiens, and validated our results using public databases of protein complexes. The prediction accuracy was even more improved over our previous work which used also bottleneck information of the PPI network, but showed limitation when predicting small-sized protein complex detection.
Our algorithm resulted in overlapping protein complexes with significantly improved F1 score over existing algorithms. This result comes from high recall due to effective network search, as well as high precision due to proper use of bottleneck information during the network search.
Most proteins are known to be involved in complex biological processes or functions in a cell, forming a protein complex with other proteins . Therefore, detecting protein complexes is one of essential and fundamental tasks in understanding various biological functions or processes. A protein complex can be modelled as an undirected graph of which node is a protein and edge is a physical interaction between two protein nodes. This physical interaction of two proteins is called PPI (Protein-Protein Interaction). Representative methods to find those interactions are two-hybrid system  and Mass Spectrometry . Recent development of those high-throughput methods has resulted in abundant PPI network.
A protein complex is a set of proteins that interact with each other, so it is frequently assumed that distances between its member proteins are short, and its members tend to form clique-like structure in the PPI network. Accordingly, a protein complex is often assumed as a dense sub-graph in the PPI network. There have been active researches to develop algorithms for detecting protein complexes, and many of them are based on searching dense sub-graph in the PPI network. MCODE  gives high weight to nodes of which degree is high, and searches the network using those nodes as seeds. It enforces local search on the network, and finds sub-network whose nodes are highly interconnected. CMC  gives weight to PPIs using an iterative scoring method to assess the reliability of PPI, finds maximal cliques from the weighted PPI network, and then removes or merges overlapping maximal cliques based on their interconnectivity. MCL  detects clusters by distinguishing the strong and weak connections in the network and partitioning the network, based on manipulation of transition probabilities or stochastic flows between vertices of the graph. MCL has been reported to have good performance, and many variations of it have been proposed [7–9]. However, they are known to suffer from imbalance of resulting clusters .
These network clustering algorithms commonly do not allow overlapping between identified protein complexes. In other words, a protein can be involved in only one protein complex. Recently, algorithms that allow overlapping have been extensively studied. DPClus  detects initial protein complexes starting from the seeds and then including neighbours so as to maintain the edge's density of the sub-network above the threshold. Then it finds overlapped protein complexes extending the initial protein complexes. CFinder  is based on Clique Percolation Method (CPM) , which defines a protein complex as a union of k-cliques that share (k-1) vertices. The result of CFinder is sensitive to the value of k. As k increases, it tends to find smaller, but highly denser sub-network. Link Cluster  firstly substitutes edges to virtual nodes, and then make edge between those virtual nodes (edges) that share nodes. Virtual nodes of the substituted network are closer as their connectivity increase. Hierarchical clustering of those virtual nodes results in the clusters of the edges, and as a result, those clusters can share nodes. Allowing the overlaps between resulting protein complexes obviously leads to higher recall and precision, because a protein is frequently involved in several protein complexes . Becker et al.  proposed Overlapping Cluster Generator (OCG) which decomposes a network into overlapping clusters for correct assignment of multifunctional proteins. The OCG makes initial overlapping classes that are iteratively fused into a hierarchy according to an extension of Newman's modularity function.
Precise prediction of protein complexes is important since they are likely to be fundamental units for various biological functions or processes. Also, the validation cost of predicted protein complexes is high. For more precise detection of protein complexes, we used the characteristics of bottlenecks in the network. A bottleneck of a network is a node that the information of the network is concentrated. The bottleneckness of a node can be calculated using betweenness centrality, which is a measure of a node's centrality in a network, and equal to the number of shortest paths going through it. Yu et al.  revealed that bottleneck proteins tend to be essential proteins and correspond to the dynamic component of the PPI network. Moreover, they can be global connectors between functional modules of the PPI network. Therefore, sub-graphs of which boundary proteins are bottleneck proteins have higher chance to be functional modules. We expected that finding these sub-graphs as candidate protein complexes will efficiently filter the possible false predictions out.
Previously, we proposed the protein complex prediction algorithm that utilizes the bottleneck proteins as partitioning points for detecting the protein complexes, based on this expectation . It iteratively constructs directed acyclic graphs of which starting node is bottlenecks in the PPI network. The search ends at nodes where flows from the starting node are concentrated. This graph is called DG (Distance Graph), and terminal nodes of DG tend to be bottlenecks of the PPI network. Established DGs are used to identify sub-graphs that may be overlapped with each other. The sub-graphs having enough edge-density are reported as protein complexes.
Even though  showed improved F1 score over previous works, it showed limited results when predicting small-sized protein complexes. For address this problem, we propose new network search algorithm which searches dense protein sub-networks of which proteins share closely located bottleneck proteins.
We applied our algorithm to several PPI networks of Saccharomyces cerevisiae and Homo sapiens, and validated our results using public databases of protein complexes. Our algorithm resulted in significantly improved F1 score over existing algorithms including our previous work . This result comes from high recall due to effective network search, as well as high precision due to proper use of bottleneck information during the network search.
The protein complex detection method proposed in this study is composed of two parts. First, betweenness centralities of all the nodes and shortest distances between all node pairs in the PPI network are calculated. Second, we search dense protein sub-networks of which proteins share closely located bottleneck proteins.
The network search starts from sorting nodes by their betweenness centrality in descending order, and putting them in the starting node set. Among them, upper BC threshold (user parameter, %) nodes are called bottleneck nodes. Also, each node keeps "close bottlenecks", which is a set of bottleneck nodes of which distance from the nodes ≤ 2.
"shared_bottleneck" indicates intersection of shared bottlenecks of cluster and close bottlenecks of n. Edge density can be measured by clustering coefficient, as in our previous work .
We find neighbouring nodes from non-bottleneck proteins in the cluster, except for the initial cluster. In other words, bottlenecks are nodes where the search ends. For each neighbouring node that makes clustering coefficient ≥ CC threshold, we calculate its score, and include top k% scored nodes into next cluster. Throughout the rest of the paper, we used k = 5. We used priority queue to implement this mechanism. Using top k% scored nodes rather than only one node with best score is essential for efficient network traverse. Higher k enables faster clustering, and we confirmed that higher k (~ 10%) does not lower the prediction accuracy through iterative experiments.
PPI network datasets
Number of proteins
Number of PPIs
Number of protein complexes
Number of proteins
Avg. number of proteins in protein complexes
The searching is successful if a protein complex is identified with affinity score ≥ 0.2 for any protein complex in a reference datasets. If this threshold is too big or small, the affinity score loses its assessment function. Through iterative experiments, we set the affinity score threshold as 0.2, which makes the difference between results of various algorithms.
where C is a set of protein complexes found by a clustering algorithm, and R is a set of protein complexes in a reference dataset. Recall means a rate of protein complexes in the reference datasets that were successfully found, precision means a rate of protein complexes identified by an algorithm that are matched with the protein complexes in the reference datasets, and F1 score means an overall accuracy of the test.
For all the experiments, tests using bottleneck information brought more accurate results. Especially, prediction accuracies were clearly increased when using bottlenecks in two cases using BioGRID. This means that bottleneck information were effective in dense network which may include many false interactions. At the same time, tests using only clustering coefficient shows comparable prediction accuracy, which means that the proposed network searching algorithm is effective for detecting protein complexes.
We then measured the prediction performance of proposed algorithm, and compared the results with representative network clustering algorithms, MCODE , MCL , Link Cluster , and our previous work . We applied each algorithm including proposed algorithm to PPI networks and two reference datasets. For each algorithm, we found optimal parameters that result in best F1 score.
Result of comparison test
PPI network dataset
Number of protein complexes
CC = 0.9, BC = 1%
CC = 0.51, BC = 20%
Partition_density = 0.30
Granularity = 2.00
Node_score = 0.10
CC = 0.6, BC = 1%
CC = 0.38, BC = 20%
Partition_density = 0.29
Granularity = 2.40
Node_score = 0.10
CC = 1.0, BC = 1%
CC = 0.54, BC = 20%
Partition_density = 0.30
Granularity = 3.60
Node_score = 0.10
CC = 1.0, BC = 15%
CC = 0.43, BC = 30%
Partition_density = 0.28
Granularity = 3.00
Node_score = 0.10
CC = 0.8, BC = 5%
CC = 0.41, BC = 20%
Partition_density = 0.21
Granularity = 1.60
Node_score = 0.10
We can see that optimal BC thresholds are generally smaller, and optimal CC thresholds are higher than . This indicates the proposed algorithm detects denser sub-network. However, this does not means that the proposed algorithm uses less bottleneck information, because prediction accuracy was also good for higher BC. Because our algorithm uses bottlenecks as boundary of the protein complex, detected sub-networks are basically similar to the DG. However, division procedure of DG  has limitation on detecting dense sub-network. Therefore, we can say that the network searching algorithm we proposed overcame the limitation when detecting dense sub-networks.
We proposed the novel network clustering algorithm which detects dense protein sub-networks of which proteins share closely located bottleneck proteins. The proposed algorithm showed improved F1 score which comes from high recall due to effective network search, as well as high precision due to proper use of bottleneck information during the network search.
As future works, we extend our algorithm to detect the hierarchical relationship between sub-networks identified. This algorithm would help us to elucidate hierarchical structure of various protein complexes or functional modules in a cell, and to infer a function of them in conjunction with various biology databases such as Gene Ontology database.
This work was supported by National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No. 2012010775).
This work is based on an earlier work: “Protein complex prediction via bottleneck-based graph partitioning”, in Proceedings of the ACM Sixth International Workshop on Data and Text Mining in Biomedical Informatics, 2012 © ACM, 2012. http://doi.acm.org/10.1145/2390068.2390079
The publication costs for this article were funded by the corresponding author.
This article has been published as part of BMC Medical Informatics and Decision Making Volume 13 Supplement 1, 2013: Proceedings of the ACM Sixth International Workshop on Data and Text Mining in Biomedical Informatics (DTMBio 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcmedinformdecismak/supplements/13/S1.
- Kumar A, Snyder M: Protein complexes take the bait. Nature. 2002, 415: 123-124. 10.1038/415123a.View ArticlePubMedGoogle Scholar
- Fields S, Song O: A novel genetic system to detect protein-protein interactions. Nature. 1989, 340: 245-245. 10.1038/340245a0.View ArticlePubMedGoogle Scholar
- Ho Y, Gruhler A, Bader GD, Moore L, Adams SL, Miller A, et al: Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 2002, 415: 180-183. 10.1038/415180a.View ArticlePubMedGoogle Scholar
- Bader GD, Hogue CW: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003, 4: 2-10.1186/1471-2105-4-2.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu G, Wong L, Chua HN: Complex discovery from weighted PPI networks. Bioinformatics. 2009, 25 (15): 1891-1897. 10.1093/bioinformatics/btp311.View ArticlePubMedGoogle Scholar
- Dongen SV: Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht. 2000Google Scholar
- Brohee S, van Helden J: Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics. 2006, 7: 488-10.1186/1471-2105-7-488.PubMed CentralView ArticlePubMedGoogle Scholar
- Vlasblom J, Wodak S: Markov clustering versus affinity propagation for the partitioning of protein interaction graphs. BMC bioinformatics. 2009, 10: 99-10.1186/1471-2105-10-99.PubMed CentralView ArticlePubMedGoogle Scholar
- Satuluri V, Parthasarathy S, Ucar D: Markov Clustering of Protein Interaction Networks with Improved Balance and Scalability. ACM-BCB. 2010, 247-256.Google Scholar
- Altaf-Ul-Amin M, Shinbo Y, Mihara K, Kurokawa K, Kanaya S: Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinformatics. 2006, 7: 207-10.1186/1471-2105-7-207.PubMed CentralView ArticlePubMedGoogle Scholar
- Adamcsek B, Palla G, Farkas I, Derenyi I, Vicsek T: CFinder:locating cliques and overlapping modules in biological networks. Bioinformatics. 2006, 22 (8): 1021-1023. 10.1093/bioinformatics/btl039.View ArticlePubMedGoogle Scholar
- Palla G, Derenyi I, Farkas I, Vicsek T: Uncovering the overlapping community structure of complex networks in nature and society. Nature. 2005, 435: 814-818. 10.1038/nature03607.View ArticlePubMedGoogle Scholar
- Ahn Y, Bagrow JP, Lehmann S: Link communities reveal multiscale complexity in networks. Nat. 2010, 466: 761-765. 10.1038/nature09182.View ArticleGoogle Scholar
- Becker E, Robisson B, Chapple CE, Guenoche A, Brun C: Multifunctional proteins revealed by overlapping clustering in protein interaction network. Bioinformatics. 2012, 28 (1): 84-90. 10.1093/bioinformatics/btr621.PubMed CentralView ArticlePubMedGoogle Scholar
- Yu H, Kim PM, Sperecher E, Trifonov V, Gerstein M: The Importance of Bottlenecks in Protein Networks: Correlation with Gene Essentiality and Expression Dynamics. PLoS Comput Biol. 2007, 3 (4): e59-10.1371/journal.pcbi.0030059.PubMed CentralView ArticlePubMedGoogle Scholar
- Ahn J, Lee DH, Yoon Y, Yeu Y, Park S: Protein complex prediction via bottleneck-based graph partitioning. Proceedings of the ACM Sixth International Workshop on Data and Text Mining in Biomedical Informatics. 2012, New York: ACM, 49-56. 10.1145/2390068.2390079.Google Scholar
- Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The database of interacting proteins: 2004 update. Nucleic Acids Research. 2004, 32 (Database): D449-D451.PubMed CentralView ArticlePubMedGoogle Scholar
- Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a. general repository for interaction datasets. Nucleic Acids Research. 2006, 34 (Database): D535-D539.PubMed CentralView ArticlePubMedGoogle Scholar
- Güldener U, Münsterkötter M, Kastenmüller G, Strack N, van Helden J, Lemer C, et al: CYGD: the comprehensive yeast genome database. Nucleic Acids Research. 2005, 33 (Database): D364-D368.PubMed CentralPubMedGoogle Scholar
- Pu S, Wong J, Turner B, Cho E, Wodak S: Up-to-date catalogues of yeast protein complexes. Nucleic acids research. 2009, 37 (3): 825-831. 10.1093/nar/gkn1005.PubMed CentralView ArticlePubMedGoogle Scholar
- Brown KR, Jurisica I: Unequal evolutionary conservation of human protein interactions in interologous networks. Genome Biol. 2007, 8: R95-10.1186/gb-2007-8-5-r95.PubMed CentralView ArticlePubMedGoogle Scholar
- Ruepp A, Waegele B, Lechner M, Brauner B, Dunger-Kaltenbach I, Fobo G, et al: CORUM: the comprehensive resource of mammalian protein complexes-2009. Nucleic Acids Research. 2010, 38 (Database): D497-501. 10.1093/nar/gkp914.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.