 Research article
 Open Access
 Published:
Identifying essential proteins in dynamic protein networks based on an improved hindex algorithm
BMC Medical Informatics and Decision Making volume 20, Article number: 110 (2020)
Abstract
Background
The essential proteins in protein networks play an important role in complex cellular functions and in protein evolution. Therefore, the identification of essential proteins in a network can help to explain the structure, function, and dynamics of basic cellular networks. The existing dynamic protein networks regard the protein components as the same at all time points; however, the role of proteins can vary over time.
Methods
To improve the accuracy of identifying essential proteins, an improved hindex algorithm based on the attenuation coefficient method is proposed in this paper. This method incorporates previously neglected node information to improve the accuracy of the essential protein search. Based on choosing the appropriate attenuation coefficient, the values, such as monotonicity, SN, SP, PPV and NPV of different essential protein search algorithms are tested.
Results
The experimental results show that, the algorithm proposed in this paper can ensure the accuracy of the found proteins while identifying more essential proteins.
Conclusions
The described experiments show that this method is more effective than other similar methods in identifying essential proteins in dynamic protein networks. This study can better explain the mechanism of life activities and provide theoretical basis for the research and development of targeted drugs.
Background
With the increasing amount of available medical information, the identification of key proteins has become an area of interest for many researchers [1,2,3].
In recent years, methods using different perspectives have been developed to mine essential nodes in complex networks. Wang et al [4]. proposed an effective method to identify vertices in dynamic networks using local detection and update strategies. This method locally detects change vertices in a dynamic network and locally updates the influence measure of these change vertices, without globally calculating the influence of all vertices. Essential proteins are those that play an important role in protein evolution and are similar to the definition of essential GO terms presented by Wan et al. [5] Li et al. [6] proposed a new method for identifying essential proteins by combining information on protein complexes and protein–protein interaction (PPI) network topological features. By analyzing the relationship between protein complexes and essential proteins, it was found that proteins in multiple complexes were more likely to be essential than those in only one complex. Based on a statistical analysis of proteins and protein complexes, Luo et al. [7] proposed a method for predicting essential proteins in PPI networks based on the local interaction density and protein complexes. Hu et al. [8] proposed a new method, the EBurt method, which can be applied to weighted networks. This method fully considers the total connection strength, the number of connection edges, and the distribution of the total connection strength on the connection edge in the local range. Wang et al. [9] used the iterative information of kshell decomposition to distinguish the influence ability of nodes with the same kshell. Lei et al. [10] proposed an essential protein exploration method named RWEP using a random walk algorithm that integrates topological and biological properties to determine protein essentiality in PPI networks. Many of the key factors to measure nodes in complex networks are based on graph theory to quantify the topological structure and attributes of each node, and comparisons of the centrality of each node are made through different centrality calculation methods, such as the degree center, median center, proximity center, and edge clustering coefficient center. Quantitative methods can also be used to find the essential nodes in networks [11,12,13,14]
The identification of essential proteins in dynamic protein networks reveals those that play the most important role in the evolution of proteins. In the search for essential proteins, the above methods consider only the importance of the nodes themselves to illustrate their centrality, ignoring structural information of network graphs. When modeling the essential proteins, some algorithms treat the protein situation at different time points as the same. However, in the process of protein evolution, the role of proteins can vary over time. Therefore, adding the attenuation coefficient can help to find proteins that are essential in the protein evolution process. The importance of the node itself and other structure information of the network can be combined by the attenuation coefficient to examine the importance of a particular node as it relates to the whole network.
Methods
Essential proteins in a protein network are usually located at the center of the entire network. The appearance or disappearance of these proteins has a crucial impact on the whole protein network [15,16,17] Accurately identifying essential proteins in a dynamic protein network is helpful for understanding various biological processes from a systematic point of view, and this information can be widely used to explore the pathogenesis of diseases and to predict and evaluate corresponding treatments. This information can also be used to find new drug targets and open new avenues for drug research and development. Although effective methods have been applied to identify essential proteins in protein interaction networks based on data mining, machine learning, and artificial neural networks, it is still necessary to carry out indepth research on algorithms to improve the accurate identification of such proteins.

(1)
Time series on dynamic protein networks
When modeling dynamic protein networks, gene expression data and largescale static protein networks are usually considered together. The gene expression arrays of M genes at T time points can be divided into T sets. Each set represents the state of M genes at the same time point and can be combined into a dynamic protein network based on a time series.

(2)
Evolution of proteins
Different protein interaction networks are present at different time points. Figure 1 shows a simple protein evolution process, where A, B, and C represent different proteins that appear at different times in protein evolution.
The ultimate goal of the model is to facilitate subsequent research by identifying essential proteins that play a crucial role in protein evolution or by predicting the linkages between proteins in subsequent points in evolution. This requires recording the evolution of the protein itself. In the past, link relationships between proteins in a network at different time points were recorded as 1 and relationships without links were recorded as 0. This approach is not amenable to a time series, because over time the historic protein data becomes less prominent and recently generated links between proteins play a larger role.
Dynamic protein networks represent the implementation of the entire evolutionary process over time. If a link relationship between proteins is considered to be constant in the evolutionary process over time, it will obviously affect subsequent studies that are based on dynamic protein models. Therefore, it is worth exploring how to incorporate protein link relationships that change over time into the scope of a model so as to correctly identify key proteins in dynamic protein networks and predict future protein link relationships.

(3)
The hindex method
The hindex [18] is a new method for evaluating academic achievements. The h stands for “high citation times”. The hindex means that a scientist has at most h papers cited at least h times. The hindex was originally used to accurately reflect a person’s academic achievements. A higher hindex indicates a greater academic influence. In the study of dynamic protein networks, the hindex can be used to find essential proteins, and formula (1) can be used to calculate the hindex value of nodes.
where d_{i} denotes the degree of node v_{i}, formula h (x_{1}, x_{2},... x_{n}) returns the maximum value of y, and at least y items from x_{1}, x_{2},... x_{n} are greater than or equal to y.
However, the hindex considers only the importance of the node itself to illustrate its centrality, ignoring some network structure information, which will reduce the accuracy of the node expansion.
For example, in Fig. 2, the centrality of node 2 is relatively large. However, considering the information of nodes in this graph, the expansion ability of node 3 is greater than that of node 2. The reason for this is that when defining the centrality of the hindex, some information about the node neighbors is ignored. For example, nodes with a degree less than y are completely ignored, resulting in reduced specification accuracy when the node is expanded. Therefore, this method is not accurate in calculating the expansion capacity of nodes.
Therefore, it is necessary to improve the existing hindex algorithm on the basis of the established dynamic protein network model to accurately identify key proteins in the network by combining information of the nodes themselves and structural information that has been neglected in previous algorithms.

(4)
Monotonicity
The ability to distinguish nodes with different scalabilities and nodes with uniform distribution at different levels is one criterion for evaluating the ranking methods of influential nodes in social networks [8] Monotonicity is used to test the recognition ability of this method for nodes with different extensibility. Formula (2) is used to calculate the m value of ranking Table R. In this equation, n is the number of column groups in list R and the number of nodes in column group R. The value of M is always in the range 0–1. Large numbers indicate that nodes have high recognition ability.
After establishing a dynamic protein network model based on attenuation coefficients, the essential protein recognition methods can be investigated. The methods to be adopted are as follows.

(1)
Construct the whole protein evolution process network based on the attenuation coefficient.
In this network, the weights of each side at the corresponding time points should be added together to obtain the final weights; the formulas are stated below.
In the process of construction, the same edge appears at different times. At the current time point, the corresponding weight calculation will vary. The earlier the edge appears, the more its role in the protein evolution process will change over time. The calculation method of weight corresponding to the edges at each time is as follows:
For each edge (u, v) in the protein network at time t, its weight D (u, v, t) varies with time t and is defined as:
in which
where a (t) is a set of all the edges appearing at time t; a (t) is a constant and is called the attenuation coefficient.
The weight of each vertex neighbor is calculated by the weight of the edge. If the vertex v is a neighbor of u, the weight w (u, v) of v for u is defined as:
It can be seen from the above definition that \( \sum \limits_{v\in N(u)}\mathrm{w}\left(u,v,t\right)=1 \) and w (u, v) and w (v, u) are not necessarily equal.
In this way, the weight of each edge in the protein network at different times is will vary.
The process of constructing a dynamic protein network based on attenuation coefficients is shown in Fig. 3.
Considering that proteins will change with time in the process of protein evolution, the protein network model is more objective and conforms to the process of biological evolution.

(2)
Calculating the cumulative centrality of the node neighborhood
Although the hindex measure attempts to determine the centrality of nodes based on the importance of adjacent nodes, some information about the adjacent nodes is still ignored. The centrality of a node can be standardized by using all the information of its adjacent nodes. For this purpose, the cumulative function in definition 1 is used.
Definition 1.
The cumulative function c_{k} (v_{i}) is defined as the number of nodes whose v_{i} neighbors are moderately larger than or equal to k, expressed as follows:
The hindex function is improved to the cumulative function defined in eq. (7):
This function takes the maximum value of k, where the degree of k neighbors is greater than or equal to k when other cumulative values are ignored by the hindex function. Therefore, in the proposed metric, the cumulative value of neighboring nodes is determined for all different values of k and is used to standardize node centrality.
In the traditional hindex, the weight of each neighbor’s link is considered as 1. In our case, the weight of each neighbor’s link w(u,v) is different. Therefore, when extending the traditional hindex, the extended cumulative function c_{k} (v_{i}) is defined as follows:
Definition 1.
Extended cumulative function c_{k} (v_{i}):
Here, N_{k} (v_{i}) is the set of neighbor vertices whose degree is k.
Definition 2, the cumulative function vector S (v_{i}), contains the cumulative values of nodes adjacent to node v_{i} at different degrees. The calculation is shown in formula (9):
Among them, h is the largest degree on the graph, and h = max_{j = 1, …, n}{d_{j}}.
To reduce the time complexity of calculating vector S (v_{i}) elements, the recursive eq. (10) can be used:
Here, S_{k} (v_{i}) is the kth index value of vector S (v_{i}), and N_{k} (v_{i}) is the set of neighbor vertices whose degree is k.
Given the cumulative function vector of node v_{i}, its cumulative centrality is expressed as eq. (11):
In this formula, p and r are two adjustable parameters, and the value of p is between 0 and 1. Because there is a larger cumulative value in the lower degree than in the higher degree and in the higherorder cumulative value of many nodes, equation (11) uses the parameter \( {\mathrm{p}}^{1+\mathrm{k}\ast \frac{\mathrm{p}}{\mathrm{r}}} \) to multiply the lowerorder cumulative value by a larger number. This ensures that the lowerorder cumulative value is more effective and has a stronger expansion and recognition ability in the regulation of node centrality.

(3)
The extended hindex centrality EHC(v) of a node is determined according to the cumulative centrality of its neighborhood.
Formula (12) can be used to determine the extended hindex centrality EHC(v) of nodes by iteration:

(4)
Calculate the centrality of all nodes and arrange them in order. N nodes with larger centralities are the essential nodes.
The process of our proposed IHindex algorithm is shown in Fig. 4.
Results
Experimental data
The following data were used in the experiment:

(1)
Gene expression data GSE3431 [19]; the corresponding matrix contained 6470 lines, and each line represented the corresponding expression data of a different gene.

(2)
Yeast protein network in DIP [20], which includes 5093 proteins and 24,743 edges. We processed the network and extracted a portion of the nodes in the evolution process, as shown in the composition diagram in Fig. 5.

(3)
1285 essential proteins obtained from the datasets MIPS [21], SGD [22], DEG [23] and SGDP [24].
Experimental results
Parameter selection experiments
First, the attenuation coefficient was tested. The dynamic protein network was divided into 36 moments, and the attenuation coefficients were compared with different values. The proposed algorithm is abbreviated as the IHindex.
The SIR extension model [25] was used to evaluate the accuracy of this method in determining the node expansion capability and sorting the nodes. For this reason, the diffusion process was simulated by SIR, and the real ranking table σ was generated. In the SIR process, each node can be in one of three states: susceptibility (S), infection (I) or recovery (Re). After applying necessary changes to the node states, the node state Re was considered as the extension capability of node v_{i}. The scalability of each node was calculated through repeated processing, and the ranking table σ was obtained.
After calculating the values in table σ, the sorting Table R can be generated by using various methods. The higher the correlation between the two ranking tables, the higher the accuracy of the corresponding methods in specifying the node expansion capability. For this reason, the Kendall correlation coefficient τ (0 ≤ τ ≤ 1) is adopted:
where n_{c} and n_{d} denote the number of consistent and inconsistent pairs of nodes in the two sorting tables, respectively, and n denotes the size of the sorting vector. The larger the Kendall correlation coefficient τ value, the closer the relationship between the two tables σ and R, and the more accurate the proposed algorithm for calculating the essential degree of the dynamic proteins.
Figures 6 and 7 show that the number of identified essential proteins and the accuracy of the identification change when the attenuation coefficient is altered. By synthesizing the two experimental results, we found that the attenuation coefficient ranged from 0.9 to 0.95, and the number of identified essential proteins found was optimized. Therefore, the attenuation coefficient was set to 0.92.
Next, the effect of parameters p and r on the results of the search algorithm was investigated using the yeast protein dataset by applying the Kendall coefficient.
Figure 8 shows that the value of the Kendall correlation coefficient changed slightly with the change of parameters p and r. The maximum value of τ was obtained when p = 0.9 and r = 100. Thus, the following experiments were carried out for the case of p = 0.9 and r = 100.
Experimental results of dynamic protein network models based on attenuation coefficients for different algorithms
To verify the performance of dynamic protein networks based on the attenuation coefficient, different algorithms were used to identify essential proteins in the established networks, and the results were compared.
The four essential node search methods were: Cnc+, [23] hindex, [26] IGC, [27] TEO, [11] RWEP [10] and IHindex, and they were run on the constructed attenuation coefficientbased protein network.

(1)
Monotonicity values
We verified the monotony of the different algorithms based on the dynamic protein network to identify essential proteins. The results are shown in Fig. 9.
Figure 9 shows that the monotonicity value of the IHindex algorithm was higher than that of the other algorithms. The value was close to 1, which indicates that this algorithm has a stronger ability to recognize essential proteins.

(2)
Correctness
Figure 10 shows the Kendall coefficients of two sorting tables corresponding to different algorithms. The accuracy of the IHindex algorithm in finding essential proteins was slightly higher than that of Cnc+, IGC, TEO and RWEP, and was significantly higher than that of the hindex.
Combined with the above two experimental results, the accuracy of the improved algorithm was also verified because the number of essential proteins found by the algorithm was the largest.

(3)
The values of SN, SP, PPV, and NPV of six algorithms
To further verify the performance of the algorithm, we compared the sensitivity (SN), specificity (SP), positive predictive value (PPV), and negative predictive value (NPV) of the six different algorithms.
Table 1 presents the values of SN, SP, PPV, and NPV of the six algorithms.
Discussion
Protein is an important component of all cells and tissues in the human body. The cell itself undergoes dynamic evolution in the body, such as growth, proliferation, differentiation, aging, and apoptosis. Therefore, when searching for proteins that are essential to the process of protein evolution, considering the changes in proteins at different times is consistent with the development of actual life activities. Few algorithms have considered this.
Because the algorithm experiments proposed by the predecessors are based on the protein which has been confirmed to be correct, there is no cross validation [28,29,30] in this paper. Next, we consider using some methods to do relevant tests. And some proteins like CDC53 appear more frequently in the whole process of biological evolution, but it is not classified as essential protein in the dataset used. Next, we will compare whether the dataset itself is overlooked, but it’s really an essential protein.
Future research should also consider that an edge plays a very small role in the network because the weight of an edge decreases with time and reaches a minimum threshold. It can be directly subtracted to save time and space. In this way, a minimum threshold is set for the weight of edges. Future research could apply this algorithm to the study of dynamic protein sequence data.
Conclusions
To consider the influence of historical data on current protein evolution data, a dynamic protein network model based on the attenuation coefficient is proposed. In this model, rather than simply generalizing the presence or absence of proteins at each time point, a dynamic protein network modeling method based on the attenuation coefficient is used to record the changes of proteins in the process of biological evolution according to their corresponding occurrences. In the proposed model, the traditional key node search method, the hindex algorithm, which neglects neighbor attributes, is improved. The cumulative function is used to account for the varying degrees of the attributes of neighboring nodes, which improves the accuracy of the search for essential proteins. To verify the validity of the method, different key node search methods were applied to a dynamic protein network. The experimental results show that the model established by the IHindex method is more convenient for accurately identifying essential proteins.
Availability of data and materials
The datasets used during the current study are available in:
DIP:http://dip.deombi.ucla.edu/dip/Stat.cgi,
MIPS:http://mips.helmholtzmuenchen.de/proj/ppi.
References
 1.
Qiao S, Yan B, Li J. Ensemble learning for protein multiplex subcellular localization prediction based on weighted KNN with different features. Appl Intell. 2018;48(7):1813–24.
 2.
Zhao B, Wang J, Li M, Wu FX, Pan Y. Prediction of essential proteins based on overlapping essential modules. IEEE Transact Nano Biosci. 2014;13(4):415–24.
 3.
Peng W, Wang JX, Wang W, et al. Iteration method for predicting essential proteins based on orthology and proteinprotein interaction networks. BMC Syst Biol. 2012;6(1):87.
 4.
Wang S, Cuomo S, Mei S, Cheng W, Xu N. Efficient method for identifying influential vertices in dynamic networks using the strategy of local detection and updating. Futur Gener Comput Syst. 2019;91:10–24.
 5.
Wan S, Mak MW, Kung SY. Sparse regressions for predicting and interpreting subcellular localization of multilabel proteins. BMC Bioinform. 2016;17(1):97.
 6.
Li M, Lu Y, Xiang N, Pan W. Identification of essential proteins by using complexes and interaction network. Bioinform Res Appl. 2014:255–65.
 7.
Luo J, Qi Y. Identification of essential proteins based on a new combination of local interaction density and protein complexes. PLoS One. 2015;10(6):e0131418.
 8.
Hu P, Mei T. Ranking influential nodes in complex networks with structural holes. Phys A: Stat Mech Appl. 2018;490:624–31.
 9.
Wang Z, Zhao Y, Xia J, Du C. Fast ranking influential nodes in complex networks using a kshell iteration factor. Phys A: Stat Mech Appl. 2016;461:171–81.
 10.
Lei X, Yang X, Fujita H. Random walk based method to identify essential proteins by integrating network topology and biological characteristics. KnowlBased Syst. 1671;2019:53–67.
 11.
Zhang W, Xu J, Li Y, Zou X. Detecting Essential Proteins Based on Network Topology, Gene Expression Data, and Gene Ontology Information. IEEE/ACM Trans Comput Biol Bioinform. 2018;15(1):109–16.
 12.
Lei X, Wang S, Wu FX. Identification of essential proteins based on improved HITS algorithm. IEEE/ACM Trans Comput Biol Bioinform Genes. 2019;10(2):177.
 13.
Mistry D, Wise R, Dickerson J. DiffSLC : a graph centrality method to detect essential proteins of a proteinprotein interaction network. PloS One. 2017.
 14.
Zaki N, Berengueres J, Efimov D. Detection of protein complexes using a protein ranking algorithm. Proteins: Structure, Function, and Bioinformatics. 2012;80(10):2459–68.
 15.
Jensen LJ, Gupta R, Staerfeldt HH, Brunak S. Prediction of human protein function according to gene ontology categories. Bioinformatics. 2003;19(5):635–42.
 16.
Wan S, Mak MW, Kung SY gLEN. Interpretable prediction of subcellular multilocalization of grampositive and gramnegative bacterial proteins. Chemom Intell Lab Syst. 2017;162:1–9.
 17.
Schlicker A, Domingues FS, Rahnenführer J, Lengauer T. A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinform. 2006;7(1):302.
 18.
Masic I, Begic E. Scientometric dilemma: is Hindex adequate for scientific validity of Academic’s work? ACTA INFORM MED. 2016;24(4):228–32.
 19.
Xiao Y, Lv Y, Zhao H, Gong Y. Predicting the Functions of Long Noncoding RNAs Using RNASeq Based on Bayesian Network. BioMed Res Int. 2015;3:1–14.
 20.
Lo YS, Chen YC, Yang JM. 3Dinterologs: an evolution database of physical proteinprotein interactions across multiple genomes. 2010;11(Suppl 3):S7.
 21.
Fleur J, Claire J, Andreas H. Integrated web visualizations for proteinprotein interaction databases. BMC Bioinformatics. 2015;195.
 22.
Luo J, Li G, Dan S, Cheng L. Integrating Functional and Topological Properties to Identify Biological Network Motif in Protein Interaction Networks. J Comput Theor Nanosci. 2014;744–50.
 23.
Zhang R, Lin Y. DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes[J]. Nucleic Acids Res. 2009;7:D455–D458.
 24.
Zhang X, Xiao W, Hu X. Predicting essential proteins by integrating orthology, gene expressions, and PPI networks. Plos One. 2018;13(4):e0195410.
 25.
Bae J, Kim S. Identifying and ranking influential spreaders in complex networks by neighborhood coreness. Phys A Stat Mech Appl. 2014;395:549–59.
 26.
Lü L, Zhou T, Zhang QM, Stanley HE. The Hindex of a network node and its relation to degree and coreness. Nat Commun. 2016;7:10168.
 27.
Wang J, Li C, Xia C. Improved centrality indicators to characterize the nodal spreading capability in complex networks. Appl Math Comput. 2018;334:388–400.
 28.
Zhong J, Wang J, Peng W, Zhang Z, Pan Y. Prediction of essential proteins based on gene expression programming. BMC Genomics. 2013;14(4):S7.
 29.
Wan S, Mak M, Kung SY. Ensemble linear neighborhood propagation for predicting subchloroplast localization of multilocation proteins. J Proteome Res. 2016;15(12):4755–62.
 30.
Chou KC, Wu ZC, Xiao X. iLocEuk: a multilabel classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PloS One. 2011;6(3):e18258.
Acknowledgements
We thank Katherine Thieltges from Liwen Bianji, Edanz Editing China (www.liwenbianji.cn/ac), for editing the English text of a draft of this manuscript.
Funding
This study was supported in part by grants from Jiangsu Province Science Foundation for Youths (No.BK20180822), Natural Science Research Projects in Jiangsu Higher Education Institution (No. 18KJB520040) and National Natural Science Foundation of China (No.61906100). The funder Caiyan Dai, is responsible for the algorithm design and model establishment of this study.
Author information
Affiliations
Contributions
DCY analyzed and established the evolution model. HJ made a comparison with other experiments and made to a conclusion. DYW was a major contributor in writing the manuscript. HKF was responsible for guiding the implementation of the work and verifying the algorithm. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Dai, C., He, J., Hu, K. et al. Identifying essential proteins in dynamic protein networks based on an improved hindex algorithm. BMC Med Inform Decis Mak 20, 110 (2020). https://doi.org/10.1186/s1291102001141x
Received:
Accepted:
Published: