Efficient protein structure search using indexing methods
© Kim et al.; licensee BioMed Central Ltd. 2013
Published: 5 April 2013
Understanding functions of proteins is one of the most important challenges in many studies of biological processes. The function of a protein can be predicted by analyzing the functions of structurally similar proteins, thus finding structurally similar proteins accurately and efficiently from a large set of proteins is crucial. A protein structure can be represented as a vector by 3D-Zernike Descriptor (3DZD) which compactly represents the surface shape of the protein tertiary structure. This simplified representation accelerates the searching process. However, computing the similarity of two protein structures is still computationally expensive, thus it is hard to efficiently process many simultaneous requests of structurally similar protein search. This paper proposes indexing techniques which substantially reduce the search time to find structurally similar proteins. In particular, we first exploit two indexing techniques, i.e., iDistance and iKernel, on the 3DZDs. After that, we extend the techniques to further improve the search speed for protein structures. The extended indexing techniques build and utilize an reduced index constructed from the first few attributes of 3DZDs of protein structures. To retrieve top-k similar structures, top-10 × k similar structures are first found using the reduced index, and top-k structures are selected among them. We also modify the indexing techniques to support θ-based nearest neighbor search, which returns data points less than θ to the query point. The results show that both iDistance and iKernel significantly enhance the searching speed. In top-k nearest neighbor search, the searching time is reduced 69.6%, 77%, 77.4% and 87.9%, respectively using iDistance, iKernel, the extended iDistance, and the extended iKernel. In θ-based nearest neighbor serach, the searching time is reduced 80%, 81%, 95.6% and 95.6% using iDistance, iKernel, the extended iDistance, and the extended iKernel, respectively.
The size of protein structure database such as the Protein Data Bank (PDB) continues to grow. PDB had around 1000 structures in 1992, but now it stores over 150,000 structures. In addition, the number of proteins with unknown functions is increasing due to efforts in structural genomics projects. Knowing the functions of proteins is crucial to many studies of biological processes. Especially, researchers need to know the key proteins that play an important role for severe deseases, and it is directly related to human life. Therefore, assigning functions to novel proteins is one of the most significant problems in proteomic study, and several methods have been developed to assign functions to an unknown protein. Basically, the function of a protein can be identified by searching amino acid sequence database for similar sequences that the functions are already known. However, the 3D structures of proteins are more conserved than the sequences and using the structural information provide more reliable similarity measures.
Many methods have been introduced for pair-wise protein structural search. They align the two structures and compute the Root Mean Square Deviation (RMSD) between the core atomic positions, e.g., alpha carbon coordinates, of the aligned proteins. However, most of methods based on structural alignment cannot be used to search structures against large database, since it has high computational complexity. Sael et al. introduced a new approach for fast protein surface similarity search using 3DZDs . This approach does not consider individual residue/atom positions, or the arrangement of the secondary structure segments. 3DZD has three advantages: 1) fast k-nearest neighbor search, 2) rotational invariance, and 3) easy adjustment of the resolution of the structural representation resolution. In particular, using 3DZDs, it is possible to retrieve similar protein structures in seconds among 150k protein structures. However, few seconds are still too long for a real time search system, since response time increases further when multiple search requests are processed simultaneously. To enhance the searching speed, using indexing technique could be a good solution [2, 3]. We exploit indexing techniques on 3DZDs in order to speed up protein structure search. Specifically, we apply two indexing techniques, iDistance and iKernel, on 3D-Surfer data set, and extend them for further speed up . To fully take advantage of the indexing techniques, we also provide θ-based nearest neighbor search which returns data points less than θ to the query point. The experimental results show that the indexing techniques both decrease the searching speed, and our nearest neighbor search algorithms further speed up the protein structure search. Speicifically, in top-k nearest neighbor search, the searching time is reduced 69.6%, 77%, 77.4% and 87.9%, respectively using iDistance, iKernel, the extended iDistance, and the extended iKernel. In θ-based nearest neighbor serach, the searching time is reduced 80%, 81%, 95.6% and 95.6% using iDistance, iKernel, the extended iDistance, and the extended iKernel, respectively.
This paper is organized as follows. We briefly introduce related works about protein structure search and top-k query search. Then, we explain iDistance, iKernel and the extended top-k query search method in combination with iDistance and iKernel. Finally, we provide experimental results to verify the efficiency of our approaches, and conclusion with future works.
A protein consists of a sequence of amino acid (AA) residues. A sequence of AA residues folds into a 3-dimensional (3D) structure in space and forms a functional protein. A 3D structure of a protein is recorded in a pdb file format as a set of Cartesian coordinates of all the atoms in the protein. The 3D structure contains rich information relating to function and evolution of the protein.
Protein structure search
Earlier structural similarity measurements were designed for pair-wise analysis where the user only needed to compare handful of protein structures [5–7]. However, as the number of known structures increased more methods are proposed for similarity search in protein database [8, 9]. One of the most intuitive approaches is to compare the coordinates of corresponding residues or atoms of proteins after structural alignment [10, 11]. Root Mean Square Deviation (RMSD) is often used as the similarity measure. Due to its high computational complexity, structure alignment is done by using Dynamic Programming (DP) or its extensions [6, 12, 13].
There are major structure databases such as PDB, CATH , and SCOP  which provides only keyword search and browsing of pre-computed classification. Some database systems that are able to take a query structure are for the search includes Distance matrix ALIgnment (DALI) server , Vector Alignment Search Tool (VAST) search , and eF-site database . Given a query protein structure, they need around an hour to finish searching their databases. Zeyar et al. suggests an indexing method called ProtDex for fast search in 3D protein structure database . Although it performs faster than DaliLite , one of the most popular protein structure search algorithms, the search time of ProtDex takes over a few minutes and it is not practical for online database searches.
3D-Surfer is a new and efficient protein structural search system which represents protein structures based on 3D-Zernike Descriptor (3DZD). The major advantage of 3DZD is that it allows a fast k-nearest neighbor (k-nn) search of protein structures. It has been verified that the retrieved k-nn proteins by 3D-Surfer have similar functional and evolutional information in terms of SCOP classification . Some of the characteristics of the 3DZDs is that it is rotational invariant, and the resolution of the representation of protein structures are easily adjusted by changing the order, and descriptors of the lower order are contained in the descriptors of the higher order.
Nearest neighbor search algorithm
There is a long stream of researches on finding nearest neighbor search problem which is an optimization problem for finding closest points in metric spaces. The simplest method is to compute the distance from the query point to every other point in the database. It has O(Nd) complexity where N is the number of data points and d is the dimensionality of the data, and 3D-Surfer also used this approach. For efficient top-k search, there have been various methods via space partitioning including X-tree, TV-tree, and SR-trees [21–23]. iDistance that we used here is also space partitioning method. There are other methods such as iKernel which is an indexing technique and designed for efficient calculation of support vector machine (SVM). The details of iDistance and iKernel are described in the methods. Note that those methods cannot be directly used for protein structure data, thus in this work we exploit the 3DZD of protein structures and apply indexing techniques on the 3DZDs.
In this section, we first introduce the protein structure dataset and their 3D-Zernike Descriptor (3DZD). Then, the descriptions of iDistance and iKernel methods and the proposed efficient top-k query search method based on the characteristics of 3DZD are provided.
Protein structural dataset and 3D-Zernike Descriptor
3DZDs are compact and rotationally invariant representation of 3D structures. 3DZD has been successfully used for protein  and ligand structure analyses  as well. We provide brief description of 3DZD for reader's convenience. Detailed description can be found in [25, 26].
The 3DZD descriptors for protein structural dataset of 158781 number of protein chain structures was obtained through 3D-Surfer database. The entire structures in PDB was collected and processed on 2009 . For each of the pdb files that contain one to several protein chains, the chains were separated and surfaces of each chain were obtained through molecular surface calculation program, MSROLL version 3.9.3 , and then voxelized. Each of the voxelized protein surface were used as a input to 3DZD conversion program and a vector of 121 numbers called invariants were computed.
where n is the order of 3DZD determining the resolution of the descriptor. Then, the norms allow ratational invariance to the desriptor. For each pair of n and l, 3DZD has a series of invariants, the numbers in the vector of 3DZD, where n is ranged from 0 to the predefined order (20 in this case).
In this work, we exploit two indexing techniques: iDistance and iKernel. Two indexing techniques partition given data points into clusters and using the clusters to find k-nn. Note that both techniques exactly retrieve k nearest neighbor results given a query. The details of the two indexing methods follows.
iDistance is an efficient indexing technique for k-nearest neighbor search in a high-dimensional metric space . It depends on how data are partitioned and how reference points for each partition are defined (we will henceforth mention partition as cluster for terminology consistency between iDistance and iKernel). After clustering and reference point selection, each data point is indexed according to the distance between its reference points.
where y is iDistance of point p in i-th cluster, O i , and C is a constant used to stretch the data ranges of indexes.
iKernel is originally designed for the efficient learning of support vector machine (SVM) . However, it is also applicable for top-k search for Euclidean distance. Similar to iDistance, it first divide given data points to clusters where the clusters have set of rings which are the data structure defined for iKernel. Given a query, it searches k-nn by visiting each cluster and its rings.
Note that each ring have g number of data points. The paramter g is user adjustable and need to be determined prior to index construction.
To process a k-nn of a query structure, we exploits the index and Minimal Possible Distance (MPD) . MPD is the minimal possible distance between a query q and ring structure C i,j . With this new notion of the MPD, k-nn search works as follows. Given a query point q, we first initialize a priority queue Q with a set of pair ¡C i,j , MPD¿ of each cluster in the ascending order of their MPDs between the q where only the outermost ring is considered first. Then, at each iteration, the top entry of Q is popped. If a ring is popped, the data points in the ring are inserted to Q with the distance from q, and if the popped item is a data point, it is simply added into top-k result since the priority queue ensures that all instances in the queue have larger distances from q and also all rings have larger MPDs between q.
Top-2 query processing
updated Q (new items in bold)
C5,4, C2,3, C3,2, C1,3, C4,4
, C5,3, C2,3, , , C3,2, C1,3, C4,4
C5,3, C2,3, , , C3,2, C1,3, C4,4
C2,3, C 5 , 2 , , , , , C3,2, C1,3, C4,4
, C5,2, , C 2 , 2 , , , , , , , C3,2, C1,3, C4,4
C5,2, , C2,2, , , , , , , C3,2, C1,3, C4,4
For both indexing techniques, we require division of data set into clusters by assigning similar number of data points to each cluster. First method randomly selects reference points and assign remaining data points to their closest reference point. K-means clustering, which is one of the most widely used clustering methods, is also tested. The goal of K-means clustering algorithm is to divide a set of points into k clusters so that the within-cluster sum of squares is minimized . K-means algorithm is easily applicable to problems and performance is often shown to be satisfying. However, it also has some disadvantages as the K-means algorithms is a local search procedure and it suffers from the serious drawback that its performance depends on the initial starting conditions . Therefore, in this work, we repeatedly cluster data points and conduct experiments, and select the best result.
Extended top-k search based on 3DZD
Based on this observation, we introduce a new approach for top-k search as follows.
1 Given a query protein Q, search top-k × 10 result using the indexing structure with 60-dim (the half of entire dimension).
2 Using the top-k × 10 results, find exact top-k result.
Note that k × 10 is very small number compared to the size of the database (around 1.6 million).
Threshold-based nearest neighbor search
The nearest neighbor search can be solved based on two different user parameters of either the number of nearest neighbor, k, or the threshold of the distance between nearest neighbors and query, θ (From now on, we call the second approach as θ-based nearest neighbor seach). Therefore, we exploit the indexing techniques in θ-based nn search as well. Using θ, a data point can be nearest neighbor, only if they have shorter distance to the query than θ. To do this, in the linear scan, we need to check whether the distance between each protein structure and query is less than θ or not, we do not need to check the number of nearest neighbor, k. In the iDistance, we set the query range, r as θ. Then the number of visiting clusters and computing the distance between their data points ot query decreases, when the number of nearest neighbor with shorter distance than θ has smaller than k. In the iKernel, the search process is terminated untill the Minimum Possible Ditance (MPD) of the popped instance is smaller than θ. It also reduces the cost of visiting clusters and computing the distance bewteen the query and their data points. Note that, using θ, the extended approaches always return the exact nearest neighbors.
In this section, we verify the effectiveness of indexing techniques on top-k search of protein structures. Sael et al. showed that 3DZD works well on finding similar proteins in terms of functional and evolutionary characteristics based on SCOP classification . The SCOP provides the ordering of all proteins of known structure according to their evolutionary and structural relationships. In addition, both of iDistance and iKernel are not approximate techniques and find exact top-k nn from database according to the structural similarity described by 3DZD. Therefore, we only measure the efficiency in terms of processing time and evaluation ratio. The evaluation ratio is computed as the fraction of accessed data points over the number of database (1 for linear scan since it access all data points in the data set). The processing time could be affected by the various factors including performance of machine, the number of users, and network environment. In contrast, the evaluation ratio shows consistent measure.
The experiments were conducted on the machine, Intel Core(TM) i7 CPU (3.40GHz), and 16 GB memory. In overall experiment, we used 100 data points that are randomly selected from data set, and averaged entire processing time and evaluation ratio.
The user parameters
There are a few parameters that are needed to be optimized in the iDistance and iKernel methods. In this section, we observe how the result varies as the user parameters varies to select the best. We also observe how the cluster number affects the top-k search. We vary the partition data points using different number of clusters: 121, 242, 498, and 866. 121 is the dimensionality of data set, and 242 is the two times the dimensionality ( refers that this way works well on iDistance). And the others are according to SCOP classification hierarchy. 498 is the number of families, and 866 is the number of protein domains . We assume that the numbers defined by domain experts could have good evidence of cluster number. The followings include the explanation of user parameters and their experimental result.
For the optimized value, we decided to use 0.2 as Δr and 866 as the number of clusters, since it shows the best result in terms of the evaluation ratio. Though it does not result out the best result in terms of processing time, the difference among Δr is not that large compared to the difference among the dimensions.
For the optimized value, we decided to use 50 as g and 866 as the number of clusters, since it shows the best result in terms of the evaluation ratio as well as the processing time.
The comparison of clustering techniques
The effectiveness of clustering (Processing time (sec.)/Evaluation ratio)
The number of nearest neighbor, k
The filtering threshold, θ
The stability of processing time/evaluation ratio
The enhancement with the extended nearest neighbor search
In this section, the result of the extended approaches is compared to the best results of the basic approach discussed in the previous section. Since we use optimal paramters based on previous experiments. First, we provide the result of the proposed approaches in top-k nearest neighbor problem (Table-3). In the table, the number in bracket is the ratio of actual top-25 result in top-25 result which are approximately obtained by the extended approach (which is same to the preliminary result, Figure 6). As you can see, the enhancement of iDistance and iKernel with basic top-k search is not that large, the extended approaches work much faster than basic approaches. In addition, iKernel always works faster than iDistance. Among iKernel results, it looks like the basic approach works faster than the extension, but it is not. Note that when they access to data instance to compute inner products of vectors, in the case of basic approach, there are 121-dimensional vectors. However, in the case of the extended approach, the inner product takes 60-dimensional vectors. It indicates that if the difference is small between two approaches, the extended approach may work better than the basic approach in real. In addition, when the number of query is small, the quality is comparable. However, when the number of query becomes large, the difference of processing time becomes larger as well.
The comparison of the proposed approaches in top-k nearest neighbor search (Proc. is processing time measured in second and Eval. is evaluation ratio)
The comparison of the proposed approaches in θ-based nearest neighbor search (Proc. is processing time measured in second and Eval. is evaluation ratio)
In this paper, we introduce an efficient indexing for protein structure search where protein structures are represented as vectors by 3D-Zernike Descriptor (3DZD). When we retrieve top-k nearest neighbors, using indexing techniques alone, we were able to make the search speed 77% faster compared to the prevoius version of 3D-Surfer that uses linear Euclidian distance scan between the 3DZDs in the database. We also proposed an extended version of the protein structure search based on the key observation that the prior dimension of the descriptor indicates global shape of the protein structure. Using the extended techniques it is improved up to 87.9%. When we retrieve nearest neighbor with shorter distance to the query than θ, using indexing techniques alone, we were able to make the search speed 81% faster compared to the linear scan. Using the extended techniques it is improved up to 96%. For future work, we will improve the nearest neighbor search with indexing techniques by utilizing the characteristics of the query prior to searching. In addition, we will apply indexing techniques for protein binding site similarity search with other data set represented based on 3DZD as well.
This work was partially supported by Mid-career Researcher Program through NRF grant funded by the MEST (No. KRF-2011-0016029). This work was also supported by IT Consilience Creative Program of MKE and NIPA (C1515-1121-0003)
This work is based on an earlier work: “Indexing methods for efficient protein 3D surface search”, in Proceedings of the ACM Sixth International Workshop on Data and Text Mining in Biomedical Informatics, 2012 © ACM, 2012. http://doi.acm.org/10.1145/2390068.2390078
The publication costs for this article were funded by the corresponding author.
This article has been published as part of BMC Medical Informatics and Decision Making Volume 13 Supplement 1, 2013: Proceedings of the ACM Sixth International Workshop on Data and Text Mining in Biomedical Informatics (DTMBio 2012). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcmedinformdecismak/supplements/13/S1.
- Sael L, Li B, La D, Fang Y, Ramani K, Rustamov R, Kihara D: Fast protein tertiary structure retrieval based on global surface shape similarity. Proteins. 2008, 72: 1259-1273. 10.1002/prot.22030.View ArticlePubMedGoogle Scholar
- Deng K, Zhou X, Shen HT, Liu Q, Xu K, Lin X: A multi-resolution surface distance model for k-NN query processing. VLDB. 2008, 1101-1119.Google Scholar
- Shen HT, Huang Z, Cao J, Zhou X: High-dimensional indexing with oriented cluster representation for multimedia databases. VLDB. 2009, ICME, 1628-1631.Google Scholar
- Kim S, Lee S, Yu H: Indexing methods for efficient protein 3D surface search. Proceedings of the ACM Sixth International Workshop on Data and Text Mining in Biomedical Informatics. 2012, New York: ACM, 41-48. 10.1145/2390068.2390078.Google Scholar
- Gibrat JF, Madej T, Bryant SH: Surprising similarities in structure comparison. Curr Opin Struct Biol. 1996, 6: 377-385. 10.1016/S0959-440X(96)80058-3.View ArticlePubMedGoogle Scholar
- Shindyalov IN, Bourne PE: Protein structure alighment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 1998, 11: 739-747. 10.1093/protein/11.9.739.View ArticlePubMedGoogle Scholar
- Singh AP, Brutlag DL: Hierarchical protein structure superposition using both secondary structure and atomic representations. Intl Syst for Mol Biol (ISMB). 2008, 1013-1022.Google Scholar
- Holm L, Sander C: Protein structure comparison by alighment of distance matrices. J Mol Biol. 1993, 233: 123-138. 10.1006/jmbi.1993.1489.View ArticlePubMedGoogle Scholar
- Martin A: The ups and downs of protein topology: rapid comparison of protein structure. Protein Eng. 2000, 13: 829-837. 10.1093/protein/13.12.829.View ArticlePubMedGoogle Scholar
- Mizuguchi K, Go N: Seeking significance in three-dimensional rotein structure comparisons. Curr Opin Struct Biol. 1995, 5: 377-382. 10.1016/0959-440X(95)80100-6.View ArticlePubMedGoogle Scholar
- Kolodny R, Petrey D, Honig B: Protein structure comparison: implications for the nature of 'fold space', and structure and function prediction. Curr Opin Struct Biol. 2006, 16: 393-398. 10.1016/j.sbi.2006.04.007.View ArticlePubMedGoogle Scholar
- Kihara D, Skolnick J: The PDB is a covering set of amall protein structures. J Mol Biol. 2003, 334: 793-802. 10.1016/j.jmb.2003.10.027.View ArticlePubMedGoogle Scholar
- Gerstein M, Levitt M: Using iterative dynamic programming to obtain accurate pairwise and multiple alignments of ptotein structures. Proc Int Conf Intell Syst Mol Biol. 1996, 4: 59-67.PubMedGoogle Scholar
- Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH-a hierarchic classification of protein domain structures. Structure. 1997, 5: 1093-1108. 10.1016/S0969-2126(97)00260-8.View ArticlePubMedGoogle Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: A structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995, 247: 536-540.PubMedGoogle Scholar
- Holm L, Sander C: Touring protein fold space with DALI/FSSP. Nucleic Acids Res. 1998, 26: 316-319. 10.1093/nar/26.1.316.PubMed CentralView ArticlePubMedGoogle Scholar
- Madej T, Gibrat JF, Bryant SH: Threading a database of protein cores. Proteins. 1995, 23: 356-369. 10.1002/prot.340230309.View ArticlePubMedGoogle Scholar
- Kinoshita K, Nakamura H: Identification of protein biochemical functions by similarity search using the molecular surface database eF-site. Protein Sci. 2003, 12: 1589-1595. 10.1110/ps.0368703.PubMed CentralView ArticlePubMedGoogle Scholar
- Aung Z, Fu W, lee Tan K: An efficient index-based protein structure database searching method. DASFAA. 2003Google Scholar
- Conte LL, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res. 2002, 30: 264-267. 10.1093/nar/30.1.264.PubMed CentralView ArticlePubMedGoogle Scholar
- Ciaccia P, Patella M, Zezula P: M-tree: an efficient access method for similarity search in metric spaces. Nucleic Acids Res. 1997, VLDBGoogle Scholar
- Keim D: Tutorial on high-dimensional index structures: Database support for next decades applications. Nucleic Acids Res. 2000, ICDEGoogle Scholar
- Bruno N, Gravano L, Marian A: Evaluating top-k queries over web-accessible databases. Nucleic Acids Res. 2002, ICDEGoogle Scholar
- Venkatraman V, Chakravarthy PR, Kihara D: Application of 3D Zernike descriptors to shape-based ligand similarity searching. J Cheminform. 2009, 1: 19-10.1186/1758-2946-1-19.PubMed CentralView ArticlePubMedGoogle Scholar
- Canterakis N: 3D Zernike Moments and Zernike Affine Invariants for 3D Image Analysis and Recognition. Image Analysis. 1999, 85-93.Google Scholar
- Novotni M, Klein R: 3D zernike descriptors for content based shape retrieval. SM. 2003Google Scholar
- La D, Esquivel-Rodríguez J, Venkatraman V, Li B, Sael L, Ueng S, Ahrendt S, Kihara D: 3D-SURFER: software for high-throughput protein surface comparison and analysis. Bioinformatics. 2009, 25: 2843-2844. 10.1093/bioinformatics/btp542.PubMed CentralView ArticlePubMedGoogle Scholar
- Connolly ML: The molecular surface package. J Mol Graph. 1993, 11 (2): 139-141. 10.1016/0263-7855(93)87010-3.View ArticlePubMedGoogle Scholar
- Jagadish HV, Ooi BC, Tan KL, Yu C, Zhang R: iDistance: An adaptive B-tree based indexing method for nearest neighbor search. Database Syst. 2005, 364-397.Google Scholar
- Yu H, Ko I, Kim Y, Hwang S, Han WS: Exact indexing for support vector machines. Management of data.Google Scholar
- MacQueen JB: Some methods for classification and analysis of multivariate observations. Management of data Math statist and Prob. 1967Google Scholar
- nad J Lozano JP, Larranaga P: An empirical comparison of four initializatoin methods for the k-means algorithm. Management of data. 1999Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.