Document not found! Please try again

Integrated search and alignment of protein structures - Bilkent CS.

Report 2 Downloads 61 Views
BIOINFORMATICS

ORIGINAL PAPER

Vol. 24 no. 24 2008, pages 2872–2879 doi:10.1093/bioinformatics/btn545

Structural bioinformatics

Integrated search and alignment of protein structures Ahmet Sacan1,2,∗ , I. Hakki Toroslu1 and Hakan Ferhatosmanoglu2,3 of Computer Engineering, Middle East Technical University, Ankara, Turkey, 2 Department of Computer Science and Engineering and 3 Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA

1 Department

Received on July 8, 2008; revised on September 26, 2008; accepted on October 19, 2008 Advance Access publication October 22, 2008 Associate Editor: Anna Tramontano

ABSTRACT Motivation: Identification and comparison of similar threedimensional (3D) protein structures has become an even greater challenge in the face of the rapidly growing structure databases. Here, we introduce Vorometric, a new method that provides efficient search and alignment of a query protein against a database of protein structures. Voronoi contacts of the protein residues are enriched with the secondary structure information and a metric substitution matrix is developed to allow efficient indexing. The contact hits obtained from a distance-based indexing method are extended to obtain highscoring segment pairs, which are then used to generate structural alignments. Results: Vorometric is the first to address both search and alignment problems in the protein structure databases. The experimental results show that Vorometric is simultaneously effective in retrieving similar protein structures, producing high-quality structure alignments, and identifying cross-fold similarities. Vorometric outperforms current structure retrieval methods in search accuracy, while requiring comparable running times. Furthermore, the structural superpositions produced are shown to have better quality and coverage, when compared with those of the popular structure alignment tools. Availability: Vorometric is available as a web service at http://bio. cse.ohio-state.edu/Vorometric Contact: [email protected]

1

INTRODUCTION

A tremendous amount of sequence and structure data is being produced with the motivation of deriving biological insights through analysis of similarities, differences and interactions among biological macromolecules. Whereas the sequence comparison methods are generally sufficient for comparing proteins that share a high level of similarity, structure comparison becomes essential in discerning more distant evolutionary relationships. Moreover, the spatial organization of the protein residues provides stronger clues into the biochemical function of the proteins than can be derived from the sequence information alone. Pairwise structure alignment is the basic step for comparing protein structures. Finding the optimal alignment has been proven to be NP-hard (Lathrop, 1994), and several heuristic approaches have

∗ To

whom correspondence should be addressed.

2872

been developed in DALI (Holm and Sander, 1993), CE (Shindyalov and Bourne, 1998) and MAMMOTH (Ortiz et al., 2002). The rapidly increasing size of the protein databases, however, has rendered exhaustive pairwise structure alignment infeasible. To overcome the difficulties presented by the database size, several strategies that aim to quickly identify relevant protein structures have recently been proposed. These strategies can best be summarized in terms of the choice of protein structure representation and the indexing method utilized for fast searching. ProGreSS (Bhattacharya et al., 2004) maps windows of protein backbone to a feature vector space using the curvature and torsion angles and the amino acid type information, and performs spatial indexing in this feature space. Protdex2 (Aung and Tan, 2004) represents the protein as a set of feature vectors of the contact regions among secondary structure elements (SSE) and uses an invertedfile index for searching. Yakusa (Carpentier et al., 2005) describes the protein structure as a sequence of its backbone dihedral α angles and uses a method analogous to BLAST for searching blocks of this sequence. 3D-BLAST (Tung et al., 2007) clusters the κ and α angles to reduce the description to an alphabet and constructs a BLOSUM-like substitution matrix for this backbone angle alphabet, so that BLAST algorithm can be used without any modifications. Currently available protein structure search methods provide database filtering, but defer a detailed structural alignment to further analysis by external alignment methods. More importantly, they focus on finding proteins that share similar overall topology or secondary structure composition, and are not sensitive to detect residue-level non-local interactions. Such non-local interactions are especially important in detecting functionally or evolutionarily significant similarities among proteins that span multiple structural folds (Brown et al., 2006; Friedberg and Godzik, 2005). In this study, Vorometric is proposed as an integrated approach to both search and alignment tasks. We collect residue interactions from the protein structures using Voronoi tesselation and build a database of these residue environments. For a query protein, similar residue environments are retrieved from the database and extended to obtain high-scoring segment pairs (HSPs), which are then used for structural superposition. We have developed a sensitive metric substitution matrix for accurate comparison of both amino acid and secondary structure information of related residue environments. Whereas an exhaustive search of similar residue environments in the database is prohibitive, our metric matrix has made distance-based indexing possible so that similar environments can be retrieved very efficiently. To the best of our knowledge, Vorometric is the first study

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

Search and alignment of protein structures

employing distance-based indexing to protein structure data. The main benefits of our approach can be summarized as follows: • The correspondences obtained from search and extension of residue environments endorse integrated and accurate structural superpositions, so that further structural alignment by external programs is no longer necessary. • Unlike other protein structure search methods that at best capture the inter-SSE contacts, Vorometric provides contact sensitivity at the residue level. • The hit and extend methodology inherently detects local, flexible structure alignments, a feature not commonly available in pairwise structure alignment methods. We demonstrate the advantages and limitations of Vorometric using both quantitative performance evaluation on large-scale datasets and on several detailed case studies. The experimental results show that Vorometric outperforms other structure search tools, and at the same time, yields high-quality structural alignments that are comparable to or better than those produced by other structure alignment tools.

Fig. 1. Delaunay tessellation (dashed lines) of a set of points in 2D and 3D. The Voronoi diagram is shown for only 2D (solid lines). The 2D curve represents a projection of the 3D backbone segment from β2-microglobulin domain (3hla). The residue names are shown next to the Cα atoms. The length of the contact strings, generated from a large structure dataset, has an average of 11.6 and a maximum of 23 residues. We encode the amino acid type and the secondary structure assignment of the residues in each contact string. For instance, the contact string for the second, Histidine residue in Figure 1 (3D) is: RC HC∗ PC AC EC KC , where the secondary structure state is given in subscript notation, and the central residue is marked with an asterisk ‘∗ ’. We use DSSP (Kabsch and Sander, 1983) to obtain the secondary structure assignment for each residue and consider only the three primary states: α-helix (H), β-sheet (E), and turns (C).

2.2 2

METHODS

Vorometric exploits the fact that related protein structures share similar residue–residue interactions, in identification and alignment of related proteins. The residue interactions are captured using Voronoi tesselation and represented as a sequential string of residues. We incorporate both the amino acid type and secondary structure information into this representation. The residue contacts from all the proteins are then compiled in a database and metric indexing is used for fast similarity search in this database. For a query protein, the contacts that are similar to those formed by its residues are searched in the database, and hits are extended for structural alignment. In the next few sections, we describe each of these steps in detail.

2.1

Representing the residue environments

Voronoi tesselation has been proposed as an effective method for extracting multi-body contacts from protein structures (Krishnamoorthy and Tropsha, 2003), and successfully utilized in packing analysis (Richards, 1974), protein folding (Gan et al., 2001), structure alignment and structural motif mining (Sacan et al., 2007). Ilyin et al. (2004) observed that structurally related proteins share common Voronoi contacts and used this observation to systematically match compatible tetrahedrons by shape, volume and backbone topology in order to obtain candidate seeds for structure alignment. Roach et al. (2005) and Birzele et al. (2007) use a different representation of the Voronoi contacts to obtain a sequential representation which allows direct use of dynamic programming. Roach et al. (2005) measure the compatibility of the contacts through discretization of the Voronoi edge lengths, whereas Birzele et al. (2007) use another level of dynamic programming to compare the residue contacts. We acknowledge that a sequential representation of the residue environments is very effective for their comparison, and utilize a similar representation in this study. We use the location of Cα atoms to represent the amino acids of a protein structure as a set of points in 3D space. The region of space around each point closer to the enclosed point than any other point defines a Voronoi polyhedron (Fig. 1). Delaunay tessellation is obtained by connecting the points that share a Voronoi boundary. For each residue, we define the set of all of its Delaunay neighbors, ordered by their sequence number along the backbone, as its environment (also denoted as contact string). This definition of contacts encodes much of the geometric proximity information and provides an abstract description of the underlying geometry.

Comparison of the contact strings

An SSE-enriched distance matrix (described below) is used to compare individual elements of two contact strings (e.g. comparing a helix forming Arginine, RH , with a β-sheet forming asparagine, NE ). The optimal alignment that minimizes the edit distance between two SSE-enriched strings with respect to the distance matrix can be obtained using the classical dynamic programming method by Needleman and Wunsch (1970). Since the contact strings are relatively short, global alignment with linear gap penalty provides sufficient accuracy in comparing contact strings. Note that, even though most of the related proteins share similar Voronoi contacts, slight differences in the backbone configurations, such as caused by the inherent noise in the crystallography experiment, or insertion and deletion of backbone segments, may induce significantly different Voronoi contacts. In order to increase the robustness of the comparison measure, we consider the contacts and the central residues separately when aligning two contact strings. The piecewise definition of the distance function is especially useful in recovering from the differences in Voronoi contacts of the residues flanking highly conserved structural cores. Furthermore, it ensures that the central residues, whose compatibility is being measured in the first place, are aligned together. Based on these considerations, the distance between two contact strings E and F is defined as follows: D(E,F) = d(E − ,F − )+η d(E 0 ,F 0 )+d(E + ,F + )

(1)

where d is the edit distance between two SSE-enriched sequences and η is a parameter used to adjust the importance of the similarity of the central residues (0 ) compared with that of the contact residues preceding (− ) and following (+ ) the central residues. A small value of η can stress the similarity of contacts better, whereas a large value would work better in recovering from differences in the contacts due to structural divergence. A very large value of η would disregard the similarity of the contacts, and would not be informative in identifying similar structures, generating many false positives that share the same central residue type. We determine a single value of η that is optimized for structure retrieval and alignment performance as described in Section 2.7. Note that the edit distance between the central residues is simply a lookup in the distance matrix and does not require dynamic programming.

2.3

Metric SSE-enriched distance matrix

Sellers (1974) has proved that if a metric distance matrix is used in the global alignment, then the resulting edit distance also forms a metric.

2873

A.Sacan et al.

There has been a number of efforts to construct metric amino acid distance matrices (Zintzaras, 1999). On the other hand, a metric matrix that captures both the amino acid and SSE information is not available. Using an identity matrix is an obvious solution; however, the identity matrix is not sensitive to detect similarities between different types of amino acids. We construct a 60 × 60 SSE-enriched distance matrix (M) using a weighted combination of a metric amino acid distance matrix (N) that we have previously derived from four-body Delaunay contact profiles of amino acids (Sacan and Toroslu, 2007) and a metric SSE exchange matrix (K) derived from an SSE similarity matrix (Wallqvist et al., 2000) using the inter-row distance method (Zintzaras, 1999). The elements of M are defined as follows: M(a,s,b,t) = w1 N(a,b)+w2 K(s,t) (2) where a and b are types of amino acids, s and t are the SSE states and w1 ,w2 are positive weighing parameters to adjust the contributions of amino acid types and SSE states. In general, one would expect a smaller w1 and a larger w2 for proteins that are remote homologous, because the remote homologs usually maintain a well-conserved secondary structure despite their low sequence similarity. An analysis of the dependence of the w1 and w2 parameters on the sequence and structure similarity levels can be found in the Supplementary Material. A distance matrix (or function) f is metric if the following properties are satisfied for any three elements x, y and z: (1) Positivity: f (x,y) ≥ 0 (2) Identity: f (x,y) = 0 iff x = y (3) Symmetry: f (x,y) = f (y,x) (4) Triangle Inequality: f (x,z) ≤ f (x,y)+f (y,z) Now, we show that M, which is a weighted combination of the metric matrices N and K is also metric. (1) The weights and matrices in Equation (2) are all positive, which makes M to be positive. (2) If M(a,s,b,t) = 0, then N(a,b) = 0 and K(s,t) = 0 from Equation (2). Moreover, a = b, s = t because N and K satisfy identity. Then, it follows that a,s = b,t. The reverse condition is also true using the same premises. (3) w1 N(a,b)+w2 K(s,t) = w1 N(b,a)+w2 K(t,s) because both N and K are symmetric, therefore M is also symmetric. (4) M(a,s,b,t)+M(b,t,c,u) = w1 (N(a,b)+N(b,c))+w2 (K(s,t)+K(t,u)) ≥ w1 N(a,c)+w2 K(s,u) = M(a,s,c,u), therefore M also satisfies triangle inequality. Note that, the distance function D defined for the contact strings is similar to M, in that it is also composed of a weighted combination of functions that are metric. According to the properties shown above, both M and D are metric.

2.4

Indexing and searching contact strings

Having a metric distance function D to compare contact strings allows us to utilize distance-based indexing for efficient retrieval. The main idea in distance-based indexing is to organize and partition the data into a hierarchical structure based on distances to representative elements of the partitions at each level. A partition whose representative entry is too dissimilar to a query can then be pruned using the triangle inequality, without having to examine the rest of the entries in that partition. This allows an efficient and focused search over the data for entries similar to the query. [Please refer to Taskin and Ozsoyoglu (2004) for a survey of distance-based indexing methods.] While any metric indexing method can be used to index and search the contact strings, we have implemented the Slim-tree method (Traina et al., 2000) which achieves sufficient time and memory performance for the large datasets used in this study.

2874

Fig. 2. Extension of the hits from a database protein A. The seeds being extended are marked with ‘o’, and those that are pruned are marked with ‘x’. The gray area represents the cells that are explored by the dynamic programming and the black cells form the alignment paths of the HSPs. We extract the contact strings from all of the protein structures in a dataset, and index them with respect to the distance function D. For a given query protein structure, we extract its contact strings, and search for similar entries in the database that are within the range δ from the query contact strings. The parameter δ provides a threshold on the similarity of the contact strings being searched. A loose threshold would capture the contact strings of all protein structures that are similar to the query but may also result in many false positive hits. Whereas, a tight threshold would seek only the proteins that share highly conserved structural cores with the query.

2.5

Generating HSPs

The pseudocode for generating HSPs from the contact string hits is outlined in Algorithm 1. The hits obtained for the individual residues of the query protein are first grouped based on which database proteins they belong to. These hits (also called seeds) correspond to a pair of residues, one from the query and one from the database protein, and are represented by the individual cells of the dynamic programming table, as illustrated in Figure 2. Please note that the substitution score of each residue pair is defined by the similarity of their contact strings, so the hit extension phase is, in fact, a 2nd level of dynamic programming.

Algorithm 1 Generate HSPs from contact string hits Input: the contact string hits from the database Output: HSPs: high scoring segment pairs HSPs ← []; foreach protein A that has contact string hits do H ← sort hits to A by their distance to query strings; foreach hit h ∈ H do if h is already explored in dynamic programming table then continue; hsp ← ExtendHit (h); if hsp can be merged into a previous hsp ∈ HSPs then hsp ← MergeHSP(hsp ,hsp); else if Score(hsp) ≥ γ then add hsp to HSPs; The extension heuristic employed for each seed is similar to that of BLAST sequence search tool (Altschul et al., 1997) in that we also construct gapped local alignments in both forward and backward directions and only consider the cells in the dynamic programming table whose score falls not more than a fraction of the best score yet found. However, we introduce several notable enhancements over the basic method that increase the efficiency while maintaining the same level of sensitivity.

Search and alignment of protein structures

The hits to a database protein A are sorted based on their distances to the query contact strings, such that the more similar hits, which are more likely to be part of the final HSPs, are explored first. Naturally, we would expect many seeds on the alignment path of an HSP; extending each of these seeds would be redundant because they would find the same HSP path. We therefore skip the seeds whose residue pairings have already been explored by the extension of the previous seeds (Fig. 2). This heuristic effectively eliminates about 42% of the seeds from consideration (based on randomized searches on the ASTRAL-25 database). Furthermore, to overcome the problem of generating many short HSPs, we merge a new HSP if its alignment path intersects with that of a previously generated HSP and if the merging produces a higher score than the individual HSP scores. This strategy results in merging 7% of the HSPs, which otherwise would have been generated as separate, shorter alignment segments.

2.6

Structure superposition

The residue correspondences defined by the HSP alignments are used to obtain a structural superposition of the query and the related database proteins. Finding the optimal transformation that minimizes the root mean squared deviation (RMSD) between two structural alignments can be computed very fast: linear in the size of the proteins (Kabsch, 1978). Following the iterative optimization procedure commonly employed by the structure alignment tools, we derive a new set of correspondences from the superposition by finding the local alignment that minimizes the total distance of the aligned residues, and then repeat the iteration. A single iteration improves the structural alignment by readjusting the underlying residue correspondences to better reflect the spatial proximity of the residues in the structural superposition. The procedure is repeated until the transformation matrix no longer changes. Because the initial correspondences defined by the HSP alignments already optimize the structural compatibility of the aligned residues, the algorithm converges fast; in only a few iterations.

2.7

Parameter optimization

Parameters used in Vorometric are optimized on an independent training set using the Nelder–Mead simplex method (Lagarias et al., 1998). The objective function used for the optimization was the geometric mean of the precision and recall values of the results returned by Vorometric and the TM-score (Zhang and Skolnick, 2004) of the structural alignments between the queries and the resulting proteins. The geometric mean penalizes imbalance in the individual quality values more than the arithmetic mean, and is considered a better metric for combining different quality measures. The training set used for optimization was taken from the representative ASTRAL v1.73 database with 25% sequence identity (Chandonia et al., 2004). We removed all the families that were used as queries in the evaluation of the Vorometric reported below, and kept remaining families that had at least 10 domains. From 13 such families, we randomly selected 10 members and assigned one of them to be the query and compiled the rest into a dataset. The training data are available from the supplementary web site.

3

EXPERIMENTAL RESULTS

Since Vorometric is proposed as a protein structure database search tool that at the same time produces high-quality structure alignments, we compare its performance with that of both pairwise structure alignment and database search tools. In the next few sections, we first demonstrate that the structural alignments produced by Vorometric are in fact comparable to or better than those of other pairwise structural alignment tools. We then show on large-scale experiments, that the structures in the database that are similar to a query protein are retrieved correctly, using the SCOP classifications (Murzin et al., 1995) as the gold standard.

Table 1. Comparison of alignment quality on 10 difficult pairs Method

RMSD (Å)

%N (query)

Quality (TM)

CE SSAP DaliLite Voroligna Vorometric

3.17 4.37 2.82 2.28 3.02

83.4 88.1 80.0 51.7 84.8

0.60 0.59 0.61 0.56 0.65

a Vorolign reports alignments for multiple substitution matrices; here we use the SM-

THREADER matrix (Dosztanyi and Torda, 2001), which gives the best results. The best values in each column are shown in bold.

The experiments are performed on a Pentium 2.6 GHz personal computer with 2 GB of main memory. For each protein structure used in this study, the contact strings are extracted only once, requiring an average of 21 s per protein. In each experiment described below, a new database index is built for the contact strings of the protein structures used in the experiment. The insertion of a single protein structure into the database index takes 22 ms on the average.

3.1

Quality of the structural alignments

In order to demonstrate the quality of the structural alignments generated by Vorometric, we used the 10 difficult pairs of protein structures that have previously been used to evaluate structural alignment methods (Fischer et al., 1996). A difficult pair is defined as a structurally similar pair that has a low sequence similarity and that had proven difficult to align with the available methods. For each pair, we use one of the proteins as query to search against the database composed solely of the other protein, and report the top-scoring HSP alignment. We compare the structural alignments produced by Vorometric with those by other popular structural alignment tools (Table 1). Note that this small and specific set of proteins is used only to demonstrate the alignment quality of Vorometric for very difficult cases. A larger and more general evaluation of the alignment quality can be found in the next section. The comparison is made using the RMSD between the superimposed structures, the percentage of the query protein aligned (%N), and the TM-score (Zhang and Skolnick, 2004). The RMSD of a superposition is defined as:  i di RMSD = N where di is the Euclidean distance between the i-th corresponding residues of the aligned proteins and N is the length of alignment. Note that a tradeoff exists between the coverage of the alignment (%N) and the RMSD error. It is generally possible to produce short structural alignments with very low RMSD error (e.g. aligning only a single residue from each protein would trivially achieve zero error). And naturally, a higher RMSD error is incurred for longer alignments. In response to the difficulty of interpreting both RMSD and %N simultaneously, TM-score attempts to summarize the quality of the alignment in a single scoring function and has been shown to agree with the results of human expert visual assessments. TM-score is defined as: TM-score =

1 Ltarget

N 

1 di 2 1+( i d0 (Ltarget ) )

2875

A.Sacan et al.

Fig. 3. Structural alignment produced by Vorometric for 1ten (dark) and 3hhrb (light). Aligned regions are shown thicker.

where Ltarget is the length of the query protein, di is same as above, and d0 (Ltarget ) is a normalizing factor so that the average TM-score is not dependent on the size of the protein. TM-score ranges from 0 to 1, with 1 being a perfect superposition. For the 10 difficult pairs, CE and DaliLite give comparable coverage and RMSD values, whereas SSAP produces slightly longer alignments with significantly worse RMSD values. For instance, the alignment produced by SSAP for the 1ede–1crl pair had the worst RMSD (9.25 Å) among all the alignments. 1ede and 1crl belong to the α/β Hydrolases superfamily and are relatively large proteins (310 and 534 residues, respectively), having eight β-strands wrapped around by 11 α-helices. SSAP relies on aligning residues that share similar inter-residue distances; the high number of contacts formed by the residues at the core of these proteins makes their alignment difficult by SSAP. Vorometric produces better alignments than any other method as measured by the TM-score. The coverage of the alignments by Vorometric are as long as those of SSAP’s, while Vorometric at the same time achieves the best average RMSD, when compared with CE and SSAP. Vorolign (Birzele et al., 2007), which is a pairwise alignment method also based on Voronoi contacts, generates the shortest alignments (about 30% smaller than Vorometric) and consequently, achieves better RMSD. However, the average alignment quality evaluated by TM-score is poorer than the other methods. Furthermore, Vorolign fails to generate an alignment for the 1ten– 3hhrb pair. Both 1ten and 3hhrb are in the Fibronectin type III family; 1ten is composed of only one domain of the immunoglobulin-like β-sandwich fold, whereas 3hhrb contains two such domains, one of which aligns well with 1ten (Fig. 3). We attribute Vorolign’s failure to its sensitivity to differences in residue contacts introduced by the additional domain in 3hhrb.

3.2

Database search for similar proteins

Large-scale comparison of different structure alignment or search methods is in itself a serious undertaking which is neither straightforward, nor completely fair, because each such method uses different databases and accuracy measures [see Kolodny et al. (2005) for a comprehensive evaluation]. Furthermore, some methods are made available only as a web service, which makes largescale experimentation with newly crafted datasets impossible, if not prohibitive. For these reasons, we use the same dataset used by Aung and Tan (2004) and Tung et al. (2007), and compare our results with those reported by them.

2876

Fig. 4. Average precision–recall curves for 108 queries on the database of 34 055 proteins. Table 2. Average precision and running times on the database of 34 055 proteins

Vorometric-TM Vorometric-raw CE MAMMOTH 3D-BLAST PSI-BLAST

Avg. precision (%)

Time per query

Superposition

82.9 79.7 80.9 80.8 76.2 61.8

51 s 44 s 14 h 1.6 h 14 s 8s

yes no yes yes no no

Average precision is calculated as the mean of precision values for different recall levels. The time results for Vorometric are based on returning top 100 hits, performed on a Pentium 2.6 GHz personal computer. Vorometric-raw does not include the time spent for optimization of structural superposition, whereas Vorometric-TM does. The times for CE, MAMMOTH, 3D-BLAST and PSI-BLAST are approximate values interpolated from Tung et al. (2007) using the running times of CE as basis of comparison. The best values in each column are shown in bold.

The dataset consists of 34 055 proteins which cover about 90% of the ASTRAL database. From medium-size families, 108 queries are selected and have