Analysis of Protein Protein Dimeric Interfaces Feihong Wu, Fadi Towfic, Drena Dobbs, Vasant Honavar Bioinformatics and ComputationalBiology Graduate Program Iowa State University Ames, IA 50011-1040,USA {wuflyh, ftowfic, ddobbs, honavar}@iastate.edu Abstract We analyzed the structural properties and the local surface environment of surface amino acid residues of proteins using a large, non-redundant dataset of 2383 protein chains in dimeric complexes from PDB. We compared the interface residues and non-interface residues based on six properties: side chain orientation, surface roughness, solid angle, cx value, hydrophobicity and interface cluster size. The results of our analysis show that interface residues have side chains pointing inward; interfaces are rougher, tend to be flat, moderately convex or concave and protrude more relative to non-interface surface residues. Interface residues tend to be surrounded by hydrophobic neighbors and tend to form clusters consisting of three or more interfaces residues. These findings are consistent with previous published studies using much smaller datasets, while allowing for more qualitative conclusions due to our larger dataset. Preliminary results suggest the possibility of using the six the properties to identify putative interface residues.
1. Introduction Protein-protein interactions play a pivotal role in cellular processes such as DNA replication and transcription, RNA splicing, signal transduction and metabolic networks. Hence, understanding the sequence and structural determinants of protein-protein interactions is critical for our understanding of biological processes, including those that play a role in diseases, and for our efforts to design therapeutic drugs. Many studies of protein-protein interface residues have been carried out to identify specific physicochemical characteristics that contribute to protein-molecule recognition. These studies have covered a wide scope and analyzed a variety of interface types (homo vs hetero dimer, transient
vs permanent interface, etc.), different amino acid characteristics and representations, and used different definitions of interfaces (measured at the level of residues or surface patches). At the residue level, interfaces differ in terms of amino acid composition, inter-residue contact preferences, and the degree of conservation across orthologous proteins relative to non-interfaces. At the surface patch level, interfaces are more hydrophobic, planar and protruding relative to non-interfaces[1, 2, 3, 4]. Previous analyses of surface of proteins [5, 6, 7, 8, 9] examined several surface descriptors associated with surface residues. Most of these studies were performed using relatively small datasets. However, the Protein Data Bank (PDB) [10] now contains over 43, 873 structures. Recent structural genomics efforts are likely to further accelerate the rate of increase in the number of structures in PDB. It is therefore natural to ask if earlier conclusions are borne out by analysis of much larger datasets. Against this background, this paper presents a residue-based analysis on a large dataset consisting of 2383 protein chains from dimeric complexes. Our analysis of six properties (side chain orientation, surface roughness, solid angle, cx value, hydrophobicity and interface cluster size) not only corroborates and extends previous findings made on small datasets, but provides the basis for training new machine learning classifiers to identify putative interface residues, as illustrated in a case study. The rest of the paper is organized as follows: Section II describes the dataset and each of six properties of surface residues examined in this study. Section III presents the results of our analysis, comparing interface and noninterface residues based on these six properties. Section IV illustrates how the results of our analysis can be used to design a simple strategy for identifying putative proteinprotein interface residues from a protein structure. Section V summarizes the results. Section VI concludes with a discussion of related work.
2 Material and methods 2.1
Protein interface dataset
All protein entries in the Protein Data Bank (PDB, October 2006 release) [10] were examined to collect the protein-protein interface residues. The protein entries with resolution ≤ 3A˙ were then checked with the Protein Quaternary structure file Server (PQS) [11] to regenerate quaternary structures, from which protein dimers were kept, while crystal packing and protein multimers were filtered out. Next, protein dimers having at least one chain with length ≤ 20 were removed. We selected chains out of the protein dimer complexes such that no two chains share sequence identity ≥ 30%. The protein sequence identity information was obtained from the PDB The final (ftp://ftp.rcsb.org/pub/pdb/derived data/NR/). dataset (PPI2383) includes 2383 protein chains derived from 2316 protein dimers. The dataset consists of 452 heterodimeric and 1931 homodimeric interfaces (Interfaces between chains with ≥ 90% sequence identity are defined as homodimeric interfaces. All others are defined as heterodimeric interfaces.)
2.2
Let Ri and ri respectively be the number of surface residues and interface residues in the set Si . Then:
Surface versus non-surface residues
Surface residues are defined according to Miller et al [12] as those residues having a relatively accessible surface area (ASA) of ≥ 5%. The accessible surface area is calculated using the Naccess program [13].
ri k
Pi
=
Ri k
i=1 ri
Ri pi log2 ( ) Pi
IPi
=
The interface propensity IPi of the property at value vi is a measure of the preference for the value vi among the interface residues (relative to the set of surface residues). IPi > 0 denotes that the specified property value vi tends to be more preferred among the interface residues relative to the surface residues. Similarly, IPi < 0 denotes that the specified property value vi tends to be less preferred among the interface residues relative to the surface residues.
2.5
Side chain orientation
The side chain orientation of a residue is defined as the angle between two vectors. The first vector connects the geometrical center of a side chain of the residue with its Cα atom. The second vector connects the geometrical center of the protein chain with the Cα atom of the residue. The angle is confined to the range from 0 to π, within which angles (0, π2 ) and ( π2 , π) correspond to side chains pointing directly inward and directly outward, respectively.
Surface roughness
Interface versus non-interface residues
We follow Ofran and Rost’s [14] definition of interface residues: two residues are considered to be in contact if the closest distance between any two atoms, one from ˙ In this paper, we only consider each residue, is ≤ 6A. surface residues, thus a surface residue having at least one contact residue with the interacting partner chain is considered to be an interface residue, otherwise it is a non-interface residue. Based on these definitions, the PPI2383 dataset contains 104, 789 interface residues and 323, 270 non-interface residues.
2.4
=
i=1
2.6 2.3
pi
Interface propensity calculation
Consider a residue-based property (such as residue roughness) with k discrete values: (v1 , v2 , ..., vk ). Each residue is assigned to to one of k disjoint subsets S1 , S2 , ..., Sk based on the value of the residue property.
Using Richard’s [5] method, a molecular surface (As ) is produced by rolling a solvent sphere with radius R against the target protein. Lewis [9] defined surface roughness as As follows: D = 2 − ∂log ∂log R . It denotes the degree of irregularity of a surface. Here, each surface residue is assumed to have its own molecular surface and roughness. Roughness ˙ in is calculated by varying the radius R from 0.2A˙ to 4.0A, ˙ steps of 0.1A. The molecular surface area As is calculated using the Molecule Surface Package (MSP) [15] .
2.7
Solid angle
Solid angle, first proposed by Connolly [7] as a measure of the shape of local regions of protein surfaces, is calculated as the traction of a sphere intersecting the protein when the sphere is centered at a point on the protein surface. The range of a solid angle is (0, 4π). A point with solid angle < 2π lies on a surface that is locally convex. A point with > 2π lies on surface that is locally concave. The
0.04
[0.9-1.0]
[0.8-0.9)
[0.7-0.8)
[0.6-0.7)
[0.5-0.6)
[0.4-0.5)
-0.04
[0.3-0.4)
0 -0.02
[0-0.1)
0.02 [0.2-0.3)
Protrusion-cx value
0.06
[0.1-0.2)
2.8
0.1 0.08
Interface Propensity
MSP software package implemented by Connolly [15] uses discrete dots to represent the molecule surface and generates a solid angle for each dot. The solid angle of a surface residue is calculated as the average of the solid angles of all the surface dots that belong to the residue. The sphere radius is set as 6A˙ by default in the computation.
-0.06 -0.08
Pintar [8] devised a metric called cx value to estimate the protrusion of protein atoms. The basic idea, similar to that of the solid angle, is to center a sphere at an atom and calculate the ratio of volume occupied by the protein and the volume left free by the protein. The cx value is a real number between 0 and 15. High cx values correspond to protruding atoms. Here, protrusion is defined over surface residues instead of atoms. A surface residue’s protrusion is represented by the cx value of its Cα atom. The cx values are computed using the C++ program provided by Pintar with default parameters [8].
2.9
Surface micro-environment: hydrophobicity and Interface cluster size
Although some interface residues (dubbed “hot spots”) contribute more to the binding affinity than other residues [16], most interface residues are not solitary. Interface residues have a tendency to form clusters on the surface. This tendency is the basis of analysis of interfaces using surface patches or spatial clusters [17, 18, 19]. Here, we define a surface micro-environment for each surface residue to examine whether residue preferences of interfaces are sensitive to the micro-environment or context in which the residue resides. Given a target residue, its surface microenvironment is defined as the set of surface residues whose Cα atom is < 7A˙ away from the Cα of the target residue. By this definition, each residue is included in its own surface micro-environment. Two surface micro-environment parameters are of interest here: the hydrophobicity and the interface cluster size. The hydrophobicity of a target residue is defined as the average hydrophobicity of all the residues in its surface micro-environment, while hydrophobicities of each residue type Ri are denoted with an energy value ei , which is derived from residue contact energies1 [18]. The residue contact energies represent the degree of hydrophobic force between residue pairs. Hence, ei can be regarded as an estimation of hydrophobicity: the lower the ei value, the more hydrophobic the residue. As a result, the average 1 The e values of 20 residues are: F -5.12, M -4.91, I -4.88, L -4.65, i W -4.36, V -4.17, C -4.00, Y -3.24, A -2.82, H -2.75, G -2.34, T -2.30, P -2.22, R -2.18, S -2.07, Q -1.98, E -1.94, N -1.90, D -1.81, K -1.50
-0.1 -0.12
pointing inward
pointing outward
Side Chain Orientation(*π)
Figure 1. Interface propensities of side chain orientation.
energy ei denotes the hydrophobicity of the surface microenvironment of the target residue. The interface cluster size is the count of interface residues within a target residue’s micro-environment. We anticipate a larger cluster size for interface residues.
3 Analysis and results 3.1
Side chain orientation
Figure 1 shows the side chain orientation propensity of interface residues relative to non-interface surface residues. Side chain orientation values that lie between 0 and π are binned into 10 equal-sized bins (along the x-axis). The interface propensity is plotted along the y-axis. Interface residues with side chain orientation < π2 are overrepresented, implying that interface residues tend to point inward. Although the interface residue side chain’s tendency to point inward is clear, the small propensity values (between -0.1 and 0.1) imply that it is not significant.
3.2
Surface roughness
The difference in surface roughness between interface residues and non-interface surface residues is shown in Figure 2. Larger surface roughness values denote a smoother surface of protein residues. The histogram shows that interface residues tend to lie in rougher regions of the surface. The smoother a surface residue, the less likely it is to be an interface residue.
3.3
Solid angle
The difference in solid angle values between the interface and non-interface surface residues is highlighted in
0.4
1.4
0.2
1 Interface Propensity
[2.9-3.0]
[2.8-2.9)
[2.7-2.8)
[2.6-2.7)
[2.5-2.6)
[2.4-2.5)
[2.3-2.4)
[2.2-2.3)
[2.1-2.2)
-0.2
[2-2.1)
Interface Propensity
1.2 0
-0.4 -0.6 -0.8
rough
smooth
0.8 0.6 0.4 0.2 0
-1 Surface Roughness
[0-1)
-0.2
[1-2)
[2-3)
[3-4)
depressing
-0.4
[4-5) protruding
CX Value
Figure 2. Interface propensities of surface roughness.
Figure 4. Interface propensities of protrusion(cx value). 0.6
0.2
-1
hydrophobic
[-2.0,-1.5]
-0.5
(-2.5,-2.0]
Figure 3. Interface propensities of solid angle.
0
(-3.0,-2.5]
convex Solid Angle(*π)
(-3.5,-3]
concave
(-4.0,-3.5]
-0.8
0.5
(-4.5,-4.0]
-0.6
1
(-5,-4.5]
-0.4
1.5 Interface Propensity
[2.4-2.5)
[2.3-2.4)
[2.2-2.3)
[2.1-2.2)
[2.0-2.1)
-0.2
2 [1.9-2.0)
0
[1.8-1.9)
Interface Propensity
0.4
hydrophilic
-1.5 Average Contact Energy 4 3 2 Interface Propensity
Figure 3. Solid angles of surface residues mostly lie between 1.8π to 2.5π. Note that the solid angle 2π denotes a “flat” local region, whereas the solid angles < 2π and > 2π denote “concave” and “convex” local regions respectively. Figure 3 shows that interface residues favor moderately concave (1.8π − 2.0π), flat (2.0π) or moderately convex (2.0π − 2.2π) local regions but not highly convex regions (2.2π − 2.5π) or highly concave regions.
1 0 -1
1
2
3
4
5
6
7
8
9
10
-2 -3 -4
3.4
Protrusion-cx value
-5 -6
Figure 4 compares the protrusion in interface and noninterface surface regions. Although the cx values range from 0-15, the cx values of surface residues corresponding to their Cα atoms are concentrated in the range 0-5. Large cx values correspond to protruding atoms. The fact that the propensities increase as the cx values increase suggests that the interface residues prefer to be protruding.
Interface Cluster Size
Figure 5. Interface propensities of hydrophobicity(average contact energy) and size of interface cluster.
3.5
Surface micro-environment: hydrophobicity and interface cluster size
Figure 5 shows the propensities of the two parameters related to the surface micro-environment: the hydrophobicity and interface cluster size. The hydrophobicity in the upper figure, estimated through average contact energy, shows that interface residues reside at more hydrophobic environments; while the lower figure discloses the fact that an interface residue tend to be clustered with three or more interface residues on the protein surface.
4 Application: a case study The distinct characteristics of residues with high versus low interface propensities, with respect to the 6 characteristics analyzed above, suggested that ”interfacial signal” based on these characteristics could be used to enhance the performance of classifiers for predicting interface residues in proteins. Although physicochemical properties of amino acids have been widely used, only a few studies have attempted to exploit geometric features of protein interfaces in building classifiers for predicting protein-protein interface residues [20, 21, 22, 3, 4]. To illustrate the potential utility of such an approach, we examined the transcriptional regulatory protein SlyA (pdb entry 1lj9) by combining the five structural properties using a simple voting method to identify the interface residues of chain B: for each surface residue, we calculated its side chain orientation, surface roughness, solid angle, cx-value and hydrophobicity. If the value of a property lies in the region where the value is preferred in the interface (based on propensity estimates), the surface residue gets voted as an interface residue based on that property. If a surface residue gets voted as an interface residue on the basis of at least 3 of the 5 votes it is predicted to be an interface residue; otherwise, it is predicted to be an non-interface residue. We then use the clustering tendency of the interface residues to refine the predictions of the voting method as follows: If a surface residue that is predicted to be an interface residue by the voting method has ≤ 2 neighbors in its surface micro-environment that are also predicted to be interface residues, it is reclassified as a non-interface residue; If a residue predicted to be non-interface residue by the voting method has ≥ 4 neighbors in its surface micro-environment that are predicted to be interface residues, it is reclassified as an interface residue. Let TP be the number of true positives (residues predicted to be interface residues that are actually interface residues); FP the number of false positives (residues predicted to be interface residues that are actually non-interface residues); TN the number of true negatives; FN the number of
false negatives. The numerical performance measures ac (accuracy), re (recall), pr (precision) and cc (correlation coefficient) are defined as follows: ac = re
=
pr
=
cc
=
TP + TN TP + FP + TN + FN TP TP + FN TP TP + FP TP ∗ TN − FN ∗ FP (TP + FN)(TN + FP)(TP + FP)(TN + FN)
The results of prediction with the voting method and refinement strategy are summarized in Table 1 and Figure 6. We see that the use of the five structural properties results in fairly accurate prediction of the interface residues. The results also suggest that refining the predictions based on the clustering tendency of the interface residues further improves the quality of the predictions in terms of both precision and recall. It is worth noting that the results are significantly better than those obtained based on analysis of sequence neighbors of the target residues (precision=55%, recall=53%). These results suggest the possibility of using structural properties of interfaces to reliably identify protein-protein interface residues when only the structure of a protein (but not that of protein-protein complex(s) in which it participates) is available.
Table 1. prediction results: chain B of protein 1lj9 classifiers ac re pr cc voting 76% 66% 82% 53% voting+refinement 82% 77% 86% 64%
5 Summary We have analyzed surface residues from a large set of dimeric protein-protein interfaces based on five structural properties and a simple characterization of the local surface environment. Our analysis has shown that: • The side chains of interface residues prefer to point inward. • Interfaces tend to be more rough compared to the rest of the protein surface. • Interfaces tend to be moderately concave, flat or moderately convex but not highly convex or concave (as measured by the solid angle).
• The Cα atoms of interface residues tend to be more protruding in terms of cx value.
buried residues. They also concluded that side chain orientation highly correlates with hydrophobicity. Our work examines the correlation between side-chain orientation with interface-residues. In 1976, Richard [5] defined the solvent accessible surface and molecule surface by rolling a probe sphere tangent to the atoms of the target protein. Connolly [6, 15] implemented a suite of programs, Molecular Surface Package (MSP) to calculate the molecule surface and evaluated the solid angle of 3 proteins anticipating its future applications in analysis of protein interface shapes. Lewis [9] used fractal surface to characterize the roughness or irregularity of protein surfaces. Bowie [28] showed that highaffinity protein binding requires a rough surface patch. Our analysis builds on this work to examine the solid angle and roughness properties on a residue level for a large dataset of dimeric protein-protein interfaces. Pintar [8] suggested the usefulness of atom protrusion in the analysis of protein-protein interactions. Young et al [18] studied the hydrophobicity of residue clusters (namely, micro-environment or surface patches) defined using a lattice model on a small dataset of 38 proteins . Jones and Thornton [17] also explored the hydrophobicity of surface patches through a different scale of hydrophobicity assignment to residues in a small dataset of 54 protein complexes. Our analysis is a simple adaptation of the microenvironment analysis of Young et al. [18] using a different definition of the micro-environment, on a much larger dataset.
• Interface residues tend to have a hydrophobic microenvironment.
References
Figure 6. Interaction Sites Recognition of Chain B of Protein 1lj9 Under Two Approaches: voting method (the left) and voting method+refinement strategy (the right). The chain B is shown in green, with the residues of interest shown in space fill and color coded as follows: red, interface residues identified as such by the classifier (TPs); yellow, interface residues missed by the classifiers (FNs); and blue, residues incorrectly classified as interface residues (FPs). For clarity, interface residues for the chain A (gray wireframe) are not shown. The structure diagrams were generated with RasMol [23].
• Interface residues tend to be clustered on the surface. Based on these observations, we devised a simple voting scheme for identifying interface residues on the protein surface using the five structural properties. The results suggest that refining the predictions generated by the voting scheme based on the clustering tendency of the interface residues further improves the quality of predictions. These results suggest the possibility of using structural properties of interfaces to improve the quality of protein-protein interface residue prediction beyond that of sequence-based prediction methods [24, 25].
6 Related work Rackovsky and Scheraga [26] first proposed the side chain orientation as a metric to estimate hydrophobic forces. They studied the residue orientations in 13 native proteins and found that polar and non-polar residues have various orientation preferences. Yan and Jernigan [27] extended the studies to 48 proteins concerning exposed, interfacial and
[1] Andrs Szilgyi, Vera Grimm, Adrin K Arakaki, and Jeffrey Skolnick. Prediction of physical protein-protein interactions. Phys Biol., 2(2):S1–16, 2005. [2] Benjamin A Shoemaker and Anna R Panchenko. Deciphering protein-protein interactions. part i. experimental techniques and databases. PLoS Comput Biol., 3(3):e42, 2007. [3] Benjamin A Shoemaker and Anna R Panchenko. Deciphering protein-protein interactions. part ii. computational methods to predict protein and domain interaction partners. PLoS Comput Biol., 3(4):e43, 2007. [4] Huan-Xiang Zhou and Sanbo Qin. Interaction-site prediction for protein complexes: a critical assessment. Bioinformatics, 2007. [5] B. Lee and F. M. Richards. The interpretation of protein structures: Estimation of static accessibility. J Mol Biol, 55:379–400, 1971.
[6] Michael L Connolly. Solvent-accessible surfaces of proteins and nucleic acids. Science, 221(4612):709– 713, 1983. [7] Michael L Connolly. Measurement of protein surface shape by solid angles. Journal of Molecular Graphics, 4(1):3–6, 1986. [8] Alessandro Pintar, Oliviero Carugo, and Sandor Pongor. Cx, an algorithm that identifies protruding atoms in proteins. Bioinformatics, 18(7):980–4, 2002. [9] Mitchell Lewis and D.C. Rees. Fractal surfaces of proteins. Science, 230(4730):1163–1165, 1985. [10] H.M. Berman, J. Westbrook, Z. Feng, and et al. The protein data bank. Nucleic Acids Res, 28:235–242, 2000. [11] K. Henrick and JM. Thornton. Pqs: a protein quaternary structure file server. Trends Biochem Sci, 23(9):358–61, 1998. [12] Susan Miller, Joel Janin, Arthur M. Lesk, and Cyrus Chothia. Interior and surface of monomeric proteins. J Mol Biol, 196(3):641–656, 1987. [13] S. Hubbard and J. Thornton. Naccess atomic solvent accessible area calculations. http://wolf.bi.umist.ac.uk/naccess, 1996. [14] Y. Ofran and B. Rost. Analysing six types of proteinprotein interfaces. J Mol Biol, 325(2):377–87, 2003. [15] Michael L Connolly. The molecular surface package. J Mol Graph, 11(2):139–41, 1993. [16] Andrew A. Bogan and Kurt S. Thorn. Anatomy of hot spots in protein interfaces. J Mol Biol, 280(1):1–9, 1998. [17] S. Jones and JM. Thornton. Analysis of proteinprotein interaction sites using surface patches. J Mol Biol, 272(1):121–32, 1997. [18] L. Young, R.L. Jernigan, and D.G. Covell. A role for surface hydrophobicity in protein-protein recognition. Proteins, 3(5):717–29, 1994. [19] R. Landgraf, I. Xenarios, and D. Eisenberg. Threedimensional cluster analysis identifies interfaces and functional residue clusters in proteins. J. Mol. Biol., 307(5):1487–502, 2001. [20] James R. Bradford, Chris J. Needham, Andrew J. Bulpitt, and David R. Westhead. Insights into proteinprotein interfaces using a bayesian network prediction method. J Mol Biol., 362(2):365–86, 2006.
[21] Hani Neuvirth, Uri Heinemann, David Birnbaum, Naftali Tishby, and Gideon Schreiber. Promateusłan open research approach to protein-binding sites analysis. Nucleic Acids Res., 35:W543CW548, 2007. [22] Yoichi Murakami and Susan Jones. Sharp2: proteinprotein interaction predictions using patch analysis. Bioinformatics, 22(14):1794–5, 2006. [23] Roger Sayle, Arne Mueller, Gary Grossman, Marco Molinaro, Herbert J. Bernstein, Clarice Chigbo, Ricky Chachra, and Mamoru Yamanishi. Openrasmol: Molecular graphics visualisation tool. http://www.openrasmol.org, 2005. [24] Changhui Yan, Drena Dobbs, and Vasant Honavar. A two-stage classifier for identification of proteinprotein interface residues. Bioinformatics, Suppl 1:I371–I378, 2004. [25] Changhui Yan, Feihong Wu, Robert L. Jernigan, Drena Dobbs, and Vasant Honavar. Characterization of protein-protein interfaces. The Protein Journal, in press, 2007. [26] S Rackovsky and H A Scheraga. Hydrophobicity, hydrophilicity, and the radial and orientational distributions of residues in native proteins. Proc Natl Acad Sci U S A, 74(12):5248–5251, 1977. [27] Aimin Yan and Robert L. Jernigan. How do side chains orient globally in protein structures? Proteins, 61(3):513–22, 2005. [28] Frank K. Pettit and James U. Bowie. Protein surface roughness and small molecular binding sites. J Mol Biol, 285(4):1377–82, 1999.