Binding Hot-Spots in Protein-Protein and Protein ... - Semantic Scholar

Report 2 Downloads 128 Views
Bioinformatics Advance Access published March 7, 2006 © The Author (2006). Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

Predicting Protein Interaction Sites: Binding Hot-Spots in Protein-Protein and Protein-Ligand interfaces

Running Header: Binding Hot-Spots in Protein Interfaces

Nicholas J. Burgoyne, Richard M. Jackson*

Faculty of Biological Sciences, University of Leeds, Leeds, LS2 9JT, UK Associate Editor: Dmitrij Frishman

* To whom correspondence should be addressed Tel: +44 (0)113 343 2592, Fax: +44 (0)113 343 3167 Email address: [email protected]

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

Institute of Molecular and Cellular Biology,

ABSTRACT

MOTIVATION: Protein assemblies are currently poorly represented in structural databases and their structural elucidation is a key goal in biology. Here we analyse clefts in protein surfaces, likely to correspond to binding “hot-spots”, and rank them according to sequence conservation and simple measures of physical properties including hydrophobicity, desolvation, electrostatics and van der Waals energies, to predict which are involved in binding in the native complex.

protein and protein-ligand interfaces are striking. There is a high level of prediction accuracy (C93%) for protein-ligand interactions, based on the following attributes: van der Waals potential, electrostatic potential, desolvation and surface conservation. Generally, the prediction accuracy for protein-protein interactions is lower, with the exception of enzymes. Our results show that the ease of cleft desolvation or “dewetting” is strongly predictive of interfaces and strongly maintained across all classes of protein binding interface.

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

RESULTS: The resulting differences between predicting binding-sites at protein-

INTRODUCTION Protein interactions are critical for all aspects of cellular function, and determining how they interact has become a major goal in the post-genomic era (Sali et al., 2003; Rhodes et al., 2005). An emerging new approach is to take advantage of structural information to predict physical binding. In particular, prediction of protein bindingsites can guide the structural elucidation of protein complexes, allowing function prediction for numerous unannotated structural genomics targets and the design of molecules that can modulate biological function at a systems-level (Russell et al.,

Protein-protein interfaces are generally considered to be circular and relatively flat (Jones and Thornton, 1997), studies of their properties have also shown that hydrophobic area prevails while electrostatic residues and hydrogen-bonding groups are evenly spread across the surface (Xu et al., 1997; Lo Conte et al., 1999), suggesting a uniform distribution of properties across the binding interfaces. However, alanine-scanning mutagenesis has shown that the stability of a complex is determined by only a fraction of the interface residues. This was first shown in the complex between human-growth hormone and the human growth hormone binding protein (Clackson and Wells, 1995). They showed that a far greater loss in affinity was seen when two tryptophan residues were mutated to alanine when compared with other mutations made in the interface. Similar results have been collated for a number of other protein complexes (Bogan and Thorn, 1998; Thorn and Bogan, 2001). Analysis of this data shows that these “hot-spot” residues are usually found in the centre of the interface, and are surrounded by residues that have a lesser effect on stability (Bogan and Thorn, 1998). Structural analysis has shown that the clusters of

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

2004).

conserved, hot-spot, residues form clefts in the surface of the protein (Li et al., 2004). Enrichment of hot-spot areas with tryptophan, tyrosine and arginine residues have been shown based on protein family conservation (Hu et al., 2000; Ma et al., 2003; Keskin et al., 2005). However, other highly sequence conserved residues, including polar residues, may also be conserved (Hu et al., 2000). Solvent exclusion around polar or charged interactions lowers the effective dielectric constant thus strengthening the interaction. It has been suggested that the hydrophobic effect of hotspot residues is almost double that of other residues in an antibody/antigen interface

al., 2003), these tend to cluster at the centre of the interface, and interact with equivalent hot-spot residues in the binding partner protein (Halperin et al., 2004; Keskin et al., 2005). The clustering of conserved residues has been used to predict possible interacting partners based on alignment to a known pair of interacting proteins (Aytuna et al., 2005; Espadaler et al., 2005). Clefts in the protein surface are also important for the prediction of protein-ligand interactions. Indeed, ligand binding-sites can often be predicted by identifying the largest exposed cleft (Laskowski et al., 1996). Ligand binding-sites may also be detected as cleft regions with high predicted binding affinity, based on energetic contouring with interacting molecular probes (Laurie and Jackson, 2005).

Here we analyse the ability of different key physicochemical attributes and other binding surface properties, such as surface conservation, to predict the binding interface in protein-protein and protein-ligand complexes. Our approach attempts to define binding hot-spots on the protein surface. These are defined by clefts on the protein surface that are further described by their key attributes. We predict clefts on

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

(Li et al., 2005). Multiple hot-spot regions can be found in a single interface (Ma et

the protein surface using Q-SiteFinder (Laurie and Jackson, 2005) an energy-based method for the prediction of protein binding clefts. The properties of these predicted clefts are then investigated and compared. Given that the key attributes we use to describe clefts are the same for both protein-protein and protein-ligand interactions, the resulting differences we find between their binding-sites are striking. This has important implications for understanding the nature of protein interactions, for identifying the suitability of sites as drug targets, and for identifying critical regions for docking and structure-based drug design.

Interaction Datasets Enzyme-inhibitor and antibody-antigen complexes of the protein-protein docking benchmark 1.0 (Chen et al., 2003) were supplemented with a selection of other available protein complexes to create a protein-protein interaction dataset of 97 pairwise non-obligate hetero-complexes. These additional complexes satisfied the following criteria: 1) No two complexes had a sequence-identity of more than 60% in an alignment over more than 80% of the sequence of both proteins involved in the interaction. 2) In keeping with the other complexes of benchmark 1.0 no complex had a resolution of greater than 3.25Å. 3) No complex had large disordered regions in the interface. Oligomeric interfaces for all obligate protein-protein interactions of the dataset were obscured by using the appropriate multimeric complex as the interacting monomer. This was achieved by analysis of the protein quaternary structure database (Henrick and Thornton, 1998). As such, monomers mentioned in the text may refer to protein structures of more than one peptide chain. Occasionally this left several independent interfaces between interacting proteins. In the cases where more than one

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

METHODS

independent interface to identical proteins exists, all interfaces of the protein are considered. Where proteins have only one interface only that interface was assessed. Cognate ligands required for the function of the protein were retained, while those not indicated as such in the literature were discarded. The final dataset consisted of 22 enzyme-inhibitor complexes, 19 antibody-antigen complexes, and 56 protein complexes that could not be included in either classification (Other complexes). Details of the proteins used can be found in the supplementary material.

and prepared as described by Laurie and Jackson (2005). These constitute a subset of the full Gold-dataset where proteins of high structural similarity are removed. The dataset was further classified into protein-ligand interactions in enzymes (95 complexes) and non-enzymes (39 complexes). The binding-site of a single ligand is assessed for each protein. Additional ligands covalently bound to the protein are treated as protein while other solvent molecules are removed. As with the proteinprotein interaction dataset the interacting proteins are assessed in the absence of the binding molecule.

Pocket Detection and Occupancy using Q-SiteFinder Clefts were identified in the protein-surface using Q-SiteFinder (Laurie and Jackson, 2005). The method is briefly described. A grid is built around each protein, such that the spacing between grid points is 0.9Å in all orthogonal-directions and the entire protein is covered. The non-bonded interaction energy is then calculated using the GRID forcefield (Goodford, 1985) parameters for every grid-point that is sufficiently far from the protein to allow a methyl (-CH3) group to be positioned without steric

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

134 ligand-protein complexes from the Gold dataset were used (Nissink et al., 2002),

overlap with any protein atom (Jackson, 2002). Probes with calculated interaction energies of less than a -1.3 kcal/mol are retained and clustered according to spatial proximity whereby no probe in any cluster has a centre that is more than 1.0Å from the centre of another probe in the same cluster. The total non-bonded interaction energy is then calculated across all probes in the cluster and serves as the means by which the clusters are ranked. The highest ranking site is the cluster with the highest cumulative interaction energy. Occupancy of a cleft is defined as the percentage fill of the cleft by atoms of the binding protein or ligand; a threshold of 25% is applied to

analysis. These consisted of the larger proteins of two protein-protein interactions 1HE8 and 1DE4, and six protein-ligand complexes (1CGY, 1DR1, 1HDY, 1LDM, 1PBD and 2PDM).

Pocket Ranking Hydrophobicity The atomistic Solvent Accessible Surface Area (SASA) covered by each cleft was calculated by NACCESS (Hubbard and Thornton, 1993). The solvent accessibilities of all protein atoms were calculated in both the presence and absence of the probes of each cluster. The difference between these values represents the covered area of the cleft. Hydrophobic area was defined as the area exposed by atoms that are either carbon or sulphur (as in NACCESS). The cleft with the highest proportion of hydrophobic accessible atom area was ranked first.

Desolvation

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

define a successful cleft. Proteins with no occupied clefts were excluded from further

The entropy associated with the removal of water from the surfaces covered by the probe clusters was estimated using coefficients for the transfer of N-acetyl derivative amino-acids between water and octanol. The total transfer energies have previously been converted to atomistic-based values by linear fitting (Fauchere and Pliska, 1983), and optimized for protein-protein docking (Fernandez-Recio et al., 2004). The coverage of each atom was calculated as described above, allowing the desolvation energy of the cleft to be calculated by multiplying the area by appropriate solvation parameter taken from (Fernandez-Recio et al., 2004). Clefts were ranked such that the

Electrostatics The peak, average, and total electrostatic potential for each cleft was calculated using the DelPhi v.4 package (Rocchia et al., 2001; Rocchia et al., 2002). Protons were added to the protein as described by Q-SiteFinder (Laurie and Jackson, 2005). A grid of 101 points in all dimensions was built around each protein such that the molecule occupied 50% of the cubic grid’s volume. Amber charges and radii were used to describe the protein and associated cognate ligands (protein-protein interface analysis only) (Giammona, 1984; Weiner et al., 1984; Schneider and Suhnel, 1999; Antony et al., 2000; Meagher et al., 2003). A dielectric boundary condition was used, with the dielectric constants of 4 within and 80 outside the molecular surfaces defined by a water probe radii of 1.4Å. Salt concentration in the solute was set at 0.15M and a 2Å exclusion radius was applied around the protein. The finite difference PossionBoltzmann calculations were performed until the potentials at the grid points converged to 0.0001 kT e-1, after which the potentials were extrapolated to the centre of the individual probes. The total cleft potential was defined as the sum of the

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

cleft most easily desolvated was ranked first.

modulus of individual probe potentials that form the cluster, while the average potential was this total value divided by the number of probes in the cluster. The peak values for each cleft were defined as the highest modulus individual probe potentials of each cluster. The same calculations were repeated with all proteins with a unit charge applied to all atoms. The same Amber atom radii were used as described, as were all other variables bar the convergence criteria. Unit charge calculations were run until they converged to within 0.01 kT e-1.

The surface conservation for each cleft was calculated by extracting the protein sequences of each chain in the protein from the associated ATOM records of their Protein Data Bank coordinates. For each chain of the protein, close-homologues were found by performing a PSI-BLAST search over the non-redundant Swiss-Prot database release 47.6 (Altschul et al., 1997; Bairoch et al., 2005). Three iterations are performed and the search refined using sequences with similarity to the query sequence defined at an e-value of less than 0.001. Redundant sequences were removed and the remaining full-length sequences were then aligned by Muscle (Edgar, 2004). The conservation of each position in the alignment relative to the initial protein sequence was calculated by Scorecons (Valdar, 2002). Conservation scores are based on the sum of the weighted pairwise exchange of residues, for each pair of sequences, at each position in the alignment. Values lie between 0 (no conservation) and 1 (completely conserved). A Solvent Accessible Surface Area was calculated for the protein structure of the same chain by MSMS with vertices spread evenly across the protein surface at a density of 1 vertex per Å2 (Sanner et al., 1996). The conservation scores for each residue were then mapped to the appropriate

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

Conservation

vertices. Cleft conservation was defined as the average conservation score of vertices within 2.0 angstroms of the centre of a probe belonging to the cluster.

Calculation of True and False Positive Rates The True Positive Rate (TPR), or sensitivity, was calculated as the number of genuine interface clefts that are ranked in the top k clefts (true positives, TP) divided by the total number of interface clefts identified for this monomer (TP plus false positives, FP), where k increases by one each time. The False Positive Rate (FPR), one minus

interface (FP) divided by the total number of non-interface clefts (FP plus true negatives, TN). Equal TPR and FPR values at all values of k indicate no discrimination by the ranking method.

The sensitivity (TPR=TP/(TP+FN)) and false positive rates (FPR=FP/(FP+TN)) were calculated for the clefts ranked by the above attributes. An attribute that is a successful predictor of interface clefts will show high sensitivity and a low error rate. Plotting TPR against the FPR gives a Receiver Operating Characteristics (ROC) curve.

In order to give a single measure of prediction accuracy, which is independent of any decision threshold, we have calculated the ROC integral or Area Under the Curve (AUC) (Hanley and Mcneil, 1982). A value of 0.5 indicates no correlation, a value less than 0.5 indicates negative correlation and a value approaching 1.0 indicates the theoretical maximum or perfect prediction.

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

specificity, was the number of clefts ranked in the top k clefts that are not in the

RESULTS Analysis of protein Clefts Q-SiteFinder was used to define the clefts found on the protein surface (see methods). Q-SiteFinder was run on both bound monomers of each hetero-protein complex in the non-obligate protein-protein interaction data sets and on all proteins in the proteinligand data set. Following evaluation at different probe interaction thresholds a value of -1.3kcal/mol was chosen, based on the success of the method in protein-ligand complexes around this value (Laurie and Jackson, 2005) and also based on

describes the clefts of the protein that were filled by the sidechains of the opposing binding protein. We found that altering the threshold did not greatly affect the predictive power for protein-protein interactions.

Q-SiteFinder was used to calculate 99 clefts on the protein surface. Occupied or true cleft predictions for each protein are defined as those occupied to more than 25% of their volume by atoms of the interacting molecule. Figure 1a shows that for all 194 monomers of the protein-protein dataset there is a slightly skewed distribution of the number of true interface clefts per protein monomer with a peak around six clefts. The separate distributions of smaller and larger monomers show differences. The peak number of successful interface clefts for the smaller (ligand) monomers, at six, is higher than the peak, at 2-4 clefts for larger (receptor) monomers, however, the latter has a broad peak spanning 2-9 clefts. Figure 1b shows there are approximately equal percentages of interface areas covered in both sets of monomers. Protein-ligand receptors show far fewer occupied clefts with a clear peak at one. Also true interface clefts have a greater coverage than do the non-interface regions in contrast to protein-

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

visualisation of the clefts involved in protein-protein interactions, since it also best

protein interfaces. However, given the large standard deviations it is not possible to conclude this observation is significant. Below we attempt to separate those clefts that are occupied from those that are not, based on properties that have been suggested to be important for the prediction and stability of intermolecular interactions.

20

All 194 Protein-Protein Monomers

20

16

16

14

14

12

12

10

10

8

8

6

6

4

4

2

2

0

97 Larger Protein-Protein Monomers

18

0 0

1

2

20

3

4

5

6

7

8

9

10 11 12 13 14 15 >15

97 Smaller Protein-Protein Monomers

18

0

1

2

3

70

4

5

6

7

8

9

10 11 12 13 14 15 >15

134 Protein-Ligand Monomers

60

16

50

14 12

40

10 8

30

6

20

4

10

2 0

0 0

1

2

3

4

5

6

7

8

9

10 11 12 13 14 15 >15

0

1

2

3

4

5

6

7

8

9

10 11 12 13 14 15 >15

Number of Clefts with Occupancies Greater than 25%

Figure 1a. Distribution of the number of occupied interface clefts across the 97 protein-protein complexes and 134 protein-ligand complexes, as generated by QSiteFinder.

Interface Non-Interface Total Surface

100

Relavent Surface Area Covered (%)

90 80 70 60 50 40 30 20 10 0 All Protein-Protein

Smaller ProteinProtein

Larger ProteinProtein

Protein-Ligand

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

Percentage of Monomers

18

Figure 1b. Average coverage of the protein surface area; by the predicted clefts generated by Q-SiteFinder in 97 protein-protein complexes and 134 protein-ligand complexes. Error bars represent the standard deviations of each distribution. Analysis of Cleft Properties Intermolecular interactions are stabilised by a number of factors, these include the burial of areas of hydrophobicity, the formation of hydrogen-bonds and electrostatic complementarity (Chothia and Janin, 1975). We have found clefts in the protein surfaces using Q-SiteFinder and re-ranked them according to simple measures of the cleft properties (see methods). All the different properties were used to rank the clefts

complex. Receiver Operating Characteristics (ROC) curves are used for this purpose (see methods). Whilst, most of the properties used (van der Waals, electrostatic, hydrophobicity, desolvation, and conservation) are readily interpretable, the Unit electrostatic properties involve the calculation of electrostatic potential from a uniform charge density on all protein atoms, as opposed to the standard atom-specific charges. The method typically generates large potentials in larger or enclosed clefts (Bate and Warwicker, 2004), and does not reflect atom-specific electrostatic properties of the site.

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

to assess their ability to predict which ones are involved in binding in the native

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

Figure 2. Receiver Operating Characteristics graphs for the prediction of interface clefts in protein-protein (a-e) and protein-ligand (f) complexes. See text for a description.

Enzyme-Inhibitor Complexes Results of re-ranking the clefts for the enzyme-inhibitor monomers are shown in figure 2a and 2b. The results for the enzymes (receptors) show that surface residue conservation is the best predictor of true protein interface regions as well as for their protein inhibitors (ligands). However, it is a much better predictor for enzymes, in agreement with the study of Bradford and Westhead (2003). The unit electrostatic properties (unit peak and average) are also good predictors of interface regions in the enzymes but show no predictive power (or even slight anti-correlation) in the

that the substrate/inhibitor binding cleft of the enzymes is conserved in sequence due to the conserved nature of the catalytic mechanism, and also active sites are often enclosed, pre-organised clefts, a property that may be important for enzyme catalysis. For example, the serine-protease family, are well represented in the enzyme class. The sequence conserved catalytic oxy-anion hole in serine proteases is a deep cleft which accommodates the conserved binding conformation of the protein inhibitor “canonical” loop (Jackson and Russell, 2000). Therefore it is unsurprising that unit electrostatic properties capture this characteristic for the enzymes only. Both of these cleft characteristics may capture the characteristics of the binding clefts of enzyme protein families very well but are not expected to generalize well to other proteinprotein interactions. The only other strong predictor of interface regions is ranking clefts by desolvation energy. This is discussed further below.

Antibody-Antigen Complexes The ROC curves for the predictions of the interface clefts of the 38 bound monomers of the 19 antibody-antigen complexes can be seen in figure 2c and 2d. Although the

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

inhibitors. These results can be rationalized in terms of known protein function, in

results for the antigens (ligands) do not show any strong trends, those for the prediction of interface clefts in the antibodies (receptors) do. The highly significant anti-correlation of residue conservation in the antibodies is striking, yet entirely logical given the biological role of antibodies. This reflects antibody structure in which highly sequence diverse loops form the complementarity-determining regions (CDRs) and antigen binding interface. The CDRs determine specificity on what is otherwise a highly conserved protein framework. Both Q-SiteFinder and total unit electrostatics scores show some weak predictive capacity that may reflect their ability

antibodies. However, the results for the antibodies show that cleft desolvation energy is by far the best predictor of protein interface regions. It is interesting to see that it is also the most prominent of all attributes in the antigen proteins as well, albeit to a much lesser degree.

Other Protein-Protein Complexes The ROC curves for the 112 bound monomers of the 56 ‘other’ complexes can be seen in figure 2e. Unlike antibody and enzyme results there are few strong correlations to be seen in either the ligand or receptor sets of proteins (see supplementary data). This is probably because this is a highly non-homogenous set of protein-protein complexes. This diversity almost certainly neutralizes protein subclass specific characteristics such as those seen in the enzyme and antibody sets. The strongest correlation between interface cleft and high rank in both the receptor and ligand proteins is the desolvation energy. Conservation is only weakly predictive, and fails to re-rank the interface clefts highly for a more diverse set of protein complexes, consistent with the results of Caffrey et al, (2004). By changing the area over which

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

to define the size and position of clefts between the loops that form the CDRs of the

conservation is assessed (from a large patch in Caffrey et al’s investigation, to a small cleft in ours) it was hoped to improve these results in line with the findings of Li et al (2004). Li et al showed that the conserved residues of interface regions cluster around clefts in the protein surface. By re-ranking the clefts according to conservation score, we anticipated an improved relationship for the prediction of interfaces. However this was not the case. Consistent with the results of all the protein-protein interfaces analyzed in this piece of work, desolvation energy is the most effective common factor in the identification of interface clefts.

The ROC curves for the 134 protein-ligand complexes can be found in figure 2f. The difference between these results and those of the protein-protein interface clefts is striking. Unlike the protein-protein interfaces, several properties are excellent predictors of protein interface regions. The ranking of clefts based on Q-SiteFinder support both the relative merits of using this method for ligand binding-site prediction and the previous observation that this method has a 90% success rate in the top three predicted clusters when tested on the same dataset (Laurie and Jackson, 2005). Not so surprisingly the electrostatic total (based on amber charges) and unit electrostatic total show a very similar pattern, with the latter being slightly more successful than QSiteFinder. These two measures may be most dependent on the number of probe centres that make up the cleft, as defined by Q-SiteFinder, rather than the nature of the properties that define the cleft. Of all the attributes the unit electrostatics (average and peak) stand out and rank slightly above Q-SiteFinder overall, however, their initial prioritisation of clefts is in fact weaker. The finding that using unit charges (Laskowski et al., 1996) rather than the more physicochemically realistic amber

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

Protein-Ligand Interactions

charges improves results (Bate and Warwicker, 2004) is also corroborated by our results. Other, strong predictors are cleft desolvation energy and surface residue conservation. The latter is not unexpected since active/ligand binding sites, have been shown to be sequence conserved across many different protein families (Bartlett et al., 2002) and consequently this property is useful in the identification of enzyme active sites (Greaves and Warwicker, 2005). This is further corroborated here by the difference in predictive power of conservation for enzymes versus non-enzyme protein-ligand complexes (see supplementary data).

correlation with each of the three measures for the unit electrostatics calculations, implying that clefts in protein-protein interfaces (defined by Q-SiteFinder) are fundamentally different to those in protein-ligand complexes. The observation that high electrostatics potential is indicative of a ligand binding-site seems at odds with the finding that the ease of desolvation of the cleft also correlates well. The ease of desolvation used in this analysis would be expected to favour non-polar surface with a high aliphatic or aromatic content rather than a charged surface. However, it is primarily the unit electrostatic properties that are highly predictive, and these may be most dependent on the shape, and depth of the cavity rather than true atom-base electrostatic properties. Furthermore, the poor predictive power of hydrophobicity in protein-ligand and most protein-protein interactions indicates fundamental differences between clefts defined by hydrophobicity and desolvation. This challenges the classical view that hydrophobicity, as defined by non-polar surface area, is a useful predictor of interaction interfaces (Chothia and Janin, 1975).

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

With the exception of the enzymes, protein-protein interaction clefts have a poor

DISCUSSION This study has assessed the applicability of using protein surface clefts for the prediction of protein-protein and protein-ligand interfaces. There are a number of multi-attribute methods for the prediction of protein-protein interfaces that use averages of properties across a patch of protein surface in order to determine whether that patch lies in an interface (Jones and Thornton, 1997; Zhou and Shan, 2001; Fariselli et al., 2002; Neuvirth et al., 2004; Bordner and Abagyan, 2005; Bradford and Westhead, 2005). Combinations of properties allow the prediction of protein-protein

average of its properties. There is some evidence to suggest that the stability of a protein complex is not spread across the interface but instead localised around clefts in the protein surface. By identifying clefts in the protein surface we investigated whether any single property could discriminate those in the protein interface from those elsewhere on the protein surface.

An overall ROC integral or Area Under the Curve (AUC) gives a single measure of prediction accuracy, (Hanley and Mcneil, 1982) used in many scientific studies. A table of results for the AUC of all the studies presented, are given in the supplementary data. For all protein-protein interactions, conservation scores appear to be less effective than anticipated (AUC: 58%), confirming the results of Caffrey et al (2004) who concluded that protein-protein interfaces are not significantly more conserved than the rest of the protein surface. Only in the enzyme (AUC: 79%) and enzyme protein-ligand (AUC: 78%) interfaces is conservation of significant predictive value. Electrostatics also failed to show any general trends in the prediction of protein-protein interfaces on monomers (AUC: 45-54% for all interactions) other

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

interactions with successes of up to 70%, despite treating the protein interface as an

than enzymes (AUC: 65%, electrostatic total) consistent with the theory that longrange electrostatics do not play an important role in the funnel concept of proteinprotein interactions (Schlosshauer and Baker, 2004), but are rather used for orientation (Schreiber and Fersht, 1996). Of the properties studied in this investigation only desolvation energy of the clefts had any general predictive power across all protein-protein interaction types (AUC: 68%). Fernández-Recio et al (2005) showed that desolvation scores over larger regions of protein surface area, as defined by circular patches, correlated with the interface in 58% of proteins. It appears that the

between enzymes and inhibitors, and those of the antigens form the majority of interfaces that Fernández-Recio et al failed to predict. Desolvation scores, as implemented by our cleft based analysis, correlate nearly as well with charged and antigen interfaces as they do with other types of interface, with the enzyme (AUC: 74%), inhibitor (AUC: 61%) and antigen (AUC: 62%) interfaces being predicted well. A similar high level of success is also seen in protein-ligand interfaces (AUC: 72%), which also have charged interfaces with a high electrostatic potential. The level of success may be due to the fact that in defining the protein surface in terms of clefts we are only sampling a portion of the protein interface, as opposed to the whole surface of a circular patch. It appears that sites of high electrostatic potential and favourable desolvation coexist independently in the same interface. This is supported by the fact that the correlation between sites ranked by these two attributes for both proteinprotein and protein-ligand complexes is insignificant (R2= 0.02-0.1) and slightly negatively correlated (results not shown).

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

cleft based approach employed here is more general. Charged interfaces, such as those

Protein-protein and protein-ligand interface clefts show several differences, as well as some similarities. The most obvious differences are the failure of electrostatic descriptors overall to identify protein-protein interface clefts. However, identification of enzyme active sites is possible by electrostatic methods, and the results were improved when using a unit charge applied to each atom in the protein (Bate and Warwicker, 2004). Our results confirm these observations. The power of electrostatic descriptors in the protein-ligand interactions, particularly those of the enzyme sub-set, where the ligands bind at active sites, is only maintained in the enzyme class of

degree of correlation with protein-ligand interactions for other attributes, and both also show significant predictive power of Q-SiteFinder and residue conservation, in contrast to all other protein-protein interfaces. Therefore, it would appear that the binding-sites of enzymes are much more predictable than those of other proteinprotein complexes. The very high predictive power of Q-SiteFinder (AUC: 88%) and all the unit electrostatic properties (AUC: 93%) for protein-ligand interactions is very encouraging and will allow the targeting of functionally important ligand bindingsites in functional genomics.

Overall, protein-protein and protein-ligand interface clefts do have similarities. Our results show that the ease of cleft desolvation or “de-wetting” of all classes of interface are strikingly similar and more indicative of interface regions than electrostatics, Q-SiteFinder, or residue conservation, which although highly predictive in some cases do not generalize well to all classes of protein interaction. Recent studies of the role of water in protein-protein association of the melittin tetramer (Barratt et al., 2005) and in protein-ligand binding (Liu et al., 2005) in mouse major

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

protein-protein interactions. In fact, the protein-protein enzyme class shows a high

urinary protein (MUP) by Molecular Dynamics simulation illustrate the importance of the phenomenon of de-wetting in molecular interactions. In the former, the strongly hydrophobic surfaces induce the evaporation (drying) of water in the interface as the melittin protein subunits approach one-another. The authors conclude that sufficiently hydrophobic protein surfaces can induce a liquid–vapour transition providing the driving force towards protein association. In the later study, the MUP ligand bindingsite cleft is pre-organised, hydrophobic and poorly solvated in the unbound form, which may explain the largely enthalpy (as opposed to entropy) driven

does not correlate strongly with interface clefts in either protein-protein or proteinligand interfaces. Hydrophobic, non-polar surfaces will also correlate closely with areas of low desolvation energy; however, in proteins that form non-obligate protein/ligand complexes (and hence must be independently stable in water) extended hydrophobic solvent exposed surfaces are unlikely to exist. Therefore, a more subtle phenomenon exists, a relationship between ease of desolvation (de-wetting) and interface propensity which may be an intrinsic feature of all protein binding surfaces.

ACKNOWLEDGEMENTS The authors would like to thank Alasdair Laurie for help with Q-SiteFinder and the protein-small molecule dataset. NJB is funded by the Medical Research Council.

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

thermodynamics of ligand binding. We have found that hydrophobicity of the surface

REFERENCES

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 25, 3389-3402. Antony, J., Medvedev, D. and Stuchebrukhov, A. (2000) Theoretical study of electron transfer between the photolyase catalytic cofactor FADH(-) and DNA thymine dimer. J Am Chem Soc, 122, 1057-1065. Aytuna, A. S., Gursoy, A. and Keskin, O. (2005) Prediction of protein-protein interactions by combining structure and sequence conservation in protein interfaces. Bioinformatics, 21, 2850-2855. Bairoch, A., Apweiler, R., Wu, C. H., Barker, W. C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M. J., Natale, D. A., O'Donovan, C., Redaschi, N. and Yeh, L. S. (2005) The Universal Protein Resource (UniProt). Nucleic Acids Res, 33, D154-159. Barratt, E., Bingham, R. J., Warner, D. J., Laughton, C. A., Phillips, S. E. and Homans, S. W. (2005) Van der Waals interactions dominate ligand-protein association in a protein binding site occluded from solvent water. J Am Chem Soc, 127, 11827-11834. Bartlett, G. J., Porter, C. T., Borkakoti, N. and Thornton, J. M. (2002) Analysis of catalytic residues in enzyme active sites. J Mol Biol, 324, 105-121. Bate, P. and Warwicker, J. (2004) Enzyme/non-enzyme discrimination and prediction of enzyme active site location using charge-based methods. J Mol Biol, 340, 263-276. Bogan, A. A. and Thorn, K. S. (1998) Anatomy of hot spots in protein interfaces. J Mol Biol, 280, 1-9. Bordner, A. J. and Abagyan, R. (2005) Statistical analysis and prediction of proteinprotein interfaces. Proteins, 60, 353-366. Bradford, J. R. and Westhead, D. R. (2003) Asymmetric mutation rates at enzymeinhibitor interfaces: implications for the protein-protein docking problem. Protein Sci, 12, 2099-2103. Bradford, J. R. and Westhead, D. R. (2005) Improved prediction of protein-protein binding sites using a support vector machines approach. Bioinformatics, 21, 1487-1494. Caffrey, D. R., Somaroo, S., Hughes, J. D., Mintseris, J. and Huang, E. S. (2004) Are protein-protein interfaces more conserved in sequence than the rest of the protein surface? Protein Sci, 13, 190-202. Chen, R., Mintseris, J., Janin, J. and Weng, Z. (2003) A protein-protein docking benchmark. Proteins, 52, 88-91. Chothia, C. and Janin, J. (1975) Principles of protein-protein recognition. Nature, 256, 705-708. Clackson, T. and Wells, J. A. (1995) A hot spot of binding energy in a hormonereceptor interface. Science, 267, 383-386. Edgar, R. C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res, 32, 1792-1797. Espadaler, J., Romero-Isart, O., Jackson, R. M. and Oliva, B. (2005) Prediction of protein-protein interactions using distant conservation of sequence patterns and structure relationships. Bioinformatics, 21, 3360-3368.

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

Fariselli, P., Pazos, F., Valencia, A. and Casadio, R. (2002) Prediction of protein-protein interaction sites in heterocomplexes with neural networks. Eur J Biochem, 269, 1356-1361. Fauchere, J. L. and Pliska, V. (1983) Hydrophobic paramaters-pi of amino-acid sidechains from the partitioning of N-acetyl-amino-acid amides. Eur J Med Chem, 18, 369-375. Fernandez-Recio, J., Totrov, M. and Abagyan, R. (2004) Identification of proteinprotein interaction sites from docking energy landscapes. J Mol Biol, 335, 843-865. Fernandez-Recio, J., Totrov, M., Skorodumov, C. and Abagyan, R. (2005) Optimal docking area: a new method for predicting protein-protein interaction sites. Proteins, 58, 134-143. Giammona, D. A. (1984), Vol PhD. Davis: University of California. Goodford, P. J. (1985) A computational procedure for determining energetically favorable binding sites on biologically important macromolecules. J Med Chem, 28, 849-857. Greaves, R. and Warwicker, J. (2005) Active site identification through geometrybased and sequence profile-based calculations: burial of catalytic clefts. J Mol Biol, 349, 547-557. Halperin, I., Wolfson, H. and Nussinov, R. (2004) Protein-protein interactions; coupling of structurally conserved residues and of hot spots across interfaces. Implications for docking. Structure (Camb), 12, 1027-1038. Hanley, J. A. and McNeil, B. J. (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143, 29-36. Henrick, K. and Thornton, J. M. (1998) PQS: a protein quaternary structure file server. Trends Biochem Sci, 23, 358-361. Hu, Z., Ma, B., Wolfson, H. and Nussinov, R. (2000) Conservation of polar residues as hot spots at protein interfaces. Proteins, 39, 331-342. Hubbard, S. J. and Thornton, J. M. (1993): NACCESS. Manchester University. Jackson, R. M. (2002) Q-fit: a probabilistic method for docking molecular fragments by sampling low energy conformational space. J Comput Aided Mol Des, 16, 43-57. Jackson, R. M. and Russell, R. B. (2000) The serine protease inhibitor canonical loop conformation: examples found in extracellular hydrolases, toxins, cytokines and viral proteins. J Mol Biol, 296, 325-334. Jones, S. and Thornton, J. M. (1997) Analysis of protein-protein interaction sites using surface patches. J Mol Biol, 272, 121-132. Jones, S. and Thornton, J. M. (1997) Prediction of protein-protein interaction sites using patch analysis. J Mol Biol, 272, 133-143. Keskin, O., Ma, B. and Nussinov, R. (2005) Hot regions in protein--protein interactions: the organization and contribution of structurally conserved hot spot residues. J Mol Biol, 345, 1281-1294. Laskowski, R. A., Luscombe, N. M., Swindells, M. B. and Thornton, J. M. (1996) Protein clefts in molecular recognition and function. Protein Sci, 5, 24382452. Laurie, A. T. and Jackson, R. M. (2005) Q-SiteFinder: an energy-based method for the prediction of protein-ligand binding sites. Bioinformatics, 21, 1908-1916. Li, X., Keskin, O., Ma, B., Nussinov, R. and Liang, J. (2004) Protein-protein interactions: hot spots and structurally conserved residues often locate in

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016

complemented pockets that pre-organized in the unbound states: implications for docking. J Mol Biol, 344, 781-795. Li, Y., Huang, Y., Swaminathan, C. P., Smith-Gill, S. J. and Mariuzza, R. A. (2005) Magnitude of the hydrophobic effect at central versus peripheral sites in protein-protein interfaces. Structure (Camb), 13, 297-307. Liu, P., Huang, X., Zhou, R. and Berne, B. J. (2005) Observation of a dewetting transition in the collapse of the melittin tetramer. Nature, 437, 159-162. Lo Conte, L., Chothia, C. and Janin, J. (1999) The atomic structure of protein-protein recognition sites. J Mol Biol, 285, 2177-2198. Ma, B., Elkayam, T., Wolfson, H. and Nussinov, R. (2003) Protein-protein interactions: structurally conserved residues distinguish between binding sites and exposed protein surfaces. Proc Natl Acad Sci U S A, 100, 5772-5777. Meagher, K., Redman, L. and Carlson, H. (2003) Development of polyphosphate parameters for use with the AMBER force field. J Comp Chem, 24, 10161025. Neuvirth, H., Raz, R. and Schreiber, G. (2004) ProMate: a structure based prediction program to identify the location of protein-protein binding sites. J Mol Biol, 338, 181-199. Nissink, J. W., Murray, C., Hartshorn, M., Verdonk, M. L., Cole, J. C. and Taylor, R. (2002) A new test set for validating predictions of protein-ligand interaction. Proteins, 49, 457-471. Rhodes, D. R., Tomlins, S. A., Varambally, S., Mahavisno, V., Barrette, T., KalyanaSundaram, S., Ghosh, D., Pandey, A. and Chinnaiyan, A. M. (2005) Probabilistic model of the human protein-protein interaction network. Nat Biotechnol, 23, 951-959. Rocchia, W., Alexov, E. and Honig, B. (2001) Extending the applicability of the nonlinear Poisson-Boltzmann equation: Multiple dielectric constants and multivalent ions. J Phys Chem B, 105, 6507-6514. Rocchia, W., Sridharan, S., Nicholls, A., Alexov, E., Chiabrera, A. and Honig, B. (2002) Rapid grid-based construction of the molecular surface and the use of induced surface charge to calculate reaction field energies: applications to the molecular systems and geometric objects. J Comput Chem, 23, 128-137. Russell, R. B., Alber, F., Aloy, P., Davis, F. P., Korkin, D., Pichaud, M., Topf, M. and Sali, A. (2004) A structural perspective on protein-protein interactions. Curr Opin Struct Biol, 14, 313-324. Sali, A., Glaeser, R., Earnest, T. and Baumeister, W. (2003) From words to literature in structural proteomics. Nature, 422, 216-225. Sanner, M. F., Olson, A. J. and Spehner, J. C. (1996) Reduced surface: an efficient way to compute molecular surfaces. Biopolymers, 38, 305-320. Schlosshauer, M. and Baker, D. (2004) Realistic protein-protein association rates from a simple diffusional model neglecting long-range interactions, free energy barriers, and landscape ruggedness. Protein Sci, 13, 1660-1669. Schneider, C. and Suhnel, J. (1999) A molecular dynamics simulation of the flavin mononucleotide-RNA aptamer complex. Biopolymers, 50, 287-302. Schreiber, G. and Fersht, A. R. (1996) Rapid, electrostatically assisted association of proteins. Nat Struct Biol, 3, 427-431. Thorn, K. S. and Bogan, A. A. (2001) ASEdb: a database of alanine mutations and their effects on the free energy of binding in protein interactions. Bioinformatics, 17, 284-285. Valdar, W. S. (2002) Scoring residue conservation. Proteins, 48, 227-241.

Weiner, S. J., Kollman, P. A., Case, D. A., Singh, U. C., Ghio, C., Alagona, G., Profeta, S. and Weiner, P. (1984) A new force-field for mlecular mechanical simulation of nucleic-acids aand proteins. J Am Chem Soc, 106, 765-784. Xu, D., Tsai, C. J. and Nussinov, R. (1997) Hydrogen bonds and salt bridges across protein-protein interfaces. Protein Eng, 10, 999-1012. Zhou, H. X. and Shan, Y. (2001) Prediction of protein interaction sites from sequence profile and residue neighbor list. Proteins, 44, 336-343.

Downloaded from http://bioinformatics.oxfordjournals.org/ by guest on January 8, 2016