Flexible Protein-Protein Docking Based on Best-First Search Algorithm EFRAT NOY, AMIRAM GOLDBLUM
Department of Medicinal Chemistry and the David R. Bloom center for Pharmacy, School of Pharmacy, The Hebrew University of Jerusalem, Israel 91120 Received 19 June 2009; Revised 4 November 2009; Accepted 4 November 2009 DOI 10.1002/jcc.21480 Published online 19 January 2010 in Wiley InterScience (www.interscience.wiley.com).
Abstract: We developed a new high resolution protein-protein docking method based on Best-First search algorithm that loosely imitates protein-protein associations. The method operates in two stages: first, we perform a rigid search on the unbound proteins. Second, we search alternately on rigid and flexible degrees of freedom starting from multiple configurations from the rigid search. Both stages use heuristics added to the energy function, which causes the proteins to rapidly approach each other and remain adjacent, while optimizing on the energy. The method deals with backbone flexibility explicitly by searching over ensembles of conformations generated before docking. We ran the rigid docking stage on 66 complexes and grouped the results into four classes according to evaluation criteria used in Critical Assessment of Predicted Interactions (CAPRI; ‘‘high,’’ ‘‘medium,’’ ‘‘acceptable,’’ and ‘‘incorrect’’). Our method found medium binding conformations for 26% of the complexes and acceptable for additional 44% among the top 10 configurations. Considering all the configurations, we found medium binding conformations for 55% of the complexes and acceptable for additional 39% of the complexes. Introducing side-chains flexibility in the second stage improves the best found binding conformation but harms the ranking. However, introducing side-chains and backbone flexibility improve both the best found binding conformation and the best found conformation in the top 10. Our approach is a basis for incorporating multiple flexible motions into protein-protein docking and is of interest even with the current use of a simple energy function. q 2010 Wiley Periodicals, Inc.
J Comput Chem 31: 1929–1943, 2010
Key words: flexible docking; Best-First search; loop modeling; backbone flexibility; conformational change
Introduction Predicting the structures of protein-protein complexes has advanced greatly over the last few years. Current docking methods are effective in many cases of small conformational changes on binding, especially if biological information is available.1,2 However, in general, it is still discouragingly difficult to predict near-correct complex structures, because there is still no general solution for handling flexibility and energy functions are not accurate enough to discriminate consistently between near experimental and other conformations.3,4 The most complicated part of developing an efficient sampling method is dealing with backbone flexibility because the search space is not only extremely large but also cannot be easily separated into independent degrees of freedom (DOFs). Unlike side-chains’ DOFs, which are relatively independent of each other, changing a backbone dihedral angle can result in a displacement of large parts of the molecule. Essentially, a thorough treatment of backbone flexibility is equivalent to the folding problem.
Consequently, existing docking methods limit backbone flexibility to certain types of motion, identify flexible regions, and predict possible conformational changes.4–6 Types of motions that have been observed on the binding of proteins include loop rearrangements, hinge movements, and other more complex motions.7 Detection of flexible regions and suggestion of possible conformational changes can be derived by analyzing different experimental protein structures,8,9 by using graph theory10 or elastic network model11 or by using molecular dynamics (MD) simulations, normal modes,12 or essential dynamics.13 Various attempts were made to deal with backbone flexibility in protein-protein docking.5 Some groups perform multiple rigid-body dockings of ensembles of conformations from nuclear magnetic resonance14 and MD simulations.15,16 Ensemble docking not only improves the performance in terms of the number
Additional Supporting Information may be found in the online version of this article. Correspondence to: A. Goldblum; e-mail:
[email protected] q 2010 Wiley Periodicals, Inc.
1930
Noy and Goldblum • Vol. 31, No. 9 • Journal of Computational Chemistry
of near-correct solutions but also leads to increased numbers of false positives. HADDOCK17 explicitly allows backbone flexibility during a MD simulated annealing refinement stage in torsion angle space. Deformations along principal components are treated as additional DOFs in ATTRACT, allowing the structures to deform along soft harmonic modes to facilitate the binding process.18 A multicopy, mean-field approach has been proposed to deal with loop rearrangements.19 The statistical weight of each loop copy is adapted during docking according to Boltzmann criterion and the receptor experiences the mean field of the ensemble of conformations. RosettaDock20 integrates backbone flexibility based on a ‘‘fold-tree’’ representation. Backbone flexibility is confined to loop regions or applied to the complete backbone in a minimization stage. FlexDock21–23 and MolFit24 handle hinge bending motions. When a protein is known or suspected to undergo such a motion it is subdivided into domains, the domains are docked independently and the docked domains are then assembled. In this article, we describe a new flexible docking method which is based on a variant of the Best-First search algorithm. This algorithm was applied previously to deal with the folding problem25 and to enhance the traditional dynamic programming method of sequence alignment.26 Search methods differ in the way they direct the search trajectory. For example, Steepest Descent prefers to move toward the minimal energy, Monte Carlo moves randomly while simulated annealing moves toward the minimal energy with a certain probability to move in a different direction. Our method uses heuristics which loosely imitates protein-protein associations. The addition of this heuristics to the energy function affects the search trajectory by preferring to move in certain directions so as to converge to the minimal energy faster. It follows the methodology of working in phases27 to reduce the search space. At the broad phase we keep the proteins rigid, with some surface softness, searching the comparatively small space of relative protein orientations (translational and rotational) and identifying an initial set of candidate configurations. After completing the broad phase, we cluster the best models produced in that phase and select the top 10 cluster’s representatives as starting points for the second phase, where we allow movement of side-chains and optionally of the backbone. Backbone flexibility is introduced using ensembles of flexible fragments (FFs) that were shown to contain conformations similar to those observed in bound complexes.28 These ensembles are generated with a loop prediction program based on the iterative stochastic elimination (ISE) algorithm prior to docking. Finding conformations close to those observed in bound complexes resembles results obtained in an earlier work that dealt with side chains.29
Methods Docking Algorithm
Proetin-protein docking is extremely sensitive to small conformational changes due to the large contact area between the proteins.30 For that reason, we use a soft van der Waals (VDW) potential, which allows the proteins to penetrate into one another, to some extent, without being heavily penalized by
energy. Such potentials are less discriminative and, therefore, make the recognition of good candidate conformations highly problematic. Although this problem is already apparent in rigid searches, it is much more pronounced when dealing with flexible searches. In these searches, the number of possibilities is much larger and, therefore, problematic candidate ranking is much more likely to result in bad decisions while choosing which conformation to try next. To circumvent this problem, we first search only rigid DOFs, where, presumably, the soft energy function harms decisions less. Our docking procedure is therefore divided into the following three stages (as shown in Fig. 1A): i. Broad phase: rigid docking, supposed to locate good starting configurations for a refinement stage that includes also flexible DOFs. ii. Minimization and rescoring: the top few thousand configurations are clustered, and cluster representatives (CRs) are then minimized using a more exact energy function with a better estimation for solvent effects. This stage is expected to rank high near-correct docked configurations. iii. Narrow phase: flexible refinement of the 10 lowest energy CR (CR10) configurations found in the last stage. The refinement stage is composed of interleaved steps of search on rigid DOFs and search on flexible DOFs. Broad Phase: Rigid Docking
Our docking method is based on a variant of the Best-First search algorithm. Best-First search is a graph search algorithm which starts from a set of initial nodes and searches for a given goal node or one passing a given goal test. It operates by evaluating the nodes using a heuristic function and expanding the most promising node first.31,32 The heuristic function encapsulates knowledge about the specific problem to help guide the search to more promising nodes. At start the initial nodes are inserted into an ‘‘open’’ list. In each iteration, the most promising node in the open list is expanded, and successors of this node are evaluated by the heuristic function and added to the open list. Expanding the most promising node in the open list continues as long as the open list is not empty, and the goal has not been reached. All the nodes that were visited are recorded in a ‘‘closed’’ list to avoid repetition and cycles. We shall describe rigid protein-protein docking in terms of graph searching (specifically, a grid) problem to apply Best-First search to it. The receptor (larger protein) is fixed in space, whereas the ligand (smaller protein) is free to translate and rotate in a grid about the receptor. Each node in the grid represents a possible configuration of the ligand with respect to the receptor. Successor nodes are adjacent nodes which differ only in one axis, i.e., neighboring ligand configurations, which differ only in one DOF. The search starts with the ligand located in one of the corners of the grid (a configuration with lowest or highest values available to each DOF is in the open list). In each iteration, the ligand’s configuration with the most promising heuristics is pulled out of open and its neighboring configurations (successor nodes) which were not visited yet are evaluated by the heuristic function. In addition, to overcome local minima better, we add a
Journal of Computational Chemistry
DOI 10.1002/jcc
Flexible Protein-Protein Docking
semi-random configuration to the neighboring configurations, formed by choosing a random configuration in the range between 25 steps to 15 steps in each DOF. We insert only the two neighboring configurations with best heuristics into the open list, effectively pruning the search. After 250 consecutive configurations are pulled out of open without improvement of the minimal energy found, the ligand molecule is repositioned in a random orientation inside the allowed search space and the search starts from this new configuration. The search ends only after 5000 consecutive configurations are pulled out of open without improvement of the minimal energy found, meaning there are at least 20 starting configurations of the ligand during the search. Figure 1B shows a flowchart of the rigid docking stage. We use a first-order linear heuristic function [eq. (1)], which is a weighted sum of the energy of the configuration and the distance between the centers of mass of the two proteins in the complex. Including the distance term in the heuristic function causes the ligand to approach the receptor rapidly and stay adjacent to the receptor along the search. The constant used in the weighted sum is c 5 1000, which we found empirically to balance well between the protein-protein conformational energy and the distance between the centers of mass (i.e., locating the ligand close to the receptor, without too many penetrations). At large distances, the distance term is dominant and the search algorithm will try to minimize the cost by bringing the molecules closer. At small distances, when penetrations occur, the energy term provides the main contribution so the search will move the molecules away. Energies of the configurations are calculated as described below. Heuristic ¼ Energy þ c 3 distance
(1)
In principle, one could consider a non-linear heuristic function. However, because we use an energy function with a linear VDW term at small distance, as explained later, a linear heuristic function works well. Figure 2 illustrates the search pattern of our Best-First search algorithm using the heuristic function described earlier. In the figure, one can see that the ligand rapidly approaches the receptor and remains adjacent to it for a significant part of the search. Minimization and Rescoring
After clustering the rigid docking results, we are left with a few hundred docked configurations. These configurations may be poorly ranked because we use a fairly simple scoring function and a soft VDW potential during the rigid search. To remove clashes and improve the ranking of near-correct docked configurations, cluster-representative (CR) configurations generated by the rigid docking stage were minimized in molecular operating environment.33 Each CR was minimized gradually on larger parts of the proteins: (1) only hydrogens, (2) sidechains, and (3) all atoms. To keep the runtimes reasonable we decided to limit the number of minimization steps applied. Each gradual minimization contains 20 steps of steepest descent and 30 steps of conjugate gradient. The convergence criterion used between consecutive minimization steps is root mean square
1931
˚ . In practice, this limitation is almost deviation (RMSD) 0.1A never met. We improved the precision of the energy function in this stage by using longer cutoffs for non-bonding interactions ˚ ), a normal VDW term and the GB/SA continuum (18.0–20.0A model for describing better solvent effects. Narrow Phase: Flexible Refinement
The flexible refinement is applied to the CR10 configurations from the former stage. This stage is composed of interleaved rigid and flexible cycles. The rigid cycles explore rigid DOFs (translation and rotation) using the Best-First search algorithm. The flexible cycles explore flexible DOFs (side-chains’ rotamers and conformational ensembles of FFs, see later) using an algorithm inspired by a method developed in Wodak’s laboratory.34 Searching rigid DOFs and flexible DOFs alternately allows the proteins to adapt their conformations to fit each other while still enabling fast spatial movement. This is in contrast to searching over all the DOFs simultaneously, which tends to linger too long at incorrect regions of the conformational search space. The flexible refinement uses an open list for both the rigid cycle and the flexible cycle. The rigid cycle of the refinement is similar to the rigid docking. A starting configuration from the CR10 of the former stage is inserted to the open list, and conformations are pulled out of the open list until there is no improvement in the minimal energy for 200 consecutive pulls. Then a flexible cycle starts and repeats the following ‘‘Wodak-like’’ procedure: i. One of the 10 best ranked conformations of the ligand sampled until this point is picked randomly. ii. One of the flexible DOFs is selected randomly. iii. The conformational space of the selected flexible DOF is explored and the best conformations (those with the lowest energy) are retained and inserted back to the open list. For side-chains, the rotameric space is explored, whereas for FFs, the conformational ensemble is explored. The Wodak-like cycle (searching one DOF) is repeated until the energy does not change for NFD steps, where NFD equals the number of flexible DOFs. When this is reached, another rigid cycle starts and so forth. The entire flexible refinement finishes only after 500 consecutive docked conformations are pulled out of the open list with no improvement of the minimal energy. The minimal number of flexible cycles between rigid cycles is therefore two but usually many more flexible cycles are conducted. A schematic description of the flexible refinement procedure is presented in Figure 3. Search Space
A global docking simulation of all translational, rotational, side chains, and backbone conformational space is computationally demanding. We, therefore, limited the number of DOFs which were explicitly searched and their allowed values. Rigid-body DOFs
We limit the search space to a discrete subset of values for each of the 6 DOFs. The sampling resolution must be sufficient to guarantee a near hit. Assuming a sampling resolution of Dx in
Journal of Computational Chemistry
DOI 10.1002/jcc
1932
Noy and Goldblum • Vol. 31, No. 9 • Journal of Computational Chemistry
Figure 1. (A) Schematic presentation of the flexible docking procedure stages: (1) rigid docking, (2) minimization and rescoring, (3) flexible refinement. The rigid docking stage is described in detail in (B) and the flexible refinement stage is illustrated in Figure 3. AMBER forcefield is used as a scoring function along the search. The search is performed only on a restricted search space defined on the basis of experimental information. (B) Flowchart of the rigid docking stage. The numbers in the ‘‘failed improve’’ cells count the iterations with no improvements in the minimal energy through that cell (see accompanying text). config 5 configuration.
Figure 2. Illustration of the search pattern produced by our BestFirst docking algorithm using the heuristic function described in the text. The set of possible positions of the ligand center (the ‘‘search space’’) is given by white grid points. The search starts at the yellow ball and each point it visits is colored in orange. The green ball is drawn at the nearest point on the grid to the experimental docking position. The receptor is drawn in blue.
Figure 3. Schematic description of the flexible refinement stage. This stage is composed of interleaved rigid and flexible cycles. The numbers in the ‘‘failed improve’’ cells count the iterations with no improvements in the minimal energy through that cell (see accompanying text).
Journal of Computational Chemistry
DOI 10.1002/jcc
Flexible Protein-Protein Docking
each of the translation axes and Da in each of the rotation axes, the overall expected error will be: (3Dx2 1 3r2Da2), where r is the radius of the molecule. To obtain results, which are accurate ˚ , it is sufficient to take translation steps of size 0.8A ˚ up to 1.0A and rotation steps of size 2.38 (equivalent to translation steps of ˚ , assuming average ligand diameter of 40A ˚ ). All along this 0.8A ˚ in study, we used steps with slightly better resolution: 0.7A translational space and 28 in rotational space. The discrete grid is placed, so that the exact bound state position of the ligand lies exactly between grid points and cannot be sampled by the ˚ in each translational DOF search (i.e., the bound state lies 0.35A and 18 in each rotational DOF from the nearest grid points). At this stage of development, we further limit the size of the search space to a 6 dimensional box around the binding site, relying on an external application that will identify the approximated binding interface in real cases, when the structure of a complex is not known. Other programs restricted the search to regions around the binding site in their development stage (e.g., ICM-DISCO,30 Internal Coordinate Mechanics - Docking and Interface Side Chain Optimization). A binding site may be identified using methods based on structure,35–37 sequence conservation,38,39 and interaction energies.40,41 Other docking methods can also be used to allocate promising starting points (e.g. low resolution docking42). In fact, experimental data continues to play a very important role in guiding docking calculations, narrowing the search space, and filtering docking solutions in the CAPRI experiment.2 A typical example is HADDOCK,17 one of the currently most successful docking programs, that requires substantial external information about the binding site. The translational search space is restricted to a box around the ˚ (610.15A ˚ from known receptor binding site of side length 20.3A each side). The rotational search space is limited to 608 (from 2308 to 308) around each Cartesian coordinate from the expected binding interface. Given the step sizes earlier, each rigid DOF has 30 possible values and the total number of possible configurations for locating the ligand relative to the receptor when the proteins are treated as rigid entities is 306 5 7.29 3 108. In principle, our method could be applied to the entire relevant search space and not constrained to the box described earlier. This restriction was imposed mainly because the energy function we use is not accurate enough for docking unbound subunits and causes a ranking problem. In larger search space, the algorithm samples much more states that should be ranked. This in turn decreases the chance of ranking good states high enough. To check the earlier assumptions, we ran the docking algorithm on a larger search space for several complexes (with both bound and unbound subunits). The search space was enlarged by doubling ˚ for each translational axis the range of each DOF to a total of 40.6A and 1208 (from 2608 to 608) for each rotational axis. We also checked the effect of doubling the number of random starting points (from 20 to 40). Note that the search space itself is larger by a factor of 2.6 On future development, we plan on integrating our method with a better energy function that will allow global search. Side-Chains DOFs
An explicit search of all interface side chains, both in the receptor and in the ligand, is computationally expensive. In addition,
1933
most side-chain conformations do not change considerably on complex formation.43 We, therefore, limited the conformational ˚ of any search to side chains in the interface which are within 5A atom in the other protein of the complex (measured when the proteins are bound). This is a simplification: in real cases the bound structure of the complex is unknown and the choice of interface residues can only be based on estimated binding sites or computed in each iteration. This simplification can cause near-correct models to be artificially preferred over more distant solutions, because they have a better chance to be refined in the flexible refinement ˚ interface is relatively large considering stage. In practice, a 5A the size of our search space, so this problem is of a lesser influence. To quantify the effect, we checked the interfaces of the molecules at the worst starting points used in the flexible search phase. We found that on average 79% of the residues that should be considered, were actually considered in the search. Another restriction we imposed is to search only over four types of amino acids (ARG, SER, GLN, and LYS), that were found to change conformation upon binding more than other types of amino acids.43 This restriction reduced the number of searched side chains by 70%, from 37 to 11 on average. Each interface side chain that is explored, in the ligand or in the receptor, adds a DOF to the search space. The allowed values of this DOF are the different rotameric conformations from Dunbrack’s backbone-dependent rotamer library.44 A backbonedependent rotamer library takes into account population variations that depend on the local backbone conformation and is, therefore, preferred over a backbone-independent rotamer library.45 As only a small part of the side chains change their conformation between the unbound and the bound states,43 we added the conformations of the side chains in the unbound monomers to the set of possible side chain’s conformations. Backbone DOFs
Partial backbone flexibility in protein-protein interfaces was introduced in this study by conformational search of FFs prior to docking using ISE.28 For clarity, we provide a brief description of the FFs selection and search procedure detailed there. Selection of FFs begins by superimposing backbone atoms of bound and unbound structures of a protein. We examine a subset of fragments for which the distance between backbone atoms of ˚ in the bound structure and the unbound structure is at least 1A four to eight consecutive residues. We define potential FF as fragments in the subset for which at least one of the residues is in the complex interface. To find actual FFs, we superimpose the bound and unbound structures over backbone atoms of three stem residues on each side of the potential FF. Any potential FF with RMSD larger ˚ is defined as a FF. In the refinement stage of this than 1.0A docking study, we addressed the backbone flexibility in a limited fashion by treating in each complex only one FF. FFs search in the unbound structure starts with a conformational search over the FF dihedral angles based on the ISE algorithm.46,47 Then side chains are added to the lowest energy conformations of the fragment using the program SCAP,48 and these conformations are minimized in Sybyl.49 Eventually, the minimized conformations are clustered.
Journal of Computational Chemistry
DOI 10.1002/jcc
Noy and Goldblum • Vol. 31, No. 9 • Journal of Computational Chemistry
1934
The selection of FFs is a simplification: in real cases, the bound structure of the complex is unknown and the detection of FFs can be derived by analyzing different experimental protein structures or by computational methods as described in the Introduction. A new DOF is added to the search space for each FF, whether in the ligand or in the receptor. The allowed values of this DOF are the 10 best ranked conformations resulting from the FF search in addition to the unbound conformation. Scoring Function
All atoms are represented explicitly during the docking experiments. AMBER forcefield50 has been used with the all atoms model. Bonding interactions are not considered in the rigid docking. Intermolecular nonbonding interactions account for VDW and electrostatic interactions. VDW interactions are basically represented by a Lennard-Jones 6–12 potential, whereas electrostatics is computed by a Coulombic term, with a distance dependent dielectric of 4 3 r. Nonbonded interactions are turned ˚ for VDW interactions and off gradually between 8 and 9A ˚ for electrostatic interactions. between 12 and 13A Because of the extreme sensitivity of VDW potential to small conformational changes, we use a soft VDW potential, which allows the proteins to penetrate into one another, to some extent, without being heavily punished. Soft potential is achieved by a linear extrapolation of the short-range repulsive score of the Lennard-Jones 6-12 potential. This soft VDW potential, adopted from Baker’s laboratory,51,52 is described in eqs. (2) and (3), where the short range repulsive score is extrapolated linearly below rij 5 0.6 3 rij and rij is the sum of the atomic radii of atoms i and j. The energy well depth, eij, and the atomic radii sum, rij, are taken from the AMBER parameters set. " 6 # N1 X N X rij 12 rij 4eij 2 3 (2) if rij > 0:6 3 rij r rij ij i¼1 j¼iþ1 rij eij 8759:2 3 þ 5672:0 else rij j¼iþ1
N1 X N X i¼1
(3)
In the flexible refinement stage, intramolecular interactions are also calculated. These interactions include deformations of bond lengths, bond angles, and torsion angles in addition to VDW and electrostatic interactions. The functional form of the Amber forcefield used in this study is presented in eq. (4). We use soft VDW potential along the search and reevaluate the energies with a normal VDW term after the rigid docking and the flexible refinement are completed. X1
þ
2
N1 X N X 1 X Vn ½1 þ cosðnx cÞ þ 2 i¼1 j¼iþ1 dihedrals ( " ) 6 # 12 rij rij qi 3 qj þ 3 4eij 2 3 rij rij 4pe0 rij
The rigid search produces many candidate states in which the ranking of near-correct docked configurations is usually not high enough. This is due to the inaccuracy of the energy function and the fact that we disregard proteins’ flexibility. We cluster binding configurations to improve the rank of near-correct docked configurations, resulting from the rigid search. We used partitional agglomerative (‘‘bottom-up’’) clustering on ligand configurations, using a backbone RMSD criterion, on a sorted (by energy) list of ligand configurations. Ligand configurations with a structure similar to the lowest energy configura˚ RMSD, are discarded first. The next remaining tion, within 1A lowest energy configuration is selected as the representative for a second cluster of configurations and configurations that are close to it are discarded. This process is continued iteratively until no more configurations are discarded. The number of representative configurations that were retained after the rigid search for each of the ensembles varied with protein and was between 144 and 985. The flexible search described earlier is performed for each of the CR10 configurations from the rigid stage. For technical reasons, the results of each of the flexible searches were clustered separately. Later, these clustered conformations were combined and clustered again in a similar manner. The number of representative configurations that were retained after the flexible refinement for each of the ensembles varied between the complexes and ranged from 149 to 1940. Dataset
The docking method was evaluated on a recently published nonredundant protein-protein docking benchmark.53 From this benchmark, targets were selected if both the bound complex and the unbound components of the target were available. We used a set of 66 unbound-unbound complexes to evaluate the results of the rigid docking, of which 23 are enzyme-inhibitor (EI) complexes, 10 are antibody-antigen (AA) complexes, and 33 are ‘‘other’’ complexes. For six of these targets, we also examined the rigid docking on bound-bound complexes. We used a set of 12 complexes to evaluate the results of flexible docking with side-chains DOFs. This set covers all the different categories of protein-protein complexes: four EI, four AA and four others. For six of these complexes, the flexible search was conducted also on a flexible backbone fragment. We examined our flexible docking approach on a limited set of complexes due to the long run time. Evaluation of Model Accuracy
X1 Vðr Þ ¼ kb ðl lo Þ þ ka ðh h0 Þ2 2 2 bonds angles N
Clustering
ð4Þ
To evaluate the quality of the models generated by our docking method, two metrics were used, fraction of native contacts (Fnat) and RMSD of the ligand (L-RMS). Both these metrics have been implemented as standard evaluation criteria in the CAPRI experiment.54 In detail, two residues are considered to be in contact in the Fnat calculation if any of their atoms are found within ˚ . L-RMS is calculated over ligand backthe cutoff distance of 5A bone atoms after superposition of the receptor backbone atoms.
Journal of Computational Chemistry
DOI 10.1002/jcc
Flexible Protein-Protein Docking
In addition, a CAPRI measure is used to combine these metrics to determine prediction accuracy of docking models as described in Table 1. To be precise, the CAPRI combined evaluation criteria also takes into account interface RMSD. Our combined evaluation criteria might therefore be considered stricter than CAPRIs evaluation criteria. L-RMS evaluates the overall geometric fit between the structures of the predicted and observed complexes but may not always provide a good picture of the fit at the interface. This is especially true when the ligand is large, and there is a slight difference in orientation between the predicted and the bound complexes. Conversely, Fnat provides a good assessment of the fit at the interface, but it is sensitive to the number of atomic clashes and depends on the chosen distance cutoff parameter. We expect the combination of these two complementary measures, L-RMS and Fnat, to provide a good picture of the fit.
Results Rigid Docking
1935
Table 1. Definition of Combined Evaluation Criteria,
Adopted From CAPRI. Combined
Conditions Fnat 0.5 and L-RMS 1.0 (0.3 Fnat \ 0.5 and L-RMS 5.0) or (Fnat 0.5 and L-RMS [ 1.0) (0.1 Fnat \ 0.3 and L-RMS 10.0) or (Fnat 0.3 and L-RMS [ 5.0) Fnat \ 0.1 or L-RMS [ 10.0
High Medium Acceptable Incorrect
L-RMS is RMSD of the ligand calculated as described in the Methods section (in evaluation of model accuracy). Fnat is fraction of native contacts calculated as described in the Methods section (in evaluation of model accuracy).
which poses a much greater challenge on the scoring function to differentiate between correct and incorrect complex configurations.55,56 In general, the results demonstrate that the search method, and the scoring function used are successful in both sampling near-correct solutions and ranking these solutions high.
Redocking Bound Subunits
Our rigid-body docking procedure was first applied to rebuilding structures of complexes from their constituents, where the constituents are already in their bound form. Although these initial conformations are unrealistic, it is an important first test of the rigid search method and the scoring function separately from the ability to cope with inherent flexibility and conformational changes that occur upon binding. We tested a small set of six protein-protein complexes, which belong to several different categories: two complexes of EI, two complexes of AA, and two complexes which do not belong to these two categories. The assessment criteria described in the Methods section were calculated for all the resulting CR configurations and compared with the same measures calculated for the CR10 configurations. The former illustrates the capability of the search method, whereas the latter illustrates the performance of the energy function and its success in high ranking of near-correct solutions. Although this separation is a bit simplistic (e.g., the energy function is also used to guide the search method), it provides insight into whether the scoring function or the sampling method affect the docking results more. The results are presented in Table 2. Best L-RMS values are presented in columns 3 and 6 and best Fnat values in columns 4 and 7. Best combined evaluation criteria are presented in columns 5 and 8. From the table, it can be seen that for five of the six complexes our docking procedure found a solution with ˚ from the experimental strucL-RMS deviation lower than 1.0A ture. These complexes also have Fnat higher than 0.5 and an overall high combined evaluation criteria. These best solutions are ranked among the CR10 configurations. The only complex with a worse result is 1GHQ. A configuration with a reasonable ˚ ) was sampled for this complex but was ranked L-RMS (3.87A below the top 10 cluster representatives (ranked 35th). This ˚ 2), complex has an exceptionally small interface area (800A
Docking Unbound Subunits
The rigid-body docking procedure was also applied to unbound subunits of 66 complexes from several different categories: 23 complexes of EI, 10 complexes of AA, and 33 complexes, which do not belong to these 2 categories. We calculated measures of accuracy for the set of all the CR configurations and compared them with the same measures calculated for the CR10 configurations. The results are presented in Table 3 for six of the complexes for which redocking was performed. For the complete list of results of all 66 complexes, refer to Table I in Supporting Information. Values of best RMSD of the ligand (L-RMS) are presented in columns 3 and 6 and values of best fraction of native contacts (Fnat) are given in columns 4 and 7. Best combined evaluation criteria are presented in columns 5 and 8. ˚ with The average best L-RMS of the 66 complexes is 4.89A ˚ a standard deviation of 2.52A, but the average best L-RMS of ˚ , and the the CR10 configurations is considerably higher, 7.60A ˚ . This is standard deviation is also somewhat wider, 3.55A expected considering the simple energy function used. Similarly, the average best Fnat is 0.45 with a standard deviation of 0.18, and the average best Fnat of the CR10 configurations is 0.32 with a standard deviation of 0.20. The sampling procedure found solutions which are at least acceptable for 62 of the 66 docking cases (94%) and for 46 cases (70%) when only the CR10 configurations are considered. The results show that our rigid docking method performs well on unbound constituents. The inferior performance of the CR10 configurations compared with all the CR configurations demonstrates the limitation of the scoring function to consistently discriminate between near-correct and other docked configurations. We, therefore, expect that a better energy function will lead to sampling of more promising regions of the search space. In turn, this should improve the results of all
Journal of Computational Chemistry
DOI 10.1002/jcc
Noy and Goldblum • Vol. 31, No. 9 • Journal of Computational Chemistry
1936
Table 2. Redocking Bound Subunits.
All collected configurationsa Complex PDB code
10 lowest energy configurationsb
Complex category
Best ˚) L-RMSc (A
Best Fnatd
Best combined criteriae
Best ˚) L-RMSc (A
Best Fnatd
Best combined criteriae
EI EI AA AA O O
0.78 0.82 0.69 0.66 0.67 3.87
0.92 0.96 0.92 0.92 0.98 0.83
High High High High High Medium
0.78 0.82 0.69 0.66 0.67 10.08
0,92 0.96 0.92 0.92 0.98 0.06
High High High High High Incorrect
1AY7 1MAH 1VFB 1WEJ 1GCQ 1GHQ a
Assessment criteria calculated for all the resulting cluster representative configurations. Assessment criteria calculated for the 10 lowest energy cluster representative configurations. c Best RMSD of the ligand. d Best fraction of native residue-residue contacts. e Best combination of L-RMS and Fnat as defined in Table 1. EI, enzyme-inhibitor; AA, antibody-antigen; O, others. b
results in both CR10 and CR. However, doubling the number of starting points resulted in sampling better configurations (Table IIIb compared with Table IIIa in Supporting Information) despite the much larger box. These configurations did not find their way to the CR10 because of the poor ranking of the energy function. As was also shown earlier, a realistic simulation, with unbound subunits, requires much better accuracy of the energy terms. We, therefore, conclude that searching on a larger space using our method is possible, but a larger number of random starting points and a better energy function are required.
CR configurations, and, in particular, the results of the CR10 configurations. Docking Given a Larger Search Box
Results of redocking bound subunits on the larger search space (Table IIa in Supporting Information) show that increasing the box size has no effect: Near-correct solutions were found for all the 5 examined complexes and were ranked in the top 10. The same results were obtained when the search box increase was matched by doubling the number of random starting points (Table IIb in Supporting Information); the bound docking results demonstrate that the search method and the scoring function used work well when bound subunits are redocked. Table IIIa in Supporting Information presents the results for docking unbound subunits. Comparing the L-RMS and Fnat values with those of Table 3 show that in the absolute majority of the cases searching on larger search space leads to worse
Minimization and Rescoring
Clustering the configurations found by the rigid-body docking stage results in a few hundred clusters (between 144 and 985). From these, we picked CRs of 6 complexes and minimized them with a more exact energy function. This action is expected to
Table 3. Docking Unbound Subunits.
All collected configurationsa Complex PDB code
10 lowest energy configurationsb
Complex category
Best ˚) L-RMSc (A
Best Fnatd
Best combined criteriae
Best ˚) L-RMSc (A
Best Fnatd
Best combined criteriae
EI EI AA AA O O
2.84 2.88 2.27 1.94 3.21 2.47
0.59 0.74 0.57 0.63 0.73 0.71
Medium Medium Medium Medium Medium Medium
5.05 5.08 3.69 7.36 5.01 10.35
0.58 0.38 0.57 0.44 0.68 0.17
Medium Acceptable Medium Acceptable Medium Incorrect
1AY7 1MAH 1VFB 1WEJ 1GCQ 1GHQ a
Assessment criteria calculated for all the resulting cluster representative configurations. Assessment criteria calculated for the 10 lowest energy cluster representative configurations. c Best RMSD of the ligand. d Best fraction of native residue-residue contacts. e Best combination of L-RMS and Fnat as defined in Table 1. EI, enzyme-inhibitor; AA, antibody-antigen; O, others. b
Journal of Computational Chemistry
DOI 10.1002/jcc
Flexible Protein-Protein Docking
remove clashes and to rank near-correct docked configurations high before selecting the 10 configurations that will be used in the next stage. For two of the six complexes, minimization improved the best L-RMS and Fnat among the CR10, whereas for three other complexes the minimization caused the best L-RMS and Fnat among the CR10 to get worse (Table IV in Supporting Information). The sixth complex (1GHQ) does not have an acceptable binding conformation among the CR10 both before and after minimization. This is probably due to 1GHQ having an extremely small interface area, which makes differentiation between correct and incorrect binding configurations much harder. Because the minimization and rescoring did not prove effective in improving ranking, we decided to forgo this step completely and move directly from the rigid stage to the flexible refinement stage. Note that this conclusion was derived based on the short minimization that we ran. We expect that running a longer minimization (until convergence) or using a different re-ranking procedure or using a better discriminating energy function will result in better ranking. Flexible Refinement
The conformational changes that proteins undergo on binding are important for association, and therefore, good binding conformations might not be sampled at all without considering flexibility. The refinement stage is able to capture atomic flexibility by sampling discrete flexible states (side chains’ rotamers and backbone fragments’ conformations). This stage is computationally intensive and, therefore, was performed only on a small set of molecules (12 with flexible side-chains and 6 with both flexible side-chains and backbone). Results Including Side-Chains Flexibility
In this section, we present the results of adding a flexible refinement stage to cope with side-chains’ flexibility. Side-chain refinement (SC-refinement) was applied to 12 different complexes: four complexes of EI, four complexes of AA, and four complexes which do not belong to these two categories. For each complex, we applied refinement to the CR10 configurations from the rigid-body stage. Results are presented in Table 4. Improvement of the results compared with the rigid-body search (difference [0.05) is shown in bold. We found better binding conformations in 7 of 12 complexes, though the ranking was harmed (compared with the rigid docking, worse conformations are ranked among the top 10 in 8 of 12 complexes and better in only 2 of 12). This is probably the result of the many more conformations that had to be ranked when flexible DOFs were also explored. The results of the SC-refinement depend on the quality of the starting configurations from the rigid-body stage. SC-refinement was carried out on six complexes for which the rigid docking supplied good starting configurations (at least one configuration ˚ ). For five of these complexes, the with L-RMS 5.1A best L-RMS values of the entire CR set improved considerably from the results of the rigid search (Table 4, column 9 vs. 3).
1937
Moreover, in three cases (1AY7, 1MLC, and 1GCQ), we found ˚ , whereas in the rigid phase, conformations with L-RMS \2A there was only one complex with such L-RMS. Moving under ˚ L-RMS, threshold has practical meaning as solutions the 2A below this threshold are considered to be good enough for accurate structural studies.27 The sixth complex (1VFB) was moved away from the correct conformation by the refinement for most of the starting configurations. The reason for this might be insufficient sampling, as the rigid docking stage tried many more options and found better results. For the other six complexes the rigid docking did not supply good starting configurations for the refinement stage. Still, the SC-refinement improved the best L-RMS of two of these complexes (1UDI and more so for 1BVN) among all the CR conformations compared to the rigid search. We restricted the number of flexible DOFs by sampling only ˚ from any atom of the side chains in the interface (within 5A other protein in the complex) of 4 amino acids types: ARG, LYS, SER, and GLN based on a work published by Koch et al.43 To evaluate the influence of restricting the search to these four amino acid types, we compared the differences between side-chains’ conformations of all types of interface amino acids in the bound and unbound structures of the 12 complexes used. There are a total of 337 interface residues: 103 are of the 4 amino acids included in the search and 234 other residues. Fiftyone percent of the first group’s residues changed by more than 408 in one of the dihedral angles upon binding,45 whereas only 32% of the second group’s residues changed conformation on binding. This indicates that restricting the flexible search space to four amino acid types to decrease the number of flexible DOFs is reasonable. We found that, at least in the dataset we used, ARG, LYS, GLN, and GLU have a high tendency to change conformation on binding. This is in agreement with the larger proportion of rotamers of these amino acid types compared with other amino acids in rotamer libraries.45 The number of GLU residues that changed conformation (15) was 2.5-fold the number of residues that did not (6), whereas the number of SER residues that changed conformation (9) was 0.4-fold the number of residues that did not (24). This means that at least in the small set of complexes we examined, searching over GLU residues instead of over SER residues is expected to provide better results. The complexes examined undergo different extents of backbone conformational changes, as can be seen in Figure 4. These backbone changes are not handled here and can be a major reason for failing to find better binding conformations. Results Including Backbone Fragments Flexibility
This section details the results of incorporating backbone flexibility in addition to side chains’ flexibility (SCBB-refinement). The SCBB-refinement was applied to six different complexes: two complexes of EI, two complexes of AA, and two complexes, which do not belong to these two categories. For each complex, we applied the refinement procedure to the CR10 configurations from the rigid-body stage using the 10 lowest energy conformations from the FF ensemble in addition to the unbound
Journal of Computational Chemistry
DOI 10.1002/jcc
Journal of Computational Chemistry
EI EI EI EI AA AA AA AA O O O O
1AY7 1BVN 1MAH 1UDI 1DQJ 1MLC 1VFB 1WEJ 1E96 1GCQ 1GHQ 1KAC
2.84 5.72 2.88 6.45 5.13 1.94 2.27 1.94 3.58 3.21 2.47 3.18
L-RMSe ˚) (A 0.59 0.29 0.74 0.35 0.49 0.66 0.57 0.63 0.64 0.73 0.71 0.49
Fnat
f
Medium Acceptable Medium Acceptable Acceptable Medium Medium Medium Medium Medium Medium Medium
Combined criteriag
Best of all conformationc
5.05 6.98 5.08 7.48 7.44 2.58 3.69 7.36 5.31 5.01 10.35 3.18
L-RMSe ˚) (A 0.58 0.14 0.38 0.20 0.48 0.66 0.57 0.44 0.64 0.68 0.17 0.45
Fnatf
Combined criteriag Medium Acceptable Acceptable Acceptable Acceptable Medium Medium Acceptable Medium Medium Incorrect Medium
Best in top 10d
b
Results of rigid-body docking. Results of flexible refinement allowing side chains flexibility. c Assessment criteria calculated for all the resulting cluster representative configurations. d Assessment criteria calculated for the 10 lowest energy cluster representative configurations. e Best RMSD of the ligand. f Best fraction of native residue-residue contacts. g Best combination of L-RMS and Fnat as defined in Table 1. EI, enzyme inhibitor; AA, antibody-antigen; O, others.
a
Category
PDB code
Complex
Rigid-body docking stagea
Table 4. Docking Unbound Subunits With Flexible Side Chains.
1.36 2.97 2.51 5.70 6.49 1.74 3.53 5.98 3.97 1.79 8.27 2.87
L-RMSe ˚) (A 0.64 0.45 0.71 0.27 0.49 0.76 0.57 0.48 0.69 0.8 0.26 0.49
Fnatf Medium Medium Medium Acceptable Acceptable Medium Medium Acceptable Medium Medium Acceptable Medium
Combined criteriag
Best of all conformationc
1.90 6.78 5.15 7.95 7.44 2.63 6.65 7.58 10.44 4.83 10.19 3.30
L-RMSe ˚) (A
0.61 0.23 0.38 0.19 0.45 0.59 0.46 0.37 0.23 0.58 0.14 0.43
Fnatf
Combined criteriag Medium Acceptable Acceptable Acceptable Acceptable Medium Acceptable Acceptable Incorrect Medium Incorrect Medium
Best in top 10d
Flexible refinement stage (only side chains)b
1938 Noy and Goldblum • Vol. 31, No. 9 • Journal of Computational Chemistry
DOI 10.1002/jcc
Flexible Protein-Protein Docking
1939
Figure 4. Backbone conformational changes. Orange and light blue indicate the bound subunits in the co-crystal. Red and blue indicate the unbound subunits. Proteins treated as the ligand along the search are colored in orange or in red. Proteins treated as the receptor along the search are colored in light blue or in blue. The callouts indicate the specific residue numbers of FFs searched in this study. All the examples in this figure are classified as ‘‘rigid-docking’’ according to the protein-protein benchmark we used.53 However, we got ‘‘medium’’ predictions for nine of the complexes and ‘‘acceptable’’ predictions for three (1UDI, 1BVN, and 1DQJ) using our rigid docking method. PyMOL59 was used to produce protein models’ figures.
conformation of the FF. To restrict the search space, we sampled only one FF per complex with a length of four to eight residues. Table 5 presents the improvement of the SCBB-refinement compared with the rigid-body search (difference [0.05 is shown in bold). Table 6 gives the list of FF residues (column 3), with their RMSD between the bound and the unbound structures (column 5). We further list the lowest RMSD of each FF among the 10 conformations that are actually used, together with the unbound conformation, in the refinement stage (column 6).
We found better conformations, in terms of L-RMS, in four of six complexes compared with the rigid docking (Table 5, column 9 vs. 3). The ranking was improved in five of six complexes and was not worse for the sixth complex (Table 5, column 12 vs. 6). The results of the SCBB-refinement also depend on the quality of the starting configurations from the rigid-body stage. The SCBB-refinement was carried out on two complexes for which the rigid docking supplied potentially good quality can˚ ). For didates (at least one configuration with L-RMS 5.1A both these complexes the best L-RMS values improved over
Journal of Computational Chemistry
DOI 10.1002/jcc
0.33 0.41 0.54 0.68 0.62 0.43 6.29 4.96 7.08 1.75 4.51 3.15 Acceptable Medium Medium Medium Medium Medium 0.33 0.41 0.54 0.68 0.66 0.65
b
a
EI EI AA AA O O 1BVN 1UDI 1DQJ 1MLC 1E96 1KAC
Results of rigid-body docking. Results of flexible refinement allowing side chains and backbone flexibility. c Assessment criteria calculated for all the resulting cluster representative configurations. d Assessment criteria calculated for the 10 lowest energy cluster representative configurations. e Best RMSD of the ligand. f Best fraction of native residue-residue contacts. g Best combination of L-RMS and Fnat as defined in Table 1. EI, enzyme inhibitor; AA, antibody-antigen; O, others.
6.15 4.73 7.08 1.75 2.85 2.87 Acceptable Acceptable Acceptable Medium Medium Medium 0.14 0.20 0.48 0.66 0.64 0.45 6.98 7.48 7.44 2.58 5.31 3.18 Acceptable Acceptable Acceptable Medium Medium Medium 0.29 0.35 0.49 0.66 0.64 0.49
Category
5.72 6.45 5.13 1.94 3.58 3.18
Combined criteriag Fnatf Fnat
f
Combined criteriag Fnatf L-RMSe ˚) (A PDB code
Complex
Best of all conformationc
L-RMSe ˚) (A
Best in top 10d
Combined criteriag
L-RMSe ˚) (A
Fnatf
Combined criteriag
L-RMSe ˚) (A
Best in top 10d Best of all conformationc
Flexible refinement stage (side chains and backbone)b Rigid-body docking stagea
Table 5. Docking Unbound Subunits With Flexible Side Chains and Backbone.
Acceptable Medium Medium Medium Medium Medium
Noy and Goldblum • Vol. 31, No. 9 • Journal of Computational Chemistry
1940
all the resulting CR conformations (Table 5, column 9 vs. 3). Moreover, in one case (1MLC), we found a conformation with ˚ that was ranked among the CR10 conformations L-RMS \2A (Table 5, column 12 vs. 6). For this complex, a conformation with similar L-RMS was also found in the SC-refinement but was not ranked among the top 10. For the second complex (1KAC), the best among the CR10 conformations has the same quality as the best among the CR10 of the rigid search and is only slightly better than the best among the CR10 of the SCrefinement. This is reasonable as this complex has relatively small backbone deformation between the bound and the unbound structures. The refinement stage was applied to four complexes for which the rigid docking did not supply good candidates. Still, for two of these complexes (1UDI and 1E96), the SCBB-refinement phase found conformations with better L-RMS over all the CR conformations compared with the rigid search, and there was also improvement in best L-RMS of the CR10 conformations compared with the best L-RMS of the CR10 configurations from the rigid search. Better conformations were found for 1DQJ and more noteworthy for 1BVN in the SC-refinement. Worse results in the SCBB-refinement might be explained by insufficient sampling or insufficient collection of binding conformations. The case of 1DQJ might be suffering from a bad set of FF conformations (see Table 6, column 6). ˚ L-RMS) from the Good starting configurations (within 5A rigid-body search guarantee, to large extent, finding near-correct docked conformations; otherwise, a major part of the simulation time is spent exploring uninteresting regions of the conformational space. Therefore, we expect that starting the flexible search with better configurations from the rigid stage will result in better binding conformations in the flexible stage. This can be achieved by using a better rescoring procedure, between the rigid docking and the flexible refinement stages, or by performing the same process on a larger set of starting configurations. Docking success depends also on additional backbone changes. Three complexes (1UDI, 1E96, and 1KAC) have only minor backbone conformational changes in addition to the sampled FF. In 2 of them (1UDI and 1E96), introducing backbone flexibility improved the best found binding conformation and the quality of the conformations that were ranked among the top 10. These two complexes have in their FF precalculated set a conformation with a distance to the bound structure that is shorter than the distance between the bound and the unbound ˚ for 1UDI and 1.29 vs. 2.21A ˚ for structures28 (1.38 vs. 1.77A 1E96, see columns 5 and 6 in Table 6). The docking result of 1KAC showed only a slight improvement when backbone flexibility was introduced. Presumably, this is because its FF precalculated set did not contain conformations with a distance to the bound structure that is shorter than the distance of the unbound ˚ instead of 1.05A ˚ , see structure to the bound structure (1.22A columns 5 and 6 in Table 6). The three other complexes (1BVN, 1DQJ, and 1MLC) have other backbone conformational changes between the bound and unbound states in addition to the sampled FF. For the 1MLC complex, the SCBB-refinement does not improve the best collected conformation compared with the SC-refinement but
Journal of Computational Chemistry
DOI 10.1002/jcc
Flexible Protein-Protein Docking
1941
Table 6. Flexible Fragments Data.
Complex PDB code
Ligand or receptora
Residuesb
Lengthc
˚) RMSD B-Ud (A
˚) Best RMSD in sete (A
Receptor Ligand Ligand Receptor Ligand Ligand
P303-P307 I19-I22 C99-C103 B54-B57 A29-A32 B83-B90
5 4 5 4 4 8
2.69 1.77 2.63 1.84 2.21 1.05
0.74 1.38 2.37 1.68 1.29 1.22
1BVN 1UDI 1DQJ 1MLC 1E96 1KAC a
Indicate whether the flexible fragment is in the ligand or in the receptor. IDs and chain IDs of the flexible fragment. c Length of the flexible fragment. d RMSD of FF between the bound and the unbound X-ray structures. RMSD calculated for backbone atoms after superimposing the structures over backbone atoms of three residues from each side of the FF. e Lowest RMSD of FF between the bound and the 10 conformations that are used (together with the unbound conformation) in the refinement stage; RMSD calculated for backbone atoms after superimposing the structures over backbone atoms of 3 residues from each side of the FF. b
does improve the ranking. The best conformation among the top ˚ thresh10 of this complex actually has L-RMS less than the 2A old. The other two complexes show a negligible improvement in ranking compared with the rigid stage. In fact, the best collected conformation of 1BVN got worse compared to the best collected conformation in the SC-refinement, even though the FF set of ˚ to this molecule has a conformation with RMSD of only 0.74A ˚ between the bound structure (compared with RMSD of 2.69A the bound and the unbound structures). The deterioration compared to the SC-refinement might be the result of the sampling trajectory spending too much time in wrong positions due to problematic conformations in the FF set. We restricted the number of flexible DOFs by sampling only side chains in the interface of four amino acids types. In the SCBB-refinement, we used only one FF per complex with a length of four to eight residues. Even with all these restrictions, we got encouraging results. Improvement in ranking can be the result of better geometric complementarity and less steric hindrance in binding conformations attained when backbone DOFs are also considered. The improvement gained compared with both the rigid search and the SC-refinement looks promising and may suggest that a similar but more comprehensive treatment of backbone flexibility will lead to further improvement. The quality of the 10 conformations used to represent backbone flexibility (with a loop prediction program based on the ISE algorithm) is important but does not guarantee better docking results. Time Performance
In principal, the runtime (T) of docking algorithms can be approximated by the number of configurations evaluated by the search algorithm (Nc) multiplied by the average time it takes to evaluate the energy of a single configuration (tc): T 5 Nc 3 tc. The two variables are not independent as both the search pattern and the number of configurations checked can vary considerably with the accuracy of the energy function. However, Nc provides a convenient measure for evaluating the time efficiency of the sampling method independent of the energy function used and the available computational power.
Having d DOFs and n options for each DOF, a brute force search will have exponential complexity, Nc 5 nd. The complexity of the Best-First search reduces to opening neighbor nodes at each step and moving to the next conformation: Ncf(n) 2d, where f(n) depends on the heuristic function. For a perfect heuristic function, the Best-First search reduces to moving in a straight line to the solution Nc 5 O(n 2d). In our runs, we examined on the average 82800 configurations out of 306 (approximately checking 1 in every 104) and the experimental complexity is therefore roughly NcO(n2 26). The calculations were performed on Linux workstations with clock speed near 2GHz with two single core processors. The expected runtime of a parallelized version is roughly the single core runtime divided by the number of central processing unit cores. The average run time per target was 375000 seconds (52 hr) on a single core, with a standard deviation of 96000 (27 hr). The run time is similar to that of other programs with atomic resolution17,52 and is expected to last less than about an hour on a cluster of 50 processor cores, which is common in the field (our method achieves near-linear behavior, at least for a few dozens of processors). However, one should note that our search space is smaller. Enlarging the search space, for example by doubling the number of options per axis, approximately quadruples the run time. Conversely, improving the energy function is expected to reduce the run time considerably. The inclusion of side chains and backbone flexibility adds many more DOFs. The total number of DOFs ranges between 12 and 28 (6 rigid DOFs 10–1 backbone DOF 16–21 side chains DOFs). Empirically. the computational complexity of docking with SC-refinement is nearly linear in that range with approximately 11000 additional tests per DOF. For docking with SCBB-refinement, the dependence on the number of DOFs is less apparent but still in the same order of magnitude.
Discussion In this work, we described and evaluated a protein-protein docking method based on the Best-First search algorithm. This search
Journal of Computational Chemistry
DOI 10.1002/jcc
1942
Noy and Goldblum • Vol. 31, No. 9 • Journal of Computational Chemistry
method loosely imitates protein-protein association. Association of proteins comprises multiple orientational nonspecific encounters. Each encounter comprises multiple collisions, during which the protein can reorient57 due to an entrapment effect in which a protein pair is surrounded and trapped by water molecules.58 The multiple collisions greatly enhance the probability of finding the correct binding orientation. The search algorithm we use brings the proteins to a nonspecific contact from a few initial random positions of the ligand protein. After approaching, we sample nearby multiple promising configurations. The heuristic function guarantees that a fast meeting between the proteins will take place in a nonspecific orientation from each of the multiple starting positions of the ligand. The heuristic function also imitates the entrapment effect by keeping the proteins close to each other after the initial approach. There are few key elements contributing to the success of our method: (1) using Best-First search algorithm with a heuristic function that minimizes the distance between the centers of mass of the two proteins in addition to minimizing the energy, (2) the ensembles of conformations representing backbone flexibility, and (3) the interleaved optimization of flexible DOFs (side-chains and backbone conformations) and rigid-body displacements in the refinement stage. All these elements essentially reflect attempts to imitate the biological process of protein-protein association. Best-First search algorithm has several features which make it a good choice as a basis for a protein-protein docking method: It does not necessarily search the whole search space, but it focuses on promising areas; with good heuristics, based on extra knowledge of the problem, the search converges rapidly to the solution; The search does not get stuck in local minima, but returns back and opens promising nodes found in the past, in cases where a direct step forward is not possible; It allows to adjust and incorporate other external algorithms and data. Failures may be caused by a variety of reasons. The most likely reason is inaccuracy of the energy function. Electrostatics is treated in a basic fashion by coulomb law with a distance-dependent dielectric. Entropy contributions, and in particular the hydrophobic desolvation effect, are not considered at all. Too small cutoffs, especially for the electrostatic term, may also lower the accuracy of the energy function. The sampling of the search space may also be too restrictive, especially in the flexible refinement stage as a result of: additional backbone conformational changes that are not sampled, too few sampling steps of rigid and/or flexible DOFs, insufficient rotameric representation or use of a set of FF conformations which is not adequate. Another possible cause for docking failures is more subtle. In both the rigid and the refinement stages, we keep a list of the 3000 lowest energy conformations sampled (in the refinement stage we keep a list of 3000 lowest energy sampled conformations for each starting configuration). Clustering is carried out only when the search ends; sometimes, especially in the refinement stage, all the 3000 collected conformations are quite similar and after clustering we are left with a very small set of clustered conformations even though many more conformations were sampled along the search. Our docking method is still in development stages, thus several restrictions and approximations were used as described ear-
lier. The usefulness of this method for predictive docking is, therefore, restricted at present.
Conclusion We present a new docking method that mimics the physical process of protein-protein association, in that it brings the proteins close to each other from multiple nonspecific orientations and forces multiple collisions that enhance the probability of finding the correct binding orientation. After the rigid stage, our method simultaneously optimizes rigid-body displacement and sidechains and backbone movements. Our method presents a new approach for dealing with backbone conformational changes by using ensembles of FF conformations that were proven to be relevant for protein-protein interfaces. It also can be used as a basis for incorporating further realistic motions of the proteins. Still, despite progress in mimicking protein-protein association and in handling backbone flexibility, protein-protein docking continues to be a significant scientific challenge.
Acknowledgment We thank Ramon Axelrod for helpful suggestions and discussions.
References 1. Mendez, R.; Leplae, R.; Lensink, M. F.; Wodak, S. J. Proteins 2005, 60, 150. 2. Lensink, M. F.; Mendez, R.; Wodak, S. J. Proteins 2007, 69, 704. 3. Ritchie, D. W. Curr Protein Pept Sci 2008, 9, 1. 4. Bonvin, A. M. Curr Opin Struc Biol 2006, 16, 194. 5. Andrusier, N.; Mashiach, E.; Nussinov, R.; Wolfson, H. J. Proteins 2008, 73, 271. 6. Gray, J. J. Curr Opin Struct Biol 2006, 16, 183. 7. Goh, C. S.; Milburn, D.; Gerstein, M. Curr Opin Struct Biol 2004, 14, 104. 8. O’Hearn, S. D.; Kusalik, A. J.; Angel, J. F. Protein Eng 2003, 16, 169. 9. Boutonnet, N. S.; Rooman, M. J.; Wodak, S. J. J Mol Biol 1995, 253, 633. 10. Jacobs, D. J.; Rader, A. J.; Kuhn, L. A.; Thorpe, M. F. Proteins 2001, 44, 150. 11. Emekli, U.; Schneidman-Duhovny, D.; Wolfson, H. J.; Nussinov, R.; Haliloglu, T. Proteins 2008, 70, 1219. 12. Hinsen, K. Proteins 1998, 33, 417. 13. Amadei, A.; Linssen, A. B.; Berendsen, H. J. Proteins 1993, 17, 412. 14. Dominguez, C.; Bonvin, A. M.; Winkler, G. S.; van Schaik, F. M.; Timmers, H. T.; Boelens, R. Structure 2004, 12, 633. 15. Grunberg, R.; Leckner, J.; Nilges, M. Structure 2004, 12, 2125. 16. Smith, G. R.; Sternberg, M. J.; Bates, P. A. J Mol Biol 2005, 347, 1077. 17. Dominguez, C.; Boelens, R.; Bonvin, A. M. J Am Chem Soc 2003, 125, 1731. 18. Zacharias, M. Proteins 2004, 54, 759. 19. Bastard, K.; Prevost, C.; Zacharias, M. Proteins 2006, 62, 956. 20. Wang, C.; Bradley, P.; Baker, D. J Mol Biol 2007, 373, 503.
Journal of Computational Chemistry
DOI 10.1002/jcc
Flexible Protein-Protein Docking
21. Sandak, B.; Nussinov, R.; Wolfson, H. J Comput Appl Biosci 1995, 11, 87. 22. Sandak, B.; Wolfson, H. J.; Nussinov, R. Proteins 1998, 32, 159. 23. Schneidman-Duhovny, D.; Inbar, Y.; Nussinov, R.; Wolfson, H. J. Proteins 2005, 60, 224. 24. Ben-Zeev, E.; Kowalsman, N.; Ben-Shimon, A.; Segal, D.; Atarot, T.; Noivirt, O.; Shay, T.; Eisenstein, M. Proteins 2005, 60, 195. 25. Uk, B.; Taufer, M.; Stricker, T.; Settanni, G.; Cavalli, A.; Caflisch, A. Cluster Computing and the Grid, IEEE International Symposium, 2003. 26. Kelly, K.; Labute, P. Available at http://www.chemcomp.com/ Journal_of_CCG/articles/astar.htm 27. Smith, G. R.; Sternberg, M. J. Curr Opin Struct Biol 2002, 12, 28. 28. Noy, E.; Tabakman, T.; Goldblum, A. Proteins 2007, 68, 702. 29. Rajamani, D.; Thiel, S.; Vajda, S.; Camacho, C. J. Proc Natl Acad Sci USA 2004, 101, 11287. 30. Fernandez-Recio, J.; Totrov, M.; Abagyan, R. Protein Sci 2002, 11, 280. 31. Pearl, J. Heuristics: Intelligent Search Strategies for Computer Problem Solving; Addison-Wesley: Boston, MA, 1984. 32. Russell, S. J.; Norvig, P. Artifical intelligence: a modern approach; Prentice-Hall: New Jersey, NJ, 2003. 33. MOE. CCGroup Inc.: Montreal, Canada. 34. Wernisch, L.; Hery, S.; Wodak, S. J. J Mol Biol 2000, 301, 713. 35. Laskowski, R. A.; Luscombe, N. M.; Swindells, M. B.; Thornton, J. M. Protein Sci 1996, 5, 2438. 36. Wangikar, P. P.; Tendulkar, A. V.; Ramya, S.; Mali, D. N.; Sarawagi, S. J Mol Biol 2003, 326, 955. 37. Amitai, G.; Shemesh, A.; Sitbon, E.; Shklar, M.; Netanely, D.; Venger, I.; Pietrokovski, S. J Mol Biol 2004, 344, 1135. 38. Lichtarge, O.; Bourne, H. R.; Cohen, F. E. Proc Natl Acad Sci USA 1996, 93, 7507. 39. Gouldson, P. R.; Dean, M. K.; Snell, C. R.; Bywater, R. P.; Gkoutos, G.; Reynolds, C. A. Protein Eng 2001, 14, 759.
1943
40. Glick, M.; Robinson, D. D.; Grant, G. H.; Richards, W. G. J Am Chem Soc 2002, 124, 2337. 41. Laurie, A. T.; Jackson, R. M. Bioinformatics 2005, 21, 1908. 42. Vakser, I. A. Biopolymers 1996, 39, 455. 43. Koch, K.; Zollner, F.; Neumann, S.; Kummert, F.; Sagerer, G. Silico Biol 2002, 2, 351. 44. Dunbrack, R. L., Jr.; Karplus, M. Nat Struct Biol 1994, 1, 334. 45. Dunbrack, R. L., Jr. Curr Opin Struct Biol 2002, 12, 431. 46. Rayan, A.; Noy, E.; Chema, D.; Levitzki, A.; Goldblum, A. Curr Med Chem 2004, 11, 675. 47. Rayan, A.; Senderowitz, H.; Goldblum, A. J Mol Graph Model 2004, 22, 319. 48. Xiang, Z.; Honig, B. J Mol Biol 2001, 311, 421. 49. Sybyl. Tripos Inc.: St. Louis, MI. 50. Wang, J.; Cieplak, P.; Kollman, P. A. J Comput Chem 2000, 21, 1049. 51. Rohl, C. A.; Strauss, C. E.; Misura, K. M.; Baker, D. Methods Enzymol 2004, 383, 66. 52. Gray, J. J.; Moughon, S.; Wang, C.; Schueler-Furman, O.; Kuhlman, B.; Rohl, C. A.; Baker, D. J Mol Biol 2003, 331, 281. 53. Mintseris, J.; Wiehe, K.; Pierce, B.; Anderson, R.; Chen, R.; Janin, J.; Weng, Z. Proteins 2005, 60, 214. 54. Janin, J.; Henrick, K.; Moult, J.; Eyck, L. T.; Sternberg, M. J.; Vajda, S.; Vakser, I.; Wodak, S. J. Proteins 2003, 52, 2. 55. Vajda, S. Proteins 2005, 60, 176. 56. Kowalsman, N.; Eisenstein, M. Bioinformatics 2006, 23, 421. 57. Selzer, T.; Schreiber, G. Proteins 2001, 45, 190. 58. Northrup, S. H.; Erickson, H. P. Proc Natl Acad Sci USA 1992, 89, 3338. 59. DeLano, W. L. The PyMOL molecular graphics system. DeLano Scientific: San Carlos, CA, 2002.
Journal of Computational Chemistry
DOI 10.1002/jcc