Flexible protein?protein docking based on Best ... - Semantic Scholar

Comment

Report 1 Downloads 111 Views

Flexible Protein-Protein Docking Based on Best-First Search Algorithm EFRAT NOY, AMIRAM GOLDBLUM

Department of Medicinal Chemistry and the David R. Bloom center for Pharmacy, School of Pharmacy, The Hebrew University of Jerusalem, Israel 91120 Received 19 June 2009; Revised 4 November 2009; Accepted 4 November 2009 DOI 10.1002/jcc.21480 Published online 19 January 2010 in Wiley InterScience (www.interscience.wiley.com).

Abstract: We developed a new high resolution protein-protein docking method based on Best-First search algorithm that loosely imitates protein-protein associations. The method operates in two stages: ﬁrst, we perform a rigid search on the unbound proteins. Second, we search alternately on rigid and ﬂexible degrees of freedom starting from multiple conﬁgurations from the rigid search. Both stages use heuristics added to the energy function, which causes the proteins to rapidly approach each other and remain adjacent, while optimizing on the energy. The method deals with backbone ﬂexibility explicitly by searching over ensembles of conformations generated before docking. We ran the rigid docking stage on 66 complexes and grouped the results into four classes according to evaluation criteria used in Critical Assessment of Predicted Interactions (CAPRI; ‘‘high,’’ ‘‘medium,’’ ‘‘acceptable,’’ and ‘‘incorrect’’). Our method found medium binding conformations for 26% of the complexes and acceptable for additional 44% among the top 10 conﬁgurations. Considering all the conﬁgurations, we found medium binding conformations for 55% of the complexes and acceptable for additional 39% of the complexes. Introducing side-chains ﬂexibility in the second stage improves the best found binding conformation but harms the ranking. However, introducing side-chains and backbone ﬂexibility improve both the best found binding conformation and the best found conformation in the top 10. Our approach is a basis for incorporating multiple ﬂexible motions into protein-protein docking and is of interest even with the current use of a simple energy function. q 2010 Wiley Periodicals, Inc.

J Comput Chem 31: 1929–1943, 2010

Key words: ﬂexible docking; Best-First search; loop modeling; backbone ﬂexibility; conformational change

Introduction Predicting the structures of protein-protein complexes has advanced greatly over the last few years. Current docking methods are effective in many cases of small conformational changes on binding, especially if biological information is available.1,2 However, in general, it is still discouragingly difﬁcult to predict near-correct complex structures, because there is still no general solution for handling ﬂexibility and energy functions are not accurate enough to discriminate consistently between near experimental and other conformations.3,4 The most complicated part of developing an efﬁcient sampling method is dealing with backbone ﬂexibility because the search space is not only extremely large but also cannot be easily separated into independent degrees of freedom (DOFs). Unlike side-chains’ DOFs, which are relatively independent of each other, changing a backbone dihedral angle can result in a displacement of large parts of the molecule. Essentially, a thorough treatment of backbone ﬂexibility is equivalent to the folding problem.

Consequently, existing docking methods limit backbone ﬂexibility to certain types of motion, identify ﬂexible regions, and predict possible conformational changes.4–6 Types of motions that have been observed on the binding of proteins include loop rearrangements, hinge movements, and other more complex motions.7 Detection of ﬂexible regions and suggestion of possible conformational changes can be derived by analyzing different experimental protein structures,8,9 by using graph theory10 or elastic network model11 or by using molecular dynamics (MD) simulations, normal modes,12 or essential dynamics.13 Various attempts were made to deal with backbone ﬂexibility in protein-protein docking.5 Some groups perform multiple rigid-body dockings of ensembles of conformations from nuclear magnetic resonance14 and MD simulations.15,16 Ensemble docking not only improves the performance in terms of the number

Additional Supporting Information may be found in the online version of this article. Correspondence to: A. Goldblum; e-mail: [email protected]

q 2010 Wiley Periodicals, Inc.

1930

Noy and Goldblum • Vol. 31, No. 9 • Journal of Computational Chemistry

of near-correct solutions but also leads to increased numbers of false positives. HADDOCK17 explicitly allows backbone ﬂexibility during a MD simulated annealing reﬁnement stage in torsion angle space. Deformations along principal components are treated as additional DOFs in ATTRACT, allowing the structures to deform along soft harmonic modes to facilitate the binding process.18 A multicopy, mean-ﬁeld approach has been proposed to deal with loop rearrangements.19 The statistical weight of each loop copy is adapted during docking according to Boltzmann criterion and the receptor experiences the mean ﬁeld of the ensemble of conformations. RosettaDock20 integrates backbone ﬂexibility based on a ‘‘fold-tree’’ representation. Backbone ﬂexibility is conﬁned to loop regions or applied to the complete backbone in a minimization stage. FlexDock21–23 and MolFit24 handle hinge bending motions. When a protein is known or suspected to undergo such a motion it is subdivided into domains, the domains are docked independently and the docked domains are then assembled. In this article, we describe a new ﬂexible docking method which is based on a variant of the Best-First search algorithm. This algorithm was applied previously to deal with the folding problem25 and to enhance the traditional dynamic programming method of sequence alignment.26 Search methods differ in the way they direct the search trajectory. For example, Steepest Descent prefers to move toward the minimal energy, Monte Carlo moves randomly while simulated annealing moves toward the minimal energy with a certain probability to move in a different direction. Our method uses heuristics which loosely imitates protein-protein associations. The addition of this heuristics to the energy function affects the search trajectory by preferring to move in certain directions so as to converge to the minimal energy faster. It follows the methodology of working in phases27 to reduce the search space. At the broad phase we keep the proteins rigid, with some surface softness, searching the comparatively small space of relative protein orientations (translational and rotational) and identifying an initial set of candidate conﬁgurations. After completing the broad phase, we cluster the best models produced in that phase and select the top 10 cluster’s representatives as starting points for the second phase, where we allow movement of side-chains and optionally of the backbone. Backbone ﬂexibility is introduced using ensembles of ﬂexible fragments (FFs) that were shown to contain conformations similar to those observed in bound complexes.28 These ensembles are generated with a loop prediction program based on the iterative stochastic elimination (ISE) algorithm prior to docking. Finding conformations close to those observed in bound complexes resembles results obtained in an earlier work that dealt with side chains.29

Methods Docking Algorithm

Proetin-protein docking is extremely sensitive to small conformational changes due to the large contact area between the proteins.30 For that reason, we use a soft van der Waals (VDW) potential, which allows the proteins to penetrate into one another, to some extent, without being heavily penalized by

energy. Such potentials are less discriminative and, therefore, make the recognition of good candidate conformations highly problematic. Although this problem is already apparent in rigid searches, it is much more pronounced when dealing with ﬂexible searches. In these searches, the number of possibilities is much larger and, therefore, problematic candidate ranking is much more likely to result in bad decisions while choosing which conformation to try next. To circumvent this problem, we ﬁrst search only rigid DOFs, where, presumably, the soft energy function harms decisions less. Our docking procedure is therefore divided into the following three stages (as shown in Fig. 1A): i. Broad phase: rigid docking, supposed to locate good starting conﬁgurations for a reﬁnement stage that includes also ﬂexible DOFs. ii. Minimization and rescoring: the top few thousand conﬁgurations are clustered, and cluster representatives (CRs) are then minimized using a more exact energy function with a better estimation for solvent effects. This stage is expected to rank high near-correct docked conﬁgurations. iii. Narrow phase: ﬂexible reﬁnement of the 10 lowest energy CR (CR10) conﬁgurations found in the last stage. The reﬁnement stage is composed of interleaved steps of search on rigid DOFs and search on ﬂexible DOFs. Broad Phase: Rigid Docking

Our docking method is based on a variant of the Best-First search algorithm. Best-First search is a graph search algorithm which starts from a set of initial nodes and searches for a given goal node or one passing a given goal test. It operates by evaluating the nodes using a heuristic function and expanding the most promising node ﬁrst.31,32 The heuristic function encapsulates knowledge about the speciﬁc problem to help guide the search to more promising nodes. At start the initial nodes are inserted into an ‘‘open’’ list. In each iteration, the most promising node in the open list is expanded, and successors of this node are evaluated by the heuristic function and added to the open list. Expanding the most promising node in the open list continues as long as the open list is not empty, and the goal has not been reached. All the nodes that were visited are recorded in a ‘‘closed’’ list to avoid repetition and cycles. We shall describe rigid protein-protein docking in terms of graph searching (speciﬁcally, a grid) problem to apply Best-First search to it. The receptor (larger protein) is ﬁxed in space, whereas the ligand (smaller protein) is free to translate and rotate in a grid about the receptor. Each node in the grid represents a possible conﬁguration of the ligand with respect to the receptor. Successor nodes are adjacent nodes which differ only in one axis, i.e., neighboring ligand conﬁgurations, which differ only in one DOF. The search starts with the ligand located in one of the corners of the grid (a conﬁguration with lowest or highest values available to each DOF is in the open list). In each iteration, the ligand’s conﬁguration with the most promising heuristics is pulled out of open and its neighboring conﬁgurations (successor nodes) which were not visited yet are evaluated by the heuristic function. In addition, to overcome local minima better, we add a

Journal of Computational Chemistry

DOI 10.1002/jcc

Flexible Protein-Protein Docking

semi-random conﬁguration to the neighboring conﬁgurations, formed by choosing a random conﬁguration in the range between 25 steps to 15 steps in each DOF. We insert only the two neighboring conﬁgurations with best heuristics into the open list, effectively pruning the search. After 250 consecutive conﬁgurations are pulled out of open without improvement of the minimal energy found, the ligand molecule is repositioned in a random orientation inside the allowed search space and the search starts from this new conﬁguration. The search ends only after 5000 consecutive conﬁgurations are pulled out of open without improvement of the minimal energy found, meaning there are at least 20 starting conﬁgurations of the ligand during the search. Figure 1B shows a ﬂowchart of the rigid docking stage. We use a ﬁrst-order linear heuristic function [eq. (1)], which is a weighted sum of the energy of the conﬁguration and the distance between the centers of mass of the two proteins in the complex. Including the distance term in the heuristic function causes the ligand to approach the receptor rapidly and stay adjacent to the receptor along the search. The constant used in the weighted sum is c 5 1000, which we found empirically to balance well between the protein-protein conformational energy and the distance between the centers of mass (i.e., locating the ligand close to the receptor, without too many penetrations). At large distances, the distance term is dominant and the search algorithm will try to minimize the cost by bringing the molecules closer. At small distances, when penetrations occur, the energy term provides the main contribution so the search will move the molecules away. Energies of the conﬁgurations are calculated as described below. Heuristic ¼ Energy þ c 3 distance

(1)

In principle, one could consider a non-linear heuristic function. However, because we use an energy function with a linear VDW term at small distance, as explained later, a linear heuristic function works well. Figure 2 illustrates the search pattern of our Best-First search algorithm using the heuristic function described earlier. In the ﬁgure, one can see that the ligand rapidly approaches the receptor and remains adjacent to it for a signiﬁcant part of the search. Minimization and Rescoring

After clustering the rigid docking results, we are left with a few hundred docked conﬁgurations. These conﬁgurations may be poorly ranked because we use a fairly simple scoring function and a soft VDW potential during the rigid search. To remove clashes and improve the ranking of near-correct docked conﬁgurations, cluster-representative (CR) conﬁgurations generated by the rigid docking stage were minimized in molecular operating environment.33 Each CR was minimized gradually on larger parts of the proteins: (1) only hydrogens, (2) sidechains, and (3) all atoms. To keep the runtimes reasonable we decided to limit the number of minimization steps applied. Each gradual minimization contains 20 steps of steepest descent and 30 steps of conjugate gradient. The convergence criterion used between consecutive minimization steps is root mean square

1931

˚ . In practice, this limitation is almost deviation (RMSD) 0.1A never met. We improved the precision of the energy function in this stage by using longer cutoffs for non-bonding interactions ˚ ), a normal VDW term and the GB/SA continuum (18.0–20.0A model for describing better solvent effects. Narrow Phase: Flexible Reﬁnement

The ﬂexible reﬁnement is applied to the CR10 conﬁgurations from the former stage. This stage is composed of interleaved rigid and ﬂexible cycles. The rigid cycles explore rigid DOFs (translation and rotation) using the Best-First search algorithm. The ﬂexible cycles explore ﬂexible DOFs (side-chains’ rotamers and conformational ensembles of FFs, see later) using an algorithm inspired by a method developed in Wodak’s laboratory.34 Searching rigid DOFs and ﬂexible DOFs alternately allows the proteins to adapt their conformations to ﬁt each other while still enabling fast spatial movement. This is in contrast to searching over all the DOFs simultaneously, which tends to linger too long at incorrect regions of the conformational search space. The ﬂexible reﬁnement uses an open list for both the rigid cycle and the ﬂexible cycle. The rigid cycle of the reﬁnement is similar to the rigid docking. A starting conﬁguration from the CR10 of the former stage is inserted to the open list, and conformations are pulled out of the open list until there is no improvement in the minimal energy for 200 consecutive pulls. Then a ﬂexible cycle starts and repeats the following ‘‘Wodak-like’’ procedure: i. One of the 10 best ranked conformations of the ligand sampled until this point is picked randomly. ii. One of the ﬂexible DOFs is selected randomly. iii. The conformational space of the selected ﬂexible DOF is explored and the best conformations (those with the lowest energy) are retained and inserted back to the open list. For side-chains, the rotameric space is explored, whereas for FFs, the conformational ensemble is explored. The Wodak-like cycle (searching one DOF) is repeated until the energy does not change for NFD steps, where NFD equals the number of ﬂexible DOFs. When this is reached, another rigid cycle starts and so forth. The entire ﬂexible reﬁnement ﬁnishes only after 500 consecutive docked conformations are pulled out of the open list with no improvement of the minimal energy. The minimal number of ﬂexible cycles between rigid cycles is therefore two but usually many more ﬂexible cycles are conducted. A schematic description of the ﬂexible reﬁnement procedure is presented in Figure 3. Search Space

A global docking simulation of all translational, rotational, side chains, and backbone conformational space is computationally demanding. We, therefore, limited the number of DOFs which were explicitly searched and their allowed values. Rigid-body DOFs

We limit the search space to a discrete subset of values for each of the 6 DOFs. The sampling resolution must be sufﬁcient to guarantee a near hit. Assuming a sampling resolution of Dx in

Journal of Computational Chemistry

DOI 10.1002/jcc

1932

Noy and Goldblum • Vol. 31, No. 9 • Journal of Computational Chemistry

Figure 1. (A) Schematic presentation of the ﬂexible docking procedure stages: (1) rigid docking, (2) minimization and rescoring, (3) ﬂexible reﬁnement. The rigid docking stage is described in detail in (B) and the ﬂexible reﬁnement stage is illustrated in Figure 3. AMBER forceﬁeld is used as a scoring function along the search. The search is performed only on a restricted search space deﬁned on the basis of experimental information. (B) Flowchart of the rigid docking stage. The numbers in the ‘‘failed improve’’ cells count the iterations with no improvements in the minimal energy through that cell (see accompanying text). conﬁg 5 conﬁguration.

Figure 2. Illustration of the search pattern produced by our BestFirst docking algorithm using the heuristic function described in the text. The set of possible positions of the ligand center (the ‘‘search space’’) is given by white grid points. The search starts at the yellow ball and each point it visits is colored in orange. The green ball is drawn at the nearest point on the grid to the experimental docking position. The receptor is drawn in blue.

Figure 3. Schematic description of the ﬂexible reﬁnement stage. This stage is composed of interleaved rigid and ﬂexible cycles. The numbers in the ‘‘failed improve’’ cells count the iterations with no improvements in the minimal energy through that cell (see accompanying text).

Journal of Computational Chemistry

DOI 10.1002/jcc

Flexible Protein-Protein Docking

each of the translation axes and Da in each of the rotation axes, the overall expected error will be: (3Dx2 1 3r2Da2), where r is the radius of the molecule. To obtain results, which are accurate ˚ , it is sufﬁcient to take translation steps of size 0.8A ˚ up to 1.0A and rotation steps of size 2.38 (equivalent to translation steps of ˚ , assuming average ligand diameter of 40A ˚ ). All along this 0.8A ˚ in study, we used steps with slightly better resolution: 0.7A translational space and 28 in rotational space. The discrete grid is placed, so that the exact bound state position of the ligand lies exactly between grid points and cannot be sampled by the ˚ in each translational DOF search (i.e., the bound state lies 0.35A and 18 in each rotational DOF from the nearest grid points). At this stage of development, we further limit the size of the search space to a 6 dimensional box around the binding site, relying on an external application that will identify the approximated binding interface in real cases, when the structure of a complex is not known. Other programs restricted the search to regions around the binding site in their development stage (e.g., ICM-DISCO,30 Internal Coordinate Mechanics - Docking and Interface Side Chain Optimization). A binding site may be identiﬁed using methods based on structure,35–37 sequence conservation,38,39 and interaction energies.40,41 Other docking methods can also be used to allocate promising starting points (e.g. low resolution docking42). In fact, experimental data continues to play a very important role in guiding docking calculations, narrowing the search space, and ﬁltering docking solutions in the CAPRI experiment.2 A typical example is HADDOCK,17 one of the currently most successful docking programs, that requires substantial external information about the binding site. The translational search space is restricted to a box around the ˚ (610.15A ˚ from known receptor binding site of side length 20.3A each side). The rotational search space is limited to 608 (from 2308 to 308) around each Cartesian coordinate from the expected binding interface. Given the step sizes earlier, each rigid DOF has 30 possible values and the total number of possible conﬁgurations for locating the ligand relative to the receptor when the proteins are treated as rigid entities is 306 5 7.29 3 108. In principle, our method could be applied to the entire relevant search space and not constrained to the box described earlier. This restriction was imposed mainly because the energy function we use is not accurate enough for docking unbound subunits and causes a ranking problem. In larger search space, the algorithm samples much more states that should be ranked. This in turn decreases the chance of ranking good states high enough. To check the earlier assumptions, we ran the docking algorithm on a larger search space for several complexes (with both bound and unbound subunits). The search space was enlarged by doubling ˚ for each translational axis the range of each DOF to a total of 40.6A and 1208 (from 2608 to 608) for each rotational axis. We also checked the effect of doubling the number of random starting points (from 20 to 40). Note that the search space itself is larger by a factor of 2.6 On future development, we plan on integrating our method with a better energy function that will allow global search. Side-Chains DOFs

An explicit search of all interface side chains, both in the receptor and in the ligand, is computationally expensive. In addition,

1933

most side-chain conformations do not change considerably on complex formation.43 We, therefore, limited the conformational ˚ of any search to side chains in the interface which are within 5A atom in the other protein of the complex (measured when the proteins are bound). This is a simpliﬁcation: in real cases the bound structure of the complex is unknown and the choice of interface residues can only be based on estimated binding sites or computed in each iteration. This simpliﬁcation can cause near-correct models to be artiﬁcially preferred over more distant solutions, because they have a better chance to be reﬁned in the ﬂexible reﬁnement ˚ interface is relatively large considering stage. In practice, a 5A the size of our search space, so this problem is of a lesser inﬂuence. To quantify the effect, we checked the interfaces of the molecules at the worst starting points used in the ﬂexible search phase. We found that on average 79% of the residues that should be considered, were actually considered in the search. Another restriction we imposed is to search only over four types of amino acids (ARG, SER, GLN, and LYS), that were found to change conformation upon binding more than other types of amino acids.43 This restriction reduced the number of searched side chains by 70%, from 37 to 11 on average. Each interface side chain that is explored, in the ligand or in the receptor, adds a DOF to the search space. The allowed values of this DOF are the different rotameric conformations from Dunbrack’s backbone-dependent rotamer library.44 A backbonedependent rotamer library takes into account population variations that depend on the local backbone conformation and is, therefore, preferred over a backbone-independent rotamer library.45 As only a small part of the side chains change their conformation between the unbound and the bound states,43 we added the conformations of the side chains in the unbound monomers to the set of possible side chain’s conformations. Backbone DOFs

Partial backbone ﬂexibility in protein-protein interfaces was introduced in this study by conformational search of FFs prior to docking using ISE.28 For clarity, we provide a brief description of the FFs selection and search procedure detailed there. Selection of FFs begins by superimposing backbone atoms of bound and unbound structures of a protein. We examine a subset of fragments for which the distance between backbone atoms of ˚ in the bound structure and the unbound structure is at least 1A four to eight consecutive residues. We deﬁne potential FF as fragments in the subset for which at least one of the residues is in the complex interface. To ﬁnd actual FFs, we superimpose the bound and unbound structures over backbone atoms of three stem residues on each side of the potential FF. Any potential FF with RMSD larger ˚ is deﬁned as a FF. In the reﬁnement stage of this than 1.0A docking study, we addressed the backbone ﬂexibility in a limited fashion by treating in each complex only one FF. FFs search in the unbound structure starts with a conformational search over the FF dihedral angles based on the ISE algorithm.46,47 Then side chains are added to the lowest energy conformations of the fragment using the program SCAP,48 and these conformations are minimized in Sybyl.49 Eventually, the minimized conformations are clustered.

Journal of Computational Chemistry

DOI 10.1002/jcc

Noy and Goldblum • Vol. 31, No. 9 • Journal of Computational Chemistry

1934

The selection of FFs is a simpliﬁcation: in real cases, the bound structure of the complex is unknown and the detection of FFs can be derived by analyzing different experimental protein structures or by computational methods as described in the Introduction. A new DOF is added to the search space for each FF, whether in the ligand or in the receptor. The allowed values of this DOF are the 10 best ranked conformations resulting from the FF search in addition to the unbound conformation. Scoring Function

All atoms are represented explicitly during the docking experiments. AMBER forceﬁeld50 has been used with the all atoms model. Bonding interactions are not considered in the rigid docking. Intermolecular nonbonding interactions account for VDW and electrostatic interactions. VDW interactions are basically represented by a Lennard-Jones 6–12 potential, whereas electrostatics is computed by a Coulombic term, with a distance dependent dielectric of 4 3 r. Nonbonded interactions are turned ˚ for VDW interactions and off gradually between 8 and 9A ˚ for electrostatic interactions. between 12 and 13A Because of the extreme sensitivity of VDW potential to small conformational changes, we use a soft VDW potential, which allows the proteins to penetrate into one another, to some extent, without being heavily punished. Soft potential is achieved by a linear extrapolation of the short-range repulsive score of the Lennard-Jones 6-12 potential. This soft VDW potential, adopted from Baker’s laboratory,51,52 is described in eqs. (2) and (3), where the short range repulsive score is extrapolated linearly below rij 5 0.6 3 rij and rij is the sum of the atomic radii of atoms i and j. The energy well depth, eij, and the atomic radii sum, rij, are taken from the AMBER parameters set. " 6 # N1 X N X rij 12 rij 4eij 2 3 (2) if rij > 0:6 3 rij r rij ij i¼1 j¼iþ1 rij eij 8759:2 3 þ 5672:0 else rij j¼iþ1

N1 X N X i¼1

(3)

In the ﬂexible reﬁnement stage, intramolecular interactions are also calculated. These interactions include deformations of bond lengths, bond angles, and torsion angles in addition to VDW and electrostatic interactions. The functional form of the Amber forceﬁeld used in this study is presented in eq. (4). We use soft VDW potential along the search and reevaluate the energies with a normal VDW term after the rigid docking and the ﬂexible reﬁnement are completed. X1

þ

2

N1 X N X 1 X Vn ½1 þ cosðnx cÞ þ 2 i¼1 j¼iþ1 dihedrals ( " ) 6 # 12 rij rij qi 3 qj þ 3 4eij 2 3 rij rij 4pe0 rij

The rigid search produces many candidate states in which the ranking of near-correct docked conﬁgurations is usually not high enough. This is due to the inaccuracy of the energy function and the fact that we disregard proteins’ ﬂexibility. We cluster binding conﬁgurations to improve the rank of near-correct docked conﬁgurations, resulting from the rigid search. We used partitional agglomerative (‘‘bottom-up’’) clustering on ligand conﬁgurations, using a backbone RMSD criterion, on a sorted (by energy) list of ligand conﬁgurations. Ligand conﬁgurations with a structure similar to the lowest energy conﬁgura˚ RMSD, are discarded ﬁrst. The next remaining tion, within 1A lowest energy conﬁguration is selected as the representative for a second cluster of conﬁgurations and conﬁgurations that are close to it are discarded. This process is continued iteratively until no more conﬁgurations are discarded. The number of representative conﬁgurations that were retained after the rigid search for each of the ensembles varied with protein and was between 144 and 985. The ﬂexible search described earlier is performed for each of the CR10 conﬁgurations from the rigid stage. For technical reasons, the results of each of the ﬂexible searches were clustered separately. Later, these clustered conformations were combined and clustered again in a similar manner. The number of representative conﬁgurations that were retained after the ﬂexible reﬁnement for each of the ensembles varied between the complexes and ranged from 149 to 1940. Dataset

The docking method was evaluated on a recently published nonredundant protein-protein docking benchmark.53 From this benchmark, targets were selected if both the bound complex and the unbound components of the target were available. We used a set of 66 unbound-unbound complexes to evaluate the results of the rigid docking, of which 23 are enzyme-inhibitor (EI) complexes, 10 are antibody-antigen (AA) complexes, and 33 are ‘‘other’’ complexes. For six of these targets, we also examined the rigid docking on bound-bound complexes. We used a set of 12 complexes to evaluate the results of ﬂexible docking with side-chains DOFs. This set covers all the different categories of protein-protein complexes: four EI, four AA and four others. For six of these complexes, the ﬂexible search was conducted also on a ﬂexible backbone fragment. We examined our ﬂexible docking approach on a limited set of complexes due to the long run time. Evaluation of Model Accuracy

X1 Vðr Þ ¼ kb ðl lo Þ þ ka ðh h0 Þ2 2 2 bonds angles N

Clustering

ð4Þ

To evaluate the quality of the models generated by our docking method, two metrics were used, fraction of native contacts (Fnat) and RMSD of the ligand (L-RMS). Both these metrics have been implemented as standard evaluation criteria in the CAPRI experiment.54 In detail, two residues are considered to be in contact in the Fnat calculation if any of their atoms are found within ˚ . L-RMS is calculated over ligand backthe cutoff distance of 5A bone atoms after superposition of the receptor backbone atoms.

Journal of Computational Chemistry

DOI 10.1002/jcc

Flexible Protein-Protein Docking

In addition, a CAPRI measure is used to combine these metrics to determine prediction accuracy of docking models as described in Table 1. To be precise, the CAPRI combined evaluation criteria also takes into account interface RMSD. Our combined evaluation criteria might therefore be considered stricter than CAPRIs evaluation criteria. L-RMS evaluates the overall geometric ﬁt between the structures of the predicted and observed complexes but may not always provide a good picture of the ﬁt at the interface. This is especially true when the ligand is large, and there is a slight difference in orientation between the predicted and the bound complexes. Conversely, Fnat provides a good assessment of the ﬁt at the interface, but it is sensitive to the number of atomic clashes and depends on the chosen distance cutoff parameter. We expect the combination of these two complementary measures, L-RMS and Fnat, to provide a good picture of the ﬁt.

Results Rigid Docking

1935

Table 1. Deﬁnition of Combined Evaluation Criteria,

Adopted From CAPRI. Combined

Conditions Fnat 0.5 and L-RMS 1.0 (0.3 Fnat \ 0.5 and L-RMS 5.0) or (Fnat 0.5 and L-RMS [ 1.0) (0.1 Fnat \ 0.3 and L-RMS 10.0) or (Fnat 0.3 and L-RMS [ 5.0) Fnat \ 0.1 or L-RMS [ 10.0

High Medium Acceptable Incorrect

L-RMS is RMSD of the ligand calculated as described in the Methods section (in evaluation of model accuracy). Fnat is fraction of native contacts calculated as described in the Methods section (in evaluation of model accuracy).

which poses a much greater challenge on the scoring function to differentiate between correct and incorrect complex conﬁgurations.55,56 In general, the results demonstrate that the search method, and the scoring function used are successful in both sampling near-correct solutions and ranking these solutions high.

Redocking Bound Subunits

Our rigid-body docking procedure was ﬁrst applied to rebuilding structures of complexes from their constituents, where the constituents are already in their bound form. Although these initial conformations are unrealistic, it is an important ﬁrst test of the rigid search method and the scoring function separately from the ability to cope with inherent ﬂexibility and conformational changes that occur upon binding. We tested a small set of six protein-protein complexes, which belong to several different categories: two complexes of EI, two complexes of AA, and two complexes which do not belong to these two categories. The assessment criteria described in the Methods section were calculated for all the resulting CR conﬁgurations and compared with the same measures calculated for the CR10 conﬁgurations. The former illustrates the capability of the search method, whereas the latter illustrates the performance of the energy function and its success in high ranking of near-correct solutions. Although this separation is a bit simplistic (e.g., the energy function is also used to guide the search method), it provides insight into whether the scoring function or the sampling method affect the docking results more. The results are presented in Table 2. Best L-RMS values are presented in columns 3 and 6 and best Fnat values in columns 4 and 7. Best combined evaluation criteria are presented in columns 5 and 8. From the table, it can be seen that for ﬁve of the six complexes our docking procedure found a solution with ˚ from the experimental strucL-RMS deviation lower than 1.0A ture. These complexes also have Fnat higher than 0.5 and an overall high combined evaluation criteria. These best solutions are ranked among the CR10 conﬁgurations. The only complex with a worse result is 1GHQ. A conﬁguration with a reasonable ˚ ) was sampled for this complex but was ranked L-RMS (3.87A below the top 10 cluster representatives (ranked 35th). This ˚ 2), complex has an exceptionally small interface area (800A

Docking Unbound Subunits

The rigid-body docking procedure was also applied to unbound subunits of 66 complexes from several different categories: 23 complexes of EI, 10 complexes of AA, and 33 complexes, which do not belong to these 2 categories. We calculated measures of accuracy for the set of all the CR conﬁgurations and compared them with the same measures calculated for the CR10 conﬁgurations. The results are presented in Table 3 for six of the complexes for which redocking was performed. For the complete list of results of all 66 complexes, refer to Table I in Supporting Information. Values of best RMSD of the ligand (L-RMS) are presented in columns 3 and 6 and values of best fraction of native contacts (Fnat) are given in columns 4 and 7. Best combined evaluation criteria are presented in columns 5 and 8. ˚ with The average best L-RMS of the 66 complexes is 4.89A ˚ a standard deviation of 2.52A, but the average best L-RMS of ˚ , and the the CR10 conﬁgurations is considerably higher, 7.60A ˚ . This is standard deviation is also somewhat wider, 3.55A expected considering the simple energy function used. Similarly, the average best Fnat is 0.45 with a standard deviation of 0.18, and the average best Fnat of the CR10 conﬁgurations is 0.32 with a standard deviation of 0.20. The sampling procedure found solutions which are at least acceptable for 62 of the 66 docking cases (94%) and for 46 cases (70%) when only the CR10 conﬁgurations are considered. The results show that our rigid docking method performs well on unbound constituents. The inferior performance of the CR10 conﬁgurations compared with all the CR conﬁgurations demonstrates the limitation of the scoring function to consistently discriminate between near-correct and other docked conﬁgurations. We, therefore, expect that a better energy function will lead to sampling of more promising regions of the search space. In turn, this should improve the results of all

Journal of Computational Chemistry

DOI 10.1002/jcc

Noy and Goldblum • Vol. 31, No. 9 • Journal of Computational Chemistry

1936

Table 2. Redocking Bound Subunits.

All collected conﬁgurationsa Complex PDB code

10 lowest energy conﬁgurationsb

Complex category

Best ˚) L-RMSc (A

Best Fnatd

Best combined criteriae

Best ˚) L-RMSc (A

Best Fnatd

Best combined criteriae

EI EI AA AA O O

0.78 0.82 0.69 0.66 0.67 3.87

0.92 0.96 0.92 0.92 0.98 0.83

High High High High High Medium

0.78 0.82 0.69 0.66 0.67 10.08

0,92 0.96 0.92 0.92 0.98 0.06

High High High High High Incorrect

1AY7 1MAH 1VFB 1WEJ 1GCQ 1GHQ a

Assessment criteria calculated for all the resulting cluster representative conﬁgurations. Assessment criteria calculated for the 10 lowest energy cluster representative conﬁgurations. c Best RMSD of the ligand. d Best fraction of native residue-residue contacts. e Best combination of L-RMS and Fnat as deﬁned in Table 1. EI, enzyme-inhibitor; AA, antibody-antigen; O, others. b

results in both CR10 and CR. However, doubling the number of starting points resulted in sampling better conﬁgurations (Table IIIb compared with Table IIIa in Supporting Information) despite the much larger box. These conﬁgurations did not ﬁnd their way to the CR10 because of the poor ranking of the energy function. As was also shown earlier, a realistic simulation, with unbound subunits, requires much better accuracy of the energy terms. We, therefore, conclude that searching on a larger space using our method is possible, but a larger number of random starting points and a better energy function are required.

CR conﬁgurations, and, in particular, the results of the CR10 conﬁgurations. Docking Given a Larger Search Box

Results of redocking bound subunits on the larger search space (Table IIa in Supporting Information) show that increasing the box size has no effect: Near-correct solutions were found for all the 5 examined complexes and were ranked in the top 10. The same results were obtained when the search box increase was matched by doubling the number of random starting points (Table IIb in Supporting Information); the bound docking results demonstrate that the search method and the scoring function used work well when bound subunits are redocked. Table IIIa in Supporting Information presents the results for docking unbound subunits. Comparing the L-RMS and Fnat values with those of Table 3 show that in the absolute majority of the cases searching on larger search space leads to worse

Minimization and Rescoring

Clustering the conﬁgurations found by the rigid-body docking stage results in a few hundred clusters (between 144 and 985). From these, we picked CRs of 6 complexes and minimized them with a more exact energy function. This action is expected to

Table 3. Docking Unbound Subunits.

All collected conﬁgurationsa Complex PDB code

10 lowest energy conﬁgurationsb

Complex category

Best ˚) L-RMSc (A

Best Fnatd

Best combined criteriae

Best ˚) L-RMSc (A

Best Fnatd

Best combined criteriae

EI EI AA AA O O

2.84 2.88 2.27 1.94 3.21 2.47

0.59 0.74 0.57 0.63 0.73 0.71

Medium Medium Medium Medium Medium Medium

5.05 5.08 3.69 7.36 5.01 10.35

0.58 0.38 0.57 0.44 0.68 0.17

Medium Acceptable Medium Acceptable Medium Incorrect

1AY7 1MAH 1VFB 1WEJ 1GCQ 1GHQ a

Assessment criteria calculated for all the resulting cluster representative conﬁgurations. Assessment criteria calculated for the 10 lowest energy cluster representative conﬁgurations. c Best RMSD of the ligand. d Best fraction of native residue-residue contacts. e Best combination of L-RMS and Fnat as deﬁned in Table 1. EI, enzyme-inhibitor; AA, antibody-antigen; O, others. b

Journal of Computational Chemistry

DOI 10.1002/jcc

Flexible Protein-Protein Docking

remove clashes and to rank near-correct docked conﬁgurations high before selecting the 10 conﬁgurations that will be used in the next stage. For two of the six complexes, minimization improved the best L-RMS and Fnat among the CR10, whereas for three other complexes the minimization caused the best L-RMS and Fnat among the CR10 to get worse (Table IV in Supporting Information). The sixth complex (1GHQ) does not have an acceptable binding conformation among the CR10 both before and after minimization. This is probably due to 1GHQ having an extremely small interface area, which makes differentiation between correct and incorrect binding conﬁgurations much harder. Because the minimization and rescoring did not prove effective in improving ranking, we decided to forgo this step completely and move directly from the rigid stage to the ﬂexible reﬁnement stage. Note that this conclusion was derived based on the short minimization that we ran. We expect that running a longer minimization (until convergence) or using a different re-ranking procedure or using a better discriminating energy function will result in better ranking. Flexible Reﬁnement

The conformational changes that proteins undergo on binding are important for association, and therefore, good binding conformations might not be sampled at all without considering ﬂexibility. The reﬁnement stage is able to capture atomic ﬂexibility by sampling discrete ﬂexible states (side chains’ rotamers and backbone fragments’ conformations). This stage is computationally intensive and, therefore, was performed only on a small set of molecules (12 with ﬂexible side-chains and 6 with both ﬂexible side-chains and backbone). Results Including Side-Chains Flexibility

In this section, we present the results of adding a ﬂexible reﬁnement stage to cope with side-chains’ ﬂexibility. Side-chain reﬁnement (SC-reﬁnement) was applied to 12 different complexes: four complexes of EI, four complexes of AA, and four complexes which do not belong to these two categories. For each complex, we applied reﬁnement to the CR10 conﬁgurations from the rigid-body stage. Results are presented in Table 4. Improvement of the results compared with the rigid-body search (difference [0.05) is shown in bold. We found better binding conformations in 7 of 12 complexes, though the ranking was harmed (compared with the rigid docking, worse conformations are ranked among the top 10 in 8 of 12 complexes and better in only 2 of 12). This is probably the result of the many more conformations that had to be ranked when ﬂexible DOFs were also explored. The results of the SC-reﬁnement depend on the quality of the starting conﬁgurations from the rigid-body stage. SC-reﬁnement was carried out on six complexes for which the rigid docking supplied good starting conﬁgurations (at least one conﬁguration ˚ ). For ﬁve of these complexes, the with L-RMS 5.1A best L-RMS values of the entire CR set improved considerably from the results of the rigid search (Table 4, column 9 vs. 3).

1937

Moreover, in three cases (1AY7, 1MLC, and 1GCQ), we found ˚ , whereas in the rigid phase, conformations with L-RMS \2A there was only one complex with such L-RMS. Moving under ˚ L-RMS, threshold has practical meaning as solutions the 2A below this threshold are considered to be good enough for accurate structural studies.27 The sixth complex (1VFB) was moved away from the correct conformation by the reﬁnement for most of the starting conﬁgurations. The reason for this might be insufﬁcient sampling, as the rigid docking stage tried many more options and found better results. For the other six complexes the rigid docking did not supply good starting conﬁgurations for the reﬁnement stage. Still, the SC-reﬁnement improved the best L-RMS of two of these complexes (1UDI and more so for 1BVN) among all the CR conformations compared to the rigid search. We restricted the number of ﬂexible DOFs by sampling only ˚ from any atom of the side chains in the interface (within 5A other protein in the complex) of 4 amino acids types: ARG, LYS, SER, and GLN based on a work published by Koch et al.43 To evaluate the inﬂuence of restricting the search to these four amino acid types, we compared the differences between side-chains’ conformations of all types of interface amino acids in the bound and unbound structures of the 12 complexes used. There are a total of 337 interface residues: 103 are of the 4 amino acids included in the search and 234 other residues. Fiftyone percent of the ﬁrst group’s residues changed by more than 408 in one of the dihedral angles upon binding,45 whereas only 32% of the second group’s residues changed conformation on binding. This indicates that restricting the ﬂexible search space to four amino acid types to decrease the number of ﬂexible DOFs is reasonable. We found that, at least in the dataset we used, ARG, LYS, GLN, and GLU have a high tendency to change conformation on binding. This is in agreement with the larger proportion of rotamers of these amino acid types compared with other amino acids in rotamer libraries.45 The number of GLU residues that changed conformation (15) was 2.5-fold the number of residues that did not (6), whereas the number of SER residues that changed conformation (9) was 0.4-fold the number of residues that did not (24). This means that at least in the small set of complexes we examined, searching over GLU residues instead of over SER residues is expected to provide better results. The complexes examined undergo different extents of backbone conformational changes, as can be seen in Figure 4. These backbone changes are not handled here and can be a major reason for failing to ﬁnd better binding conformations. Results Including Backbone Fragments Flexibility

This section details the results of incorporating backbone ﬂexibility in addition to side chains’ ﬂexibility (SCBB-reﬁnement). The SCBB-reﬁnement was applied to six different complexes: two complexes of EI, two complexes of AA, and two complexes, which do not belong to these two categories. For each complex, we applied the reﬁnement procedure to the CR10 conﬁgurations from the rigid-body stage using the 10 lowest energy conformations from the FF ensemble in addition to the unbound

Journal of Computational Chemistry

DOI 10.1002/jcc

Journal of Computational Chemistry

EI EI EI EI AA AA AA AA O O O O

1AY7 1BVN 1MAH 1UDI 1DQJ 1MLC 1VFB 1WEJ 1E96 1GCQ 1GHQ 1KAC

2.84 5.72 2.88 6.45 5.13 1.94 2.27 1.94 3.58 3.21 2.47 3.18

L-RMSe ˚) (A 0.59 0.29 0.74 0.35 0.49 0.66 0.57 0.63 0.64 0.73 0.71 0.49

Fnat

f

Medium Acceptable Medium Acceptable Acceptable Medium Medium Medium Medium Medium Medium Medium

Combined criteriag

Best of all conformationc

5.05 6.98 5.08 7.48 7.44 2.58 3.69 7.36 5.31 5.01 10.35 3.18

L-RMSe ˚) (A 0.58 0.14 0.38 0.20 0.48 0.66 0.57 0.44 0.64 0.68 0.17 0.45

Fnatf

Combined criteriag Medium Acceptable Acceptable Acceptable Acceptable Medium Medium Acceptable Medium Medium Incorrect Medium

Best in top 10d

b

Results of rigid-body docking. Results of ﬂexible reﬁnement allowing side chains ﬂexibility. c Assessment criteria calculated for all the resulting cluster representative conﬁgurations. d Assessment criteria calculated for the 10 lowest energy cluster representative conﬁgurations. e Best RMSD of the ligand. f Best fraction of native residue-residue contacts. g Best combination of L-RMS and Fnat as deﬁned in Table 1. EI, enzyme inhibitor; AA, antibody-antigen; O, others.

a

Category

PDB code

Complex

Rigid-body docking stagea

Table 4. Docking Unbound Subunits With Flexible Side Chains.

1.36 2.97 2.51 5.70 6.49 1.74 3.53 5.98 3.97 1.79 8.27 2.87

L-RMSe ˚) (A 0.64 0.45 0.71 0.27 0.49 0.76 0.57 0.48 0.69 0.8 0.26 0.49

Fnatf Medium Medium Medium Acceptable Acceptable Medium Medium Acceptable Medium Medium Acceptable Medium

Combined criteriag

Best of all conformationc

1.90 6.78 5.15 7.95 7.44 2.63 6.65 7.58 10.44 4.83 10.19 3.30

L-RMSe ˚) (A

0.61 0.23 0.38 0.19 0.45 0.59 0.46 0.37 0.23 0.58 0.14 0.43

Fnatf

Combined criteriag Medium Acceptable Acceptable Acceptable Acceptable Medium Acceptable Acceptable Incorrect Medium Incorrect Medium

Best in top 10d

Flexible reﬁnement stage (only side chains)b

1938 Noy and Goldblum • Vol. 31, No. 9 • Journal of Computational Chemistry

DOI 10.1002/jcc

Flexible Protein-Protein Docking

1939

Figure 4. Backbone conformational changes. Orange and light blue indicate the bound subunits in the co-crystal. Red and blue indicate the unbound subunits. Proteins treated as the ligand along the search are colored in orange or in red. Proteins treated as the receptor along the search are colored in light blue or in blue. The callouts indicate the speciﬁc residue numbers of FFs searched in this study. All the examples in this ﬁgure are classiﬁed as ‘‘rigid-docking’’ according to the protein-protein benchmark we used.53 However, we got ‘‘medium’’ predictions for nine of the complexes and ‘‘acceptable’’ predictions for three (1UDI, 1BVN, and 1DQJ) using our rigid docking method. PyMOL59 was used to produce protein models’ ﬁgures.

conformation of the FF. To restrict the search space, we sampled only one FF per complex with a length of four to eight residues. Table 5 presents the improvement of the SCBB-reﬁnement compared with the rigid-body search (difference [0.05 is shown in bold). Table 6 gives the list of FF residues (column 3), with their RMSD between the bound and the unbound structures (column 5). We further list the lowest RMSD of each FF among the 10 conformations that are actually used, together with the unbound conformation, in the reﬁnement stage (column 6).

We found better conformations, in terms of L-RMS, in four of six complexes compared with the rigid docking (Table 5, column 9 vs. 3). The ranking was improved in ﬁve of six complexes and was not worse for the sixth complex (Table 5, column 12 vs. 6). The results of the SCBB-reﬁnement also depend on the quality of the starting conﬁgurations from the rigid-body stage. The SCBB-reﬁnement was carried out on two complexes for which the rigid docking supplied potentially good quality can˚ ). For didates (at least one conﬁguration with L-RMS 5.1A both these complexes the best L-RMS values improved over

Journal of Computational Chemistry

DOI 10.1002/jcc

0.33 0.41 0.54 0.68 0.62 0.43 6.29 4.96 7.08 1.75 4.51 3.15 Acceptable Medium Medium Medium Medium Medium 0.33 0.41 0.54 0.68 0.66 0.65

b

a

EI EI AA AA O O 1BVN 1UDI 1DQJ 1MLC 1E96 1KAC

Results of rigid-body docking. Results of ﬂexible reﬁnement allowing side chains and backbone ﬂexibility. c Assessment criteria calculated for all the resulting cluster representative conﬁgurations. d Assessment criteria calculated for the 10 lowest energy cluster representative conﬁgurations. e Best RMSD of the ligand. f Best fraction of native residue-residue contacts. g Best combination of L-RMS and Fnat as deﬁned in Table 1. EI, enzyme inhibitor; AA, antibody-antigen; O, others.

6.15 4.73 7.08 1.75 2.85 2.87 Acceptable Acceptable Acceptable Medium Medium Medium 0.14 0.20 0.48 0.66 0.64 0.45 6.98 7.48 7.44 2.58 5.31 3.18 Acceptable Acceptable Acceptable Medium Medium Medium 0.29 0.35 0.49 0.66 0.64 0.49

Category

5.72 6.45 5.13 1.94 3.58 3.18

Combined criteriag Fnatf Fnat

f

Combined criteriag Fnatf L-RMSe ˚) (A PDB code

Complex

Best of all conformationc

L-RMSe ˚) (A

Best in top 10d

Combined criteriag

L-RMSe ˚) (A

Fnatf

Combined criteriag

L-RMSe ˚) (A

Best in top 10d Best of all conformationc

Flexible reﬁnement stage (side chains and backbone)b Rigid-body docking stagea

Table 5. Docking Unbound Subunits With Flexible Side Chains and Backbone.

Acceptable Medium Medium Medium Medium Medium

Noy and Goldblum • Vol. 31, No. 9 • Journal of Computational Chemistry

1940

all the resulting CR conformations (Table 5, column 9 vs. 3). Moreover, in one case (1MLC), we found a conformation with ˚ that was ranked among the CR10 conformations L-RMS \2A (Table 5, column 12 vs. 6). For this complex, a conformation with similar L-RMS was also found in the SC-reﬁnement but was not ranked among the top 10. For the second complex (1KAC), the best among the CR10 conformations has the same quality as the best among the CR10 of the rigid search and is only slightly better than the best among the CR10 of the SCreﬁnement. This is reasonable as this complex has relatively small backbone deformation between the bound and the unbound structures. The reﬁnement stage was applied to four complexes for which the rigid docking did not supply good candidates. Still, for two of these complexes (1UDI and 1E96), the SCBB-reﬁnement phase found conformations with better L-RMS over all the CR conformations compared with the rigid search, and there was also improvement in best L-RMS of the CR10 conformations compared with the best L-RMS of the CR10 conﬁgurations from the rigid search. Better conformations were found for 1DQJ and more noteworthy for 1BVN in the SC-reﬁnement. Worse results in the SCBB-reﬁnement might be explained by insufﬁcient sampling or insufﬁcient collection of binding conformations. The case of 1DQJ might be suffering from a bad set of FF conformations (see Table 6, column 6). ˚ L-RMS) from the Good starting conﬁgurations (within 5A rigid-body search guarantee, to large extent, ﬁnding near-correct docked conformations; otherwise, a major part of the simulation time is spent exploring uninteresting regions of the conformational space. Therefore, we expect that starting the ﬂexible search with better conﬁgurations from the rigid stage will result in better binding conformations in the ﬂexible stage. This can be achieved by using a better rescoring procedure, between the rigid docking and the ﬂexible reﬁnement stages, or by performing the same process on a larger set of starting conﬁgurations. Docking success depends also on additional backbone changes. Three complexes (1UDI, 1E96, and 1KAC) have only minor backbone conformational changes in addition to the sampled FF. In 2 of them (1UDI and 1E96), introducing backbone ﬂexibility improved the best found binding conformation and the quality of the conformations that were ranked among the top 10. These two complexes have in their FF precalculated set a conformation with a distance to the bound structure that is shorter than the distance between the bound and the unbound ˚ for 1UDI and 1.29 vs. 2.21A ˚ for structures28 (1.38 vs. 1.77A 1E96, see columns 5 and 6 in Table 6). The docking result of 1KAC showed only a slight improvement when backbone ﬂexibility was introduced. Presumably, this is because its FF precalculated set did not contain conformations with a distance to the bound structure that is shorter than the distance of the unbound ˚ instead of 1.05A ˚ , see structure to the bound structure (1.22A columns 5 and 6 in Table 6). The three other complexes (1BVN, 1DQJ, and 1MLC) have other backbone conformational changes between the bound and unbound states in addition to the sampled FF. For the 1MLC complex, the SCBB-reﬁnement does not improve the best collected conformation compared with the SC-reﬁnement but

Journal of Computational Chemistry

DOI 10.1002/jcc

Flexible Protein-Protein Docking

1941

Table 6. Flexible Fragments Data.

Complex PDB code

Ligand or receptora

Residuesb

Lengthc

˚) RMSD B-Ud (A

˚) Best RMSD in sete (A

Receptor Ligand Ligand Receptor Ligand Ligand

P303-P307 I19-I22 C99-C103 B54-B57 A29-A32 B83-B90

5 4 5 4 4 8

2.69 1.77 2.63 1.84 2.21 1.05

0.74 1.38 2.37 1.68 1.29 1.22

1BVN 1UDI 1DQJ 1MLC 1E96 1KAC a

Indicate whether the ﬂexible fragment is in the ligand or in the receptor. IDs and chain IDs of the ﬂexible fragment. c Length of the ﬂexible fragment. d RMSD of FF between the bound and the unbound X-ray structures. RMSD calculated for backbone atoms after superimposing the structures over backbone atoms of three residues from each side of the FF. e Lowest RMSD of FF between the bound and the 10 conformations that are used (together with the unbound conformation) in the reﬁnement stage; RMSD calculated for backbone atoms after superimposing the structures over backbone atoms of 3 residues from each side of the FF. b

does improve the ranking. The best conformation among the top ˚ thresh10 of this complex actually has L-RMS less than the 2A old. The other two complexes show a negligible improvement in ranking compared with the rigid stage. In fact, the best collected conformation of 1BVN got worse compared to the best collected conformation in the SC-reﬁnement, even though the FF set of ˚ to this molecule has a conformation with RMSD of only 0.74A ˚ between the bound structure (compared with RMSD of 2.69A the bound and the unbound structures). The deterioration compared to the SC-reﬁnement might be the result of the sampling trajectory spending too much time in wrong positions due to problematic conformations in the FF set. We restricted the number of ﬂexible DOFs by sampling only side chains in the interface of four amino acids types. In the SCBB-reﬁnement, we used only one FF per complex with a length of four to eight residues. Even with all these restrictions, we got encouraging results. Improvement in ranking can be the result of better geometric complementarity and less steric hindrance in binding conformations attained when backbone DOFs are also considered. The improvement gained compared with both the rigid search and the SC-reﬁnement looks promising and may suggest that a similar but more comprehensive treatment of backbone ﬂexibility will lead to further improvement. The quality of the 10 conformations used to represent backbone ﬂexibility (with a loop prediction program based on the ISE algorithm) is important but does not guarantee better docking results. Time Performance

In principal, the runtime (T) of docking algorithms can be approximated by the number of conﬁgurations evaluated by the search algorithm (Nc) multiplied by the average time it takes to evaluate the energy of a single conﬁguration (tc): T 5 Nc 3 tc. The two variables are not independent as both the search pattern and the number of conﬁgurations checked can vary considerably with the accuracy of the energy function. However, Nc provides a convenient measure for evaluating the time efﬁciency of the sampling method independent of the energy function used and the available computational power.

Having d DOFs and n options for each DOF, a brute force search will have exponential complexity, Nc 5 nd. The complexity of the Best-First search reduces to opening neighbor nodes at each step and moving to the next conformation: Ncf(n) 2d, where f(n) depends on the heuristic function. For a perfect heuristic function, the Best-First search reduces to moving in a straight line to the solution Nc 5 O(n 2d). In our runs, we examined on the average 82800 conﬁgurations out of 306 (approximately checking 1 in every 104) and the experimental complexity is therefore roughly NcO(n2 26). The calculations were performed on Linux workstations with clock speed near 2GHz with two single core processors. The expected runtime of a parallelized version is roughly the single core runtime divided by the number of central processing unit cores. The average run time per target was 375000 seconds (52 hr) on a single core, with a standard deviation of 96000 (27 hr). The run time is similar to that of other programs with atomic resolution17,52 and is expected to last less than about an hour on a cluster of 50 processor cores, which is common in the ﬁeld (our method achieves near-linear behavior, at least for a few dozens of processors). However, one should note that our search space is smaller. Enlarging the search space, for example by doubling the number of options per axis, approximately quadruples the run time. Conversely, improving the energy function is expected to reduce the run time considerably. The inclusion of side chains and backbone ﬂexibility adds many more DOFs. The total number of DOFs ranges between 12 and 28 (6 rigid DOFs 10–1 backbone DOF 16–21 side chains DOFs). Empirically. the computational complexity of docking with SC-reﬁnement is nearly linear in that range with approximately 11000 additional tests per DOF. For docking with SCBB-reﬁnement, the dependence on the number of DOFs is less apparent but still in the same order of magnitude.

Discussion In this work, we described and evaluated a protein-protein docking method based on the Best-First search algorithm. This search

Journal of Computational Chemistry

DOI 10.1002/jcc

1942

Noy and Goldblum • Vol. 31, No. 9 • Journal of Computational Chemistry

method loosely imitates protein-protein association. Association of proteins comprises multiple orientational nonspeciﬁc encounters. Each encounter comprises multiple collisions, during which the protein can reorient57 due to an entrapment effect in which a protein pair is surrounded and trapped by water molecules.58 The multiple collisions greatly enhance the probability of ﬁnding the correct binding orientation. The search algorithm we use brings the proteins to a nonspeciﬁc contact from a few initial random positions of the ligand protein. After approaching, we sample nearby multiple promising conﬁgurations. The heuristic function guarantees that a fast meeting between the proteins will take place in a nonspeciﬁc orientation from each of the multiple starting positions of the ligand. The heuristic function also imitates the entrapment effect by keeping the proteins close to each other after the initial approach. There are few key elements contributing to the success of our method: (1) using Best-First search algorithm with a heuristic function that minimizes the distance between the centers of mass of the two proteins in addition to minimizing the energy, (2) the ensembles of conformations representing backbone ﬂexibility, and (3) the interleaved optimization of ﬂexible DOFs (side-chains and backbone conformations) and rigid-body displacements in the reﬁnement stage. All these elements essentially reﬂect attempts to imitate the biological process of protein-protein association. Best-First search algorithm has several features which make it a good choice as a basis for a protein-protein docking method: It does not necessarily search the whole search space, but it focuses on promising areas; with good heuristics, based on extra knowledge of the problem, the search converges rapidly to the solution; The search does not get stuck in local minima, but returns back and opens promising nodes found in the past, in cases where a direct step forward is not possible; It allows to adjust and incorporate other external algorithms and data. Failures may be caused by a variety of reasons. The most likely reason is inaccuracy of the energy function. Electrostatics is treated in a basic fashion by coulomb law with a distance-dependent dielectric. Entropy contributions, and in particular the hydrophobic desolvation effect, are not considered at all. Too small cutoffs, especially for the electrostatic term, may also lower the accuracy of the energy function. The sampling of the search space may also be too restrictive, especially in the ﬂexible reﬁnement stage as a result of: additional backbone conformational changes that are not sampled, too few sampling steps of rigid and/or ﬂexible DOFs, insufﬁcient rotameric representation or use of a set of FF conformations which is not adequate. Another possible cause for docking failures is more subtle. In both the rigid and the reﬁnement stages, we keep a list of the 3000 lowest energy conformations sampled (in the reﬁnement stage we keep a list of 3000 lowest energy sampled conformations for each starting conﬁguration). Clustering is carried out only when the search ends; sometimes, especially in the reﬁnement stage, all the 3000 collected conformations are quite similar and after clustering we are left with a very small set of clustered conformations even though many more conformations were sampled along the search. Our docking method is still in development stages, thus several restrictions and approximations were used as described ear-

lier. The usefulness of this method for predictive docking is, therefore, restricted at present.

Conclusion We present a new docking method that mimics the physical process of protein-protein association, in that it brings the proteins close to each other from multiple nonspeciﬁc orientations and forces multiple collisions that enhance the probability of ﬁnding the correct binding orientation. After the rigid stage, our method simultaneously optimizes rigid-body displacement and sidechains and backbone movements. Our method presents a new approach for dealing with backbone conformational changes by using ensembles of FF conformations that were proven to be relevant for protein-protein interfaces. It also can be used as a basis for incorporating further realistic motions of the proteins. Still, despite progress in mimicking protein-protein association and in handling backbone ﬂexibility, protein-protein docking continues to be a signiﬁcant scientiﬁc challenge.

Acknowledgment We thank Ramon Axelrod for helpful suggestions and discussions.

References 1. Mendez, R.; Leplae, R.; Lensink, M. F.; Wodak, S. J. Proteins 2005, 60, 150. 2. Lensink, M. F.; Mendez, R.; Wodak, S. J. Proteins 2007, 69, 704. 3. Ritchie, D. W. Curr Protein Pept Sci 2008, 9, 1. 4. Bonvin, A. M. Curr Opin Struc Biol 2006, 16, 194. 5. Andrusier, N.; Mashiach, E.; Nussinov, R.; Wolfson, H. J. Proteins 2008, 73, 271. 6. Gray, J. J. Curr Opin Struct Biol 2006, 16, 183. 7. Goh, C. S.; Milburn, D.; Gerstein, M. Curr Opin Struct Biol 2004, 14, 104. 8. O’Hearn, S. D.; Kusalik, A. J.; Angel, J. F. Protein Eng 2003, 16, 169. 9. Boutonnet, N. S.; Rooman, M. J.; Wodak, S. J. J Mol Biol 1995, 253, 633. 10. Jacobs, D. J.; Rader, A. J.; Kuhn, L. A.; Thorpe, M. F. Proteins 2001, 44, 150. 11. Emekli, U.; Schneidman-Duhovny, D.; Wolfson, H. J.; Nussinov, R.; Haliloglu, T. Proteins 2008, 70, 1219. 12. Hinsen, K. Proteins 1998, 33, 417. 13. Amadei, A.; Linssen, A. B.; Berendsen, H. J. Proteins 1993, 17, 412. 14. Dominguez, C.; Bonvin, A. M.; Winkler, G. S.; van Schaik, F. M.; Timmers, H. T.; Boelens, R. Structure 2004, 12, 633. 15. Grunberg, R.; Leckner, J.; Nilges, M. Structure 2004, 12, 2125. 16. Smith, G. R.; Sternberg, M. J.; Bates, P. A. J Mol Biol 2005, 347, 1077. 17. Dominguez, C.; Boelens, R.; Bonvin, A. M. J Am Chem Soc 2003, 125, 1731. 18. Zacharias, M. Proteins 2004, 54, 759. 19. Bastard, K.; Prevost, C.; Zacharias, M. Proteins 2006, 62, 956. 20. Wang, C.; Bradley, P.; Baker, D. J Mol Biol 2007, 373, 503.

Journal of Computational Chemistry

DOI 10.1002/jcc

Flexible Protein-Protein Docking

21. Sandak, B.; Nussinov, R.; Wolfson, H. J Comput Appl Biosci 1995, 11, 87. 22. Sandak, B.; Wolfson, H. J.; Nussinov, R. Proteins 1998, 32, 159. 23. Schneidman-Duhovny, D.; Inbar, Y.; Nussinov, R.; Wolfson, H. J. Proteins 2005, 60, 224. 24. Ben-Zeev, E.; Kowalsman, N.; Ben-Shimon, A.; Segal, D.; Atarot, T.; Noivirt, O.; Shay, T.; Eisenstein, M. Proteins 2005, 60, 195. 25. Uk, B.; Taufer, M.; Stricker, T.; Settanni, G.; Cavalli, A.; Caﬂisch, A. Cluster Computing and the Grid, IEEE International Symposium, 2003. 26. Kelly, K.; Labute, P. Available at http://www.chemcomp.com/ Journal_of_CCG/articles/astar.htm 27. Smith, G. R.; Sternberg, M. J. Curr Opin Struct Biol 2002, 12, 28. 28. Noy, E.; Tabakman, T.; Goldblum, A. Proteins 2007, 68, 702. 29. Rajamani, D.; Thiel, S.; Vajda, S.; Camacho, C. J. Proc Natl Acad Sci USA 2004, 101, 11287. 30. Fernandez-Recio, J.; Totrov, M.; Abagyan, R. Protein Sci 2002, 11, 280. 31. Pearl, J. Heuristics: Intelligent Search Strategies for Computer Problem Solving; Addison-Wesley: Boston, MA, 1984. 32. Russell, S. J.; Norvig, P. Artiﬁcal intelligence: a modern approach; Prentice-Hall: New Jersey, NJ, 2003. 33. MOE. CCGroup Inc.: Montreal, Canada. 34. Wernisch, L.; Hery, S.; Wodak, S. J. J Mol Biol 2000, 301, 713. 35. Laskowski, R. A.; Luscombe, N. M.; Swindells, M. B.; Thornton, J. M. Protein Sci 1996, 5, 2438. 36. Wangikar, P. P.; Tendulkar, A. V.; Ramya, S.; Mali, D. N.; Sarawagi, S. J Mol Biol 2003, 326, 955. 37. Amitai, G.; Shemesh, A.; Sitbon, E.; Shklar, M.; Netanely, D.; Venger, I.; Pietrokovski, S. J Mol Biol 2004, 344, 1135. 38. Lichtarge, O.; Bourne, H. R.; Cohen, F. E. Proc Natl Acad Sci USA 1996, 93, 7507. 39. Gouldson, P. R.; Dean, M. K.; Snell, C. R.; Bywater, R. P.; Gkoutos, G.; Reynolds, C. A. Protein Eng 2001, 14, 759.

1943

40. Glick, M.; Robinson, D. D.; Grant, G. H.; Richards, W. G. J Am Chem Soc 2002, 124, 2337. 41. Laurie, A. T.; Jackson, R. M. Bioinformatics 2005, 21, 1908. 42. Vakser, I. A. Biopolymers 1996, 39, 455. 43. Koch, K.; Zollner, F.; Neumann, S.; Kummert, F.; Sagerer, G. Silico Biol 2002, 2, 351. 44. Dunbrack, R. L., Jr.; Karplus, M. Nat Struct Biol 1994, 1, 334. 45. Dunbrack, R. L., Jr. Curr Opin Struct Biol 2002, 12, 431. 46. Rayan, A.; Noy, E.; Chema, D.; Levitzki, A.; Goldblum, A. Curr Med Chem 2004, 11, 675. 47. Rayan, A.; Senderowitz, H.; Goldblum, A. J Mol Graph Model 2004, 22, 319. 48. Xiang, Z.; Honig, B. J Mol Biol 2001, 311, 421. 49. Sybyl. Tripos Inc.: St. Louis, MI. 50. Wang, J.; Cieplak, P.; Kollman, P. A. J Comput Chem 2000, 21, 1049. 51. Rohl, C. A.; Strauss, C. E.; Misura, K. M.; Baker, D. Methods Enzymol 2004, 383, 66. 52. Gray, J. J.; Moughon, S.; Wang, C.; Schueler-Furman, O.; Kuhlman, B.; Rohl, C. A.; Baker, D. J Mol Biol 2003, 331, 281. 53. Mintseris, J.; Wiehe, K.; Pierce, B.; Anderson, R.; Chen, R.; Janin, J.; Weng, Z. Proteins 2005, 60, 214. 54. Janin, J.; Henrick, K.; Moult, J.; Eyck, L. T.; Sternberg, M. J.; Vajda, S.; Vakser, I.; Wodak, S. J. Proteins 2003, 52, 2. 55. Vajda, S. Proteins 2005, 60, 176. 56. Kowalsman, N.; Eisenstein, M. Bioinformatics 2006, 23, 421. 57. Selzer, T.; Schreiber, G. Proteins 2001, 45, 190. 58. Northrup, S. H.; Erickson, H. P. Proc Natl Acad Sci USA 1992, 89, 3338. 59. DeLano, W. L. The PyMOL molecular graphics system. DeLano Scientiﬁc: San Carlos, CA, 2002.

Journal of Computational Chemistry

DOI 10.1002/jcc

Recommend Documents

Assessing energy functions for flexible docking - Semantic Scholar

Mapping Monomeric Threading to ProteinProtein ... - Semantic Scholar

Effect of interprotein polarization on proteinprotein ... - Semantic Scholar