Scoring Docked Protein Complexes with Hydrogen Bonds

Report 5 Downloads 64 Views
Scoring Docked Protein Complexes with Hydrogen Bonds Patrick Day April 17, 2013

1 1.1

Introduction Area

Finding the structure of proteins and protein-protein complexes is an important problem in the field of computational biology. Determining protein structures can be done successfully with various experimental methods like x-ray crystallography, but determining protein-protein structures experimentally is more difficult. Instead, these structures are frequently predicted computationally. One approach to predicting these complexes is to create a function to score complexes on their viability, ideally giving the correct structure the best score.

1.2

Importance

Since the function of a protein (or protein-protein complex) is determined by its structure, understanding protein structure is tantamount to understanding protein function. Discovering the structure of a protein can provide a wealth of information about the function of that protein, but many proteins must bind another protein or small molecule to perform their function. So in order to have a complete understanding of these proteins, the structures of these bound complexes must be determined. Beyond providing the knowledge of a protein’s function within biological pathways, understanding the structures of protein-protein complexes also assists with drug design. Learning how proteins complex allows drug designers to better shape their drugs to bind to desired receptors and not bind to undesired receptors. 1

1.3

Types of Protein Structure

Protein structure is defined in four distinct levels from primary to quaternary 3 . Primary structure is defined as the sequence of amino acids in the polypeptide chain. There are 20 different amino acids and these amino acids are chained together by peptide bonds between the carboxyl and amino termini. Secondary structure is largely defined by hydrogen bonding interactions between the amino acid residues. These interactions create localized structures such as alpha helices and beta sheets. Tertiary structure is the 3D representation of protein as determined by the folding of the localized structures of secondary protein structure. These structures fold in a way that hides the hydrophobic regions on the inside of the protein, but exposes the hydrophilic regions to the solvent. Quaternary is the 3D structure of a complex made of multiple subunits (discrete polypeptide chains). This quaternary structure is the result of proteins need to bind other proteins in order to function.

Figure 1: Levels of Protein Structure 2

1.4

Ease of Determination

The structure of unbound proteins is relatively easy to determine. The primary structure can be determined experimentally in a number of ways. This primary structure is then used to computationally compute the secondary, and tertiary structure of a protein. When a protein binds (or docks with) another protein, this new protein-protein complex takes on a new quaternary structure that is not as easily determined. These complexes are so big and 2

the interaction between the member proteins so weak that the new structure cannot be determined experimentally, even if the structures of both proteins are already known. In addition to this, these complexes are also often times transient and difficult to isolate. Fortunately, computational determination of protein complexes is promising since the search space is more limited compared to protein folding. In order to find the most likely docking conformation of a complex, a large number of possible structures must first be determined. These determinations are made under the assumption of rigid-body docking, which means that the protein will have little to no conformational change upon docking. With this assumption, possible dockings are generated by fixing the position of one of the proteins while giving the other protein six degrees of freedom: three for rotation and three for translation. If a protein complex consists of proteins Xa and Xb , we seek a transformation T (Xb ) such that the energy (or potential) of the complex E(Xa , T (Xb )) is at a minimum. This energy function can take into account many different factors including electrostatic forces, hydrophobic/hydrophilic interactions, and specific amino acid interactions.

1.5

Hydrogen Bonding

My research focuses on the role of hydrogen bonds in complex formation. When hydrogen atoms are attached to highly electronegative atoms like nitrogen, oxygen, or sulfur, they become polarized with a slight positive charge. These positively charged hydrogens are attracted to and bind with electronegative atoms. It is important to note that they do not bind covalently, but are attracted by a strong dipole-dipole force. When these bonds are between atoms on different proteins, the protein-protein complex is stabilized. Given the role of hydrogen bonds in stabilizing these complexes, I used a statistical learning approach to find a scoring function based on the quantity of hydrogen bonds found in the structures of known complexes provided by the Protein Data Bank (www.rcsb.org).

3

2 2.1

Background Complex Structure

These protein complexes always consist of two proteins, the longer one of which is usually called the receptor while the other is called the ligand. Between these two proteins exists an interface, or region of contact between the proteins. The atoms in this region are close enough to interact and thus, contribute to the binding of the two proteins. Also, this region is protected from the solvent up complex formation which is important as hydrogen bonds are less likely to be formed between proteins when solvent is present.

Figure 2: Protein-protein interface 7

2.2

Types of Docking

At this point, it is important to delineate between the two theoretical forms of docking. The first theory, induced fit, posits that the two proteins undergo a significant conformational change in the process of binding. This would mean that the bound versions of proteins would bear little resemblance to their unbound counterparts. As our transformations only allow the six degrees of freedom found in rotation and translation, this theory is less useful to our research. The second theory is known as the lock-and-key model and posits that there is little conformational change in the process of binding. 11 This would mean that the correct binding of two proteins could be found

4

from the six degrees of freedom in our transformations. This theory is especially useful in the case of bound docking in which the ligand and receptor proteins are actually pulled apart from a complex of which the structure is already known. This allows us to know that the proteins are already in their correct conformation to be bound. Bound docking is used to benchmark the performance of docking algorithms since it allows us to compare the algorithms docking compared to the actual one. A more scientifically useful case is unbound docking. This is the case that not only is the structure of the bound complex not previously known, but the structures of the receptor and the ligand may not be known. When this is the case, the structures of those proteins must be first approximated through homology modeling, which is the determination of structure through the structure of a homologous protein. Once the structures of the receptor and ligand are known, the process of transforming the proteins occurs in the same way as in the bound case.

Figure 3: Bound and unbound docking 4

2.3

Scoring Functions

Once the transformations are complete, the complexes need to be scored and ranked so that the most feasible conformation can be found. Scoring functions (or potentials) can either be physics based or knowledge based. Physics based potentials are rooted in the ideal interactions between atoms.

5

Knowledge based potentials are rooted in statistical analysis of already known structures compared to unknown structures. A physics based potential (or molecular mechanics potential) are generally based on knowledge of bonding and electrostatics. They first find energies for the structure in terms of bonded terms and non-bonded terms, and then sum these energies. Bonded terms typically take into account deviations from ideal physics laws like bond length and angle and then square these deviations. Non-bonded terms take into account knowledge from electrostatics and van der Waals interactions. These potentials are used to minimize the given structures to create a more feasible docked complex. 5 Two common knowledge based potentials are statistical potentials and mathematical-programming based potentials. Instead of analyzing protein structures based on the physics of protein interactions, these potentials are wholly based on statistical analysis of predicted structures. Statistical potentials are based on equation 1 which is known as the inverse Boltzmann distribution. This equation states that the difference in energy between the current structure (or state) and some reference state is a function of the log odds ratio of the probabilities of the current state over the reference state. The probability of the current structure, P (r), is determined by the presence of certain interactions within the structure.The reference state, Pref (r), is created to represent a system in which all of the measured interactions are absent. 6 E = −kB T ∗ ln

P (r) Pref (r)

(1)

Potentials can also be based on mathematical programming. These potentials are based on the idea that correctly folded protein structures will have a low energy, and that the energy of a structure will get higher the further that structure is from being correctly folded. This logic is exemplified in the funnel-shaped graph of the protein folding landscape below. This gives rise to equation 2, which describes the relationship between correct and misfolded structures. A large number of these equations gives a linear program to be solved by an objective function. This objective function is minimized over the given constraints to produce the parameters of the potential. 9 E(Xmisf olded ) − E(Xcorrect ) > 0 ∀Xmisf olded (2)

6

Figure 4: Protein Landscape 8

3 3.1

Method Data Set

My work involves the scoring of these protein-protein complexes based on the number of hydrogen bonds present in the interface of the complex. This work was done on the basis that hydrogen bonds provide at least specificity to protein-protein complexes, if not stability.The code I wrote analyzes protein structures for the proximity of polar hydrogens to electronegative atoms given variable constraints for bond length and angle. This code was run on a data set of 640 complexes stored in pdb (protein data bank) files. 462 of these complexes were bound and 178 of these were unbound. These pdb files unfortunately did not explicitly include hydrogens. This is because the experimental technique of X-ray crystallography cannot see the small hydrogen atoms. Fortunately, the positions of hydrogen atoms can be modeled accurately from chemical geometry. To add these hydrogen atom coordinates, my entire data set was run through a program called Reduce that was developed at Duke to insert the hydrogens into the files. Reduce adds hydrogens to pdb files so that the hydrogen atoms are staggered about the atom to which they are bound, rotationally optimized, and given appropriate bond lengths. 10

7

3.2

Interface

With the hydrogens in place, I then needed to find the subset of atoms in the proteins found at the interface of the two proteins. I defined this interface as the subset of atoms in each protein that are within ten angstroms of an atom in the opposite protein.

3.3

Hydrogen Bond Geometry

Next was determining the appropriate geometries for a hydrogen bond. Research suggests that the typical distance (d) between a polarized hydrogen and the electronegative atom it is hydrogen bonded with is around 1.9 angstroms. The angle (θ) between the covalent bond and the hydrogen bond (hydrogen being the origin) is less well defined, but obviously trends toward 180◦ .

Figure 5: Geometry of a hydrogen bond

3.4

Code

Given these constraints, I was then able to run my code on these interfaces with various values for d and θ, in order to find the most appropriate values for d and θ. My score function, W, is defined by a log odd ratio, W = obs −kB T ∗ ln PPref , where Pobs is the observed frequency of hydrogen bond length (or angle) in the dataset and Pref is the reference (or null) distribution. This reference distribution represents the expected distribution if there is no hydrogen bonding for this particular complex. These formulas are defined as Pobs = p(P, H, d, θ) Pref = p(H) ∗ p(P ) ∗ p(R, d, θ|H) 8

where H represents polar hydrogens, P represents any polarizing atom, and R represents any atom. After determining the score ratio, we test it again against an independent data set (one consisting of purposefully mis-docked complexes) to assess the recognition capacity.

4

Results

Existing research defines hydrogen bonds to have a length (d) around 2 angstroms and an angle (θ) tending toward 180◦ 1 . Figure 6 shows the breakdown of scores across all of the proteins in my data set for several different bond lengths. A few different θ values were used as well, but this information was combined for each bond length. The trough at 2.0 angstroms gives creedence to the validity of my scoring function as it shows a preference for the correct bond length.

Figure 6: Breakdown of protein scores by bond length In addition to this, Figure 7 shows scoring to improve as d approaches 2.0 and θ approaches 180◦ . This lines up with Coulomb’s law which states that 9

the electrostatic force between charges decreases as the distance between the charges decreases. The closer θ is to 180◦ , the larger the distance between the two negatively charged on either side of the hydrogen is. This increased distance results in a more stable bond.

Figure 7: Contour plot of protein scores After parameterizing the energy function with a d value of 2.0 and a preference for higher values of θ, correct structures were scored against ’decoy’ structures. These decoy structures are purposefully misdocked versions of the same receptor and ligand to provide a comparison against correct structures. Figure 8 shows the scorings for six different complexes against several decoy complexes. Many of the decoy structures had no possible hydrogen bonds and so returned scores of zero. Although these are just the results from six complexes, the general trend across all complexes was in favor of the correct complex. 10

Figure 8: Correct vs Decoy Complexes

5

Summary

Using hydrogen bonding to score dockings of protein-protein complexes is definitely feasible. My energy function is relatively simple, yet can still tell 11

a good deal about the viability of a protein-protein complex. In cases where the correct structure was not chosen, the number of hydrogen bonds present was generally higher in the correct structure, but was overpowered by the reference structure. This suggests that perhaps I should weight hydrogen bonds more or change the calculation of reference structure.

References [1] Kortemme, Tanja, Alexandre V. Morozov, and David Baker. ”An Orientation-dependent Hydrogen Bonding Potential Improves Prediction of Specificity and Structure for Proteins and ProteinProtein Complexes.” Journal of Molecular Biology. 326.4 (2003): 1239-1259. Web. 16 Apr. 2013. ¡http://dx.doi.org/10.1016/S0022-2836(03)00021-4¿. [2] Fischer T. protein structure levels, http://faculty.irsc.edu/FACULTY/T Fischer/bio%201%20files/bio%201%20resources.htm. 2013 [3] Janin J, Bahadur RP, Chakrabarti P. Protein-protein interaction and quaternary structure. Q Rev Biophys 2008;41(2):133-180. [4] Funkhouser T. COS 597A Lectures Notes in Structural Bioinformatics. In: Archives PUC, editor; 2005. [5] Ravikant D. Learning to Dock Proteins. Ithaca NY: Cornell University; 2011. [6] Schwede T, Peitsch MC. Computational structural biology : methods and applications. N.J.: World Scientific; 2008. x, 779 p. p. [7] Keskin O. Protein-Protein Interface, http://home.ku.edu.tr/ okeskin/interface. 2013 [8] Chaplin M. Protein Folding and http://www.lsbu.ac.uk/water/protein2.html. 2013

Denaturation,

[9] Maiorov VN, Crippen GM. Contact Potential That Recognizes the Correct Folding of Globular-Proteins. J Mol Biol 1992;227(3):876-888.

12

[10] Word, J. Michael, Simon C. Lovell, Jane S. Richardson, and David C. Richardson. ”Asparagine and Glutamine: Using Hydrogen Atom Contacts in the Choice of Side-chain Amide Orientation.” Journal of Molecular Biology. 285. (1999): 1735-1747. Print. [11] Fischer E. Einfluss der Configuration auf die Wirkung der Enzyme. Ber Dt Chem Ges 1894;27(3):9.

13