Incorporating Protein Dynamics Through Ensemble Docking in Machine Learning Models to Predict Drug Binding S18: Oral Presentations - 'Omics Innovations in Methods and Applications
Sally R. Ellingson Cancer Research Informatics, Markey Cancer Center Biomedical Informatics, University of Kentucky Twitter: #AMIA2017
Disclosure I have no relevant relationships with commercial interests to disclose.
AMIA 2017 | amia.org
2
Learning Objectives After participating in this session the learner should be better able to: • Learn about ensemble docking • Demonstrate how ensemble docking results can be used in machine learning models to predict drug binding
AMIA 2017 | amia.org
3
Motivation
Drug discovery is expensive and hard! In 2010 the estimated cost of bringing a new drug on the market was $1.8 billion and steadily rising. This economic burden has not been relieved as a late 2014 estimate rose to $2.6 billion dollars.
The cost to bring a drug to market must compensate for failed drugs
Motivation
Drug discovery is expensive and hard! First step - discovery of a therapeutic method. • Biochemical systems are not completely understood.
Biochemical pathway chart
Computational Solutions With the number of different proteins in humans and the genetic variations observable in the population, a full understanding of all possible interactions through experiments and clinical testing alone is infeasible, making computational investigations particularly useful and relevant.
An Equal Opportunity University
Introduction to Ensemble Docking Proteins are not static structures Proteins are dynamic and may exist in many conformations, each of which may be druggable Different binding sites may exist in different conformations
Molecular Dynamics Trajectories of atoms and molecules are determined by numerically solving the Newton's equations of motion for a system of interacting particles, where forces between the particles and potential energy are defined by molecular mechanics force fields.
Equations must be calculated for every frame in the trajectory and time steps must be very small to keep the system stable = lots of calculations!!!
Molecular Dynamics
Simulating protein movements using Anton could aid drug design. SCIENCE/AAAS
Binding of cancer drug dasatinib to target, Src kinase How does a drug molecule find its target binding site? Shan et al. (2011) JACS. (Anton)
Molecular Dynamics
Takes days of computer time on special purpose hardware Simulating protein movements using Anton could aid drug design. SCIENCE/AAAS
Binding of cancer drug dasatinib to target, Src kinase How does a drug molecule find its target binding site? Shan et al. (2011) JACS. (Anton)
Ensemble Docking Pipeline Molecular dynamics • generation of protein dynamics
Clustering • gromos pairwise RMSD-based clustering of each frame in the trajectory to extract significantly different conformations
Molecular Docking • conformational search of chemical compound with simplified scoring function to predict binding
Molecular Docking Predicts conformation of a protein-ligand complex Predicts binding affinity of the ligand to the protein (+) Reproduce correct bound conformation (+) Assign better scores to high-affinity ligands than to decoys (enrichment)
Diller, D. J. and Merz, K. M. (2001), High throughput docking for library design and library prioritization. Proteins, 43: 113–124.
(-) Generate scores that correlate with measured binding affinities Docking engines such as Autodock4 and Autodock Vina allow for virtual binding experiments with atomic level detail on a desktop computer
Ensemble Docking
How do you reduce the degrees of freedom of protein dynamics?
2000 snapshots from a 200ns molecular dynamics simulation
40 distinct states identified with clustering
red indicates the binding site region
Enrichment Enrichment Plots (src)
Percent of Known Ligands Found
25
xtal
20
15
10
5
0 0.5
1
1.5 2 2.5 3 3.5 Percent of Ranked Database
4
4.5
5
We want to do even better and reliably
A machine learning approach •Protein kinases have over 900 protein products •Protein kinases regulate the majority of cellular pathways and signal transduction •The deregulation of kinases has been implicated in many disease states, especially in cancers •Due to the high similarity in sequence and structure between kinases, selectivity is a huge challenge for drug design •This leads to off-target effects that may be extremely toxic if drugs interact with kinases that are normally expressed and not implicated in the given disease in which the drug is intended to relieve.
Many kinases are both drug targets and causes of ADRs
Kinase Data •Tyrosine-protein kinase Lck (LCK) is implicated as a drug target in many cancers and also known to have toxic effects when unintentionally targeted. DUD-E is designed to help benchmark molecular docking programs by providing challenging decoys. It contains: • 22,886 active compounds and their affinities against 102 targets, an average of 224 ligands per target • 50 decoys for each active having similar physico-chemical properties but dissimilar 2-D topology. DUD-E is provided by the Shoichet Laboratory in the Department of Pharmaceutical Chemistry at the University of California, San Francisco (UCSF)
What are we trying to do? Binary classification Protein-Drug non-Interaction: Decoy Non-Binders
Protein-Drug Interaction: Active Binders
LCK – 7 conformations from molecular dyamics
drug1:protein1,1
. . . . . . . . .
drugn:protein1,7
(0,1) . . . . . . . . . (0,1)
Machine Learning
PREDICT protein + active drug complex vs protein + decoy drug complex
**New binding prediction
LCK Conformations
Data Collection
Binding Features Ensemble docking to 7 conformations of LCK (score-only) Drug Features Dragon features for DUD-e active and decoy sets Protein Features Features on 3-D structure: Coach and 2struc
Binding Features Scoring of generated conformations
Huey, R., Morris, G. M., Olson, A. J. and Goodsell, D. S. (2007), A semiempirical free energy force field with charge-based desolvation. J. Comput. Chem., 28: 1145–1152.
Drug Features
Dragon features for DUD-e active and decoy sets • calculates over 5 thousand molecular descriptors, including the simplest atom types, functional groups and fragment counts, topological and geometrical descriptors, and three-dimensional descriptors. It also includes several property estimations like logP and drug-like alerts like Lipinski’s alert.
Protein Features
Features on 3-D structure • Coach: generates ligand binding site predictions using two comparative methods, TM-SITE and S-SITE, which recognize ligand-binding templates • 2struc: secondary structure from 3D model (not predicted secondary structure content from primary sequence)
Full Dataset Labels • Active compounds = 1 and decoy compounds = 0, for all conformations
Random Forest Regressor for Feature Selection
There are 28 protein features that have a weight greater than zero when selected for classification of the protein conformation
Machine Learning using Knn k=1,2,3,4,5 (neighbors) n=2,5,10 (n-fold cross validation), test size = 10%, 20%, and 50%, several distance measures, metrics averages of 10 runs
Tested Models
Model 1: each protein conformation separate Model 2: all entries for every drug combined with each conformation Model 3: one entry per drug, the one with the best overall docking score
Reported Metrics
Model 1
MD = molecular docking ML = machine learning
Binding Site Differences
Conformation 1 Conformation 2
Model 2 (all)
• Adding protein features does not help in any metric • The best conformation in Model 1 is better than Model 2, but no prior information is needed on a “good” conformation here • All ML metrics are better than MD metrics MD = molecular docking ML = machine learning
Model 2 (all)/Model 3 (best by docking)
Model 2 has 7 X the binding features
Model 3 (best by docking)
MD = molecular docking ML = machine learning
• Forcing all binding features in the model decreases performance • Even when using only the binding features, ML models do better • Basically making a custom LCK scoring function • MD Recall of binding features alone is only metric better than ML counterpart, but with larger drop in precision • Recover a larger number of active compounds but with more false positives
Conclusions Usually drug features are most predictive Best results come from having one conformation that has binding features that correlates with the prediction Keeping all binding features does not improve it though All models do better than docking See our results developing a single model for a family of related proteins which includes features based on the protein primary sequence.
Acknowledgements UKY Students William Jones Fatemah Alghamedy Jeevith Bopaiah UKY Collaborators Heidi Weiss, PhD Nathan Jacobs, PhD MCC Cancer Research Informatics SRF CCTS KL2TR000116 and 1KL2TR001996-01 This work was supported in part by the U.S. Department of Energy, Office of Science, Office of Workforce Development for Teachers and Scientists (WDTS) under the BLUFF. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. This research used resources provided through the UK Center for Computational Sciences.
AMIA is the professional home for more than 5,400 informatics professionals, representing frontline clinicians, researchers, public health experts and educators who bring meaning to data, manage information and generate new knowledge across the research and healthcare enterprise. AMIA 2017 | amia.org
@AMIAInformatics @AMIAinformatics Official Group of AMIA @AMIAInformatics #WhyInformatics
37
Thank you! Email me at:
[email protected]