Incorporating Protein Dynamics Through Ensemble

Report 0 Downloads 125 Views
Incorporating Protein Dynamics Through Ensemble Docking in Machine Learning Models to Predict Drug Binding S18: Oral Presentations - 'Omics Innovations in Methods and Applications

Sally R. Ellingson Cancer Research Informatics, Markey Cancer Center Biomedical Informatics, University of Kentucky Twitter: #AMIA2017

Disclosure I have no relevant relationships with commercial interests to disclose.

AMIA 2017 | amia.org

2

Learning Objectives After participating in this session the learner should be better able to: • Learn about ensemble docking • Demonstrate how ensemble docking results can be used in machine learning models to predict drug binding

AMIA 2017 | amia.org

3

Motivation

Drug discovery is expensive and hard! In 2010 the estimated cost of bringing a new drug on the market was $1.8 billion and steadily rising. This economic burden has not been relieved as a late 2014 estimate rose to $2.6 billion dollars.

The cost to bring a drug to market must compensate for failed drugs

Motivation

Drug discovery is expensive and hard! First step - discovery of a therapeutic method. • Biochemical systems are not completely understood.

Biochemical pathway chart

Computational Solutions With the number of different proteins in humans and the genetic variations observable in the population, a full understanding of all possible interactions through experiments and clinical testing alone is infeasible, making computational investigations particularly useful and relevant.

An Equal Opportunity University

Introduction to Ensemble Docking Proteins are not static structures Proteins are dynamic and may exist in many conformations, each of which may be druggable Different binding sites may exist in different conformations

Molecular Dynamics Trajectories of atoms and molecules are determined by numerically solving the Newton's equations of motion for a system of interacting particles, where forces between the particles and potential energy are defined by molecular mechanics force fields.

Equations must be calculated for every frame in the trajectory and time steps must be very small to keep the system stable = lots of calculations!!!

Molecular Dynamics

Simulating protein movements using Anton could aid drug design. SCIENCE/AAAS

Binding of cancer drug dasatinib to target, Src kinase How does a drug molecule find its target binding site? Shan et al. (2011) JACS. (Anton)

Molecular Dynamics

Takes days of computer time on special purpose hardware Simulating protein movements using Anton could aid drug design. SCIENCE/AAAS

Binding of cancer drug dasatinib to target, Src kinase How does a drug molecule find its target binding site? Shan et al. (2011) JACS. (Anton)

Ensemble Docking Pipeline Molecular dynamics • generation of protein dynamics

Clustering • gromos pairwise RMSD-based clustering of each frame in the trajectory to extract significantly different conformations

Molecular Docking • conformational search of chemical compound with simplified scoring function to predict binding

Molecular Docking Predicts conformation of a protein-ligand complex Predicts binding affinity of the ligand to the protein (+) Reproduce correct bound conformation (+) Assign better scores to high-affinity ligands than to decoys (enrichment)

Diller, D. J. and Merz, K. M. (2001), High throughput docking for library design and library prioritization. Proteins, 43: 113–124.

(-) Generate scores that correlate with measured binding affinities Docking engines such as Autodock4 and Autodock Vina allow for virtual binding experiments with atomic level detail on a desktop computer

Ensemble Docking

How do you reduce the degrees of freedom of protein dynamics?

2000 snapshots from a 200ns molecular dynamics simulation

40 distinct states identified with clustering

red indicates the binding site region

Enrichment Enrichment Plots (src)

Percent of Known Ligands Found

25

xtal

20

15

10

5

0 0.5

1

1.5 2 2.5 3 3.5 Percent of Ranked Database

4

4.5

5

We want to do even better and reliably

A machine learning approach •Protein kinases have over 900 protein products •Protein kinases regulate the majority of cellular pathways and signal transduction •The deregulation of kinases has been implicated in many disease states, especially in cancers •Due to the high similarity in sequence and structure between kinases, selectivity is a huge challenge for drug design •This leads to off-target effects that may be extremely toxic if drugs interact with kinases that are normally expressed and not implicated in the given disease in which the drug is intended to relieve.

Many kinases are both drug targets and causes of ADRs

Kinase Data •Tyrosine-protein kinase Lck (LCK) is implicated as a drug target in many cancers and also known to have toxic effects when unintentionally targeted. DUD-E is designed to help benchmark molecular docking programs by providing challenging decoys. It contains: • 22,886 active compounds and their affinities against 102 targets, an average of 224 ligands per target • 50 decoys for each active having similar physico-chemical properties but dissimilar 2-D topology. DUD-E is provided by the Shoichet Laboratory in the Department of Pharmaceutical Chemistry at the University of California, San Francisco (UCSF)

What are we trying to do? Binary classification Protein-Drug non-Interaction: Decoy Non-Binders

Protein-Drug Interaction: Active Binders

LCK – 7 conformations from molecular dyamics

drug1:protein1,1

. . . . . . . . .

drugn:protein1,7

(0,1) . . . . . . . . . (0,1)

Machine Learning

PREDICT protein + active drug complex vs protein + decoy drug complex

**New binding prediction

LCK Conformations

Data Collection

Binding Features Ensemble docking to 7 conformations of LCK (score-only) Drug Features Dragon features for DUD-e active and decoy sets Protein Features Features on 3-D structure: Coach and 2struc

Binding Features Scoring of generated conformations

Huey, R., Morris, G. M., Olson, A. J. and Goodsell, D. S. (2007), A semiempirical free energy force field with charge-based desolvation. J. Comput. Chem., 28: 1145–1152.

Drug Features

Dragon features for DUD-e active and decoy sets • calculates over 5 thousand molecular descriptors, including the simplest atom types, functional groups and fragment counts, topological and geometrical descriptors, and three-dimensional descriptors. It also includes several property estimations like logP and drug-like alerts like Lipinski’s alert.

Protein Features

Features on 3-D structure • Coach: generates ligand binding site predictions using two comparative methods, TM-SITE and S-SITE, which recognize ligand-binding templates • 2struc: secondary structure from 3D model (not predicted secondary structure content from primary sequence)

Full Dataset Labels • Active compounds = 1 and decoy compounds = 0, for all conformations

Random Forest Regressor for Feature Selection

There are 28 protein features that have a weight greater than zero when selected for classification of the protein conformation

Machine Learning using Knn k=1,2,3,4,5 (neighbors) n=2,5,10 (n-fold cross validation), test size = 10%, 20%, and 50%, several distance measures, metrics averages of 10 runs

Tested Models

Model 1: each protein conformation separate Model 2: all entries for every drug combined with each conformation Model 3: one entry per drug, the one with the best overall docking score

Reported Metrics

Model 1

MD = molecular docking ML = machine learning

Binding Site Differences

Conformation 1 Conformation 2

Model 2 (all)

• Adding protein features does not help in any metric • The best conformation in Model 1 is better than Model 2, but no prior information is needed on a “good” conformation here • All ML metrics are better than MD metrics MD = molecular docking ML = machine learning

Model 2 (all)/Model 3 (best by docking)

Model 2 has 7 X the binding features

Model 3 (best by docking)

MD = molecular docking ML = machine learning

• Forcing all binding features in the model decreases performance • Even when using only the binding features, ML models do better • Basically making a custom LCK scoring function • MD Recall of binding features alone is only metric better than ML counterpart, but with larger drop in precision • Recover a larger number of active compounds but with more false positives

Conclusions Usually drug features are most predictive Best results come from having one conformation that has binding features that correlates with the prediction Keeping all binding features does not improve it though All models do better than docking See our results developing a single model for a family of related proteins which includes features based on the protein primary sequence.

Acknowledgements UKY Students William Jones Fatemah Alghamedy Jeevith Bopaiah UKY Collaborators Heidi Weiss, PhD Nathan Jacobs, PhD MCC Cancer Research Informatics SRF CCTS KL2TR000116 and 1KL2TR001996-01 This work was supported in part by the U.S. Department of Energy, Office of Science, Office of Workforce Development for Teachers and Scientists (WDTS) under the BLUFF. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. This research used resources provided through the UK Center for Computational Sciences.

AMIA is the professional home for more than 5,400 informatics professionals, representing frontline clinicians, researchers, public health experts and educators who bring meaning to data, manage information and generate new knowledge across the research and healthcare enterprise. AMIA 2017 | amia.org

@AMIAInformatics @AMIAinformatics Official Group of AMIA @AMIAInformatics #WhyInformatics

37

Thank you! Email me at: [email protected]