Probabilistic Divergence Measures for Detecting ... - Semantic Scholar

Report 5 Downloads 68 Views
Probabilistic Divergence Measures for Detecting Interspecies Recombination Dirk Husmeier & Frank Wright Biomathematics and Statistics Scotland at the Scottish Crop Research Institute Invergowrie, Dundee DD2 5DA, UK Email: [email protected] http://www.bioss.ac.uk/∼dirk

→ Phylogenetics → Sporadic recombination → Statistical detection methods

slide-1

Phylogenetics

Frog Chicken Human Rabbit Mouse Opossum

G G G G G G

C C C C C C

Rabbit

T G G G G G

T T T T T T

G A C C C C

A A A A A A

C C C C C C

T T T T T T

T T T T T T

C C G G G G

T A A A A A

G C G G C G

A A A A A A

G T C C G C

G G G G G G

T A C C C C

T T T T T T

Mouse Opossum

Human

--> Topology --> Branch lengths

Chicken

Frog

slide-2

A Probabilistic Model of Evolution 1

Transition−Transversion Ratio = 2

P(A|A,w)

1

0.75 P(A|A,w)

A

P(G|A,w)

G

P(T|A,w)

P(C|A,w)

0.5

P(G|A,w) 0.25 P(C|A,w)=P(T|A,w)

C

T

0 0

1

2 w

3

4

slide-3

A Probabilistic Model of Evolution 2

y1 y3 w2

z1

z2 w3

w4 w5

w1

y4 y2 P (y1 , y2, y3, y4, z1, z2|w) = P (y1 |z1, w2)P (y2 |z1, w1)P (z2 |z1, w3)P (y3 |z2, w4)P (y4 |z2, w5) XX P (y1 , y2, y3, y4 |w) = P (y1 , y2, y3, y4 , z1, z2|w) z1

z2

slide-4

Statistical Approach to Phylogenetics

G G G G G G

Frog Chicken Human Rabbit Mouse Opossum

C C C C C C

Rabbit C Human C

T G G G G G

T T T T T T

G A C C C C

A A A A A A

Mouse C

C C C C C C

T T T T T T

T T T T T T

C C G G G G

T A A A A A

G C G G C G

A A A A A A

Opossum C

G T C C G C

G G G G G G

T A C C C C

T T T T T T

--> Likelihood Topology Branch lengths

Chicken A G Frog

slide-5

Recombination

slide-6

Recombination in HIV

G

H E A D B F

env

gag

Recombinant strain ZR-VI 191

C

slide-7

PLATO (Grassly, Holmes)

... Q=

... average L window average L remainder

→ Find maximum Q values → Test significance with parametric bootstrapping

1

3 1

2 1

3

2

4 3

4 2

4

slide-8

Shortcoming of PLATO

→ Need a reference tree → Obtained with global maximum likelihood Reference tree

1

3 1

2 1

3

2

4 3

4 2

4

slide-9

TOPAL (McGuire, Wright)

DSS small

DSS large DSS small

DSS large DSS small

XX SoS = (dik − dˆik )2 i

DSS = |SoSleft − SoSright |

k

i, k labels for taxa dˆik fitted distances (Fitch or Neighbour Joining) dik true distances slide-10

1

3

1

2

1

3

2

4

3

4

2

4

1)

1

3

2

4

1

2

3

4

1

3

4

2

1 2 3

2) 1 2 3

3)

1 2 3

1 2 3

slide-11

Detection of Recombination with MCMC

P(S|D) S

P(S|D) S

P(S|D) S

P(S|D) S

P(S|D) S

slide-12

Marginal Posterior Distribution over Tree Topologies with MCMC

11111111111111 00000000000000

Dt

P(S|D t )

S

Z P (S|Dt) :=

P (S, w|Dt)dw

MCMC −→ Sample : {Sti, wti}N i=1 N 1 X P (S, w|Dt) ≈ δS,Sti δ(w − wti ) N i=1 P (S|Dt)

=

N 1 X NS (t) δS,Sti = N i=1 N

slide-13

Divergence between Distributions

Dt

P(S|D t )

S

D t’

Q(S|D ) t’

Difference S

Divergence measure in probability space: Kullback-Leibler divergence KL(P, Q) =

X S

 PS ln

PS QS



slide-14

Local and Global Divergence Measures

t S

Dt

P(S|D t )

Global Divergence S

P(S|D ) t’

D t’

Local Divergence S

slide-15

Divergence Measures and Statistical Significance

Divergence between the distribution over the window, PS (t), and the average P distribution, P = W1 W t=1 PS (t) :   X PS (t) PS (t) ln d[PS (t), P ] = PS S Divergence between the distributions over two adjacent windows, PS (t) and PS (t)+PS (t0) 0 ˜ PS (t ) , where PS = (Sibson): 2      0 X 1 PS (t) PS (t ) PS (t) ln + PS (t0 ) ln d[PS (t), PS (t0 )] = 2 P˜S P˜S S

Null hypotheses: PS (t) = P S and PS (t) = PS (t0 ) 2N d[PS (t), P ] → χ2(ν − 1), 2N d[PS (t), PS (t0)] → χ2(˜ ν − 1),

ν = |Support(P )| ν˜ = |Support(P˜ )| slide-16

Simulation Experiment A

r1 B00

A00 B0

A0

B01

A01

r2

A

B B10

A10 B1

A1

B11

A11

r1

r2

5000 nucleotides window size = 500 nucleotides

slide-17

Simulation Experiment A

r1 B00

A00 B0

A0

6

B01

A01 A

B

4

B10

A10

KL

r2

B1

A1

2

B11

A11

0 0

1000

2000

3000

4000

5000

1000

2000

3000

4000

5000

8 6

r2 AS

r1

4 2

5000 nucleotides window size = 500 nucleotides

0 0

slide-18

Simulation Experiment A

r1 B00

A00 B0

A0

B01

A01

r2

A

MCMC Global

B B10

A10

MCMC Local

B1

A1

B11

A11

TOPAL

r1

r2

PLATO

5000 nucleotides window size = 500 nucleotides

slide-19

Results - Simulation Experiment A

KL

KL

4

2

1000

2000

3000

4000

5

4

4

3

3

2

2

1

1

0 0

5000

1000

2000

3000

4000

0 0

5000

8

8

6

6

6

4 2 0 0

AS

8

AS

AS

0 0

5

KL

6

4 2

1000

2000

3000

4000

5000

0 0

1000

2000

3000

4000

5000

1000

2000

3000

4000

5000

4 2

1000

2000

3000

4000

5000

0 0

slide-20

Results - Simulation Experiment A

KL

KL

4

2

1000

2000

3000

4000

5

4

4

3

3

2

2

1

1

0 0

5000

3000

4000

0 0

5000

6

6

6

4

AS

8

4 2

1000

2000

3000

4000

0 0

5000

DSS: Topal2

1.5

1

0.5

0 0

1000

2000

3000

4000

1000

2000

3000

4000

0 0

5000

0.2

0.02 0.015

0.1 0.05 0 0

1000

2000

3000

4000

0 0

5000

0.6

0.6

0.6

Plato

1 0.8

Plato

1 0.8

0 0

0.4 0.2

1000

2000

3000

4000

5000

4000

5000

1000

2000

3000

4000

5000

0.005

1

0.2

3000

0.01

0.8

0.4

2000

4

0.15

5000

1000

2

DSS: Topal2

0 0

DSS: Topal2

2000

8

2

Plato

1000

8

AS

AS

0 0

5

KL

6

0 0

1000

2000

3000

4000

5000

1000

2000

3000

4000

5000

0.4 0.2

1000

2000

3000

4000

5000

0 0

slide-21

Simulation Experiment B

r2 r1

A00

B00 B0

A0

B01

A01 A

B B10

A10 A1

B1 B11

A11

r1

r2

5500 nucleotides

window size = 500 nucleotides

slide-22

Simulation Experiment B

r2 r1

A00

B00 B0

A0

B01

A01 A

B 5

B10

A10

4

A1

B11

A11

KL

B1 3 2 1 0 0

r1

r2

1000

2000

3000

4000

5000

6000

1000

2000

3000

4000

5000

6000

8 6 AS

5500 nucleotides 4 2

window size = 500 nucleotides

0 0

slide-23

Simulation Experiment B: Smaller Window

r2 r1

A00

B00 B0

A0

B01

A01 A

B 5

B10

A10

4

A1

B11

A11

KL

B1 3 2 1 0 0

r1

r2

1000

2000

3000

4000

5000

6000

1000

2000

3000

4000

5000

6000

5 4 AS

5500 nucleotides

3 2 1

window size = 250 nucleotides

0 0

slide-24

Simulation Experiment B

r2 r1

A00

B00 B0

A0

B01

A01 A

MCMC Global

B B10

A10 A1

B1 B11

A11

r1

r2

MCMC Local

TOPAL

5500 nucleotides PLATO

window size = 500 nucleotides

slide-25

4

3

3

2

2

1

1 1000

2000

3000

4000

5000

0 0

6000

3 2 1 1000

2000

3000

4000

5000

0 0

6000

8

8

8

6

6

6

4 2 0 0

AS

AS

0 0

4

KL

5

4 KL

5

AS

KL

Results - Simulation Experiment B

4 2

1000

2000

3000

4000

5000

6000

0 0

1000

2000

3000

4000

5000

6000

1000

2000

3000

4000

5000

6000

4 2

1000

2000

3000

4000

5000

6000

0 0

slide-26

5

4

4

3

3

2

2

1

1 1000

2000

3000

4000

5000

2 1

1000

2000

3000

4000

5000

0 0

6000

8

8

6

6

6

4

4 2

1000

2000

3000

4000

5000

0 0

6000

4 DSS: Topal2

2 1 0 0

1000

2000

3000

4000

5000

1000

2000

3000

4000

5000

0.5

1000

2000

3000

4000

5000

0 0

6000

0.6

0.6

0.6

Plato

1 0.8

Plato

1

0.4 0.2

1000

2000

3000

4000

5000

6000

6000

1000

2000

3000

4000

5000

6000

0.005

0.8

0 0

5000

0.01

1

0.2

4000

0.015

0.8

0.4

3000

0.02

1

0 0

6000

2000

4

0 0

6000

1.5

3

1000

2

DSS: Topal2

0 0

DSS: Topal2

AS

8

2

Plato

3

0 0

6000

AS

AS

0 0

4

KL

5

KL

KL

Results - Simulation Experiment B

0 0

1000

2000

3000

4000

5000

6000

1000

2000

3000

4000

5000

6000

0.4 0.2

1000

2000

3000

4000

5000

6000

0 0

slide-27

Potato Virus Y Four strains, 9700 bases, window size= 500 bases. MCMC, global MCMC, local

TOPAL PLATO

3 0.06

KL

DSS: Topal2

2

1

2000

4000

6000

8000

0.02

0 0

10000

8

1

6

0.8 Plato

AS

0 0

0.04

4 2 0 0

2000

4000

6000

8000

10000

2000

4000

6000

8000

10000

0.6 0.4 0.2

2000

4000

6000

8000

10000

0 0

slide-28

Potato Virus Y: RecPars (J. Hein)

Transition cost= 2, transversion cost= 5, recombination cost:

Recombination cost=100

3

Topology

Topology

50 5

Recombination cost=50

3

2

1

2

1

2000

4000

6000

8000

2000

Recombination cost=10

4000

6000

8000

Recombination cost=5

3

Topology

3

Topology

100 10

2

1

2

1

2000

4000

6000

8000

2000

4000

6000

8000

slide-29

Potato Virus Y 3

Baulcombe

Singh

Robaglia

KL

2

Robaglia

Singh

1

0 0

2000

4000

6000

8000

10000

2000

4000

6000

8000

10000

Robaglia

Hungarian

8

4 2

Baulcombe

0 0

Hungarian

State 1

State 2

4000

6000

8000

2 Topology

3

1

Posterior Probability

Posterior Probability 1

0.5

0

Hungarian

State 3

1

0.5

0

2000

Singh

Baulcombe

1

Posterior Probability

AS

6

1

2 Topology

3

0.5

0

1

2 Topology

3

slide-30

Hepatitis B Virus

Five strains, 3050 bases, window size= 500 bases. MCMC, global MCMC, local 2.5

TOPAL PLATO 0.02

DSS: Topal2

KL

2 1.5 1 0.5 500

1000

1500

2000

2500

0.01 0.005 0 0

3000

8

1

6

0.8 Plato

AS

0 0

0.015

4 2 0 0

500

1000

1500

2000

2500

3000

500

1000

1500

2000

2500

3000

0.6 0.4 0.2

500

1000

1500

2000

2500

3000

0 0

slide-31

Conclusions • Sliding window: marginal posterior distribution over tree topologies , conditional on the selected subset of the alignment. • Global divergence measure: Kullback-Leibler divergence between a local distribution and the global distribution. • Local divergence measure: Modified Kullback-Leibler divergence between adjacent local distributions. • Comparison with TOPAL and PLATO on several synthetic benchmark problems. • Distinguishes between recombination and rate variation . • Detects all recombination events. • Hepatitis B virus: New method detects breakpoints predicted with TOPAL plus two additional breakpoints.

slide-32

Hepatitis B Virus, 10 Strains

5

KL

4 3 2 1 0

500

1000

1500

2000

2500

3000

500

1000

1500

2000

2500

3000

8

AS

6 4 2 0 0

slide-33

Hepatitis B Virus: Spectrum

0.25

0.2

0.15

0.1

0.05

0

0

2

4

6

8

10

12

14

16

18

20

slide-34

Hepatitis B Virus: Pruning, K = 5

2

KL

1.5

1

0.5 0

500

1000

1500

2000

2500

3000

500

1000

1500

2000

2500

3000

8

AS

6 4 2 0 0

slide-35

Hepatitis B Virus, Pruning:

1.6

2

1.5 KL

KL

1.4 1.2

1

1 500

1000

1500

2000

2500

3000

0.5 0

8

8

6

6 AS

AS

0.8 0

4 2 0 0

500

1000

1500

2000

2500

1500

2000

2500

3000

500

1000

1500

2000

2500

3000

5 4 KL

KL 1

3 2

500

1000

1500

2000

2500

1 0

3000

8

8

6

6 AS

AS

1000

4

0 0

3000

1.5

4 2 0 0

500

2

2

0.5 0

K=3 K=4 K=5 K=∞

500

1000

1500

2000

2500

3000

500

1000

1500

2000

2500

3000

4 2

500

1000

1500

2000

2500

3000

0 0

slide-36

Average Divergence Measure

• Divergence measure d[P (t), P (t + ∆t)] • How to choose ∆t ? M 1 X d[P (t), P (t + m∆t)] • Average over different degrees of overlap: d = M m=1 1

1

0 0 1

1000

2000

3000

4000

5000

0 0 1

1000

2000

3000

4000

5000

0 0 1

1000

2000

3000

4000

5000

0 0 1

1000

2000

3000

4000

5000

0 0 1

1000

2000

3000

4000

5000

0 0 1

1000

2000

3000

4000

5000

0 0

1000

2000

3000

4000

5000

0 0

1000

2000

3000

4000

5000

From top to bottom: 0%, 50%, 90% overlap, averaging between 50% and 90% slide-37