Probabilistic Divergence Measures for Detecting Interspecies Recombination Dirk Husmeier & Frank Wright Biomathematics and Statistics Scotland at the Scottish Crop Research Institute Invergowrie, Dundee DD2 5DA, UK Email:
[email protected] http://www.bioss.ac.uk/∼dirk
→ Phylogenetics → Sporadic recombination → Statistical detection methods
slide-1
Phylogenetics
Frog Chicken Human Rabbit Mouse Opossum
G G G G G G
C C C C C C
Rabbit
T G G G G G
T T T T T T
G A C C C C
A A A A A A
C C C C C C
T T T T T T
T T T T T T
C C G G G G
T A A A A A
G C G G C G
A A A A A A
G T C C G C
G G G G G G
T A C C C C
T T T T T T
Mouse Opossum
Human
--> Topology --> Branch lengths
Chicken
Frog
slide-2
A Probabilistic Model of Evolution 1
Transition−Transversion Ratio = 2
P(A|A,w)
1
0.75 P(A|A,w)
A
P(G|A,w)
G
P(T|A,w)
P(C|A,w)
0.5
P(G|A,w) 0.25 P(C|A,w)=P(T|A,w)
C
T
0 0
1
2 w
3
4
slide-3
A Probabilistic Model of Evolution 2
y1 y3 w2
z1
z2 w3
w4 w5
w1
y4 y2 P (y1 , y2, y3, y4, z1, z2|w) = P (y1 |z1, w2)P (y2 |z1, w1)P (z2 |z1, w3)P (y3 |z2, w4)P (y4 |z2, w5) XX P (y1 , y2, y3, y4 |w) = P (y1 , y2, y3, y4 , z1, z2|w) z1
z2
slide-4
Statistical Approach to Phylogenetics
G G G G G G
Frog Chicken Human Rabbit Mouse Opossum
C C C C C C
Rabbit C Human C
T G G G G G
T T T T T T
G A C C C C
A A A A A A
Mouse C
C C C C C C
T T T T T T
T T T T T T
C C G G G G
T A A A A A
G C G G C G
A A A A A A
Opossum C
G T C C G C
G G G G G G
T A C C C C
T T T T T T
--> Likelihood Topology Branch lengths
Chicken A G Frog
slide-5
Recombination
slide-6
Recombination in HIV
G
H E A D B F
env
gag
Recombinant strain ZR-VI 191
C
slide-7
PLATO (Grassly, Holmes)
... Q=
... average L window average L remainder
→ Find maximum Q values → Test significance with parametric bootstrapping
1
3 1
2 1
3
2
4 3
4 2
4
slide-8
Shortcoming of PLATO
→ Need a reference tree → Obtained with global maximum likelihood Reference tree
1
3 1
2 1
3
2
4 3
4 2
4
slide-9
TOPAL (McGuire, Wright)
DSS small
DSS large DSS small
DSS large DSS small
XX SoS = (dik − dˆik )2 i
DSS = |SoSleft − SoSright |
k
i, k labels for taxa dˆik fitted distances (Fitch or Neighbour Joining) dik true distances slide-10
1
3
1
2
1
3
2
4
3
4
2
4
1)
1
3
2
4
1
2
3
4
1
3
4
2
1 2 3
2) 1 2 3
3)
1 2 3
1 2 3
slide-11
Detection of Recombination with MCMC
P(S|D) S
P(S|D) S
P(S|D) S
P(S|D) S
P(S|D) S
slide-12
Marginal Posterior Distribution over Tree Topologies with MCMC
11111111111111 00000000000000
Dt
P(S|D t )
S
Z P (S|Dt) :=
P (S, w|Dt)dw
MCMC −→ Sample : {Sti, wti}N i=1 N 1 X P (S, w|Dt) ≈ δS,Sti δ(w − wti ) N i=1 P (S|Dt)
=
N 1 X NS (t) δS,Sti = N i=1 N
slide-13
Divergence between Distributions
Dt
P(S|D t )
S
D t’
Q(S|D ) t’
Difference S
Divergence measure in probability space: Kullback-Leibler divergence KL(P, Q) =
X S
PS ln
PS QS
slide-14
Local and Global Divergence Measures
t S
Dt
P(S|D t )
Global Divergence S
P(S|D ) t’
D t’
Local Divergence S
slide-15
Divergence Measures and Statistical Significance
Divergence between the distribution over the window, PS (t), and the average P distribution, P = W1 W t=1 PS (t) : X PS (t) PS (t) ln d[PS (t), P ] = PS S Divergence between the distributions over two adjacent windows, PS (t) and PS (t)+PS (t0) 0 ˜ PS (t ) , where PS = (Sibson): 2 0 X 1 PS (t) PS (t ) PS (t) ln + PS (t0 ) ln d[PS (t), PS (t0 )] = 2 P˜S P˜S S
Null hypotheses: PS (t) = P S and PS (t) = PS (t0 ) 2N d[PS (t), P ] → χ2(ν − 1), 2N d[PS (t), PS (t0)] → χ2(˜ ν − 1),
ν = |Support(P )| ν˜ = |Support(P˜ )| slide-16
Simulation Experiment A
r1 B00
A00 B0
A0
B01
A01
r2
A
B B10
A10 B1
A1
B11
A11
r1
r2
5000 nucleotides window size = 500 nucleotides
slide-17
Simulation Experiment A
r1 B00
A00 B0
A0
6
B01
A01 A
B
4
B10
A10
KL
r2
B1
A1
2
B11
A11
0 0
1000
2000
3000
4000
5000
1000
2000
3000
4000
5000
8 6
r2 AS
r1
4 2
5000 nucleotides window size = 500 nucleotides
0 0
slide-18
Simulation Experiment A
r1 B00
A00 B0
A0
B01
A01
r2
A
MCMC Global
B B10
A10
MCMC Local
B1
A1
B11
A11
TOPAL
r1
r2
PLATO
5000 nucleotides window size = 500 nucleotides
slide-19
Results - Simulation Experiment A
KL
KL
4
2
1000
2000
3000
4000
5
4
4
3
3
2
2
1
1
0 0
5000
1000
2000
3000
4000
0 0
5000
8
8
6
6
6
4 2 0 0
AS
8
AS
AS
0 0
5
KL
6
4 2
1000
2000
3000
4000
5000
0 0
1000
2000
3000
4000
5000
1000
2000
3000
4000
5000
4 2
1000
2000
3000
4000
5000
0 0
slide-20
Results - Simulation Experiment A
KL
KL
4
2
1000
2000
3000
4000
5
4
4
3
3
2
2
1
1
0 0
5000
3000
4000
0 0
5000
6
6
6
4
AS
8
4 2
1000
2000
3000
4000
0 0
5000
DSS: Topal2
1.5
1
0.5
0 0
1000
2000
3000
4000
1000
2000
3000
4000
0 0
5000
0.2
0.02 0.015
0.1 0.05 0 0
1000
2000
3000
4000
0 0
5000
0.6
0.6
0.6
Plato
1 0.8
Plato
1 0.8
0 0
0.4 0.2
1000
2000
3000
4000
5000
4000
5000
1000
2000
3000
4000
5000
0.005
1
0.2
3000
0.01
0.8
0.4
2000
4
0.15
5000
1000
2
DSS: Topal2
0 0
DSS: Topal2
2000
8
2
Plato
1000
8
AS
AS
0 0
5
KL
6
0 0
1000
2000
3000
4000
5000
1000
2000
3000
4000
5000
0.4 0.2
1000
2000
3000
4000
5000
0 0
slide-21
Simulation Experiment B
r2 r1
A00
B00 B0
A0
B01
A01 A
B B10
A10 A1
B1 B11
A11
r1
r2
5500 nucleotides
window size = 500 nucleotides
slide-22
Simulation Experiment B
r2 r1
A00
B00 B0
A0
B01
A01 A
B 5
B10
A10
4
A1
B11
A11
KL
B1 3 2 1 0 0
r1
r2
1000
2000
3000
4000
5000
6000
1000
2000
3000
4000
5000
6000
8 6 AS
5500 nucleotides 4 2
window size = 500 nucleotides
0 0
slide-23
Simulation Experiment B: Smaller Window
r2 r1
A00
B00 B0
A0
B01
A01 A
B 5
B10
A10
4
A1
B11
A11
KL
B1 3 2 1 0 0
r1
r2
1000
2000
3000
4000
5000
6000
1000
2000
3000
4000
5000
6000
5 4 AS
5500 nucleotides
3 2 1
window size = 250 nucleotides
0 0
slide-24
Simulation Experiment B
r2 r1
A00
B00 B0
A0
B01
A01 A
MCMC Global
B B10
A10 A1
B1 B11
A11
r1
r2
MCMC Local
TOPAL
5500 nucleotides PLATO
window size = 500 nucleotides
slide-25
4
3
3
2
2
1
1 1000
2000
3000
4000
5000
0 0
6000
3 2 1 1000
2000
3000
4000
5000
0 0
6000
8
8
8
6
6
6
4 2 0 0
AS
AS
0 0
4
KL
5
4 KL
5
AS
KL
Results - Simulation Experiment B
4 2
1000
2000
3000
4000
5000
6000
0 0
1000
2000
3000
4000
5000
6000
1000
2000
3000
4000
5000
6000
4 2
1000
2000
3000
4000
5000
6000
0 0
slide-26
5
4
4
3
3
2
2
1
1 1000
2000
3000
4000
5000
2 1
1000
2000
3000
4000
5000
0 0
6000
8
8
6
6
6
4
4 2
1000
2000
3000
4000
5000
0 0
6000
4 DSS: Topal2
2 1 0 0
1000
2000
3000
4000
5000
1000
2000
3000
4000
5000
0.5
1000
2000
3000
4000
5000
0 0
6000
0.6
0.6
0.6
Plato
1 0.8
Plato
1
0.4 0.2
1000
2000
3000
4000
5000
6000
6000
1000
2000
3000
4000
5000
6000
0.005
0.8
0 0
5000
0.01
1
0.2
4000
0.015
0.8
0.4
3000
0.02
1
0 0
6000
2000
4
0 0
6000
1.5
3
1000
2
DSS: Topal2
0 0
DSS: Topal2
AS
8
2
Plato
3
0 0
6000
AS
AS
0 0
4
KL
5
KL
KL
Results - Simulation Experiment B
0 0
1000
2000
3000
4000
5000
6000
1000
2000
3000
4000
5000
6000
0.4 0.2
1000
2000
3000
4000
5000
6000
0 0
slide-27
Potato Virus Y Four strains, 9700 bases, window size= 500 bases. MCMC, global MCMC, local
TOPAL PLATO
3 0.06
KL
DSS: Topal2
2
1
2000
4000
6000
8000
0.02
0 0
10000
8
1
6
0.8 Plato
AS
0 0
0.04
4 2 0 0
2000
4000
6000
8000
10000
2000
4000
6000
8000
10000
0.6 0.4 0.2
2000
4000
6000
8000
10000
0 0
slide-28
Potato Virus Y: RecPars (J. Hein)
Transition cost= 2, transversion cost= 5, recombination cost:
Recombination cost=100
3
Topology
Topology
50 5
Recombination cost=50
3
2
1
2
1
2000
4000
6000
8000
2000
Recombination cost=10
4000
6000
8000
Recombination cost=5
3
Topology
3
Topology
100 10
2
1
2
1
2000
4000
6000
8000
2000
4000
6000
8000
slide-29
Potato Virus Y 3
Baulcombe
Singh
Robaglia
KL
2
Robaglia
Singh
1
0 0
2000
4000
6000
8000
10000
2000
4000
6000
8000
10000
Robaglia
Hungarian
8
4 2
Baulcombe
0 0
Hungarian
State 1
State 2
4000
6000
8000
2 Topology
3
1
Posterior Probability
Posterior Probability 1
0.5
0
Hungarian
State 3
1
0.5
0
2000
Singh
Baulcombe
1
Posterior Probability
AS
6
1
2 Topology
3
0.5
0
1
2 Topology
3
slide-30
Hepatitis B Virus
Five strains, 3050 bases, window size= 500 bases. MCMC, global MCMC, local 2.5
TOPAL PLATO 0.02
DSS: Topal2
KL
2 1.5 1 0.5 500
1000
1500
2000
2500
0.01 0.005 0 0
3000
8
1
6
0.8 Plato
AS
0 0
0.015
4 2 0 0
500
1000
1500
2000
2500
3000
500
1000
1500
2000
2500
3000
0.6 0.4 0.2
500
1000
1500
2000
2500
3000
0 0
slide-31
Conclusions • Sliding window: marginal posterior distribution over tree topologies , conditional on the selected subset of the alignment. • Global divergence measure: Kullback-Leibler divergence between a local distribution and the global distribution. • Local divergence measure: Modified Kullback-Leibler divergence between adjacent local distributions. • Comparison with TOPAL and PLATO on several synthetic benchmark problems. • Distinguishes between recombination and rate variation . • Detects all recombination events. • Hepatitis B virus: New method detects breakpoints predicted with TOPAL plus two additional breakpoints.
slide-32
Hepatitis B Virus, 10 Strains
5
KL
4 3 2 1 0
500
1000
1500
2000
2500
3000
500
1000
1500
2000
2500
3000
8
AS
6 4 2 0 0
slide-33
Hepatitis B Virus: Spectrum
0.25
0.2
0.15
0.1
0.05
0
0
2
4
6
8
10
12
14
16
18
20
slide-34
Hepatitis B Virus: Pruning, K = 5
2
KL
1.5
1
0.5 0
500
1000
1500
2000
2500
3000
500
1000
1500
2000
2500
3000
8
AS
6 4 2 0 0
slide-35
Hepatitis B Virus, Pruning:
1.6
2
1.5 KL
KL
1.4 1.2
1
1 500
1000
1500
2000
2500
3000
0.5 0
8
8
6
6 AS
AS
0.8 0
4 2 0 0
500
1000
1500
2000
2500
1500
2000
2500
3000
500
1000
1500
2000
2500
3000
5 4 KL
KL 1
3 2
500
1000
1500
2000
2500
1 0
3000
8
8
6
6 AS
AS
1000
4
0 0
3000
1.5
4 2 0 0
500
2
2
0.5 0
K=3 K=4 K=5 K=∞
500
1000
1500
2000
2500
3000
500
1000
1500
2000
2500
3000
4 2
500
1000
1500
2000
2500
3000
0 0
slide-36
Average Divergence Measure
• Divergence measure d[P (t), P (t + ∆t)] • How to choose ∆t ? M 1 X d[P (t), P (t + m∆t)] • Average over different degrees of overlap: d = M m=1 1
1
0 0 1
1000
2000
3000
4000
5000
0 0 1
1000
2000
3000
4000
5000
0 0 1
1000
2000
3000
4000
5000
0 0 1
1000
2000
3000
4000
5000
0 0 1
1000
2000
3000
4000
5000
0 0 1
1000
2000
3000
4000
5000
0 0
1000
2000
3000
4000
5000
0 0
1000
2000
3000
4000
5000
From top to bottom: 0%, 50%, 90% overlap, averaging between 50% and 90% slide-37