The relation between indel length and functional divergence: a formal study Raheleh Salari1 3 , Alexander Sch¨onhuth1 3 , Fereydoun Hormozdiari1, Artem Cherkasov2, and S. Cenk Sahinalp1 1
2
School of Computing Science, Simon Fraser University, 8888 University Drive Burnaby, BC, V5A 1S6, Canada Division of Infectious Diseases, Faculty of Medicine, University of British Columbia 2733 Heather Street, Vancouver, BC, V5Z 3J5, Canada 3 The authors contributed equally.
Abstract. Although insertions and deletions (indels) are a common type of evolutionary sequence variation, their origins and their functional consequences have not been comprehensively understood. There is evidence that, on one hand, classical alignment procedures only roughly reflect the evolutionary processes and, on the other hand, that they cause structural changes in the proteins’ surfaces. We first demonstrate how to identify alignment gaps that have been introduced by evolution to a statistical significant degree, by means of a novel, sound statistical framework, based on pair hidden Markov models (HMMs). Second, we examine paralogous protein pairs in E. coli, obtained by computation of classical global alignments. Distinguishing between indel and non-indel pairs, according to our novel statistics, revealed that, despite having the same sequence identity, indel pairs are significantly less functionally similar than non-indel pairs, as measured by recently suggested GO based functional distances. This suggests that indels cause more severe functional changes than other types of sequence variation and that indel statistics should be taken into additional account to assess functional similarity between paralogous protein pairs. Keywords. Alignment statistics, Deletions, Insertions, GO, Pair Hidden Markov Models
1 Introduction Assessment of functional similarity of two proteins based on their amino acid composition is routinely done by alignment procedures. The reasoning behind this is that homology can be reliably inferred from alignment scores, if significantly high. Homology, in turn, is correlated to functional similarity [12]. However, it is well known that in the twilight zone, i.e. protein alignments between roughly 20% to 40% sequence identity, structural hence functional similarity cannot be reliably inferred [30]. Further criteria are needed to distinguish functionally similar proteins from functionally different proteins. A couple of findings point out that indels, not substitutions, are the predominant evolutionary factor when it comes to functional changes. For example, indels happen to
2
R. Salari et. al.
occur predominantly in loop regions [11] which strongly indicates that their occurrence is related to structural changes on the proteins’ surfaces, hence functional changes. Moreover, recent studies show that indels can be of particular interest with regard to disease. Indels are substantially involved in disease-causing mutational hot spots in the human genome [17]. Moreover, promising approaches to the design of novel antibacterial drugs based on the identification of indels have been recently described [6, 5, 21]. A more formal investigation of the relationship between indels and the functional changes they imply may employ the following general approach. 1. Compute all paralogous protein pairs in an organism. 2. Collect “Indel (I)” and “Non-Indel (NI)” pairs. 3. Compare functional similarity of “Indel” and “Non-Indel”. Obviously, there are several issues that need to be resolved in the above approach. (i) First of all one needs to use a good definition of paralogous proteins (paralogs); this is a relatively easy task as most studies define two proteins as paralogs if their sequence identity and similarity, as given by a classical alignment, are above a threshold value and are statistically significant. (ii) Then the organism on which the study will be based must be chosen. In this paper we have opted to examine paralogous protein pairs in E. coli K12. This choice of organism is motivated by our particular interest in studying bacterial organisms. 4 (iii) It is then of key importance to provide a formal definition of an indel. Note that, while amino acid substitution is an evolutionary process that is reliably covered by the classical dynamic programming alignment procedures, realistic computational models for insertions and deletions (indels) have remained an insufficiently solved problem. There are sound indel studies both for DNA, [34, 14, 20] and proteins [2, 28, 4, 23]. However, models have usually been inferred for highly specific datasets and slightly contradict each other such that they are not generally applicable. Given the unclear actual situation, we have employed the classical Needleman-Wunsch alignment procedure with affine gap penalties to compute paralogous protein pairs, in order to account both for computational efficiency and soundness. The main problem we deal with in this context was to reliably distinguish between alignment gaps that have been introduced by point mutations and those that have not. This is analogous to that of identifying alignment scores that indicate alignments that reflect truly evolutionary relationships. The corresponding statistical frameworks [16, 8] allow for reliable statistics even if empirical statistics, due to insufficient amounts of data, are not applicable. In analogy to this, we have developed a statistical framework, based on pair HMMs, with which to identify alignment gaps that are statistically significant, hence more likely refer to insertions and deletions, introduced by evolution. In order to measure the functional impact of indels compared to that of substitutions, 4
Note that, in prokaryotes, in addition to the classical transfer of genetic material from parent to offspring, other phenomena can be responsible for mutational changes. For example, horizontal gene transfer, that is, transfer of genetic material between arbitrary prokaryotic cells, sometimes of different species, has been made responsible for the rapid development of drug resistance [18, 15]. Therefore, to study the paralogous proteins of prokaryotes is also of biomedical interest.
Relation between indel length and functional divergence
3
“Indel” (I) and “Non-Indel” (NI) pairs that are compared with each other will refer to the same levels of sequence identity. Hence, the same amount of non-identical positions either contain significant indels (I) or a significant amounts of substitutions (NI). (iv) The final challenge is the definition of functional similarity between protein pairs. One possible approach is to measure functional similarity by the use of GO based distances, as recently suggested [19, 31] (see section 2.2 for further details). According to the suggested strategy from above, we found that “Indel” protein pairs are significantly less functionally similar than “Non-Indel” pairs, although being of similar alignment scores. Perhaps this finding is quite intuitive. However, the main goal of this paper is a sound, formal framework for quantifying what an indel is with respect to sequence similarity and length and what particular indel lengths imply changes in functional similarity measured in terms of GO based distances. Our results do not only provide a rigorous framework for understanding the relationship between indels and functional similarity but also aim to lay some of the computational foundations of a novel large-scale approach to drug target search by extending the previous works of [6, 5, 21]. In sum, our major goals in this study are: (1): To provide, based on pair HMMs, reliable formulas to assess the resulting from alignment methods with affine gap penalties. (2):To carry out a large-scale study on the correlation of indel size and functional similarity for paralogous proteins (made in the context of E. coli). Note that, thanks to our arrangements, differences in functional similarity cannot be due to differences in alignment scores. Therefore, this study strongly suggests that indels, as evolutionary sequence variation, contribute significantly more to functional changes in the affected proteins than substitutions only. To the best of our knowledge, these points have not been addressed in a formal sense before. In general, this study provides novel ideas on how to use classical alignment procedures more reliably in order to account for functional similarity, by incorporating indel statistics.
2 Methods 2.1 Indel Length Statistics Problem Description The driving problem is described by the following scenario. Let T = {(xt , y t ) | t = 1, ..., |T |} be a set of pairs of sequences over a common alphabet Σ. The alphabet Σ will later be identified with the set of amino acids and sets T will contain pairs of proteins of interest. Let A be an alignment procedure and LA (x, y) resp.
IA (x, y)
be the length of the alignment of x = x1 ...xm , y = y1 ...yn resp. the length of the largest indel that can be found in the alignment, as computed by A. If k is an integer, we are interested in the probabilities Pn,T (IA(x,y)≥k ) := P (IA (x, y) ≥ k | LA (x, y) = n, (x, y) ∈ T )
(1)
4
R. Salari et. al. q
X
p
qx
1-2p
i
2p
1-q Begin
M
End
Px y
i j
p 1-q
1-2p
Y
qyj
Match
Indel
q
q 1-q
Fig. 1. Pair HMM (left) and the Markov chain resulting from it (right)
that the largest indel in the alignment of x and y is greater than k or, equivalently, that the alignment contains an indel of length at least k, given that x and y have been drawn from T such that the alignment of x and y is of length n. Clearly, k depends on n and also on T if, for example, T contains all protein pairs that yield a similar alignment score. An analytical treatment of the problem is highly desirable, as there usually are insufficient amounts of data for statistically reliable empirical distributions. Note that, in our case, we have only on the order of tens of proteins, sampled from the pools of interest, which is by far too little to infer reliable statistical estimates. This usually is also the justification of an analytical treatment of score statistics. Note that the score statistics problem refers to replacing IA (x, y) by SA (x, y), the alignment score attributed to an alignment of x and y computed by A. In the case of database searches, pools T then may refer to all alignments where x = x0 in all pairs (x, y) , for a single protein xo . We will refer to the problem of computing probabilities of the type (1 as the indel length probability (ILP) problem.
Pair HMMs We consider the ILP problem for A being the Needleman-Wunsch procedure for global alignments with affine gap penalties [22, 13], as we are interested in statistics for paralogs which have been inferred by globally aligning all proteins in E. coli. Pair HMMs provide an approach to computing global alignments with affine gap penalties which is equivalent to the dynamic programming approach of NeedlemanWunsch. As pair HMMs are standard, we only give a brief description here and refer the reader to [10] for details. Consider the pair HMM in Fig. 1. It generates alignments by traversing the hidden states according to state transition probabilities (parameterized by p and q) and emitting pairs of symbols according to emission probability distributions attached to the “hidden” states, i. e. matches/mismatches from the ’M’ state by producing a pair of symbols xi , yj with probability pxi yj and symbols paired with gaps from states ’X’ and ’Y’. Upon termination of the run through the hidden states, one has obtained an alignment of two sequences x = x1 ...xm , y = y1 ...yn , according to the sequence of symbol/gap pairs that had been generated along the run. As is well known [10], a NeedlemanWunsch alignment with affine gap penalties of two sequences x = x1 ...xm , y = y1 ...yn can be obtained by computing the most likely sequence of hidden states (the Viterbi path) that yields an alignment of the two sequences by emitting suitable combinations
Relation between indel length and functional divergence
5
of symbols along the run. Note that we can neglect transition probabilities referring to start and end states as they do not affect our further considerations. Owing to the formulation of the ILP problem, we are only interested in statistics on sequences of hidden states, namely in probabilities referring to lengths of consecutive runs with either the ’X’ or the ’Y’ state. Therefore, we can direct our attention to the Markov chain of Fig. 1 which has been obtained from the pair HMM by collapsing the hidden states ’X’ and ’Y’ of the pair HMM into the ’Indel’ state of the Markov chain. The Markov chain generates sequences over the alphabet ˜§ := {′ M ′ ,′ I ′ } where ′ M ′ and ′ I ′ are shorthand for ′ M atch′ and ′ Indel′ . Consider now Cn,k , the set of sequences over the alphabet ˜§ of length n that contain a consecutive ′ I ′ stretch of length at least k. We suggest the following procedure in order to provide good approximations of Pn,T (IA(x,y)≥k ) from (1). C OMPUTATION OF Pn,T (IA (x, y) ≥ k): Align all sequence pairs from T . Infer p and q by training the pair HMM with the alignments. n ← length of the alignment of x and y Compute P (Cn,k ), the probability that the Markov chain generates a sequence from Cn,k . 5: Output P (Cn,k ) as an approximation for Pn,T (IA (x, y) ≥ k). 1: 2: 3: 4:
The idea of step 1 and 2 is to design a pair HMM that generates alignments from the pool T . This is straightforwardly achieved by a standard Baum-Welch training with the Needleman-Wunsch alignments of the protein pairs from T . In our setting, this reduces to simply counting ’Match’-to-’Match’ and ’Indel’-to-’Indel’ transitions in the alignments under consideration to provide maximum likelihood estimates for the derived Markov chain, as the pair HMM is only of virtual interest. Clearly, any statistics on the pair HMM, thanks to the relationship with the Needleman-Wunsch procedure, will yield good approximations to alignment statistics. Note that this basic strategy has recently been successfully employed to model certain alignment features in a comparative genomics study on human-mouse DNA alignments [20]. Therefore, computing Pn,T (IA (x, y) ≥ k) has been translated to computing the probability that the Markov chain generates a sequence from Cn,k . However, surprisingly, efficient computation and/or closed formulas for such probabilities had been an unsolved mathematical problem so far. Related work only refers to respective formulas for i.i.d. processes on a two-letter alphabet [27]. Efficient Computation of PT (Cn,k ) Imagine that, as usual, we have collected the Markov chain parameters, as given by Fig. 1, into a state transition probability matrix and an initial probability distribution q 2p A = (aij )i,j=1,2 = and π = e2 = (0.0, 1.0)T . 1 − q 1 − 2p That is, state 1 corresponds to the ’Indel’ state and state 2 corresponds to the ’Match’ state. The initial distribution reflects that we assume that an alignment always starts
6
R. Salari et. al.
from a ’Match’ state (by adding a match to the alignment at the artificial position 0), in order to take into account that starting a global alignment with a gap is scored with a gap opening penalty just as if coming from a ’Match’ state. For example, according to the laws that govern a Markov chain, the probability of being in state 1 at position t in a sequence generated by the Markov chain is P (Xt = 1) = eT1 At π = eT1 At e2 = (1.0, 0.0)At (0.0, 1.0)T . We first would like to point out that naive approaches to computing P (Cn,k ) fail. For example, let Bt,k be the set of sequences that contain a consecutive ’Indel’ run of length k, stretching over positions t to t + k − 1. Realizing that n−k+1 Cn,k = ∪t=1 Bt,k and proceeding by inclusion-exclusion for computing probabiliPn−k+1 P ties (i.e. P (Cn,k ) = m=1 (−1)m+1 1≤t1