BIOINFORMATICS
ORIGINAL PAPER
Vol. 21 no. 7 2005, pages 902–911 doi:10.1093/bioinformatics/bti070
Sequence analysis
The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions Yi-Kuo Yu and Stephen F. Altschul∗ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA Received on June 14, 2004; revised on August 6, 2004; accepted on September 13, 2004 Advance Access publication October 27, 2004
ABSTRACT Motivation: Amino acid substitution matrices play a central role in protein alignment methods. Standard log-odds matrices, such as those of the PAM and BLOSUM series, are constructed from large sets of protein alignments having implicit background amino acid frequencies. However, these matrices frequently are used to compare proteins with markedly different amino acid compositions, such as transmembrane proteins or proteins from organisms with strongly biased nucleotide compositions. It has been argued elsewhere that standard matrices are not ideal for such comparisons and, furthermore, a rationale has been presented for transforming a standard matrix for use in a non-standard compositional context. Results: This paper presents the mathematical details underlying the compositional adjustment of amino acid or DNA substitution matrices. Availability: Programs implementing the methods described are available from the authors upon request. Contact:
[email protected] 1
INTRODUCTION
Most methods for comparing protein sequences, including database search programs such as FASTA (Pearson and Lipman, 1988) or BLAST (Altschul et al., 1990, 1997), make use of amino acid substitution matrices for assessing sequence similarity. In the context of local alignments without gaps, a statistical theory of substitution matrices has been developed (Karlin and Altschul, 1990; Dembo et al., 1994). Assuming a random protein model in which the amino acids occur independently with background probabilities p, any matrix with at least one positive score and a negative expected score can be written uniquely in the form qij 1 sij = ln , (1) λ p i pj where the ‘target frequencies’ q are positive numbers that sum to 1 and λ is a positive scale factor. The implicit values of λ and q are easily calculated. A substitution matrix of this form can be shown to be optimal for distinguishing from chance those ungapped local alignments whose aligned amino acid pair frequencies are characterized by the target frequencies. ∗ To
whom correspondence should be addressed.
902
Although any suitable matrix implicitly is of the form of Equation (1), the most popular substitution matrices are constructed explicitly as ‘log-odds’ matrices, using this equation. Careful curatorial work yields collections of alignments representing true biological relationships, from which empirical target frequencies q are derived. The PAM (Dayhoff et al., 1978; Schwartz and Dayhoff, 1978) and BLOSUM (Henikoff and Henikoff, 1992) matrices are constructed using different methods for estimating these frequencies, but both approaches yield symmetric target frequencies qij = qj i . For consistency, background amino acid frequencies are calculated from the target frequencies by pi = j qij = j qj i . Different degrees of evolutionary divergence imply different target frequencies, and both approaches therefore are able to construct q tailored to differing evolutionary distances, giving rise to a series of PAM and BLOSUM substitution matrices. Although standard matrices are constructed from sequence data with specific background frequencies p, they often are used to compare sequences that have quite different amino acid compositions. Some specialized proteins, such as hydrophobic or cysteine-rich proteins, have non-standard compositions due to functional constraints. In addition, proteins encoded by heavily AT- or CG-rich genomes or isochores tend to have non-standard compositions due to pressures on codon usage (Sueoka, 1988; Wan and Wootton, 2000; Knight et al., 2001). Specialized substitution matrices have been constructed for certain classes of proteins, and have been shown to be more effective than standard matrices at recognizing biological relationships among these proteins (Ng et al., 2000; Muller et al., 2001). Recently, Yu et al. (2003) have argued that it is, in general, inappropriate to use standard matrices to compare proteins with non-standard compositions. Specifically, when a substitution matrix with scores given by Equation (1) is used to compare two sequences with amino acid compositions P and P that differ from p, the implicit target frequencies of the matrix in the new compositional context are, in general, inconsistent with the new background frequencies. As we argue in detail below, if one demands consistency between target and background frequencies, then a substitution matrix can have at most one valid set of target and background frequencies, except in certain special cases unlikely to arise in practice. Further, we show how to determine effectively whether an arbitrary substitution matrix can correspond to any consistent set of target and background frequencies and, if so, how to calculate these frequencies.
Published by Oxford University Press 2004.
Substitution matrices for non-standard compositions
Ideally, in a compositional context defined by background frequencies P and P , one should use a substitution matrix with consistent target frequencies Q. In general, this means that P = i j Qij and Pj = i Qij , and that Q will be asymmetric. Is there a sensible way to derive such target frequencies from the q implicit in, e.g. a BLOSUM matrix, so that each new case does not require laborious curatorial work? Yu et al. (2003) propose choosing those Q that satisfy the consistency constraints, and that are a minimum distance from the original target frequencies q by the measure of relative entropy Qij D(Q, q) = Qij ln . (2) qij ij
We present below the mathematical details of a Newtonian procedure for minimizing D, which were elided in the paper of Yu et al. (2003). The compositional adjustment of amino acid substitution matrices, in general, tends to increase the statistical significance of alignments of proteins with non-standard compositions, and often improves the accuracy of these alignments as well (Yu et al., 2003). The general method described here is fully automatic, sidestepping the substantial curatorial effort required by earlier methods for the construction of special purpose matrices. In addition, it can be applied consistently to the comparison of proteins with compositions that differ markedly from one another, as well as from a standard composition.
2
THE VALIDITY OF SUBSTITUTION MATRICES
Given a matrix q of positive, not necessarily symmetric target frequencies that sum to 1, and an arbitrary positive scale factor λ, one may construct consistent ‘background frequencies’ using the equations qij = pi ; qij = p j , (3) j
i
and then a corresponding log-odds substitution matrix with the equation qij 1 sij = ln . (4) λ pi pj We call a matrix s whose elements sij can be expressed in the form of Equations (3) and (4) ‘valid’ in the context of the background frequencies p and p , and target frequencies q. For brevity, we will sometimes say below that a substitution matrix is valid when we mean that it is valid in some context. Yu et al. (2003) have argued that a substitution matrix can be valid in at most one context. This is true for almost all substitution matrices, but there are degenerate cases (specifically, when the row or column vectors of implicit target frequencies are linearly dependent) where more than one set of target and associated background frequencies can give rise to the same matrix. Although it is still possible to characterize these degenerate cases, we will not study them in detail here because they rarely arise in real applications. Instead, we will limit ourselves to recognizing and characterizing fully the non-degenerate cases. As discussed in the Introduction section, most amino acid substitution matrices in common use are explicitly constructed using Equations (3) and (4) (although generally with symmetric q and therefore with p = p ), so such matrices are by definition valid. For completeness, however, we ask how one may determine whether an arbitrary substitution matrix is valid.
Yu et al. (2003) have outlined a method for finding the λ and q corresponding to a valid substitution matrix s. To review, define the matrix M(τ ), dependent upon the parameter τ , to have matrix elements Mij (τ ) ≡ eτ sij . (5) The requirement that s satisfy Equations (3) and (4) implies that, for the unknown p, p and λ, pi Mij (λ) = 1; Mij (λ)pj = 1. (6) i
j
Equation (6) can be solved for p and p by inverting M(λ) to Y(λ): Yj i (λ); pj = Yj i (λ), (7) pi = j
i
and the condition that the pi and the pj sum to 1 implies simply that
Yij (λ) = 1.
(8)
ij
It would thus appear that our problem reduces merely to finding a positive solution to Equation (8), after which the p and p may be constructed using Equation (7), and the qij defined as pi pj eλsij . However, several potential difficulties must be addressed. First, the matrix M(τ ) may have determinant zero either uniformly or for isolated values of τ . As we discuss in the Appendix section, it is these zeros of det M(τ ) that can give rise to multiple valid sets of target and background frequencies. In practice, this is unlikely to arise except by construction, and we will assume in the main text that M(τ ) is invertible. Second, there may be no or multiple solutions to Equation (8). Third, a solution to Equation (8) may not be ‘sensible’, in that it may yield a nonsense solution to Equation (7), with one or more of the pi or pj negative. It has at least been established (Yu et al., 2003) that there is at most one sensible solution to Equation (8). We describe here an effective procedure to find this unique sensible solution if it exists, or else to report that no such solution exists.
3
CALCULATING λ
It can be shown that a valid substitution matrix s with a row or column uniformly zero implies that det M(τ = λ) = 0, i.e. a degenerate case. Thus we will consider such matrices only in the Appendix section, and will assume here that no row or column of s is uniformly zero. We argue that if s is valid, each of its rows and columns then must contain at least one positive and one negative score. Suppose, on the contrary, that s has a row I with no positive scores. Then, ln
qIj qIj ≤0 ⇒ ≤ pj pI pj pI
∀j ,
(9)
with strict inequality holding in at least one case, because no row is uniformly zero. Summing over j and using Equation (3), we get the contradiction 1 < 1. The argument generalizes of course to no negative scores, and to columns. Assuming s is valid, we first seek an upper bound on its corresponding λ. Let cj be the largest score in column j and let c be the smallest of these cj . By our argument above, all cj must be positive, as well as c. Let pJ be the largest of the background frequencies p corresponding to s so that pJ ≥ 1/N, where N is the size of the
903
Y.-K.Yu and S.F.Altschul
alphabet. Let the largest score in column J of the substitution matrix, i.e. cJ , be located in row I . Then by Equation (6), = 1. eλsI 1 p1 + · · · + eλcJ pJ + · · · + eλsI N pN
(10)
Because no term on the left-hand side of Equation (10) is negative, cJ ≥ c, and pJ ≥ 1/N, we have eλc ≤ 1, N
(11)
which immediately provides an upper bound for λ, i.e. λ≤
ln N . c
(12)
The argument above applies with rows and columns switched, so if we let ri be the largest score in row i, and r be the smallest of the ri , then (ln N )/r is also an upper bound for λ. Note that in most amino acid or DNA scoring matrices of interest, the largest score in each row and column is on the main diagonal. In this case, c and r are identical, and equal to the smallest score on the main diagonal. We require only some upper bound on λ, but a program to calculate λ is rendered more efficient by tighter upper bounds. One may show that if z is the largest score in s, then eλc − eλ(c−z) ≤ N − 1
(13)
provides a bound for λ at least as good as (12) (proof omitted). The bound of course applies with r replacing c, and the structure of some matrices allows one to derive even better bounds on λ. Given that λ for a valid matrix must lie between 0 and our upper bound, we may use numeric methods to find all solutions to Equation (8). Although multiple solutions may exist, as soon as one that implies non-negative target frequencies has been found, we know this is the unique λ that can do so (Yu et al., 2003), and the search for further solutions may be abandoned. Although this procedure presents no difficulties in practice, it may be objected that we have not described an effective procedure for knowing when all solutions to (8) have in fact been found. For mathematical rigor we require only the assumption that the scores of s are rational. Equation (8) may then be recast as a problem in finding the roots of a polynomial, and lower bounds on the separation of such roots are available (Rump, 1979). This degree of rigor, however, is too computationally costly to be of practical use.
4
Changing variables to Ai = eαi −1 and Bj = eβj (with BN = 1), yields Qij = Ai Bj qij .
j
i
but that minimize D of Equation (2).
(16)
Our constraints on Q can now be written as 2N − 1 constraints on A and B: ≡ Ai Bj qij = Pi ∀ 1 ≤ i ≤ N ; (17) fi (Ai , B) j
Bj ) ≡ Bj gj (A,
i
Ai qij = Pj
∀ 1 ≤ j < N.
(18)
To use the multidimensional Newton’s method, assume the solution When A and B are not to Equations (17) and (18) occurs at a and b. too far from this solution, we can expand the functions fi and gj of Equations (17) and (18) as = Pi + (Ai − ai ) fi (Ai , B) Bj qij j
+ Ai (Bj − bj )qij + higher order terms; j
Bj ) = Pj + (Bj − bj ) gj (A,
(19)
Ai qij
i
SUBSTITUTION MATRIX ADJUSTMENT
Our central question is how best to transform a substitution matrix that is valid in one background frequency context into a matrix that is valid in a different context. We formulate the problem by assuming we are given a matrix with implicit target frequencies q, and also a set of background frequencies P and P that are inconsistent with q. (Note that for the usual application, the frequencies q correspond to a symmetric substitution matrix, and are therefore themselves symmetric, but our development does not require this.) We then seek target frequencies Q, that are consistent with P and P , i.e. that satisfy Qij = Pi ; Qij = Pj , (14)
904
If our alphabet has N characters, consistency with P and P imposes 2N − 1 independent linear constraints on the N 2 dimensional target frequency matrix Q. [One may choose to think of Q as being (N 2 − 1)-dimensional, because of the requirement that the Qij sum to 1. Then, beyond this original constraint, consistency with P and P imposes 2N − 2 additional independent constraints on Q.] Note that because the second derivative matrix of D is diagonal, with its (i, i)-th element equal to 1/Qii , which is a positive number in the domain of interest, D will have a unique minimum when subject to any linear constraints. To minimize D we introduce 2N − 1 Lagrange multipliers, α1 through αN and β1 through βN −1 , one for each constraint. To simplify notation below, we define βN (not one of the Lagrange multipliers) to be the constant 0. Then, setting the partial derivative of D with respect to each of the Qij equal to the sum of the Lagrange multipliers times the partial derivatives of their respective constraints, we have Qij + 1 = αi + βj . (15) ln qij
+ Bj
(Ai − ai )qij + higher order terms.
(20)
i
Then, defining Ai = ai −Ai , Bj = bj −Bj , fi = Pi −fi and gj = Pj − gj , Newton’s procedure is to ignore the higher order terms and to solve for the Ai and Bj in terms of the fi and gj . Grouping the Ai (for 1 ≤ i ≤ N ) and Bj (for 1 ≤ j < N) −→
into a column vector AB, and the fi and gj into a column −→
vector fg, we obtain from Equations (19) and (20) the matrix equation −→
−→
AB = W−1 fg,
(21)
Substitution matrices for non-standard compositions
this is almost identical to the procedure described above for minimizing D(Q, q) subject to these same constraints. However, because
where W=
j
Bj q1,j 0 .. . 0
j
B1 q1,1 B2 q1,2 .. . BN −1 q1,N −1
0 Bj q2,j .. . 0
B1 q2,1 B2 q2,2 .. . BN −1 q2,N −1
... ... .. . ...
0 0 .. . j Bj qN ,j
... ... .. . ...
B1 qN ,1 B2 qN ,2 .. . BN −1 qN ,N −1
Starting with A and B that are not a solution to Equations (17) and (18), we can calculate from these equations the desired fi and gj . We then calculate new values of A and B using Equations (21) Because D is and (22), and iterate until convergence to a and b. convex, convergence is guaranteed. In addition, simpler but slower methods for finding a and b are available. It is worth noting that Equation (16) implies that the new target frequencies Q can be associated with log-odds scores S related to the original scores s by the equation Sij = sij + δi + j ,
(23)
and having the same-scale parameter λ as s. In other words, the new scores may be derived from the old ones by adding a certain (but variable) quantity to each row and to each column. However, the δi and j are not unique, because Equation (23) still holds if one adds a quantity to all the δi , and subtracts that quantity from all the j . As described in the Introduction section, different substitution matrices are appropriate for different evolutionary distances. A convenient and widely used measure for characterizing the degree of divergence for which a matrix is tailored is the relative entropy H of its target and background frequencies (Altschul, 1991): H (q; p, p , ) =
qij ln
ij
qij . pi pj
(24)
A1 q1,1 A2 q2,1 .. . AN qN ,1 i Ai qi,1 0 .. . 0
A1 q1,2 A2 q2,2 .. . AN qN ,2
... ... .. . ...
0 i Ai qi,2 .. . 0
A1 q1,N −1 A2 q2,N −1 .. . AN qN ,N −1
... 0 ... 0 .. .. . . ... i Ai qi,N −1
.
(22)
the maxima of convex functions occur at boundary points, where numeric stability can be problematic, we do not attempt to provide a numeric upper bound for h. Next, an additional Lagrange multiplier must be introduced, for the constraint of Equation (25). This leads to a generalization of Equation (16): Qij = Ai Bj qijC (Pi Pj )1−C (26) B and C. Equations (17) and (18) are then for non-negative A, B and C: replaced by 2N constraints on A, 1−C C) ≡ Ai P −C fi (Ai , B, Bj qijC Pj i j
= 1 ∀ 1 ≤ i ≤ N; Bj , C) ≡ Bj Pj −C gj (A, Ai qijC Pi 1−C
(27)
i
= 1 ∀ 1 ≤ j < N; B, C) ≡ k(A, Ai Bj qijC (Pi Pj )1−C ij
× ln Ai Bj
qij Pi Pj
(28)
C = h.
(29)
However, when a substitution matrix is transformed for a new compositional context, H can change substantially. Because the effectiveness of a matrix depends strongly upon its relative entropy (Altschul, 1991, 1993; States et al., 1991; Henikoff and Henikoff, 1992), it may be desirable to control H , when constructing new target frequencies Q, by adding the extra constraint
A multidimensional Newtonian optimization procedure may be applied to these constraints, as it was above to Equations (17) and (18). However the resulting equations are quite complex, and will be omitted here. For feasible h, the convexity of D(Q, q) and that H (Q; P , P ) differs from D only by terms linear in the Qij imply the convergence of the Newtonian method to a unique optimal Q. We note that the resulting substitution scores S take the form
H (Q; P , P ) = h,
Sij = γ sij + δi + j ,
(25)
for an appropriately chosen positive constant h (Yu et al., 2003). [Note that although D of Equation (2) is similar in form to H of Equation (25), the former is a quantity we seek to minimize, whereas the latter is constrained.] When Equation (25) is added to Equation (14) to constrain Q, the details of our optimization procedure must be modified accordingly, as we outline below. First, if the h specified is too large, no Q can satisfy Equation (25). To find the maximum feasible h one needs merely to maximize H (Q; P , P ) subject to the constraints of Equation (14). Formally,
(30)
when they are expressed using the same-scale parameter λ as s. Once again, the δi and j are not unique, although γ is, and is equal to the variable C of Equations (26)–(29). Finally, we observe that the ungapped scale parameter λ of a standard substitution matrix (Karlin and Altschul, 1990) depends upon the compositional context in which it is used. Accordingly, to estimate accurate statistics for gapped local alignments, it has been argued that standard matrices should be scaled appropriately for use with a standard set of gap costs (Schäffer et al., 2001), and this is now the default
905
Y.-K.Yu and S.F.Altschul
Fig. 1. Examples of alignment extensions yielded by compositional adjustment of the scoring system. The sequences compared are fructose-bisphosphate aldolases from P.falciparum (NCBI gi #3319034) (top lines) and F.nucleatum (NCBI gi #19703667) (bottom lines). In the central lines, aligned identical residues are echoed, and aligned residues with positive substitution score are indicated by ‘+’ symbols. Following the current default behavior of BLAST (Altschul et al., 1990, 1997), alignments and normalized bit scores (Altschul, 1993) for all comparisons are calculated using experimentally determined gapped statistical parameters (Altschul et al., 2001) and composition-based statistics (Schäffer et al., 2001). Substitution matrices were scaled to have ungapped λ = 0.006352, rounded to the nearest integer, and then used in conjunction with gap scores of −550–50k for a gap of length k. Amino acid compositions were calculated by adding 20 pseudocounts proportional to the amino acid frequencies implicit in BLOSUM-62 (Table 2) to the actual amino acid counts from the proteins in question (Yu et al., 2003). (a) The alignment yielded by the standard BLOSUM-62 substitution matrix. (The matrix of Table 1 was multiplied by 0.965/0.006352 = 151.9.) The alignment has a score of 36.6 bits. The ungapped relative entropy of the matrix in the new compositional context is 0.566 bits. (b) The alignment yielded by a composition-adjusted matrix derived from BLOSUM-62 with unconstrained relative entropy. (The corresponding components of δu and u from Table 2 were added to the matrix of Table 1, and the resulting matrix was multiplied by 1.0/0.006352 = 157.4.) The alignment has a score of 39.7 bits. The ungapped relative entropy of the matrix in the new compositional context is 0.601 bits. (c) The alignment yielded by a composition-adjusted matrix derived from BLOSUM-62 with relative entropy constrained to equal 0.566 bits. (The corresponding components of δc and c from Table 2 were added to γ = 0.969 times the matrix of Table 1, and the resulting matrix was then multiplied by 1.0/0.006352 = 157.4.) The alignment has a score of 39.9 bits.
behavior of the BLAST protein database search programs (Altschul et al., 1990, 1997). In other words, when such ‘composition-based’ statistics are employed, a standard matrix s is multiplied by a constant factor. Thus, when written in terms of such a scaled matrix, even a compositionally adjusted matrix S with unconstrained relative entropy takes the form of Equation (30), with γ not equal to 1.
5
EXAMPLE
To illustrate the use of our methods, we consider the comparison of fructose-bisphosphate aldolases from Plasmodium falciparum (Kim et al., 1998) (NCBI gi #3319034) and Fusobacterium nucleatum (Kapatral et al., 2002) (NCBI gi #19703667). An optimal local alignment (Fig. 1a) was produced (Smith and Waterman, 1981) using a
906
scaled version of the standard BLOSUM-62 amino acid substitution matrix (Table 1) (Henikoff and Henikoff, 1992), and gap costs proportional to 11 + k for a gap of length k. The alignment has a normalized score (Altschul, 1993) of 36.6 bits, within the ‘twilight zone’ of detection in a database search. Both P.falciparum and F.nucleatum have AT-rich genomes, which tend to favor amino acids with AT-rich codon sets (FLINKYM), and disfavor those with CG-rich codon sets (PRAWG) relative to standard amino acid frequencies (Wan and Wootton, 2000), such as those implicit (Yu et al., 2003) in the BLOSUM-62 matrix. However, of the two proteins considered here, only that from F.nucleatum has an amino acid composition (Table 2) largely typical for AT-rich organisms. We used the methods described above to transform the standard BLOSUM-62 matrix s of Table 1 to matrices consistent with the composition of the sequences being compared.
Table 1. The standard BLOSUM-62 amino acid substitution matrix
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
1.362 −0.490 −0.531 −0.608 −0.142 −0.279 −0.299 0.055 −0.563 −0.458 −0.508 −0.254 −0.324 −0.766 −0.282 0.387 −0.016 −0.876 −0.611 −0.066
−0.490 1.897 −0.152 −0.557 −1.175 0.341 −0.040 −0.799 −0.087 −1.036 −0.747 0.731 −0.474 −0.966 −0.731 −0.265 −0.389 −0.929 −0.587 −0.867
−0.531 −0.152 1.959 0.441 −0.922 0.001 −0.093 −0.147 0.200 −1.115 −1.171 −0.062 −0.745 −1.038 −0.693 0.208 −0.016 −1.281 −0.722 −0.997
−0.608 −0.557 0.441 2.001 −1.199 −0.109 0.523 −0.455 −0.388 −1.082 −1.250 −0.243 −1.060 −1.207 −0.513 −0.090 −0.364 −1.461 −1.062 −1.089
−0.142 −1.175 −0.922 −1.199 2.974 −1.006 −1.252 −0.867 −1.036 −0.425 −0.443 −1.052 −0.492 −0.823 −0.969 −0.303 −0.300 −0.799 −0.834 −0.280
−0.279 0.341 0.001 −0.109 −1.006 1.832 0.643 −0.619 0.155 −0.960 −0.740 0.441 −0.146 −1.097 −0.444 −0.035 −0.234 −0.675 −0.492 −0.762
−0.299 −0.040 −0.093 0.523 −1.252 0.643 1.699 −0.731 −0.041 −1.107 −0.986 0.269 −0.693 −1.107 −0.387 −0.051 −0.299 −0.983 −0.700 −0.847
0.055 −0.799 −0.147 −0.455 −0.867 −0.619 −0.731 1.928 −0.707 −1.291 −1.257 −0.529 −0.928 −1.077 −0.739 −0.101 −0.546 −0.863 −1.054 −1.088
−0.563 −0.087 0.200 −0.388 −1.036 0.155 −0.041 −0.707 2.603 −1.120 −0.966 −0.250 −0.538 −0.428 −0.749 −0.306 −0.584 −0.812 0.587 −1.081
−0.458 −1.036 −1.115 −1.082 −0.425 −0.960 −1.107 −1.291 −1.120 1.386 0.527 −0.925 0.390 −0.056 −0.955 −0.814 −0.249 −0.894 −0.461 0.883
−0.508 −0.747 −1.171 −1.250 −0.443 −0.740 −0.986 −1.257 −0.966 0.527 1.334 −0.848 0.690 0.144 −0.991 −0.847 −0.415 −0.566 −0.368 0.273
−0.254 0.731 −0.062 −0.243 −1.052 0.441 0.269 −0.529 −0.250 −0.925 −0.848 1.561 −0.470 −1.067 −0.351 −0.071 −0.232 −1.025 −0.631 −0.784
−0.324 −0.474 −0.745 −1.060 −0.492 −0.146 −0.693 −0.928 −0.538 0.390 0.690 −0.470 1.869 0.004 −0.858 −0.513 −0.231 −0.494 −0.345 0.238
−0.766 −0.966 −1.038 −1.207 −0.823 −1.097 −1.107 −1.077 −0.428 −0.056 0.144 −1.067 0.004 2.095 −1.247 −0.821 −0.730 0.318 1.019 −0.294
−0.282 −0.731 −0.693 −0.513 −0.969 −0.444 −0.387 −0.739 −0.749 −0.955 −0.991 −0.351 −0.858 −1.247 2.552 −0.280 −0.373 −1.267 −1.012 −0.814
0.387 −0.265 0.208 −0.090 −0.303 −0.035 −0.051 −0.101 −0.306 −0.814 −0.847 −0.071 −0.513 −0.821 −0.280 1.346 0.479 −0.954 −0.584 −0.571
−0.016 −0.389 −0.016 −0.364 −0.300 −0.234 −0.299 −0.546 −0.584 −0.249 −0.415 −0.232 −0.231 −0.730 −0.373 0.479 1.575 −0.842 −0.557 −0.019
−0.876 −0.929 −1.281 −1.461 −0.799 −0.675 −0.983 −0.863 −0.812 −0.894 −0.566 −1.025 −0.494 0.318 −1.267 −0.954 −0.842 3.640 0.747 −0.982
−0.611 −0.587 −0.722 −1.062 −0.834 −0.492 −0.700 −1.054 0.587 −0.461 −0.368 −0.631 −0.345 1.019 −1.012 −0.584 −0.557 0.747 2.286 −0.419
−0.066 −0.867 −0.997 −1.089 −0.280 −0.762 −0.847 −1.088 −1.081 0.883 0.273 −0.784 0.238 −0.294 −0.814 −0.571 −0.019 −0.982 −0.419 1.306
The standard BLOSUM-62 matrix is here scaled so that its ungapped scale parameter λ equals 1.0 (Karlin and Altschul, 1990) in the context of its implicit background frequencies (Yu et al., 2003). In the compositional context of the two sequences under consideration (NCBI gi #3319034 and #19703667), this matrix has ungapped λ = 0.965, and relative entropy h = 0.566 bits.
907
Substitution matrices for non-standard compositions
A R N D C Q E G H I L K M F P S T W Y V
A
Y.-K.Yu and S.F.Altschul
Table 2. Compositional adjustment of the BLOSUM-62 substitution matrix
Amino acid
Amino acid frequencies (%)a BLOSUM-62 P.falciparum background gi #3319034
F.nucleatum gi #19703667
A R N D C Q E G H I L K M F P S T W Y V
7.4 5.2 4.5 5.3 2.5 3.4 5.4 7.4 2.6 6.8 9.9 5.8 2.5 4.7 3.9 5.7 5.1 1.3 3.2 7.3
8.5 4.4 5.8 5.8 0.3 3.1 8.8 5.8 1.4 7.1 9.5 9.8 2.7 3.7 3.7 5.8 4.7 0.3 2.0 6.8
a
10.9 3.8 5.2 3.0 1.9 3.8 7.9 8.4 1.4 4.6 10.6 7.9 0.8 2.7 4.1 4.3 7.1 1.1 3.3 7.3
BLOSUM-62 score adjustment parameters δu u δc
c
−0.016 −0.070 −0.124 −0.170 0.491 −0.078 −0.194 0.129 0.092 −0.039 −0.019 −0.179 −0.049 0.049 −0.004 −0.038 −0.011 0.480 0.125 −0.020
−0.091 0.094 0.028 0.190 −0.085 0.008 −0.010 −0.114 0.160 0.099 0.051 −0.006 0.119 0.161 −0.010 0.004 −0.052 −0.096 0.027 0.044
−0.013 −0.057 −0.109 −0.148 0.479 −0.070 −0.178 0.133 0.101 −0.028 −0.010 −0.165 −0.040 0.063 0.012 −0.034 −0.008 0.473 0.131 −0.012
−0.096 0.094 0.025 0.195 −0.128 0.008 −0.012 −0.135 0.157 0.099 0.048 −0.006 0.124 0.158 −0.028 0.005 −0.056 −0.152 0.013 0.043
Frequencies may not sum to 100% due to rounding errors.
In the first instance, we left unconstrained the relative entropy of the new matrix Su . The multidimensional Newtonian optimization procedure converged after six iterations. The resulting sets of δs and s, δu and u , are shown in Table 2, corresponding to substitution scores involving amino acids from, respectively, the P.falciparum and F.nucleatum sequences. Because one may add an arbitrary constant to one set and subtract it from the other, we have chosen δu and u so that the components of each have the same mean. The matrix Su is then constructed from s (Table 1) simply by using Equation (23). The relative entropy of Su is 0.601 bits in the new compositional context. In the second instance, we constrained the relative entropy of the new matrix Sc to be h = 0.566 bits, which equals that of the original matrix s in the new context. The Newtonian procedure converged after 10 iterations, with a resulting γ = 0.969, and δc and c given in Table 2. The matrix Sc is constructed from s using Equation (30). When Su and Sc are used in place of s to compare the two sequences in question, the resulting optimal local alignments, which differ only slightly from one another, are shown in Figures 1b and c, respectively. Both extend the standard BLOSUM-62 alignment to the right by a mean of 79.5 residues, and are broadly consistent with threedimensional structural evidence (Yu et al., 2003). The normalized alignment score is 39.7 bits in the first instance, and 39.9 bits in the second, improving in both cases the standard BLOSUM-62 score by over 3 bits, or a factor of 8 in statistical significance. Compositionally adjusted matrices do not always produce better alignments and alignment scores, but for pairs of related, compositionally biased sequences, they do on average yield improved results (Yu et al., 2003). In addition, constraining the relative entropy of the new matrix as we have done does not always yield better alignment
908
scores than does leaving the relative entropy unconstrained. In the example considered here, the constrained and unconstrained matrices improve the alignment score with respect to the standard matrix by 3.3 and 3.1 bits, respectively, but in the example illustrated by Yu et al. (2003), they improve the standard score by 2.1 and 3.4 bits, respectively. We are investigating empirically whether rules dependent on sequence length or composition can be formulated for deciding when it is productive to constrain the relative entropy of compositionally adjusted matrices, as well as how best to specify the value for h. These questions, however, do not bear on the details of the methods this paper presents.
6
CONCLUSION
Standard amino acid substitution matrices are generally constructed as log-odds matrices from protein alignment data characterized by specific background amino acid frequencies. It has been argued that if one demands consistency between target and background frequencies, a substitution matrix is in general appropriate to only one compositional context (Yu et al., 2003). It is therefore not appropriate to use a standard matrix to align proteins with amino acid compositions that differ substantially from those used for the matrix’s construction. Furthermore, a rationale has been presented for transforming a standard matrix into one appropriate for any new compositional context (Yu et al., 2003). In this paper, we have presented the mathematical details underlying this approach. Specifically, we have described an effective procedure for determining whether an arbitrary substitution matrix can be valid and non-degenerate in some compositional context and, if so, for calculating the implicit scale, and background and target frequencies that characterize the
Substitution matrices for non-standard compositions
matrix. In addition, we have described an efficient Newtonian procedure for transforming a substitution matrix that is valid in one compositional context into a close matrix that is valid in a different context, controlling the relative entropy of the new matrix if desired. Programs implementing the algorithms described are available from the authors upon request.
ACKNOWLEDGEMENTS We thank Dr John Spouge and Dr John Wootton for many stimulating discussions and suggestions.
REFERENCES Altschul,S.F. (1991) Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol., 219, 555–565. Altschul,S.F. (1993) A protein alignment scoring system sensitive at all evolutionary distances. J. Mol. Evol., 36, 290–300. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. Altschul,S.F., Madden,T.L., Schäffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. Altschul,S.F., Bundschuh,R., Olsen,R. and Hwa,T. (2001) The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res., 29, 351–361. Dayhoff,M.O., Schwartz,R.M. and Orcutt,B.C. (1978) A model of evolutionary change in proteins. In Dayhoff,M.O. (ed.), Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Washington, DC, vol. 5, Suppl. 3, pp. 345–352. Dembo,A., Karlin,S. and Zeitouni,O. (1994) Limit distribution of maximal non-aligned two-sequence segmental score. Ann. Prob., 22, 2022–2039. Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci., USA, 89, 10915–10919. Karlin,S. and Altschul,S.F. (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl Acad. Sci., USA, 87, 2264–2268. Kapatral,V., Anderson,I., Ivanova,N., Reznik,G., Los,T., Lykidis,A., Bhattacharyya,A., Bartman,A., Gardner,W., Grechkin,G. et al. (2002) Genome sequence and analysis of the oral bacterium Fusobacterium nucleatum strain ATCC 25586. J. Bacteriol., 184, 2005–2018. Kim,H., Certa,U., Dobeli,H., Jakob,P. and Hol,W.G. (1998) Crystal structure of fructose1,6-bisphosphate aldolase from the human malaria parasite Plasmodium falciparum. Biochemistry, 37, 4388–4396. Knight,R.D., Freeland,S.J. and Landweber,L.F. (2001) A simple model based on mutation and selection explains trends in codon and amino-acid usage and GC composition within and across genomes. Genome Biol., 2, research0010.1–research0010.13. Muller,T., Rahmann,S. and Rehmsmeier,M. (2001) Non-symmetric score matrices and the detection of homologous transmembrane proteins. Bioinformatics, 17 (Suppl. 1), S182–S189. Ng,P.C., Henikoff,J.G. and Henikoff,S. (2000) PHAT: a transmembrane-specific substitution matrix. Bioinformatics, 16, 760–766. Pearson,W.R. and Lipman,D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci., USA, 85, 2444–2448. Rump,S.M. (1979) Polynomial minimum root separation. Math. Comput., 33, 327–336. Schäffer,A.A., Aravind,L., Madden,T.L., Shavirin,S., Spouge,J.L., Wolf,Y.I., Koonin,E.V. and Altschul,S.F. (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res., 29, 2994–3005. Schwartz,R.M. and Dayhoff,M.O. (1978) Matrices for detecting distant relationships. In Dayhoff,M.O. (ed.), Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Washington, DC, vol. 5, Suppl. 3, pp. 353–358. Smith,T.F. and Waterman,M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195–197. States,D.J., Gish,W. and Altschul,S.F. (1991) Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. Methods, 3, 66–70. Sueoka,N. (1988) Directional mutation pressure and neutral molecular evolution. Proc. Natl Acad. Sci., USA, 85, 2653–2657. Wan,H. and Wootton,J.C. (2000) A global compositional complexity measure for biological sequences: AT-rich and GC-rich genomes encode less complex proteins. Comput. Chem., 24, 71–94.
Yu,Y.-K., Wootton,J.C. and Altschul,S.F. (2003) The compositional adjustment of amino acid substitution matrices. Proc. Natl Acad. Sci., USA, 100, 15688–15693.
7
APPENDIX: THE NON-INVERTIBILITY OF M(λ) AND DEGENERATE MATRICES
In the main text, we have assumed that for valid substitution matrices, M(τ ) of Equation (5) is invertible at τ = λ, where λ is the scale corresponding to the matrix. This is not necessarily the case, and when det M(λ) = 0 one may not solve for p and p using Equation (7). As we will discuss, in this case there exists a space of target frequencies all giving rise to s, and the matrix is said to be degenerate. In general, it is possible to distinguish degenerate but valid matrices from invalid matrices, and then to characterize fully the target frequencies corresponding to the former. However, the mathematical details of this complete anatomy of substitution matrices are quite involved, and we will limit ourselves here to recognizing and fully characterizing only those valid matrices that are non-degenerate. As we will show below, any valid matrix with a row or column identically zero must have det M(λ) = 0, and therefore be degenerate. Thus for non-degenerate matrices, the upper bound λmax on λ of (12) remains valid. (However, we will show that when a row or column of s is identically zero, an ever better bound on λ may be derived.) To apply the approach of the main text to non-degenerate s, we may merely exclude from consideration those τ in (0, λmax ] with det M(τ ) = 0. It is possible that det M(τ ) is uniformly zero. This arises trivially when two rows or two columns of s are identical or offset by a constant, but more complex cases may also be constructed, which may be recognized numerically. [For complete rigor, again assume rational scores, conceptually convert det M(τ ) into a polynomial, and evaluate det M(τ ) at sufficiently many distinct points to guarantee it is uniformly zero.] When det M(τ ) is uniformly zero, a valid s must be degenerate, so we consider this case no further. However, as stated above, further analysis allows one to distinguish valid from invalid s, and to describe the space of target frequencies corresponding to a valid s. When det M(τ ) is not uniformly zero, we find all the roots (x1 , . . . , xn ) of det M(τ ) = 0 within the interval (0, λmax ], if any exist. [As before, for complete rigor in determining that all roots have been found, we require rational scores, and may then use results on minimum root separation (Rump, 1979).] The procedure in the main text may then be applied to the intervals (0, x1 ), (x1 , x2 ), . . . , (xn , λmax ]. If no λ yielding a sensible solution to Equation (7) is found within these intervals, then s may be valid only if its λ equals one of the xi , implying s is degenerate. As before, further analysis would allow one to distinguish valid from invalid s, and to describe the space of target frequencies corresponding to a valid s. We will next show that a valid matrix with a row or column of scores identically zero must be degenerate and, finally, we will discuss briefly what the degeneracy of a matrix implies about its target frequencies.
7.1
Valid matrices with a row or column identically zero
For a valid matrix s with a row or column identically zero and scale parameter λ, one can show that det M(τ = λ) = 0. Assume, without
909
Y.-K.Yu and S.F.Altschul
loss of generality, that the first column of s is identically zero, and write M(λ) in terms of a set of corresponding target and background frequencies:
q 11 p1 p1 q21 p p 2 1 . .. qN 1 pN p1
q12 p1 p2 q22 p2 p2 .. . qN 2 pN p2
... ... ..
.
...
1 p 1 1 det p1 . .. 1 p1
q12 p1 p2 q22 p2 p2 .. . qN 2 pN p2
... ... ..
.
...
e
pL+1
pJ
+ ··· + e + ··· + e = 1 − p1 + · · · + pL . λcJ
λsI ,N
pN
Because no term on the left-hand side of this equation is negative, cJ ≥ c and pJ ≥ (1 − Lj=1 pj )/(N − L), we have eλc ≤ 1, N −L 910
q12 p1 p2 q22 p2 p2 .. . qN 2 pN p2
q1N p 1 pN q2N p 2 pN .. . qNN p N pN
... ... ..
ln(N − L) . c
.
...
The same argument of course applies with rows and columns switched.
7.2
The target frequencies of degenerate matrices
What type of target frequencies give rise to degenerate matrices? For any valid matrix with scale parameter λ and a set of target
q1N 1 p 1 pN q2N 1 p 2 pN = det . .. .. . qNN 1 p N pN
Because the determinant equals 1/p1 times itself, the determinant must be zero, proving our assertion. Although we do not seek here to determine the possible target and background frequencies of degenerate matrices, it is worth observing that the argument of the main text deriving an upper bound on λ may be extended to matrices with one or more rows or columns identically zero. Recall that cj is the largest score in column j . If L columns of s are uniformly zero, with L < N , simply redefine c to be smallest of the positive cj . Then [ln(N − L)]/c provides an even tighter upper bound on λ. (Note that λ is unbounded only if the entire matrix s is zero.) Without loss of generality, assume that the first L columns of s have only zero score. Let pJ be the largest among {pL+1 , . . . , pN } so that L pJ ≥ (1 − j =1 pj )/(N − L). Let the largest score in column J of the substitution matrix be located in row I . Then by Equation (6) λsI ,L+1
λ≤
q1N 1 p 1 pN q2N 1 p 2 pN = .. .. . . qNN 1 p N pN
Now, multiply the second through N-th column vectors of the matrix on the left-hand side by, respectively, p2 /p1 through pN /p1 , and add these all to the first column of the matrix. This operation does not change the value of the determinant, and since j qij = pi , we obtain
which immediately provides an even tighter upper bound for λ, i.e.
q12 p1 p2 q22 p2 p2 .. . qN 2 pN p2
... ... ..
.
...
q1N p 1 pN q2N p2 pN . .. . qN N p N pN
frequencies q, we may write det M(λ) as q11 q12
p1 p 1 q21 p2 p 1 det M(λ) = det . . . qN 1 pN p1
p1 p2 q22 p2 p2
...
q1N p1 pN
...
q2N p2 p N
.. .
..
.. .
qN 2 pN p2
...
=
N 1 pi pi i=1
q11
q21 · det . . . qN 1
.
qN N pN p N
q12
...
q22
...
.. .
..
qN 2
.
q1N
q2N . .. .
. . . qN N
Since det M(λ) vanishes if and only if det q does, degenerate matrices arise when some row (or column) vector of target frequencies may be written as a linear combination of the other row (or column) vectors. Note that the ‘letters’ corresponding to rows and columns may always be considered distinct, and it is not necessarily the case that if row vector i is a linear combination of the other row
Substitution matrices for non-standard compositions
vectors, then column vector i is a linear combination of the other column vectors. However, if the row vectors as a whole are linearly dependent, then so are the column vectors. For degenerate matrices, the alphabet being employed is, in a certain sense, too large. In essence, one may always view some row (column) ‘letter’ as equivalent to a linear combination of the other
row (column) letters. This leads to a linear space of valid target frequencies, as one may replace certain letters by others, in given fixed proportions. A degenerate matrix may, however, always be considered equivalent to a non-degenerate matrix of lower dimension, with certain rules for translating the ‘extra’ letters into linear mixtures of the remaining ones.
911