JOURNAL OF COMPUTATIONAL BIOLOGY Volume 17, Number 9, 2010 # Mary Ann Liebert, Inc. Pp. 1113–1128 DOI: 10.1089/cmb.2010.0099
Natural Parameter Values for Generalized Gene Adjacency ZHENYU YANG and DAVID SANKOFF
ABSTRACT Given the gene orders in two modern genomes, it may be difficult to decide if some genes are close enough in both genomes to infer some ancestral proximity or some functional relationship. Current methods all depend on arbitrary parameters. We explore a class of gene proximity criteria and find two kinds of natural values for their parameters. One kind has to do with the parameter value where the expected information contained in two genomes about each other is maximized. The other kind of natural value has to do with parameter values beyond which all genes are clustered. We analyze these using combinatorial and probabilistic arguments as well as simulations. Key words: algorithms, genomic rearrangements, molecular evolution, sequence analysis, statistics.
1. INTRODUCTION
A
s genomes of related species diverge through rearrangement mutations, groups of genes once tightly clustered on a chromosome will tend to disperse to remote locations on this chromosome or even onto other chromosomes. Even if most rearrangements are local, for examples, small inversions or transpositions, after a long enough period of time their chromosomal locations may reflect little or none of their original proximity. Given the gene orders in two modern genomes, then, it may be difficult to decide if a group of genes are close enough to each other in both genomes to infer some ancestral proximity or some functional relationship. There are a number of formal criteria for gene clustering in two organisms, giving rise to cluster detection algorithms and statistical tests for the significance of clusters. These methods, comprehensively reviewed by Hoberman and Durand (2005), all depend on one or more arbitrary parameters as well as n, the number of genes in common in the two genomes. The various parameters control, in different ways, the proximity of the genes on the chromosome in order to be considered a cluster. Change the parameters and the number of clusters may change, as may the content of each cluster. In this article, we define a class of gene proximity criteria, where two genes are said to be (i, j)-adjacent if they are separated by i 1 other genes on a chromosome in either one of the genomes and j 1 genes in the other genome. We also call this gene pair a (i, j) adjacency, and we say that the gene separation is i 1 in one genome and j 1 in the other. We define a (y, c)-generalized adjacency cluster in terms of a graph where the genes are vertices and an edge connects a pair of genes if they are (i, j)-adjacent where min(i, j) Department of Mathematics and Statistics, University of Ottawa, Ottawa, Ontario, Canada.
1113
1114
YANG AND SANKOFF
min(y, c) and max(i, j) max(y, c). Then the connected components of the graph are the (y, c) clusters. These definitions extend our two previous notions, k adjacency and y cluster in Zhu et al. (2009); Xu and Sankoff (2008), which dealt with all the gene pairs that are (i, j)-adjacent in our definition of twoparameter generalized adjacency, where i k and j k and (y, y) clusters in the present terminology. An innovative property of generalized adjacency clusters is that they embody gene order considerations within the cluster. Of particular interest are (i, 1) adjacencies and (y, 1) clusters. These are special in that they pertain to how far apart are genes in either one of the genomes that are strictly adjacent in the other genome. As with other criteria, the quantities y and c would seem arbitrary parameters in our definition of a cluster. The main goal of this article is to remove some of this arbitrariness, by finding ‘‘natural values’’ for y and c as a function of n, the total number of genes in the genomes. We find two such functions; the first trades off the expected number, across all pairs of genomes, of generalized adjacencies against the parameters y and c, with lower parameter values considered more desirable, i.e., it is good to find a large number of generalized adjacencies, but not at the cost of including unreasonably remote adjacencies. To do this we first define a wide class of similarities (or equivalently, distances) between two genomes in terms of weights on the (i, j) adjacencies, namely any system of fixed-sum, symmetric, non-negative weights o non-increasing in i and j. This is the most general way of representing decreasing weight with increasing separation of the genes on the chromosome. In any pair of genomes, we then wish to maximize the sum of the weights, which essentially maximizes the sensitivity of the criterion. Our main result is a theorem showing that the solution reduces to a uniform weight on gene separations up to a certain value of both y 1 and c 1, and zero weight on larger separations. 2 2 Moreover, the theorem specifies that the optimum value of h2 is the record time of a series of n2 random variables, where n is the length of the genomes. These are not the i.i.d. random variables studied in the theory of records, however, being highly dependent and, more important, of decreasing mean and variance. We use simulations to investigate the expected value of the time under a uniform measure on the set pffiffirecord ffi 2 of genomes, finding that these increase approximately as n, in contrast with the value n4 to be expected if 3 these were i.i.d. variables. Thus, it turns out that with genomes lengths of order 10 or 104, the optimal value of y is in the range of 8–15. We extend all these results to generalized adjacency in three or more genomes. If we are willing to accept the loss of sensitivity, and prefer to search for clusters more widely dispersed on chromosomes, there is a second set of ‘‘natural’’ parameter values that serve as an upper bound on the meaningful choices y and c. These values are the percolation thresholds of the (y, c) clusters. Beyond these values, tests of significance are no longer meaningful because all clusters rapidly coalesce together; it is no longer surprising, revealing or significant to find large groups of genes clustering together, even in pairs of random genomes. Percolation has been studied for max-gap clusters (Hoberman et al., 2005), but the main analytical results on percolation pertain to completely random (Erdo¨s-Re´nyi) graphs. The graphs associated with (y, c) clusters manifest delayed percolation, so the use of Erdo¨s-Re´nyi percolation values would be a ‘‘safe’’ but conservative way of avoiding dangerously high values of the parameters. We show how to translate known results on Erdo¨s-Re´nyi percolation back to generalized adjacency clusters. We also introduce random bandwidth-limited graphs and use simulations to compare the delays of generalized adjacency and bandwidth-limited percolation with respect to Erdo¨s-Re´nyi percolation in order to understand what structural properties of generalized adjacency are responsible for the delay.
2. DEFINITIONS Let S and T be two genomes with gene set V ¼ f1, . . . , ng linearly disposed in some order. Two genes g i and h are i-adjacent in S, written x~y in S, if there are i 1 genes between them in S. E.g., 1 and 4 are 2-adjacent in the genome 2134. We also call the gene pair (g, h), an i adjacency in S, if g and h are i-adjacent. We define genes g and h are (i, j)-adjacent in two genomes S and T, if they are i-adjacent in either one of the genomes and j-adjacent in the other. The gene pair (g, h) is called an (i, j) adjacency if they are (i, j)-adjacent in two genomes. We denote by EMH the set of all i adjacencies in a genome M, where 1 i Y. For two genomes S and T with the same gene set V ¼ f1, . . . , ng linearly disposed in some order, we define a subset of C V to be a
NATURAL PARAMETERS FOR GENERALIZED GENE ADJACENCY
FIG. 1.
1115
Determination of (1, 3) clusters (or (3, 1) clusters).
a generalized adjacency cluster, or (y, c) cluster, if it consists of the vertices of a connected component w w h h of the graph Ghw ST ¼ (V, (ES \ ET ) [ (ES \ ET )). Figure 1 illustrates how genomes S ¼ 123456789 and T ¼ 215783649 determine the (1, 3) clusters {1, 2} and {3, 4, 5, 6, 7, 8}.
3. A CLASS OF GENOME DISTANCES Rather than looking directly for natural values of y and c, we first remark that counting all (i, j) adjacencies (with i y and j c) on the same footing when defining clusters may be giving undue weight to pairs of genes that are remote in one or both genomes, relative to genes that are directly adjacent on both. Thus, we are led to consider a general system of (i, j)-dependent weights and to try to optimize this weighting, instead of just the ‘‘cut-off’’ values y and c. Given two genomes S and T with the same genes. Let oij be the weight on two genes that are (i, j)adjacent, i.e., i-adjacent in one of the genomes and j-adjacent in the other, such that 1. P 0 oijP ¼ oji, i, j 2 f1, 2, . . . , n 1g n1 n1 2. i¼1 j¼1 xij ¼ 1 3. oi,j ok,l if (a) max(i, j) < max(k, l) (b) max(i, j) ¼ max(k, l) and min(i, j) < min(k, l) We define the distance between two genomes S and T as d(S, T) ¼ 2(n 1)
nX 1 i¼1
nii wii þ
nX 1
! nij xij :
(1)
j¼1
where nij is the total number of gene pairs (x, y) that are i-adjacent in S and j-adjacent in T. n is genome size of two genomes S and T.
4. THE OPTIMUM WEIGHT SYSTEM IS UNIFORM We wish to find a system of weights o that tends to allocate, inasmuch as possible, higher weight to (i, j) adjacencies with small i and j, thus emphasizing the local similarities of the two genomes, but not excluding moderate values of i and j, to allow some degree of genome shuffling. Our strategy is to examine individual pairs of genomes (S, T) first, and that is the topic of this section. In the next section, we will study the consequences of introducing a uniform measure on the space of genomes with length n.
1116
YANG AND SANKOFF
Before stating our main results on (i, j) adjacencies, we prove a special case. We require that every adjacency with non-zero weight be a (1, j) adjacency. Then the definitions of weight and genome distance in Section 2 can be rephrased as following: Let the weight oi be any non-negative, non-increasing, function P on the positive integers such that ni ¼11 xi ¼ 1. The weight o induces a distance between genomes S and T with the same genes as follows: d(S, T) ¼ 2(n 1)
nX 1
(nSi þ nTi )xi
(2)
i¼1
where nXj is the number of pairs of genes that are j-adjacent in genome X and 1-adjacent in the other genome. Theorem 1.
For genomes S and T, the weight o that minimizes the distance (2) has 1 , if 1 i k xi ¼ k 0, otherwise,
(3)
where k* is an integer maximizing the function Pk
i¼1
f (k) ¼
(nSi þ nTi ) : k
(4)
Proof. Let ni ¼ nSi þ nTi . Based on Equation (2), minimizing d(S, T) is equivalent to maximizing the summation R¼
nX 1
ni x i
(5)
i¼1
We first note that a uniform upper bound on oi is 1i . i.e. 0 xi 1i . This follows from the non-increasing condition on o and its sum over all i being 1. Moreover, if xi ¼ 1i for some value of i, then x1 ¼ x2 ¼ ¼ xi 1 ¼ xi ¼ 1i and xi þ 1 ¼ ¼ xn 1 ¼ 0, by the same argument. Now we show that for any solution, i.e., an x ¼ (x1 , x2 , . . . , xn 1 ) that maximizes equation (5), there must be one weight in o which attains this upper bound. To prove this, let weights x1 , x2 , . . . , xn 1 maximize the equation (5) for given values of n1 , n2 , . . . , nn 1 , such that z ¼ max R. If all the ni’s are equal or all the oi’s are equal, the theorem holds trivially. For all other cases, assume that there is no weight in o that attains its upper bound. We define the set C ¼ f i j xi 4 xi þ 1 , 1 i n 2g 6¼ ;. Let n ¼ minC (min (xi xi þ 1 , 1i xi )) 4 0, by assumption. We select two weights oi and oj where ni = nj. Without loss of generality, we fix i < j and ni < nj. Then we define f0 ¼
i1 X
nk xk þ ni (xi n) þ
k¼1
j1 X k¼i þ 1
nk xk þ nj (xj þ n) þ
nX 1
nk xk
(6)
k¼j þ 1
¼ f þ (nj ni )n
(7)
4 f:
(8)
Then z is not the maximal value, contradicting the assumption about o. Hence, there must exist a weight oi in o attaining its upper-bound 1i . Then the optimal weight is x1 ¼ x2 ¼ ¼ xi 1 ¼ xi ¼ 1i and xi þ 1 ¼ ¼ xn 1 ¼ 0. Substituting this o in (5), produces the expression of form (4). So maximizing (5) is the same as maximizing (4). & Thus if we set y ¼ k*, we should find a large number of generalized adjacencies, but without paying the price of including a disproportionately large number of potential adjacencies. It is somewhat surprising that we find uniform weights up to the cut-off values since it was to avoid choosing this simplest of all weight systems by default that we were originally motivated to search within a much broader gamut of weight systems. The cut-off k* differs widely of course according to the pair of genomes S and T being compared. Nevertheless, we can use E[k*], as function of n, to find the natural value for the cut-off parameters in the uniform weight-based distance.
NATURAL PARAMETERS FOR GENERALIZED GENE ADJACENCY
1117
Having obtained these results for the special case of (1, j) adjacencies, we may ask whether they hold true for (i, j) adjacencies more generally. Indeed we can prove: pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Theorem 2. Let ak ¼ b 1 þ 8(k2 1) þ 1c. The weight o that minimizes the distance (1) has 8 < k1 , if i 5 ak , j i, (9) xij ¼ or i ¼ ak , j k i(i 2 1) : 0, otherwise where k* is an integer maximizing the function 1 f (k) ¼ k
"
aX i k 1 X
(nij þ nji ) þ
k 12aX k (ak 1)
i¼1 j¼1
# (nak j þ njak ) ,
(10)
j¼1
where nij is the number of gene pairs i-adjacent on S and j-adjacent on T. Proof.
Since oij ¼ oji, Equation (1) is equivalent to d(S, T) ¼ 2(n 1)
nX 1 X i
(nij þ nji )xij :
(11)
i¼1 j¼1
We can use the lower-triangle matrix to represent the bivariate weight, i.e. 0 x11 B x21 x22 B B x31 x32 x33 x¼B B B @ x(n 2)1 x(n 2)2 x(n 2)3 x(n 2)(n 3) x(n 2)(n 2) x(n 1)1 x(n 1)2 x(n 1)3 x(n 1)(n 3) x(n 1)(n 2)
1 C C C C C C A
(12)
x(n 1)(n 1)
and for each oij (1 j i n 1), the corresponding coefficient is nij þ nji. Thus, we transform the twodimensional weight matrix to one-dimensional weight sequence x ¼ fx11 , x21 , x22 , , x(n 1)(n 2) , x(n 1)(n 1) g
(13)
Obviously, the os0 in Equation (13) satisfy the weight definition in Section 3. Therefore we have that oij satisfies a uniform distribution with a cut-off value k* based on Theorem 1. Let oij be the last nonzero o. Then the two-dimensional weight matrix is 1
0
x11 B x21 x22 B B B x x¼B x i1 i2 B B B @0 0 0 0
xij 0 0 0 0 0 0 0 0
k ¼
i(i 1) þj 2
0 0 0
C C C C C C C C A
(14)
So we have (15)
Since 1 j i n 1, then i(i 1) i(i þ 1) þ 1k 2 2 Solving (16), we obtain the bounds for i,
(16)
1118
YANG AND SANKOFF
FIG. 2. k is augmented from left to right, starting at the top row, in the lower triangle including the diagonal. Values of oij in the upper triangle determined by symmetry.
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 þ 8(k 1) þ 1 1 þ 8k 1 (17) i 2 2 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Because i, j and k* are all integers, i must be b 1 þ 8(k2 1) þ 1c and j ¼ k i(i 2 1). Therefore, the theorem holds. & pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffi In analogy to the (1, y) clusters mentioned above, we can set h ¼ w ¼ b 1 þ 8(k2 1) þ 1c 2k and use E[k*], as function of n, to find the natural value for the cut-off parameters in the uniform weight-based distance. The schema in Figure 2 indicates the order in which the cells of the two dimensional array of (i, j) adjacencies become non-zero as k* increases.
5. FINDING E[k*] The parameter k* in the previous section minimizes d(S, T) over all weighting systems for a given pair of genomes S and T. A larger k might increase the number of adjacencies with non-zero weight, but this would decrease the weight o disproportionately, so that the distance would no longer be minimal. A smaller k, on the other hand, would result in a greater weight o, but applied to far fewer adjacencies, so that the distance would depart from minimality as well. This motivates us to consider k* as a ‘‘natural’’ parameter for the weight system for the pair (S, T). If we want to see whether one pair of experimental genomes are more similar than another pair, however, we would like to have a single, standard value of k*. Consider the set of all pairs of random genomes, modeled by the set of all pairs of permutations on (1, , n), and consider the distribution of the natural value k* over this set. We argue that the mean of this distribution defines the single most reasonable single weight system for studying pairs (S, T) of experimental genomes. The search for k* in Theorem 2 over all pairs of random genomes, or even a sample, may seem prohibitively costly, but it can be substantially simplified. First we can show the nij, the number of pairs of genes that are i-adjacent in one genome and j-adjacent in the other, is Poisson distributed, with easily calculated parameters. Let N(n, y, c) be the number of gene pairs whose separation is less than y in either of the genomes and less than c in the other. Then, correcting for counting some terms more than once,
NATURAL PARAMETERS FOR GENERALIZED GENE ADJACENCY
N(n, h, w) ¼
w h X X
(nij þ nji )
i¼1 j¼1
min(h, j1 Xw) X j¼2
(nij þ nji )
i¼1
min(h, Xw)
1119
nii :
(18)
i¼1
Theorem 3. For two random genomes S and T with n genes, nij converges in distribution to the Poisson with parameter E(nij ) ¼
2(n i)(n j) n(n 1)
(19)
Moreover, E[N(n, h, w)] ¼ (4wh 2h2 )
2h(w2 h2 þ wh) h(h 1)(2w2 2w h2 þ h) þ , where h w: n 2n(n 1)
(20)
Proof. We create a new genome R, by relabeling the genes of the genome T using their positions in the genome S. So the comparison of two genomes S and T with n genes is equivalent to the comparison of two i genomes I and R, where I is a identity genome, i.e., I ¼ (1, 2, . . . , n). Thus the event fx~y in genome S, j i (i, j) x~y in genome Tg is equivalent to the event fk~k þ i in the genome Rg. We define yk as: 8 < 1, if k~j k þ i in R (i, j) yk ¼ (21) : 0, otherwise: Clearly, the number of gene pairs in a genome with n genes is n(n 1). Now we are going to calculate the j number of gene pairs in the random genome R that satisfy k~k þ i, i.e., the two genes are i-adjacent in the genome R and j-adjacent in the genome I. For two genes x and y which are i-adjacent in the genome R, if the location of one gene, say x, is in the interval [j, n j] of I, then there are two choices for the position of the other gene y in I; while if the location of the gene x is not in the interval [j, n j] of I, there is only one choice for the position of gene y in I. So the number of gene pairs in the random genome R that satisfy j k~k þ i is 2(n j). Thus, we obtain 8 2(n j) < n(n k ¼ 1, 2, . . . , n i 1) , j (22) Pn (k~k þ i) ¼ : 0, otherwise, So j
Pn (yk(i, j) ¼ 1) ¼ Pn (k~k þ i, 1 k n i) j
¼ Pn (k~k þ ij1 k n i)Pn (1 k n i) 2(n i)(n j) ¼ n(n 1)2
(23)
Considering the tth factorial moment of nij,
h i n1 t!E yk(i,1 j) yk(i,2 j) . . . yk(i,t j) E[nij (nij 1) . . . (nij t þ 1)] ¼ t
(24)
Equation (24) holds because it represent the expectation of the number of ways to choose t both sidesj) of non-zero elements from y1(i, j) , . . . , y(i, n1 . Since all yk(i, j) can only take the value 1 or 0, we have j) (i, j) (i, j) (i, j) E[yk(i,1 j) y(i, k2 . . . ykt ] ¼ Pn (yk1 ¼ 1, . . . , ykt ¼ 1) t Y 1 Pn (yk(i,r j) ¼ 1) þ O t þ 1 ¼ n r¼1 " #t ! 2(n i)(n j) 1 ¼ þ O tþ1 n n(n 1)2
(25)
(26)
1120
YANG AND SANKOFF
Therefore, lim E[y(y 1) . . . (y t þ 1)] n1 j) (i, j) (i, j) ¼ lim t!E[y(i, k1 y k 2 . . . y kt ] n!1 t t 2(n i)(n j) ¼ n(n 1) n!1
(27)
Based on Theorem 2 in Xu and Sankoff (2008), we conclude nij converges in distribution to the Poisson with parameter 2(n i)(n j) : (28) E(nij ) ¼ n(n 1) This proves the first statement of the theorem. Substituting Equation (28) in Equation (18) gives us (20), completing the proof of the theorem. & The n0ij ¼ nij þ nji (1jin 1) are asymptotically independent, so that we may profit from: Corollary 1. Let S and T be two random genomes containing the same n genes and nij be the number of gene pairs i-adjacent in genome S and j-adjacent in genome T, then f (k) in Theorem 2 satisfies
ak 2 E[f (k)] ! 2 n 8 2 2 Var[f (k)] ! 2 1 þ 2 (29) ak ak ak jpffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi k as n ? ? , where ak ¼ 1 þ 8(k2 1) þ 1 . Proof. We have ! k 12ak (ak 1) i k 1 X X 1 aX 0 0 n þ na k j f (k) ¼ (30) k i¼1 j¼1 ij j¼1 From Theorem 3, we know E[n0ij ] ¼ 1 E[f (k)] ¼ k
4(n i)(n j) n(n 1) .
aX i k 1 X i¼1 j¼1
Then the expectation of f (k) is
4(n i)(n j) þ n(n 1)
k 12aX k (ak 1) j¼1
4(n ak )(n j) n(n 1)
!
4kn2 [2k2 2(a2k 3ak 1)k 12 ak (a2k 1)(ak 2)]n kn(n 1) 2 2 2ak k 2ak (ak ak 1)k þ 12 ak (a2k 1)(a2k ak 23)n þ kn(n 1) 1 2 ak (ak þ 1)(ak þ 2)(3ak þ 1) þ6 kn(n 1)
1 ak 2 1 ¼ 2 þO where ak n 1 and k ¼ a2k : n 2 n Similarly, the variance of f (k) is
(31)
¼
1
k 2aX k (ak 1) aX k 1 4(n i)(n j) 2(n i)2 4(n ak )(n j) þ4 þ n(n 1) n(n 1) n(n 1) i¼1 j¼1 i¼1 j¼1 4(k þ ak 1) 1 ¼ þO k2 n
1 8 2 2 1 ¼ 2 1þ 2 þO where ak n 1 and k ¼ a2k : ak n 2 ak ak
1 Var[f (k)] ¼ 2 k
aX i1 k 1 X
(32) !
(33) &
NATURAL PARAMETERS FOR GENERALIZED GENE ADJACENCY
1121
Because it determines a maximum, looking for k* is similar to the upper record problem, i.e., for a series of random variables X1 , X2 , . . ., we consider the new sequence L(m), (m ¼ 1, 2, . . . ), such that L(1) ¼ 1 and for any m 2, L(m) ¼ min{j : Xj > XL(m1)}, where L(m) is the index of the mth upper record (or mth record time), while the corresponding random variable XL(m) is the value of the mth record (or mth record value). Readily derived properties of record times for i.i.d. random variables include:
The probability that the i-th random variable attains a record is 1i . The expected number of records up to the i-th random variable is log i. The average time at the record for n random variables is n2.
The quantity k* in Theorem 2 is a record time over 12 n(n 1) values of f (k), though these are clearly neither identical nor independent random variables. That both the mean and variance of f (k) are decreasing functions of n means that records become increasingly harder to attain. This is illustrated in Figure 3, which compares the proportion of record values at each (i, j) in 50,000 pairs of random genomes of size n ¼ 10, 30, 100, 300, 1000, and 3000 and the accumulated number of record values up to this point, with the values of i.i.d. random variables. Note that the pffifficorresponding ffi horizontal axis is k, which maps to i ¼ 2k as a position in the genome. More important for our purposes is that the average record time is nowhere near half p theffiffiffi number of 2 random variables (n4 in our case). Figure 4 shows that k* depends approximately linearly on n, so that the pffiffiffiffiffi cut-off position in the genome of the maximizing weight system will be O( 4 4n). For genomes of size n ¼ 12,000, the expected value of k* is around 110, so that the cut-off y for generalized adjacency need not be greater than 15.
6. EXTENSIONS TO THE COMPARISON OF THREE GENOMES In this section, we will extend our model to the comparison of three genomes. Let S, T, and U be three genomes with gene set V ¼ f1, . . . , ng linearly disposed in some order. We define two genes g and h are (i, j, k)-adjacent in three genomes S, T and U, if they are i-adjacent in either one of the genomes, j-adjacent in another genome and k-adjacent in the last genome. The gene pair (g, h) is called an (i, j, k) adjacency if they are (i, j, k)-adjacent in three genomes. Let EMH the set of all i adjacencies in a genome M, where 1 i Y. We define a subset of C V to be an (y, f, c) generalized adjacency gene cluster, or (y, f, c) cluster, if it consists of vertices of a connected w w w w / / / / h h h component of the graph, Gh/w STU ¼ (V, (ES \ ET \ EU ) [ (ES \ ET \ EU ) [ (ES \ ET \ EU ) [ (ES \ ET \ w w / / h h h EU ) [ (ES \ ET \ EU ) [ (ES \ ET \ EU )). Figure 5 illustrates how genomes S ¼ 1 2 3 4 5 6 7 8 9, T ¼ 2 1 5 7 8 3 6 4 9 and U ¼ 1 2 6 7 8 4 9 5 3 determine the (2, 3, 4) clusters {1, 2} and {3, 4, 5, 6, 7, 8, 9}. Given three genomes S, T, U with the same genes. Let oijk be the weight on two genes that are i-adjacent in S, j-adjacent in T and k-adjacent in U, such that 1. P 0 oijkP ¼ oikjP ¼ ojik ¼ ojki ¼ okij ¼ okji, i, j, k 2 f1, 2, . . . , n 1g n1 n1 n1 2. i¼1 j¼1 k ¼ 1 xijk ¼ 1 3. oijk ostu if (a) max(i, j, k) < max(s, t, u) (b) max(i, j, k) ¼ max(s, t, u) and mid(i, j, k) < mid(s, t, u) (c) max(i, j, k) ¼ max(s, t, u), mid(i, j, k) ¼ mid(s, t, u) and min(i, j, k) ¼ min(s, t, u) where mid(x, y, z) is the middle number of x, y, z, i.e., the second largest of the three numbers x, y, z. We define the dissimilarity function of three genomes S, T, and U to be " !# nX 1 nX 1 nX 1 d0 (S, T, U) ¼ 6(n 1) nijk xijk 2niii xiii þ (nijj þ njij þ njji )xijj þ (34) i¼1
j¼1
k¼1
where oxyz is the weight of an (x, y, z) adjacency; nxyz is the number of gene pairs (g, h) that are x-adjacent on S, y-adjacent on T and z-adjacent on U. Similar to the results of two-genome comparison, we have the following theorems. We omit the proofs, which are parallel to the preceding theorems, but lengthier. j k pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 pffiffiffiffiffi Theorem 4. Let ak ¼ 13 (81k 1 9 81k2 2k)3 þ 13 (81k 1 9 81k2 2k) 3 þ 23 b 3 6kc. The weight o that minimizes the dissimilarity function (34) is
1122
YANG AND SANKOFF
a
1 0
50
100
150
200
250
300
350
400
450
500
frequency of records
0.1
0.01 1/k n=3000 n=1000
0.001 n=30
n=100 n=300
0.0001 n=10
0.00001
k
b
8
n=3000
7 log k
n=1000
number of records
6
5
n=300
4 n=100
3 n=30 2
1
0 0
200
400
600
800
1000
k
FIG. 3. Comparison of mean optimal k values, over 50,000 pairs of random genomes, with the record behavior of i.i.d. random variables. Proportion of cases where k is optimal (a) and number of records attained (b), for (i, j) adjacencies as a function of genome size n. As n ? ? , for any k0 , all curves approach the record time curves for all k < k0 , but even at n ¼ 3000, there is an eventual drop off, due to the declining mean expectations and variances of the nij.
NATURAL PARAMETERS FOR GENERALIZED GENE ADJACENCY
a
1123
90 80 70 60
E(k*)
50 40 30 20 10 0 0
1000
2000
3000
4000
5000
n
b
140
120
100 y = 1.3021x
E(k*)
80
60
40
20
0 0
10
20
30
40
50
60
70
80
90
100
÷n
FIG. 4. Average record time as a function of genome length n (a) and as a function of
xstu ¼
8 1 > < k ,
if 1uts 5 ak ,
> :
or 1ts ¼ ak , 1ut otherwise
0,
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi t2 2k þ 13 s(s 1)2
pffiffiffi n (b).
(35)
where k* is an integer maximizing the function 1 f (k) ¼ k
aX s X t k 1 X s¼1 t¼1 u¼1
1
n0stu
þ
ak t 3 X t¼1
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2 3t2 6k þ ak (ak 1) X
! n0ak tu
(36)
u¼1
where n0 xyz ¼ nxyz þ nxzy þ nyxz þ nyzx þ nzxy þ nzyx and nxyz is the number of gene pairs x-adjacent in S, yadjacent in T and z-adjacent in U. The schema in Figure 6 indicates the order in which the cells of the three dimensional array of (s, t, u) adjacencies become non-zero as k* increases.
1124
YANG AND SANKOFF
FIG. 5.
Determination of (2 3 4) clusters of genomes S, T, U.
Theorem 5. For three random genomes, S, T and U, with n genes, let nstu be the total number of (s, t, u)-adjacent gene pairs (x, y) that are s-adjacent in genome S, t-adjacent in genome T and u-adjacent in t)(n u) genome U. Then nstu , converges to a Poisson distribution with parameter 4(n ns)(n 2 (n 1)2 In contrast to the two-genome case in Theorem 3, where nij ? 2, for the three-genome case in Theorem 5, nstu ? 0.
7. PERCOLATION Cluster-finding procedures based on parameterized adjacency criteria (Hoberman et al., 2005; Xu and Sankoff, 2008), can have pathological behaviour as the criteria become less restrictive. At some point, called a percolation threshold, instead of large clusters being rare, they suddenly start to predominate and it becomes unusual not to find a large cluster. Beyond this point, it becomes meaningless to test that the numbers or sizes of clusters exceed those predicted by the null hypothesis of random genomes! It was established by Erdo¨s and Re´nyi (1959, 1960, 1961) that, for random graphs, where edges are independently present between pairs of the n vertices with probability p, the percolation threshold is p ¼ 1n. When simulating the size of the largest (y, c) cluster as a function of y and c, we obtain graphs like that in Figure 7.
NATURAL PARAMETERS FOR GENERALIZED GENE ADJACENCY
1125
FIG. 6. k is augmented starting at the origin, in the triangular pyramid 0A0 B0 C0 . Values of ostu in other parts determined by symmetry.
1 0.9 0.8 0.7 0.6
40 35
0.5
30
0.4
25 0.3 20 0.2 15
y
0.1 10 0 0
5
5 10
15
20
q
25
30
0 35
40
FIG. 7. Proportion of simulations where size of the largest cluster is larger than the half of genome size n ¼ 100, based on a sample of 50,000 random permutations for each combination of y, w ¼ 1, 2, . . . , 99. For all values of y or c greater than 40 (not shown), the proportion is 1.0.
1126
YANG AND SANKOFF
We note that the percolation of the generalized adjacency graph is delayed considerably compared to unconstrained Erdo¨s-Re´nyi graphs with the same number of edges, as may be seen in Figure 8. To explore what aspect of the generalized adjacency graphs may be responsible for this delay, we also simulated random graphs of limited bandwidth (Chinn et al., 2006). The bandwidth of a graph is the minimum value, over all labelings of the vertices with distinct integers, of the maximum difference across an edge. Thus we generated graphs with bandwidth y, since this constraint is an important property of generalized adjacency. It can be seen in Figure 8 that the limited bandwidth graphs also show delayed percolation, but less than half that of generalized adjacency graphs.
a 1
P(max cluster > n/2)
0.8 Erdös- Rényi 0.6
bandwidth £ q 0.4
(q,q)-adjacency 0.2
0 0
10
20
30
40
q
b 30
Percolation(q )
25 q = 0.73÷n 20
bandwidth £ q
(q,q)-adjacency q = 1.01÷n
15 Erdös- Rényi
10
q = 0.61÷n 5 0 0
5
10
15
20
25
30
÷n
FIG. 8. (a) Simulation with genome length n ¼ 1000, with 2y2 edges in each graph, showing delayed percolation of generalized adjacency graphs with respect to Erdo¨s-Re´nyi graphs. Bandwidth-limited graphs are also delayed but much pffiffiffi pffiffiffi less so. (b) Percolation point as a function of n, again with 2y2 edges per graph. Delay measured by coefficient of n in equation for trend line.
NATURAL PARAMETERS FOR GENERALIZED GENE ADJACENCY
1127
As a control on our simulations, it is known (D’Souza et al., 2009) that Erdo¨s-Re´nyi graphs with rn edges, as n ? ? and r & 12 the probability they have a component of size (4r 2)n tends to 1. Our percolation criterion is that one cluster must have at least n2 vertices. Solving (4r 2)n ¼
n 2
(37)
2 we get r ¼ 0.625. This suggests that the pffiffiffi 2y edges we use in each pofffiffiffi our simulated graphs should be approximately 0.625n, so that h ¼ 0:56 n. This compares to the 0:61 n we found in our simulations, the difference being attributable to the discrepancy between r ¼ 0.625 and values of r closer to 0.5 where the 4(r 2)n size prediction for the largest cluster holds more exactly. The percolation results inpffiffithis ffi section suggest another kind of ‘‘natural’’ value for the generalized adjacency criterion, namely n, since abovepthis ffiffiffi value it is no longer statistically surprising to find a large cluster. Indeed, it would be prudent to use k n for some small fraction l, since for finite n percolation may pffiffiffi occur for y somewhat smaller than n.
8. CONCLUSION We have defined and explored the notions of generalized gene adjacency and (y, c)-clusters. We have made use not only simulations, but of analytical methods. The asymmetry of the criteria in two (or more) genomes allows a flexibility not available in previous models. Of crucial importance is that the natural values of the cut-off we found in Section 5 are less than the percolation points in Section 7. In Section 6, we extend the two-parameter generalized adjacency model to three-genome comparison and found properties similar to two-genome comparison, except that the expected number of (i, j)adjacencies decreases to zero as n increases, rather than approaching a limiting value of 2. The same methods should allow extension to larger numbers of genomes. Future work may help locate k* analytically as a function of n. Another direction is to pin down other structural properties, beside bandwidth constraints, responsible for the delayed percolation of generalized adjacency graphs.
ACKNOWLEDGMENTS Research supported in part by grants from the Natural Sciences and Engineering Research Council of Canada (NSERC). D.S. holds the Canada Research Chair in Mathematical Genomics. We would like to thank Wei Xu, Ximing Xu, Chunfang Zheng, and Qian Zhu for their constant support and help in this work.
DISCLOSURE STATEMENT No competing financial interests exist.
REFERENCES Chinn, P.Z., Chva´talova´, J., Dewdney, A.K., et al. 2006. The bandwidth problem for graphs and matrices—a survey. J. Graph Theory 6, 223–254. D’Souza, R., Achlioptas, D., and Spencer, J. 2009. Explosive percolation in random networks. Science 323, 1453–1455. Erdo¨s, P., and Re´nyi, A. 1959. On random graphs. Publ. Math. 6, 290–297. Erdo¨s, P., and Re´nyi, A. 1960. On the evolution of random graphs. Publ. Math. Instit. Hung. Acad. Sci. 5, 17–61. Erdo¨s, P., and Re´nyi, A. 1961. On the strength of connectedness of a random graphs. Acta Math. Sci. Hung. 12, 261– 267. Hoberman, R., and Durand, D. 2005. The incompatible desiderata of gene cluster properties. Lect. Notes Bioinform. 3678, 73–87.
1128
YANG AND SANKOFF
Hoberman, R., Sankoff, D., and Durand, D. 2005. The statistical analysis of spatially clustered genes under the maximum gap criterion. J. Comput. Biol. 12, 1081–1100. Xu, X., and Sankoff, D. 2008. Tests for gene clusters satisfying the generalized adjacency criterion. Lect. Notes Bioinform. 5167, 152–160. Zhu, Q., Adam, Z., Choi, V., et al. 2009. Generalized gene adjacencies, graph bandwidth, and clusters in yeast evolution. Trans. Comput. Biol. Bioinform. 6, 213–220.
Address correspondence to: Dr. David Sankoff Department of Mathematics and Statistics University of Ottawa 585 King Edward Avenue Ottawa, Ontario, Canada K1N 6N5 E-mail:
[email protected]