Expected Gene Order Distances and Model Selection in ... - CiteSeerX

Comment

Report 1 Downloads 56 Views

Expected Gene Order Distances and Model Selection in Bacteria Daniel Dalevi∗

Niklas Eriksen†

May 21, 2007

Abstract The most parsimonous distances calculated in pairwise gene order comparisons cannot accurately reflect the true number of events separating two species, unless the number of changes are few. Better is to use the expected distances. In this study we recapitulate previous results and derive new expected distances for models that have gained support in other studies, such as, symmetrical reversal distances and short reversals. Further, we investigate the patterns of dotplots between species of bacteria with the purpose of model selection in gene order problems. We find several categories of data which can be explained by carefully weighing the contributions of reversals, transpositions, symmetric reversals, single gene transpositions, and single gene reversals.

1 Introduction Almost 40 years ago Kimura [11] proposed, in an influential paper, that the majority of all mutant substitutions occurring at the molecular level are selectively neutral. A consequence of his neutral theory [12] is that sites in DNA sequences that evolve without constraints will have rates of change identical to the actual mutation rate of the organism. If constant, mutations will accumulate at a pace of an evolutionary clock which can be used for establishing the evolutionary distance between organisms. This distance is highly significant for our knowledge on how organisms relate to each other — both at present and in the past. The early models of evolution treat all types of substitutions equal and assume that all sites in a DNA or protein sequence evolved with the same rate, which would be the case if they were all neutral and under the same mutational pressure. However, as it appears, there are many selectional constraints even at positions that are expected to be neutral, such as synonymous codon sites, resulting in a whole spectrum of different rates. At present there exist big families of nested models for Maximum likelihood based phylogeny and there are statistical hypothesis tests for selecting among those models (e.g. [16, 17]). A process that would be expected to be neutral, which has been less subject to model selection, is the rearrangement of the genes within the genome. Of course, there are clusters of genes that are co-regulated, such as operons that progress through evolution as a unit, but apart from that the gene order should be more or less uniformly changeable. However, as it appears there are several selectional constraint that seem to act on the gene order. [18] made experiments on E. coli where reversals of segments were performed using a system for in vivo selection of genomic rearrangements. They found several phenotypic constraints on reversals. One type of constraint is that reversals need to preserve a gene’s distance to the origin of replication (see also [15]). Therefore reversals preserving this distance, so called symmetric reversals occurring around the origin (ori) and terminus (ter) of replication, ought to be much more frequent than others. This has been observed using dotplots in many pairs of closely related bacteria. Figure 1.1 illustrates a dotplot of two bacteria with identical gene-order and with a symmetric reversal. The pattern found looks like an ”X” and is often referred to as X-plots or X-files [6, 23, 14]. Another type of gene rearrangements that has shown to be over represented are single gene reversals (e.g. [20, 5, 13]). These are occurring much more frequently than would be expected under a uniform model. [22] found that the distribution of reversal lengths could be approximated by a gamma-distribution with shape-parameter α = 0.60 and β = 1200. Attempts have been made to compute distances with length weighted reversals (e.g. [2]). A consequence of a single gene reversal, and other reversals that preserve the direction of transcription relative to replication, is that the gene will swap from leading to lagging strand (or vice versa). This may increase the nucleotide mutation rates because of a mechanism called GC skew (e.g. [9]). ∗ Department

of Computing Science, Chalmers University, SE 412 96 G¨oteborg, Sweden, [email protected] Sciences, G¨oteborg University, SE-412 96 G¨oteborg, Sweden, [email protected]

† Mathematical

ori

ori

ori

ori Inversion ori A

B

ori

A

B

ter

ter

Figure 1.1: The distance to ori and the direction of transcription [compared to the direction of replication] are preserved under symmetric reversals. Upper-left, dotplot of two bacteria with identical gene orders. Upper-right, after a symmetric reversal. The lower two figures show the two strands of a circular bacterial chromosome. Two genes are drawn with the direction of transcription. The direction of replication is shown with the arrows on the circles. In this article we attempt to motivate the use of different models of evolution in gene-order problems by studying the shape and appearance of dotplots obtained in pairwise comparisons of bacteria. We also derive expected distance for a couple of novel models and evaluate their performance by examining different cases of real biological data. It appears, as in the case of mutations at the nucleotide and protein level, that is hard for a single model to explain the evolution of gene-orders in bacteria.

2 Data analysis All pairs of bacteria used in this study were downloaded from the NCBI ftp site of bacteria (ftp://ftp.ncbi. nih.gov/genomes/Bacteria/). Orthologous gene pairs were identified as the best bi-directional hits using BLASTP with an Evalue of at least 0.01. Duplicated genes were totally removed from the datasets. All data are available in the supplementary material together with a series of perl-scripts for producing permutations directly from raw-data.

3 Diversity of models There are many ways to measure the distance between two gene orders. Simplest, we may look at differences between the gene orders, for instance how many adjacent genes in one of the gene orders that are not adjacent in the other. This is known as the number of breakpoints and is easily computed. We may also define the distance as the minimal number of operations needed to transform one of them into the other. We refer to this as a minimal distance. The computability of these distances depends heavily on which operations we allow. A third option comes from combining the two approaches above. Starting with two identical gene orders and applying random operations to one of them, the expected number of breakpoints between them generally increases in a nonlinear fashion with the number of operations we apply. Taking the inverse of the expected number of breakpoints after t operations, we get an estimate of the number of operations applied by simply computing the number of breakpoints ([21]). We call this estimate the expected distance. For closely related gene orders, both minimal and expected distances give good estimates of the true number of operations, but for distant gene orders, minimal distances tend to underestimate the true number of operations. This contrasts with expected distances, which should give relevant information over longer time spans ([8]). Below we state old and new results concerning minimal and expected distances between signed genomes in various models. These models are usually to simple to explain the full evolutionary scenario, but we show in Section 4 that they, either on their own or in combination, can provide information on true data.

3.1 Uniform reversal distribution To the classical results in this area of research we count the (minimal) reversal distance by [10]. A reversal is an operation that takes out a segment of genes and reinserts them backwards at the same position. For most genomes,

this distance is given by the number of genes minus the number of cycles in the associated breakpoint graph. This computation can be made in linear time and an up-to-date summary of all relevant aspects is [3]. Turning to expected distances, this area was pioneered by Wang and Warnow and their contributions are summarised in [24]. Building on their results, [7] presented an approximation of the expected evolutionary reversal distance given the measured number of breakpoints, ´ ³ log 1 − n(1−b 1 ) 2n−2 ¡ ¢ , t(b) = log 1 − n2 where n is the number of genes and b the number of breakpoints. In addition, there is a (more complicated) formula for the expected reversal distance after t reversals, whose inverse (which has to be computed numerically) gives the expected number of reversals given the reversal distance [8]. These two methods give comparable results within the model, but may differ if the data does not adhere perfectly to this model.

3.2 Single gene reversals In this model, we only allow those reversals that have length one, that is those that change the sign of a gene. Most gene orders cannot be sorted using this restricted set of reversals, but for those created using this model, the minimal number of single gene reversals needed to sort them equals the number of negative elements. In computing the expected number of breakpoints after t reversals, we start with the identity order 123 . . . n. By symmetry of the circular gene order, we need only keep track of gene 2 and see if it ends up next to gene 1. This is described by a Markov chain with 2n − 2 states, one for each available position and orientation of gene 2. Restricting ourselves to single gene reversals, we have only four states, corresponding to all combinations of positive of negative orientations of genes 1 and 2, their positions being fixed. It is easy to verify that the transition matrix becomes   n−2 1 1 0  1 n−2 0 1  . M =  1 0 n−2 1  0 1 1 n−2 What we need to compute is [7]

Ã b(t) = n 1 −

P2n−2 j=1

nt

vj2 λtj

! ,

where λj are the eigenvalues of M , vj2 the first entries of the corresponding normed eigenvectors and the n in the denominator is the common row sum in the matrix, that is the number of available operations. In this case, the eigenvalues are n, n − 2 and n − 4 and computing the eigenvectors as well, we get Ã µ ¶t µ ¶t ! n 2 4 b(t) = 3−2 1− − 1− . 4 n n We have not found an analytical inverse of this function.

3.3 Symmetric reversals — 1 Axis In this model, we assume an axis of replication that divides the genome into two equally long halves, and allow only reversals that are symmetric about this axis, that is half of the genes that are reversed are on each of the sides of the axis. Again, in this model we cannot sort any permutation, but those that we can sort are easy to handle. The number of reversals needed is just half the number of breakpoints. This simple relationship between reversal distance and breakpoints indicates that the expected distance is equally simple to compute. Let the total number of possible symmetric reversals be m, equalling (n − 1)/2 for odd n and either n/2 or n/2 − 1 for even n, depending on whether the axis of replication goes through or between genes. Then, only one reversal can divide or unite a pair of originally neighbouring genes, giving a Markov chain with only two states and transitions matrix µ ¶ m−1 1 M= . 1 m−1

With eigenvalues m and m − 2, we plug the m = n/2 into our formula to get Ã µ ¶t ! n 4 b(t) = 1− 1− . 2 n Inverting gives the expected distance

¡ ¢ log 1 − 2b n¢ ¡ . t(b) = log 1 − n4

3.4 Symmetric reversals — 2 Axes Modelling reversals symmetric about one sharp axis seems a bit idealistic. In fact, when [1] investigated the asymmetric coefficient of applied reversals, they found that while for some genomes these reversals had a very low asymmetric coefficient (thus being very symmetric), it was usually strictly greater than zero. A reversal with a positive asymmetric coefficient can be modelled using an axis that differs from the axis of replication. A situation with small asymmetric coefficients would thus be modelled by allowing several axes, which are close to each other. Figure 3.2: Left figure: Expected number of breakpoints after t reversals in different models, using a genome of 40 genes. One should note that the behaviour of these functions does not depend on the number of genes — except for small numbers, the graphs are very similar if the abscissa is scaled proportionally. The models include all reversals, single gene reversals, reversals symmetric about one axis, two and three axes through adjacent genes and finally one axis between two genes plus one or two axes between on or both of these adjacent genes. Note how the asymptotic number of breakpoints vary between the models. Middle figure: Expected number of breakpoints after t reversals in mixed models, using a genome of 40 genes. The black line is the single gene reversals model, and below and above it has been combined with the proportion 1 − p of one axis and two axes symmetric reversals, respectively. While increasing p move b(t) from single gene reversals straight to one axis symmetric reversals, the combination of single gene reversals and two axes symmetric reversals has more breakpoints than the respective pure models for most p. Right figure: Expected number of breakpoints after t operations in different models, using a genome of 40 genes. We have used the uniform reversal model, the single gene transposition model and the uniform transposition model. We find that the single gene transposition model resembles the uniform transposition model much more than the uniform reversal model. 40

40

35

35

40 Reversals Single gene trps Transpositions

35 30

20

Breakpoints

25

25

0

50

100 Reversals

150

0 0.1 0,3 0.5 0.7 0.9 1

15

10

20 15 10

20 All Short 1 axis 2 axes 3 axes 1+1 axes 1+2 axes

15

10

25

30 Breakpoints

Breakpoints

30

0

50

100 Reversals

5 0

150

−5

0

50

100

150

Operations

Looking at the eigenvalues and -vectors of the corresponding transition matrices, no clear patterns emerge to help us derive a closed formula, but we can still plot b(t). For comparison, we have done this for different sets of axes (Figure 3.2, left): two axes through neighbouring genes, three axes through neighbouring genes, one axis between genes and another axis between one of these genes and finally one axis between genes and two axes through both these genes. In addition, we have b(t) for a single axis, for single gene reversals and for uniform reversals, using formulae above. Apparent from this figure is that the one axis and the single gene reversal models differ severely from the plain reversal model. The number of breakpoints grows more slowly and does not extend above n/2 and 3n/4, respectively, as is apparent from their formulae. This should be compared to the asymptotic for the all reversals model, which is n(1 − 1/(2n − 2)). On the other hand, using two axes, while the rate of growth is still slower than the all reversal model, we still approach the same limit. Also, with three axes the number of breakpoints grows almost as fast as for all reversals. There seems to be nothing gained by introducing more axes — instead we can use the plain reversal model.

3.5 Reversal combinations Considering the clear division between the breakpoint growth in different models shown in the left of Figure 3.2, we should also ask what happens when we mix two models. For instance, if we apply the proportion p of single gene reversals and 1 − p of symmetric reversals (one axis), will the combined model behave as single gene or symmetric reversals? If we combine these two methods, we get a Markov chain with eight states and eigenvalues n, n−2p, n−4+2p and n − 4, all with coefficients 1/4. Thus, the eigenvalues progress linearly from those in the symmetric model to those in the single gene reversals model as p increases and the formula becomes Ã µ ¶t µ ¶t µ ¶t ! 2p 4 − 2p 4 n 3− 1− − 1− − 1− . b(t) = 4 n n n The effect on changing the eigenvalues is most important for the larger ones, so for p = 1/2, this is close to the formula for single gene reversals. Interestingly, if we pose the similar question for single gene reversals and symmetric reversals about two axes, the answer is ’neither’. Combining these two models gives in general faster growth than either taken separately. In Figure 3.2 (middle), we have plotted the expected number of breakpoints of different values of p, namely 0, 0.1, 0.3, 0.5, 0.7, 0.9 and 1.

3.6 Single gene transpositions Along with reversals, plenty of attention has been put on transpositions, which is an operation that takes out a segment and reinserts it at another position, and on reversed transpositions or transversals, in which the segment moved is also reversed. For neither of these, nor their combination, is there a polynomial time algorithm for computing the minimal distance. Single gene transpositions, moving a single gene to any other position in the genome and possibly changing its orientation (that is including reversed transpositions on one gene), are sufficient to sort a genome. In fact, it is also quite easy to compute the distance to the identity, at least if we view single gene reversals as a special case of single gene transpositions. What we need is some classical combinatorics. Definition 1 An increasing subsequence in a permutation π is a sequence i1 < i2 < . . . < ik such that πi1 < πi2 < . . . < πik . Theorem 2 The minimal number of single gene transpositions needed to transform a signed permutation π to the identity is the number of genes minus the size of a longest increasing subsequence in π. For circular genomes, the starting point of the longest increasing subsequence is arbitrary. Proof. It is easy to see that if we take an element that is not part of an increasing subsequence and insert it at an appropriate position, we can make this increasing subsequence one element longer. On the other hand, inserting this element somewhere else will not make the subsequence longer, and moving an element that belong to the subsequence will only make it shorter. Thus, each single gene transposition can only prolong the subsequence by at most one. Conversely, since each element that is not part of a longest increasing subsequence can be inserted at its proper position relative the subsequence, we find that we do not need more transpositions than we have elements outside this subsequence. Also, for circular genomes, any increasing subsequence will do, regardless of its starting point. ¤ Importantly, the longest increasing subsequence can be computed in polynomial time. The Robinson-SchenstedKnuth algorithm (see for instance, [19]) transforms any permutation into a pair of Young tableaux, whose top row will tell the length of the longest increasing subsequences. Also, one such subsequence can be reconstructed from the algorithm. Since this algorithm moves through the permutation once, for each element making at most O(n) operations, we have at most quadratic time. Having circular genomes, to find an optimal starting point it is enough to iteratively try the position of the minimal number not found in any longest increasing subsequences starting elsewhere. The time needed is O(n3 ). Turning to expected distances, it seems hard to compute the eigenvalues of the transition matrices of single gene transpositions as well as transpositions and reversed transposition. However, numerical computations of b(t) for single gene transpositions shows that it is very close to the uniform transpositions case (Figure 3.2, right). Thus,

genomes scrambled with single gene transpositions have significantly more breakpoints than those scrambled with a comparably sized set of reversals, whether symmetric, single gene or neither.

4 Classification of models The data analysis described above has given us a number of dotplots to work with. One soon discovers that these plots can be partitioned into but a few categories based on appearance. We have done this and tried to explain how these appearance have come about. To support our claims, we have made simulation in specific models to check if these models really produce similar dotplots. In Figure 4.3, we have gathered dotplots from true data comparisons and compared them to simulations in various models. The number of operations within each model has been chosen to accentuate the similarity between the true dotplots and the simulated ones. The motivation for this is that what we want to show is that these patterns have natural explanations.

4.1 The whirl This is the common reversal model. Its name stems from the dotplot obtained by performing a few reversals, leaving quite a few segments in an unordered fashion, some ascending and some descending (see Figure 4.3 (A)). As described, this model is very well explored, both in terms of minimal and expected distances. One problem, though, is that it does not take that many reversals to make the dotplot look quite fuzzy, in which case we could not recognise this scenario by inspection. This could possibly be amended for by counting the number of hurdles in the breakpoint graph, which are very rare in the reversal model [4], but likely to turn up if transpositions have been used.

4.2 The X-model The X-model has emerged quite strongly during the last couple of years, as described in the Introduction. The dotplot is shaped like an ’X’ if the origin or terminus of replication is placed at the origin of the graph. As seen in Figure 4.3 (B), such a graph can be easily obtained by performing reversals that are symmetric about these two points. A good thing with this model is that the ’X’ is visible no matter how many symmetric reversals have been applied. This of course is one of the reasons that it occurs so often in real data comparisons. Though the ’X’ may be quite sharp, the reversals are rarely perfectly symmetric. Thus, the expected distance can usually be computed in the uniform reversal model, since only a few axes renderes great similarity between the graphs of b(t) in this and the uniform case.

4.3 The fat X-model A perfect X-model has a sharp ’X’ in the dotplot. But many dotplots show a wider ’X’. This could to some extent be explained by the difference in size between genes, our removal of genes that cannot be paired perfectly with the other genome and the fact the symmetric reversals are probably not always perfectly symmetric. However, the width found in Figure 4.3 (C) is hard to explain from these sources. We have found that the same phenomenon occurs if the symmetric reversals are accompanied by short transpositions, that is transposition that takes out a fairly short segment and reinserts it at any place. These segment should not be longer than some 5% of the genome length. As time goes on, the ’X’ will look more like a cloud than an ’X’. For distant genomes, it will be hard to discover that this model has been used to generate them. On the other hand, for expected distance computations, we should be able to use the uniform transpositions model, since even the single gene transpositions model showed comparable behaviour and allowing also short transpositions should bring us much closer to the uniform transpositions case.

4.4 The zipper The zipper appears in combination with or without symmetric reversals. Figure 4.3 (D) gives an example with symmetric reversals. It is fairly obvious that it is created using short reversals, again up to some 5% of the genome length, equally distributed along the genome.

Figure 4.3: Real data for different pairs of species and simulation on a genome of 800 genes. The graphs show which gene occupies each genome position (on the abscissa), red indicating negatively orientation. The models depicted are, to the right: (A) the whirl, (B) the X-model, (C) the fat X-model, (D) the zipper and (E) the cloud. The real data are, to the left, comparisons between: (A) Bordetella bronchiseptica/Bordetella parapertussis, (B) Chlamydia trachomatis/Chlamydophila pneumoniae AR39, (C), Mycobacterium bovis/Mycobacterium leprae, (D) Escherichia coli CFT073/Shigella dysenteriae and (E) Bacillus halodurans/Bacillus subtilis. For the simulations, we have used (A) five random reversals of lengths between 30 and 100 in the middle 40 percent of the genome, (B) seven symmetric reversals, (C) 20 operations, with equal probability between symmetric reversals of any length and transpositions of blocks of length 10 − 40 genes, (D) five single gene reversals and one symmetric, and (E) 200 operations with 98% chance for single gene transpositions and otherwise symmetric reversals. 4500

800

4000

700

3500

600

3000 500

(A)

2500 400

2000 300

1500 200

1000

100

500 0

(B)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0

800

800

700

700

600

600

500

500

400

400

300

300

200

200

100 0

0

100

200

300

400

500

600

700

800

0

100

200

300

400

500

600

700

800

0

100

200

300

400

500

600

700

800

0

100

200

300

400

500

600

700

800

0

100

200

300

400

500

600

700

800

100

0

100

200

300

400

500

600

700

800

1500

0

800

700

600

1000 500

(C)

400

300

500 200

100

0

0

500

1000

1500

3000

0

800

700

2500 600

2000 500

(D)

1500

400

300

1000 200

500 100

0

0

500

1000

1500

2000

2500

3000

2000

0

800

1800

700

1600 600

1400 500

1200

(E)

1000

400

800

300

600 200

400 100

200 0

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0

Since the reversals are quite short, the pattern is visible for a longer time than the whirl. After a while, it appears as small whirls along the ’X’, but it will take quite some time before the whole dotplot is covered in a cloud. Again, we should not need to elaborate on this for computing the expected distance. The short reversals are sufficiently random for showing a behaviour similar to the uniform reversals model.

4.5 The cloud While the structure of the dotplot in almost all models disappears in a cloud as time progresses, we may also have a cloud superposed on a structure, for instance an ’X’ as in Figure 4.3 (E). While the ’X’ is a bit tortuous, it is clearly visible. What, then, is the origin of the cloud? Our educated guess is that it is created using a few symmetric reversals and a lot of single gene transpositions, primarily sending single genes to other locations in the genome. This creates a cloudy background without really disturbing the primary pattern. In our example, it is quite clear that some parts of the genome omit and receive genes more frequently than other parts. While we cannot give a full explanation to why this happens, and which part are more given to these mutations, we suspect that this is a key to explaining the nature of gene order mutations. It is important to realise that the cloud can be superposed on virtually any other pattern. In fact, single gene transpositions are present in most other examples presented in Figure 4.3, but to a much lesser extent. It seems that these are just as frequent as single gene reversals, which have received much more attention previously. As long as only a few operations used are not single gene transpositions, the minimal distance can be computed by separating the two groups of operations, performing the other operations and then computing the single gene transposition distance. For expected distances, we recommend using the uniform transpositions model, at least if the proportion of single gene tanspositions is fairly large.

5 Are these models valid on real data? A common assumption among gene order mathematicians and computational biologists is that reversals are much more frequent than transpositions. This is one of two reasons why the literature on sorting by reversals is a lot vaster than the one on sorting by transpositions, the other being that sorting by reversals seems a more tractable problem. Even in this study, we have mainly focused on pure reversal models. A simple way to confirm that a model is relevant to the data is to compare the minimal distance to the expected distance. If the expected distance is shorter than the minimal, this is a strong indication that operation outside the model have been used. In particular, transpositions increase the minimal reversal distance twice as fast as reversal, whereas the ratio for the expected distance is closer to 1.5. Thus, the expected reversal distance of a genome generated using transpositions is most certainly lower than the minimal reversal distance. In the upcoming subsections we show results from a few cases of pairwise comparisons of bacteria. The dotplots are found in the supplementary material.

5.1 Case 1 — single gene operations: Campylobacter jejuni vs Campylobacter jejuni RM1221 In sorting Campylobacter jejuni relative Campylobacter jejuni RM1221, we find that they differ only by six single gene transpositions of non-adjacent genes (namely 218, 358, 857, 1124, 1141, 1209 in our annotation) and two single gene reversals (genes 975 and 1307). To compute the distance, we need only consider the single gene transposition model in Section 3.6: the longest increasing subsequence has length 1456 out of a maximal 1464, and their difference gives the number of operations, which is 8. This is an example of an early stage of the cloud. While such early stages can be sorted manually, the same procedure can be applied as long as the genomes develop through single gene operations. Note that in this case the data adheres perfectly to the single gene operations model.

5.2 Case 2 — symmetric reversals: Bartonella henselae Houston vs Bartonella quintana Toulouse Exploring the dotplot of Bartonella henselae Houston vs Bartonella quintana Toulouse in Figure 5.4 reveals symmetric and single gene reversals. However, a comparison between the minimal reversal distance (21) and the expected reversal distance (14) indicate that some transpositions are present. To illustrate a sorting scenario, we perform the operations one by one and show the corresponding dotplots in Figure 5.4. In addition to three single and one double gene transposition not visible in the Figure, we use one single gene and one longer transposition as well as two short (one single and one double gene) reversals and three symmetric reversals.

Figure 5.4: A sorting scenario for Bartonella henselae Houston and Bartonella quintana Toulouse using symmetric and short reversals, and transpositions. The vertical arrows indicate reversals and horizontal transpositions. 1200

1200

1200

1000

1000

1000

800

800

800

600

600

600

400

400

400

200

200

0

0

200

400

600

800

1000

1200

0

200

0

200

400

600

800

1000

1200

0

1200

1200

1200

1000

1000

1000

800

800

800

600

600

600

400

400

400

200

200

0

0

200

400

600

800

1000

1200

0

0

200

400

600

800

1000

1200

0

200

400

600

800

1000

1200

200

0

200

400

600

800

1000

1200

0

6 Conclusions and future work It appears to be difficult to quantify evolution using a single hypothesis. Therefore it is important to incorporate proper assumptions in the models. We propose that these assumptions can partly be derived by investigating the shape of dotplots obtained from pairwise comparisons of bacteria. As noted, it may be hard to distinguish between different models from the dotplot only. We are in the process of elaborating more sensitive tools to establish which model is the correct one, as well as indicating whether this choice of model is certain or uncertain.

Acknowledgments Daniel Dalevi was supported by Swedish National Research School in Genomics and Bioinformatics, funded by the Swedish Foundation for Strategic Research (SSF).

References [1] Y. Ajana, J. Lefebvre, E. Tillier, and N. El-Mabrouk, Exploring the Set of All Minimal Sequences of Reversals — An application to Test the Replication-Directed Reversal Hypothesis, LNCS 2452 (2002), 300–315. [2] M. Bender, D. Ge, S. He, H. Hu, R. Pinter, S. Skiena, and F. Swidan, Improved bounds on sorting with lengthweighted reversals, SODA ’04: Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms (Philadelphia, PA, USA), Society for Industrial and Applied Mathematics, 2004, pp. 919–928. [3] A. Bergeron, J. Mixtacki, and J. Stoye, The inversions distance problem, Mathematics of Evolution and Phylogeny (O. Gascuel, ed.), Oxford University Press, New York, 2005, pp. 262–290. [4] A. Caprara, On the tightness of the alternating-cycle lower bound for sorting by reversals, J Comb Opt 3 (1999), 149–182. [5] D. Dalevi, N. Eriksen, K. Eriksson, and S. Andersson, Measuring genome divergence in bacteria: a case study using chlamydian data, J Mol Evol 55 (2002), 24–36. [6] J. Eisen, J. Heidelberg, O. White, and S. Salzberg, Evidence for symmetric chromosomal inversions around the replication in bacteria, Genome Biol 1 (2000), RESEARCH0011. [7] N. Eriksen, Approximating the expected number of inversions given the number of breakpoints, LNCS 2452 (2002), 316–330.

[8] N. Eriksen and A. Hultman, Estimating the expected reversal distance after a fixed number of reversals, Adv Appl Math 32 (2004), 439–453. [9] A. Frank and J. Lobry, Asymmetric substitution patterns: a review of possible underlying or selective mechanisms, Gene 238 (1999), 65–77. [10] S. Hannenhalli and P. Pevzner, Transforming cabbage into turnip (polynomial algorithm for sorting signed permutations with reversals), J ACM 46 (1999), 1–27. [11] M. Kimura, Evolutionary rate at the molecular level, Nature 217 (1968), 624–6. [12]

, The neutral theory of molecular evolution, University Press, Cambridge, 1983.

[13] J. Lefebvre, N. El-Mabrouk, E. Tillier, and D. Sankoff, Detection and validation of single gene inversions, Bioinformatics 19 (2003), i190–196. [14] P. Mackiewicz, D. Mackiewicz, M. Kowalczuk, and S. Cebrat, Flip-flop around the origin and terminus of replication in prokaryotic, Genome Biol 2 (2001), INTERACTIONS1004. [15] H. Niki, Y. Yamaichi, and S. Hiraga, Dynamic organization of chromosomal DNA in Escherichia coli, Genes Dev 14 (2000), 212–23. [16] D. Posada and T. Buckley, Model selection and model averaging in phylogenetics: advantages of akaike criterion and bayesian approaches over likelihood ratio tests, Syst Biol 53 (2004), 793–808. [17] D. Posada and K. Crandall, MODELTEST: testing the model of DNA substitution, Bioinformatics 14 (1998), 817–818. [18] J. Rebollo, V. Francois, and J. Louarn, Detection and possible role of two large nondivisible zones on the coli chromosome, Proc Natl Acad Sci U S A 85 (1988), 9391–5. [19] B. Sagan, The symmetric group : representations, combinatorial algorithms, and symmetric functions, Springer Verlag, New York, 2001. [20] D. Sankoff, Short inversions and conserved gene cluster, Bioinformatics 18 (2002), 1305–1308. [21] D. Sankoff and M. Blanchette, Probability models for genome rearrangements and linear invariants for phylogenetic inference, RECOMB ’99: Proceedings of the third annual international conference on computational molecular biology, ACM Press, 1999, pp. 302–309. [22] D. Sankoff, J. Lefebvre, E. Tillier, A. Maler, and N. El-Mabrouk, The distribution of inversion lengths in bacteria, RECOMB (2004). [23] E. Tillier and R. Collins, Genome Rearrangement by replication-directed translocation, Nature Genetics 26 (2000), 195–197. [24] L.-S. Wang and T. Warnow, Distance-based genome rearrangement phylogeny, Mathematics of Evolution and Phylogeny (O. Gascuel, ed.), Oxford University Press, New York, 2005, pp. 353–383.

Recommend Documents

Expected Gene Order Distances and Model Selection in Bacteria

Expected time bounds for selection - CiteSeerX

Objective Bayesian Model Selection in Gaussian ... - CiteSeerX

Model Selection and Surface Merging in Reconstruction ... - CiteSeerX

Model Checking and Higher-Order Recursion - CiteSeerX