Genomic Signatures from DNA Word Graphs - Springer

Report 2 Downloads 83 Views
Genomic Signatures from DNA Word Graphs Lenwood S. Heath and Amrita Pati Department of Computer Science, Virginia Tech, Blacksburg, VA 24061-0106 {heath,apati}@vt.edu

Abstract. Genomes have both deterministic and random aspects, with the underlying DNA sequences exhibiting features at numerous scales, from codons and cis-elements through genes and on to regions of conserved or divergent gene order. The DNA Words program aims to identify mathematical structures that characterize genomes at multiple scales. The focus of this work is the fine structure of genomic sequences, the manner in which short nucleotide sequences fit together to comprise the genome as an abstract sequence, within a graph-theoretic setting. A DNA word graph is a generalization of a de Bruijn graph that records the occurrence counts of node and edges in a genomic sequence. A DNA word graph can be derived from a genomic sequence generated by a finite Markov chain or a subsequence of a sequenced genome. Both theoretically and empirically, DNA word graphs give rise to genomic signatures. Several genomic signatures are derived from the structure of a DNA word graph, including an information-rich and visually appealing genomic bar code. Application of genomic signatures to several genomes demonstrate their practical value in identifying and distinguishing genomic sequences.

1

Introduction

The genome G of an organism is a set of long nucleotide sequences modeled, within a formal language framework, as strings over ΣDNA = {A, C, G, T}, the DNA alphabet. While G itself is a unique mathematical structure for the organism, a genome is typically quite large (e.g., billions of bases) and differs slightly from one individual of a species to another. Fix a genomic sequence H that is a substring of some string in G. Intuitively, a genomic signature for an organism is a mathematical structure θ(H) derived from H, which, ideally, can be efficiently computed, is significantly smaller to represent than H, and, if H is sufficiently representative of G, can uniquely identify the original organism. The intent is that the signature of other large substrings from G be highly similar to θ(H) and distinguishable from signatures of other organisms. A genomic signature is judged along two, typically antagonistic, dimensions: (1) the amount of compression achieved by θ(H), and (2) its effectiveness in identifying the genome. Karlin and Burge [1] were among the first to use the term genomic signature. They define the dinucleotide odds ratio or relative abundance, which is the 16 functions defined for dinucleotides XY by ρXY (H) =

fXY (H) , fX (H)fY (H)

I. Ma ˘ndoiu and A. Zelikovsky (Eds.): ISBRA 2007, LNBI 4463, pp. 317–328, 2007. c Springer-Verlag Berlin Heidelberg 2007 

318

L.S. Heath and A. Pati

where fx (H) is the frequency of string x as a substring in H. They observe that ρ values are similar throughout a genome and may reflect the net response of the genome to selection pressures. They compare dinucleotide odds ratios for a number of organisms and demonstrate their capability of distinguishing organisms. Karlin et al. [2] observe that dinucleotide odds ratios typically range from 0.78 to 1.23. They define the delta distance between strings x and y to be δ(x, y) =

1000 16



|ρXY (x) − ρXY (y)|.

dinucleotide XY

Jernigan and Baran [3] demonstrate that the delta distance between strings sampled from a genome is preserved over a wide range of string lengths |x| and |y|. For bacterial species, Coenye and Vandamme [4] correlate delta distance with 16S rDNA sequence similarity and DNA-DNA hybridization values. They find a strong negative correlation between δ and 16S rDNA similarity among groups of species with low δ and high 16S rDNA similarity. They also demonstrate an overall high negative correlation between δ and DNA-DNA hybridization values. Deschavanne et al. [5] construct images from oligonucleotide frequencies. For 57 prokaryotic genomes, Sandberg et al. [6] compare G+C content, oligonucleotide frequency, and codon bias. Dufraigne et al. [7] and van Passel et al. [8] employ oligonucleotide frequencies to identify regions of horizontal gene transfer (HGT) in prokaryotes. Carbone et al. [9] correlate the ecological niches of 80 Eubacteria and 16 Archaea to codon bias used as a genomic signature. As part of our DNA Words program investigating mathematical invariants derived from genomes, we examine the finest scale in graph-theoretic terms, employing a generalization of de Bruijn graphs. One frequently exploited observation is that a string over ΣDNA defines a walk in a suitably defined de Bruijn graph. Closely related is the correspondence of such a string to an Eulerian tour in a suitably defined multigraph. Applications include DNA physical mapping, DNA sequence assembly, and multiple sequence alignment problems [10,11,12,13,14]. In Section 2, we formalize the mathematical basis for graph-theoretic genomic signatures and, in Section 3, prove that, under reasonable probabilistic assumptions, these signatures characterize a genomic sequence with high probability and distinguish it from those of other organisms. We present empirical studies that support the theoretical results in Section 4 and conclude in Section 5.

2

Preliminaries

An alphabet is a finite, non-empty set of symbols; the DNA alphabet is ΣDNA = {A, C, G, T}. A string or word x over ΣDNA is a finite sequence x = σ1 σ2 · · · σw of symbols from ΣDNA ; its length |x| is w. A single chromosome in a genome is typically written as the string of nucleotides on one DNA strand. A genomic sequence is a chromosomal sequence or any substring of it. G is the set of all chromosomal sequences from an organism. Nucleotide frequencies vary among

Genomic Signatures from DNA Word Graphs

319

organisms, while, as Fickett et al. [15] observe, the frequencies of A’s and T’s (and hence of G’s and C’s) are approximately constant within a single genome. If x and y are strings, then occ (x, y) is the count of occurrences of x in y. w , Fix a word length w ≥ 1. Let l = 4w . The order-w state space is S w = ΣDNA the set consisting of the l words of length w. The order-w de Bruijn graph DBw = (S w , E) is the directed graph, where (xi , xj ) ∈ E when xi σ = ιxj , for some σ, ι ∈ ΣDNA ; such an edge is labeled σ [16]. Fig. 1 illustrates the order-2 de Bruijn graph. ∗ have length |H| = n; we think of H as a long genomic Let H ∈ ΣDNA sequence that traces a walk in DBw . The vertex count of xi in H is vc (xi , H) = occ (xi , H), while the edge count of edge (xi , xj ) ∈ E in H, where xi σ = γxj , is ec ((xi , xj ), H) = occ (xi σ, H). The order-w DNA word graph DN Aw (H) is DBw together with labels vc (xi , H) for each xi ∈ S w and ec ((xi , xj ), H) for each (xi , xj ) ∈ E. For xi , xj ∈ S w , the frequency of xj after xi in H is ⎧ if (xi , xj ) ∈ E or vc (xi , H) = 0; ⎨0 Freq ((xi , xj ), H) = ec ((xi , xj ), H) otherwise. ⎩ vc (xi , H) For 1 ≤ i ≤ l, let xi be the ith element of S w in lexicographic order. The orderw word count vector χw H of H is the l-vector having components occ (xi , H), in lexicographic order. We consider Markov chains with state space S w and having nonzero transition probabilities only for edges in DBw ; such a Markov chain is called an orderw de Bruijn chain (DBC). Let DC be an order-w DBC with l × l transition probability matrix P = (pij ); here, pij is the probability of a one-step transition from state xi to state xj [17]. P is sparse, with at most 4 nonzero entries per row. The order-w DBC DC w (H) for genomic sequence H has transition probabilities pij = Freq ((xi , xj ), H). Experimental results suggest that genomic sequences are sufficiently large and diverse in their composition to sample all words in S w for reasonably small w ∈ [1, 5]. Hence, the DBCs generating such sequences are irreducible. Genomic sequences are also adequately random to assume that the DBCs generating them are aperiodic and recurrent non-null. Throughout, we assume that all DBC are ergodic and hence that there is a unique stationary distribution π = (πi ) on S w satisfying πP = π [17]. This assumption does not hold for a genomic sequence that consists of systematic repeats of a small subset of words from S w . For a genome G and a genomic sequence H taken from G, a genomic signature for H is a function θ that maps H to a mathematical structure θ(H). Ideally, θ(H) is able to identify sufficiently large substrings that come from G and to distinguish H from genomic sequences of other genomes. To be useful, θ(H) must be efficiently computable. Of course, a representation of G itself satisfies the requirements, but offers no advantage in space. Fixing word length w ≥ 1, we obtain DN Aw (H), with associated vc (xi , H) and ec ((xi , xj ), H). We define several candidate signatures. The simplest is the vertex count vector θcv (w) = (vc (xi , H))li=1 , requiring space Θ(4w lg n).

320

L.S. Heath and A. Pati

Supernode A

Supernode C

AA

AC

CA

CC

AG

AT

CG

CT

GA

GC

TA

TC

GG

GT

TG

TT

Supernode G

Supernode T

Fig. 1. Representation of the de Bruijn graph DB 2 in terms of supernodes and superedges. Each supernode consists of the 4 nodes with the same 1-symbol prefix in their labels and is closed by a dotted boundary. An edge from a node to a supernode represents a set of edges from the node to all nodes in the supernode. For example, the edge from node AC to supernode C represents the set of edges {(AC, CA), (AC, CC), (AC, CG), (AC, CT)}.

Additional signatures come from interplay between the graph structure DBw and the count vectors. Let ψ ≥ 0 be an integer threshold. Let E ≤ψ = {(i, j) ∈ E | ec ((i, j), H) ≤ ψ}, be the set of edges with counts at most ψ. Then edge deletion is the process of deleting edges in E ≤ψ from DB w , while varying ψ from 0 to Ξ = max{ec ((i, j), H) | (i, j) ∈ E} and deleting edges with tied counts in arbitrary order. The ψ-edge deletion of DBw is DB w (ψ) = (S w , E − E ≤ψ ). As ψ increases from 0 to Ξ, the number of connected components in DBw (ψ) increases from 1 to l, while the number of isolated vertices increases from 0 to l. The vertex deletion order θvdo is the permutation of S w giving the order in which vertices become isolated during edge deletion. Let ψi be the smallest integer such that DB w (ψi ) has precisely i connected components. The component-based edge deletion vector θced is the l-vector whose ith component is the number of edge deletions required to go from i−1 to i components. The vertex-based edge deletion vector θved is the l-vector whose ith component is the number of edge deletions required to go from i − 1 to i isolated vertices. The ordered vertex-based edge deletion vector θoed is the l-vector whose ith component is the total number of edge deletions required to isolate the vertex xi , where xi is the ith element of S w in lexicographic order. For two vector-based signatures θ1 and θ2 , let d (θ1 , θ2 ), be the L1 metric in l-dimensional real space.

3

Theory and Methods

We imagine every biological sequence to be generated by a formal model that can be approximated by a DBC. In this section, we build a theoretical framework to analyze distances between genomic signatures in terms of the parameters of the DBC generating them.

Genomic Signatures from DNA Word Graphs

321

Let DC be an ergodic, order-w DBC. Let H be a sequence generated by DC, where |H| = n. If xi , xj ∈ S w , the probability of transition from state xi to state xj is given by pi,j , and the stationary probability for xi is πi . w . A period of x is an integer i, where 1 ≤ i ≤ w, Let x = σ1 σ2 . . . σw ∈ ΣDNA such that x[1 . . . i] = x[w − i + 1 . . . w]. Two occurrences H[i . . . i + w − 1] and H[j . . . j + w − 1] of x in H overlap if i ≤ j ≤ i + w − 1 or j ≤ i ≤ j + w − 1. An x-clump in H is a maximal subsequence of one or more consecutive overlapping occurrences of x. For example, 2 is a period of x = AACAA, and AACAACAACAACAA is a clump with 4 occurrences of x. Waterman [18] notes that the count of a rare DNA word in H is a function of the number of x-clumps in H, which approximately follows a Poisson distribution [18], with parameter λβ (derived below). Let x be a DNA word with shortest period d. Then a declumping event with respect to x is defined as the event of not observing the string x = x[1 . . . d]. Suppose the probability of occurrence of x is px . Then the probability of a declumping event is given by qx = 1 − px . The number of occurrences of x within a clump is approximately geometric with mean 1/px [18]. Lemma 1. Let Xx be the random variable that is the number of occurrences of word x in genomic sequence H. Then the probability generating function of Xx is   λx (t − 1)(1 − px ) fXx (t) = exp . 1 − qx t Proof. Let Z be the random variable that is the number of x-clumps in H, and let Ci be the number of occurrences of x in the ith clump. Hence, Xx =

Z 

Ci .

i=1

Since Z has (approximately) a Poisson distribution with parameter λx , the probability generating function for Z is fZ (t) =

∞ 

e−λx

k=0

(λx t)k = eλx (t−1) . k!

The probability generating function for each Ci is fC (t) = px

∞  k=0

(qx t)k =

px . 1 − qx t

Hence the probability generating function for Xx is      λx (t − 1)(1 − p) px −1 = exp fXx (t) = fZ (fC (t)) = exp λx . 1 − qx t 1 − qx t   λx qx λx qx qx and Var [Xx ] = Lemma 2. E [Xx ] = 2+ . px px px

322

L.S. Heath and A. Pati 



 Proof. By results in [17], E [Xx ] = fX (1) and Var [Xx ] = fXx (1) + fXx (1) − x  2 (fXx (1)) . The lemma follows by calculation. 

Lemma 3. Let H be a genomic sequence of length n, and let χw H be its word count vector. Fix threshold τ > 0. Then   nπx  qx Pr [d (χ, E [χ]) ≥ lτ ] ≤ 2 + . τ2 px w x∈S

Proof. Let χw , we have H = (X1 , X2 , . . . Xl ). Since E [Xx ] = nπx = (λx qx )/px λx = (nπx px )/qx . The distance between χ and E [χ] is d (χ, E [χ]) = |Xx − x∈S w

E [Xx ] |. By Chebyshev’s bound and Lemma 2, we obtain     Var [Xx ] λx qx qx nπx qx = Pr [|Xx − E [Xx ] | ≥ τ ] ≤ 2+ = 2 2+ . τ2 px τ 2 px τ px The lemma follows from the resulting inequality:  Pr [d (χ, E [χ]) ≥ lτ ] ≤ Pr [|x − E [x] | ≥ τ ] .



x∈S w

Theorems 1 and 2 address the ability of word count vectors to identify and distinguish DBCs. Theorem 1. Let DC be an order s DBC. Let H1 and H2 be two genomic sequences of length n generated independently by DC. Let χ1 and χ2 be their respective order-w word count vectors. Then, √

2e Pr d (χ1 , χ2 )) ≥ 2lτ n ≤ 2l . τ Proof. The component-wise expected values in χ1 and χ2 are the same. Their expected difference is therefore the 0 vector. Therefore, d (χ1 − E [χ1 ] , χ2 − E [χ2 ]) = d (χ1 , χ2 ) . √ Furthermore using T = nτ we obtain, Pr [d (χ1 , E [χ1 ]) ≥ lT ] = Pr [d (χ2 , E [χ2 ]) ≥ lT ] . Using the above equations and Lemma 3, we obtain Pr [d (χ1 − E [χ1 ] , χ2 − E [χ2 ]) ≥ 2lT ] = Pr [d (χ1 , χ2 ) ≥ 2lT ] . Pr [d (χ1 , χ2 ) ≥ 2lT ] ≤ Pr [d (χ1 , E [χ1 ]) ≥ lT ] + Pr [d (χ2 , E [χ2 ]) ≥ lT ]   nπx  qx =2 2+ . T2 px w x∈S

Genomic Signatures from DNA Word Graphs

323

If x = x[1 . . . d], where d is the smallest period of x, |x| ≥ |x |. Therefore, qx 1 − πx ≤ , which yields px ≥ πx and px πx 2  Pr [d (χ1 , χ2 ) ≥ 2lT ] ≤ 2l (1 + πx ). τ w x∈S

From the Arithmetic-Geometric Inequality [19], we obtain  l  l  1 x∈S w (1 + πx ) (1 + πx ) ≤ = 1+ l l w x∈S

From the above results we have  l √

2 1 Pr d (χ1 , χ2 ) ≥ 2lτ n ≤ 2l 1 + . τ l As l → ∞ the theorem follows.



Let H1 and H2 be genomic sequences of length n, generated independently by H2 1 DBCs DC 1 and DC 2 of orders s1 and s2 , respectively. Let χ1 = χH 1 and χ1 = χ2 be their order-w word count vectors. This assumption formalizes the separation of genomic sequences obtained from different organisms. Assumption 1. There exists a non-negative real number γ ∈ (0, 1] such that √

Pr d (E [χ1 ] , E [χ2 ]) ≥ 3lτ n ≥ γ. Then, the distance d (χ1 , χ2 ) can distinguish DC 1 and DC 2 . Theorem 2. let Xx,1 and Xx,2 denote the counts of x in H1 and H2 , respectively. Assuming that H1 and H2 are both generated by Markov chains DC 1 and DC 2 of order w, let πx,1 and πx,2 denote the stationary probabilities of state x in DC 1 and DC 2 , respectively. If there exists a constant γ as in Assumption 1 then,  1  1 √

(2π + 1) − (2πx,2 + 1). Pr d (χ1 , χ2 ) ≥ lτ n ≥ γ − x,1 τ2 τ2 w w x∈S

x∈S

Proof. Treating d (χ1 , χ2 ), d (χ1 , E [χ1 ]), d (χ2 , E [χ2 ]), and d (E [χ1 ] , E [χ2 ]) as distances d, d1 , d2 , and d3 , respectively, in 1-dimensional space and using T = √ τ n we obtain, d3 ≤ d + d1 + d2 Pr [d3 ≥ 3lT ] ≤ Pr [d ≥ lT ] + Pr [d1 ≥ lT ] + Pr [d2 ≥ lT ] . From Assumption 1, Lemma 3, and πx ≤ px we obtain,    nπx,1   nπx,2  qx,1 qx,2 γ ≤ Pr [d (χ1 , χ2 ) ≥ lT ] + 2+ + 2+ , T2 px,1 T2 px,2 x∈S w x∈S w  1  1 √

(2πx,1 + 1) − (2πx,2 + 1). Pr d (χ1 , χ2 ) ≥ lτ n ≥ γ − 2 τ τ2 w w x∈S

The theorem follows.

x∈S



324

L.S. Heath and A. Pati

By Theorem 2, the probability that the distance between √ the word count vectors of sequences generated by different DBCs exceeds lτ n, increases with τ . Sequences assumed to be generated by two different DBC with sufficiently different stationary distributions would have a high probability of being separated by a large distance.

4

Results and Discussion

To evaluate our genomic signatures, we employed chromosomal sequences from the following organisms: Arabidopsis thaliana (AT), Borrelia burgdorferi (BB), Caenorhabditis elegans (CE), Chlamydophila pneumoniae (CP), Chlamydia muridarum (CM), Escherichia coli (EC), Homo sapiens (HS), and Saccharomyces cerevisiae(SC). We computed the vertex count vector θcv (3), the vertex deletion order vector vdo θ (3), the vertex-based edge deletion vector θved (3), the component-based edge deletion vector θced (3), and the ordered vertex-based edge deletion vector θoed (3) signatures for SC chromosomes 4,5, and 8, CE chromosomes 1,3, and 4, and AT chromosomes 1,2, and 3, followed by all pairwise Pearson correlation coefficients between signature vectors for each type of signature. Figure 2(a) illustrates that despite Theorem 2, even the nucleotide compositional differences between diverse species such as SC, CE, and AT are not captured by the traditional method of using multi(tri, here)-nucleotide frequencies. Correlation coefficients between θcv (3) signatures of chromosomes of the same species are very close to those between θcv (3) signatures of chromosomes of different species. Nucleotide frequencies for dimers, tetramers, and pentamers display similar characteristics (results not shown). For the same set of genomic sequences, the θved (3) and θced (3) signatures display much greater discriminatory power (Fig. 2) than the θvdo (3) and θcv (3) signatures. To test the efficiency of the θced and θved signatures in identifying target genomes from smaller unknown genomic sequences, we randomly sampled 1 Mb sequences from the existing genomic sequences. We computed signatures from these sequences and matched them to existing genomic signatures. In most cases, the signature derived from a random sequence sampled from a genomic sequence of an organism O displayed highest positive correlation with genomic signatures derived from chromosomes of O. However, this behavior was not conserved across all samples. Varying the word length at which the signatures were computed did not alter this behavior significantly. To identify the organism from which a sequence originates more accurately, we combined the properties of θvdo and θved to compute the ordered vertex-based edge deletion vector θoed . The θoed signature precisely predicts the organism corresponding to a short DNA sequence (1 Mb) using a database of previously computed θoed ’s for various organisms. Empirical results suggest that θoed performs best at order 5 for bacterial genomes. To demonstrate the efficiency of the θoed signature we tested its performance using 5 chromosomal sequences from 4 species of the Rhizobiaceae family. The species with their Entrez refseq numbers are: Agrobacterium tumefaciens str.

Genomic Signatures from DNA Word Graphs

(b)

0. 99 1

III

0.405 − 0.568

0.434 − 0.647 0.949

IV

SC

V 6

7 93

CE IV

94 1

8

0.

III

0.

93

0.

SC

V

0.957

I

93

0.937

0.658 − 0.731

II 0. 92 3

0. 88 0

0.430 − 0.645

AT

0.

0.380 − 0.499

0.974

1 94 0.

IV

I 47

47

0. 92 8

III

0.9

CE

V

0. 88 1

(a) 0.956

0. 83 0

0. 99 7

0. 99 79

SC VIII

III

IV

0.994

IV

I

III

0.927 − 0.935

VIII

0.591 − 0.718

II

0.951 − 0.957

V

7 93 0.

0.9

AT

CE IV

3 99 0.

87 99 0.

0.970

SC

0.954

I 4 90 0.

0.9991

AT

0.931 − 0.944

II

III

0.914 − 0.942

IV

0.997

98

6

IV

0.986 − 0.988

I

I 0.9

99

III

CE

III

0. 99 71

0.9997

I 82 99 0.

0.9

AT

0.905 − 0.931

II

0. 99 93

0.9998

I

325

VIII

VIII

(c)

(d)

Fig. 2. Correlation graphs of Pearson correlation coefficients between order-3 (a) θcv s, (b) θvdo s, (c) θved s, (d) θced s. Text in the circles indicates chromosome numbers. Edges between any two enclosed subgraphs are labeled with the range of correlations on all 9 edges between the subgraphs. p-values of all the correlation coefficients given are below 10−3 . Lengths of the chromosomes used above are as follows: AT I: 30 M, AT II: 19 M, AT III: 15 M, CE I: 15 M, CE III: 14 M, CE IV: 17 M, SC IV: 1.5 M, SC V: 0.5 M, SC VIII: 0.5 M.

C58 (2 chromosomes, NC 003062 and NC 003063), Rhizobium etli CFN 42 (NC 007761), Rhizobium leguminosarum bv. viciae 3841 (NC 008380), and Sinorhizobium meliloti 1021 (NC 003047). 1 Mb segments were randomly sampled from each of these sequences, their θoed s were computed, and matched to the database of θoed s. At order 5, all random segements correctly matched up to their respective target genomes demonstrating that the θoed signature is sensitive enough to differentiate between species within a family. The efficiencies of identification of correct matches were 60%, 80%, and 100% at orders 3, 4, and 5, respectively. The θoed signature is computable in O(n + 4w+1 log(4w+1 )) time, where n is the input sequence length and w is the order at which the signature is computed. Figure 3 illustrates θved (3) and θced (3) signatures for the four eukaryotes and the four prokaryotes. Although θved (3) and θced (3) appear similar, one cannot

326

L.S. Heath and A. Pati

(a)

(b)

(c)

(d)

(e)

(f)

Fig. 3. θved (3)s for (a) four prokaryotes and (b) four eukaryotes. θced (3)s for (c) four prokaryotes and (d) four eukaryotes. θoed (3)s for (e) four prokaryotes and (f) four eukaryotes. Numbers denote chromosomes. The above is a grayscale representation. Colors play a vital role in these signatures. Each shaded-bar represents a specific component in the signature. Figures (e) and (f) illustrate that the θoed (3) bar code of each species is unique and sufficiently different from the θoed (3) bar codes of other species.

conclude that an increase of one in the number of isolated vertices precisely coincides with an increase of one in the number of connected components during edge deletion. As the DNA word graph fragments, in the early stage it is natural that the number of components grows at a rate greater than or equal to the number of isolated vertices. However, in the later stages, the graph is sufficiently fragmented so that an increase in the number of connected components of the graph coincides with the isolation of a vertex.

Genomic Signatures from DNA Word Graphs

5

327

Conclusion

The genomic signatures introduced in this paper are systematically derived from the structure of DNA word graphs obtained from genomic sequences. Moreover, distances between such signatures can be characterized within a probabilistic framework in terms of the parameters of the underlying DBC assumed to generate the sequences. For each organism, both eukaryotic and prokaryotic, it is possible to derive a θoed -bar code from a sufficiently long genomic sequence of that organism that uniquely identifies the organism among competing genomic sequences. When sufficient sequence for an organism is present in a biological sample, the target organism for the sample can be retrieved by querying an already existing database of signatures. All order-w signatures discussed in this paper are compact and computable in Θ(4w lg n) time and space. The amount of sequence needed to create an order-w signature representative of its species is exponential in w. In practice, DNA sequences that need to be matched to a target organism can be much smaller than 1 Mb. We have found that each genomic sequence has a separate set of specifications for order and sample sequence size for best results (work in progress). We continue to investigate bounds on the minimum amount of sequence required to achieve an effective θoed signature and on alternate signatures for high order w that sample counts for a few length-w strings instead of requiring counts for all 4w strings.

Acknowledgments We thank two anonymous reviewers for their useful comments and suggestions.

References 1. Karlin, S., Burge, C.: Dinucleotide relative abundance extremes — A genomic signature. Trends in Genetics 11(7) (1995) 283–290 2. Karlin, S., Mrazek, J., Campbell, A.M.: Compositional biases of bacterial genomes and evolutionary implications. Journal of Bacteriology 179(12) (1997) 3899–913 3. Jernigan, R.W., Baran, R.H.: Pervasive properties of the genomic signature. BMC Genomics 3 (2002) 9 pages 4. Coenye, T., Vandamme, P.: Use of the genomic signature in bacterial classification and identification. Systematic and Applied Microbiology 27(2) (2004) 175–185 5. Deschavanne, P.J., Giron, A., Vilain, J., Fagot, G., Fertil, B.: Genomic signature: Characterization and classification of species assessed by chaos game representation of sequences. Molecular Biology and Evolution 16(10) (1999) 1391–1399 6. Sandberg, R., Branden, C.I., Ernberg, I., Coster, J.: Quantifying the speciesspecificity in genomics signatures, synonymous codon choice, amino acid usage, and G+C content. Gene 311 (2003) 35–42 7. Dufraigne, C., Fertil, B., Lespinats, S., Giron, A., Deschavanne, P.: Detection and characterization of horizontal transfers in prokaryotes using genomic signature. Nucleic Acids Research 33(1) (2005) 12 pages

328

L.S. Heath and A. Pati

8. van Passel, M.W.J., Bart, A., Thygesen, H.H., Luyf, A.C.M., van Kampen, A.H.C., van der Ende, A.: An acquisition account of genomic islands based on genome signature comparisons. BMC Genomics 6 (2005) 10 pages 9. Carbone, A., Kepes, F., Zinovyev, A.: Codon bias signatures, organization of micro-organisms in codon space, and lifestyle. Molecular Biology and Evolution 22(3) (2005) 547–561 10. Pevzner, P.A.: DNA physical mapping and alternating Eulerian cycles in colored graphs. Algorithmica 13(1-2) (1995) 77–105 11. Pevzner, P.A., Tang, H.X., Waterman, M.S.: An Eulerian path approach to DNA fragment assembly. Proceedings of The National Academy of Sciences of The United States Of America 98(17) (2001) 9748–9753 12. Zhang, Y., Waterman, M.S.: An Eulerian path approach to global multiple alignment for DNA sequences. Journal of Computational Biology 10(6) (2003) 803–819 13. Raphael, B., Zhi, D.G., Tang, H.X., Pevzner, P.: A novel method for multiple alignment of sequences with repeated and shuffled elements. Genome Research 14(11) (2004) 2336–2346 14. Zhang, Y., Waterman, M.S.: An Eulerian path approach to local multiple alignment for DNA sequences. Proceedings of The National Academy of Sciences of The United States Of America 102(5) (2005) 1285–1290 15. Fickett, J.W., Torney, D.C., Wolf, D.R.: Base compositional structure of genomes. Genomics 13(4) (1992) 1056–1064 16. Rosenberg, A.L., Heath, L.S.: Graph separators, with applications. Frontiers of Computer Science. Kluwer Academic/Plenum Publishers (2000) 17. Feller, W.: An Introduction to Probability Theory and Its Applications. Third edn. Volume I. John Wiley & Sons Inc., New York (1968) 18. Waterman, M.: Introduction to Computational Biology. First edn. Academic Press Inc., Boston, MA (1995) ´ 19. Cauchy, A.L.: Cours d’analyse de l’Ecole Royale Polytechnique. Premi`ere partie. Instrumenta Rationis. Sources for the History of Logic in the Modern Age, VII. Cooperativa Libraria Universitaria Editrice Bologna, Bologna (1992) Analyse alg´ebrique. [Algebraic analysis], Reprint of the 1821 edition, Edited and with an introduction by Umberto Bottazzini.