BIOINFORMATICS
ORIGINAL PAPER
Vol. 22 no. 5 2006, pages 517–522 doi:10.1093/bioinformatics/btk029
Genome analysis
Bayesian classifiers for detecting HGT using fixed and variable order markov models of genomic signatures Daniel Dalevi1, , Devdatt Dubhashi1 and Malte Hermansson2 1 2
Department of Computing Science, Chalmers University, SE 412 96 Go¨teborg, Sweden and Department of Cell and Molecular Biology, Microbiology, Go¨teborg University, 405 30 Go¨teborg, Sweden
Received on June 21, 2005; revised on December 8, 2005; accepted on December 27, 2005 Advance Access publication January 10, 2006 Associate Editor: Christos Ouzounis
ABSTRACT Motivation: Analyses of genomic signatures are gaining attention as they allow studies of species-specific relationships without involving alignments of homologous sequences. A naı¨ve Bayesian classifier was built to discriminate between different bacterial compositions of short oligomers, also known as DNA words. The classifier has proven successful in identifying foreign genes in Neisseria meningitis. In this study we extend the classifier approach using either a fixed higher order Markov model (Mk) or a variable length Markov model (VLMk). Results: We propose a simple algorithm to lock a variable length Markov model to a certain number of parameters and show that the use of Markov models greatly increases the flexibility and accuracy in prediction to that of a naı¨ve model. We also test the integrity of classifiers in terms of false-negatives and give estimates of the minimal sizes of training data. We end the report by proposing a method to reject a false hypothesis of horizontal gene transfer. Availability: Software and Supplementary information available at www.cs.chalmers.se/~dalevi/genetic_sign_classifiers/ Contact:
[email protected] 1
INTRODUCTION
Mutational and selectional forces acting at the cellular level result in species-specific DNA characteristics that manifest in, for example, G + C content (e.g. Muto and Osawa, 1987), nucleotide dependencies (e.g. Hooper and Berg, 2002), codon bias (e.g. Sharp and Matassi, 1994) and oligomer frequencies (e.g. Burge et al., 1992; Sandberg et al., 2001). These characteristics can be used to identify foreign genes in bacterial genomes. Introduction of genes by horizontal gene transfer (HGT), rather than vertical inheritance of genes from parents by the offspring, is an important mechanism of bacterial adaptation, and has profound implications for phylogenetic analyses (Woese, 1987; Doolittle, 1999). Existence of highly homologous genes in evolutionarily distantly related organisms has been taken as an indication of HGT. These can be detected using, for example, whole-genome phylogenies (e.g. Sicheritz-Ponten and Andersson, 2001) or by comparing gene trees with a species tree (e.g. Hallet and Lagergren, 2000). These approaches obviously require knowledge and sequence data of homologous genes from many organisms in order to work. Analyses
To whom correspondence should be addressed.
of DNA characters on the other hand, can identify foreign genes directly with access to only one genome (e.g. Dufraigne et al., 2005). In the case of G + C content much work has been done based on the assumption that organisms occupy a certain ‘G + C color’ shaped by different mutational and selectional forces (e.g. Muto and Osawa, 1987; Lawrence and Ochman, 1997; Lee et al., 2004; Sandberg et al., 2003; Forsdyke and Mortimer, 2000). However, these have proven to capture only about 40% of the speciesspecificity (Sandberg et al., 2003). This is not surprising since reliance on the G + C content alone places a lot of faith on a single number. This was also noted by Koski et al. (2001) who announced that G + C content on its own is a poor indicator of HGT. They also found that a high Codon Adaption Index (CAI) was a good indicator of positional conservation, meaning no HGT. But on the other hand, a low CAI was not as good an indicator of HGT as a atypical G + C content in the first and third codon positions. Their main conclusion was that any estimates of HGT ought to be confirmed using phylogenies whenever such can be made. Furthermore, when comparing the third position G + C content of the set of positionally conserved genes with the set of unique genes in a pair of Chlamydia species, the distributions were not significantly different indicating that either no HGT had occurred since time of divergence or G + C content was a poor indicator of such (Dalevi et al., 2002). A better indicator of species-specificity is to study the genomic composition of oligomers of different lengths, or, so called genomic signatures (Karlin and Burge, 1995; Dufraigne et al., 2005; Sandberg et al., 2003; Wang et al., 2005). Burge et al. (1992) showed that the composition of dimers is highly specific and conserved through the genome for bacterial chromosomes and plasmids. Pride et al. (2003) declared similar results for tetramers. The use of overlapping k-mers can capture many more characteristics than any individual estimates of for example codon usage, restriction sites and transcription factor binding sites—all of which may be host-specific. Recently, Teeling et al. (2004) demonstrated that the discriminatory power of correlations in a tetramer derived Z-score was far superior to that of differences in G + C content using a dataset of 118 bacteria. More generally, in the studies of HGT, Sandberg et al. (2001), built a naı¨ve classifier using models with arbitrary word size. The classifier was applied to different data verifying that it was indeed capable of identifying HGT that were already well-documented, such as HGT from Haemophilus influenzae to Neisseria meningitis (Kroll et al., 1998). The approach based on oligomer profiles and the technique of Sandberg et al.
The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email:
[email protected] 517
D.Dalevi et al.
(2001) for classifying unknown DNA deserves further attention. One obvious direction in which to extend this work is to use more flexible and mathematically sound models that can take into account statistical dependencies between words.This is the approach we pursue here using fixed and variable length Markov models. Markov models are very natural for modeling DNA and have been used successfully in many applications, e.g. identifying transcription binding sites (Ellrott et al., 2002; Zhao et al., 2004), modeling oligomers (Reinert et al., 2000; Scherer et al., 1994), gene finding (Borodovsky and McIninch, 1993; Salzberg et al., 1998) and identifying horizontally transferred genes (Nakamura et al., 2004). DNA sequences have the following properties: (1) A finite alphabet G 2 {A, C, G, T}, (2) There is a ‘time’ index taken as the direction of replication and transcription (50 !30 ) and (3) The dependencies on earlier nucleotides vanish with time, a sort of amnesia, also known as the Markov property. All (1)–(3) are typical characteristics of Markov chains. In this report, we expand previous work (Karlin and Burge, 1995; Dufraigne et al., 2005; Sandberg et al., 2003; Wang et al., 2005) by proposing the framework of (fixed and variable order) Markov chains to capture dependencies in genomic signatures. In particular, we demonstrate that variable order Markov chains (Bu¨hlmann and Wyner, 1999; Ron et al., 1996) are very well suited when dependencies are long in specific directions, and that the sparsity of their parameter space renders them very suitable for model inference. We demonstrate improvements over previous work and propose a novel method for locking the number of parameters in a variable length Markov model together with giving estimates of minimal sizes of training data. We confirm the integrity of the classifiers by estimating the rate of false negatives and propose a novel method for assigning confidence to candidates of HGT. The contribution of this study is a novel classifier with high flexibility, accuracy and mathematical soundness.
2
METHODS
This section is divided into several subsections. First the basic common principle behind all the Bayesian classifiers is presented. Next we discuss the different implementations of the likelihood function.
2.1
The Bayesian classifiers
The general classification problem we address may be formulated as follows. We are given a test sequence S and a collection of possible source genomes of origin G. We would like to classify the sequence S as arising from a particular G 2 G. The criterion for classification we adopt here is a standard simple Bayesian rule of maximum a posteriori. By Bayes’ rule, PðG j SÞ ¼ P
PðS j GÞPðGÞ : G2G PðS j GÞPðGÞ
ð1Þ
Note, when using a flat prior, P(G) ¼ const for all G 2 G, Equation (1) simplifies to the likelihood. To account for strand-specificity we simply take the maximum of the two possible probability values for the sequences, S and Sc, where Sc is the reverse complement of S, LðS‚ GÞ :¼ arg max max fPðS j GÞ‚PðSc j GÞg: G2G
ð2Þ
The genome, G 2 G, that maximizes L(S,G) is the most likely source of S.
2.2
The naı¨ve classifier, Nk
The naı¨ve classifier Nk of order k, earlier implemented in Sandberg et al. (2001), considers words M of length k + 1 and estimates each P(M j G) by the
518
Table 1. The corresponding number of free parameters O(k) in a Markov model of order k when the alphabet is DNA
k
0
1
2
3
4
5
O(k)
3
12
48
192
768
3072
relative frequency with which M occurs in G (as a window of size k + 1 is slid from left to right across G). It then estimates P(S j G) by PðS j GÞ ¼
Y
PðM j GÞ:
ð3Þ
M2M
Here M is the set of words of length k + 1 that occur in S (as we slide a window of size k across S from left to right). The sequence S is then classified among a set of genomes G by a simple maximum likelihood rule: assign S to that G 2 G which maximizes P(S j G). The drawback of this classifier is that (3) assumes that the probabilities P(M j G) are all independent for M 2 M, an assumption which is clearly invalid given that the words M 2 M are overlapping. This was pointed out by the authors themselves (Sandberg et al., 2001): ‘The naı¨ve Bayesian classifier assumes each motif to be independent of the other motifs, which is clearly false’. To remedy this drawback, we consider Markov models.
2.3
The fixed order Markov classifier, Mk
A Markov chain is a discrete stochastic process (X0, X1, X2, X3, . . . , Xi, . . .), with Xi taking values in a finite set G called the alphabet, such that there is a k 0 satisfying for all n k, n1 PðXn ¼ xn j Xn1 ¼ x0n1 Þ ¼ PðXn ¼ xn j Xn1 0 nk ¼ xnk Þ:
ð4Þ
Here and in the sequel, xnm denotes the sequence xm, xm+1, . . . , xn. The order of the Markov chain is the smallest integer k ¼ k0 0 such that (4) holds. A Markov chain of order k is completely specified by giving the initial sequence x0, . . . , xk1 and then the transition probabilities pu‚ a :¼ Pða j uÞ for u 2 Gk, a 2 G. Thus specifying a Markov chain of order Pk involves jGjk(jGj 1) degrees of freedom (since for each u 2 Gk, we have a pu,a ¼ 1). For example, when the alphabet is DNA, G ¼ {A, C, G, T} and the order k ¼ 1, we have 12 degrees of freedom. Table 1 lists the number of free parameters for the first ks. The transition probabilities are estimated by maximum likelihood using the empirical distribution, p~ kx ðu‚aÞ :¼ ðN x ðuaÞ=N x ðuÞ) estimated from frequency counts Nx(w) of words w in the data x.
2.4
The variable length markov classifier, VLMk
The basic idea behind variable length Markov models (VLMMs, Bu¨hlmann and Wyner, 1999; Ma¨chler and Bu¨hlmann, 2004; Ron et al., 1996) is that the memory of the stationary Markov chain should be allowed to depend on the context, i.e. the position in the data. Depending on the context, one may have to look back a variable number of symbols in order to determine the probabilities of the next transition. This can be made precise via the notion of a Prediction Suffix Tree (PST). This is a rooted tree where each vertex v is labeled with a string sv 2 G and each edge is labeled with a letter a 2 G. The root is labeled with the empty string. If a vertex v is labeled sv the child w is labeled sva, where a is the label of the edge (v, w). A terminal node v with label sv is also associated with transition probabilities gs,a, a 2 G. The transitions in a VLMM are determined as follows: given x1, . . . , xn, follow the unique path from the root given by the edges labeled by xn,xn1, . . . , (i.e. in reverse order) until we
Bayesian classifiers for detecting HGT
(a)
(b)
80 70 60 50 n1 m1 vlm1
40 30
0
500
1000
(c)
100
100
90
90
80
80
70
70
60
60
50
n2 m2 vlm2
40 30
0
(d)
500
1000
50
n3 m3 vlm3
40 30
0
500
(e)
100
1000
(f)
100
90
90
80
80 80
70 70
60 vlm1 vlm2 vlm3 vlm4 vlm5
60 n4 m4 vlm4
50 40
0
500
1000
40
20
0
500
1000
60 n4 m4 vlm4
50 40
500
1000
1500
2000
Fig. 1. Comparing performance. The y-axis shows the efficiency measured in percentage of correctly classified samples and the x-axis indicates the number of nucleotides. A total of 100 sequences were drawn randomly from the 10% test set and classified to the profiles constructed from the remaining 90%. In (a)–(e) 28 taxa were used. In (a) variable of 12–15 free parameters (vlm1), k ¼ 1 for naı¨ve (n1) and fixed order (m1), (b) k ¼ 2, (c) k ¼ 3 (d) k ¼ 4, (e) improvement for the VLMk for k ¼ 1, . . . , 5 (f) k ¼ 4, same as (d) but with 157 taxa.
arrive at a leaf node. Use the transition probabilities associated with the leaf node to generate the next transition. VLMMs are a structurally much richer class of models allowing vastly increased flexibility in choosing the number of parameters to fit a given set of data.They may be seen as a broad class of Markov models that adapt to the environment. The optimal model can be found using standard model selection criteria such as the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) by generating all possible sub-models. However, in practice, it is not possible to perform an exhaustive search. Therefore other approaches have to be considered.
2.5
Locking the model
Instead of finding the optimal model we suggest that one should use the model with the most number of parameters that the data allows when classifying genes. Sandberg et al. (2001) found that the longer the oligomers they used the better the classification. This is true as long as there is enough data. This simply reflects the fact that a higher order model can always approximate a lower order model. However, when data get too sparse the approximation will not work because of unreliable estimates of the transition probabilities. In this section we propose a simple locking algorithm, based on the algorithm in Bu¨hlmann and Wyner (1999), to generate a PST with a specified number of parameters m. This locking procedure is also necessary in order to make a sort of fair comparison with the fixed order model when comparing performance. The algorithm first builds a maximal context tree having a minimum number of counts of each element and a maximal depth. For nodes with next-symbol probabilities equal to zero we apply the correction of pseudocounts with Laplace’s rule (e.g. Durbin et al., 1998). Then the algorithm assigns the node with string v the score, Dw,v, where w is the j v j 1 suffix of
v, that is based on a Kullback–Liebler distance of next-symbol probabilities (Bu¨hlmann and Wyner, 1999). X p½a j v Dw‚ v ¼ NðvÞ: ð5Þ p½a j vlog p½a j w a2G The tree is then pruned, by removing terminal nodes with lower scores, until the PST has approximately m parameters (see also the Supplementary Material).
3 3.1
RESULTS AND DISCUSSION Comparing performance
Two datasets, one with 28 taxa (same as used in, Sandberg et al., 2001) and one with 157 taxa, were used for testing the performance of the classifiers. Both datasets are available in the Supplementary Material. Of the data 90% was used for training and the remaining part for testing. The same k was used for the naı¨ve and the fixed order Markov model meaning that we compare performance for oligomers of the same length (Fig. 1). The variable length Markov model was constrained to the same number of free parameters but with no limit to the length of the oligomers. The only exception was made for k ¼ 1 where the VLMk got the opportunity to choose a model with 15 parameters if more appropriate. Otherwise, there would be no difference between M1 and VLM1 since there are very few ways of selecting a model of 12 parameters that is different from M1.
519
D.Dalevi et al.
100 sequences of different lengths were randomly chosen from each bacterium’s test data and classified to the different profiles. The procedure was repeated for sequences with lengths of 35, 60, 100, 200, 400, 1000, 1500 and 2000. Figure 1 displays the results for different orders and different models. The improvement of using words longer than five (Fig. 1e, k ¼ 4) is insignificant but in these simulations there is an improvement using words longer than the popular tetramers (e.g. Pride et al., 2003; Teeling et al., 2004; Dufraigne et al., 2005). Figure 1f shows the result for the bigger dataset of 157 taxa for k ¼ 4. The sequences need to be twice as long to obtain approximately the same accuracy as with the lower number of taxa, but the signal is still well defined. The major improvement is observed when going from the naı¨ve model to a fixed order Markov model (Fig. 1). There is not a single case where the naı¨ve performs better than either of the Markov models. This is more pronounced for the bigger dataset (Fig. 1f). Further, the VLMk shows a slight improvement compared with the Mk for all k. The improvement in favor of VLMk over Mk is consistent for the different runs but it is not a large improvement and it may be hard to argue for using VLMk only based on this result. However, what is important with the VLMk is that more or less any number of parameters can be used and not just those identified in Table 1. There are models in between that can fit almost any number of parameters. Figure 1 indicates that a proportion of the sequences in bacteria is not species-specific in genomic signatures. This fraction contains candidates of horizontally transferred genes. These candidates must be further processed in a second step where biological constraints are considered, see e.g. Nakamura et al. (2004). This is true for all classifiers of genomic signatures since there are atypical genes that are not HGTs.
3.2
The number of data in training
A very important issue is to know when to use a certain model: How many parameters can we estimate accurately when training on a certain number of nucleotides? Deriving an exact formula is not trivial owing to the overlapping nature of the words. Reinert et al. (2000) discuss the issue and present various approximations using the Law of Large Numbers and a Poisson distribution when the probability of a word is low. However, one can do a simple heuristic estimate as follows: Assume data are generated under a multinomial model where each nucleotide occurs with probability pa where a 2 G ¼ {A, C, G, T} (this corresponds to an M0 model). Under this model the standard deviation is given by sa ¼ p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pa ð1 pa Þ=N where N is the number of nucleotides. By fixing the standard deviation we can derive an N for this independent model. This means that 300 nt are required for having sa 0.025 when all nucleotides have the same probability. Extending this to counts of all dimers that start in nucleotide b, approximately 300 counts of ba will be required for a similar accuracy in pba for each a 2 G. Since jGj ¼ 4 this results in N > 300 · 4l1 for an arbitrary l-mer. In the variable model, with a further assumption that the counts are more or less balanced in the tree, we need to keep the number of parameters less than N/100. The rules are of course rough approximations since the words are over-lapping, however, using simulations we can confirm that they serve as approximations at a specified accuracy. A standard statistical approach to these kinds of question is via parametric bootstrapping (Efron, 1985). The technique relies on the
520
Table 2. Listing the results of the parametric bootstrap
N[nt]
e (M0)
e ( M1)
e (M2)
e (M3)
300 1200 4800 19200
4.5 5.4 6.5 8.4
4.4 5.4 6.9 8.4
4.5 5.3 7.2 9.6
4.4 5.4 7.0 9.7
± ± ± ±
0.18 0.85 1.9 3.9
± ± ± ±
0.18 0.69 1.5 2.7
± ± ± ±
0.13 0.60 1.9 3.6
± ± ± ±
0.22 0.88 2.1 4.0
e (Mk) represents the error when data are simulated under an Mk model and are expressed in units of 102. The rows represent the estimated model of order k ¼ 0, 1, 2, 3, meaning that 300 nt resulted in the error e(Mk) for a model of k ¼ 0, 1200 nt for k ¼ 1, 4800 nt for k ¼ 2 and 19200 nt for k ¼ 3.
assumption that the error measured between the true distribution and the observed sample is approximately the same as between a distribution based on the sample and new samples derived from this. Selecting 10 different bacteria we ran 1000 bootstrap replications on sequences of length N and calculated the error between the sample probabilities and the re-sampled probabilities. The result, giving estimates of the data needed for a given accuracy in various models is summarized in Table 2. Observe that the error increases when dependencies get longer.
3.3
False negatives I: the IncP-b dataset
Plasmids are important in the study of HGT since they are often the carriers of genes between bacteria. The members of the IncP-b group are known to be promiscuous and often isolated from Pseudomonas species (e.g. Adamczyk and Jagura-Burdzy, 2003). Since plasmids may exist in different hosts they are affected by different selective forces. However, a plasmid group (like the IncP) often has a common set of conserved backbone genes that are essential for survival (i.e. maintenance, replication, transfer and stability). It is well known that this backbone is enriched in certain restriction sites and oligomers (such as GGCC and GCCG, Wilkins et al., 1996). The PST of VLMMs can be used for visualizing such dependencies in data. The algorithm proposed in Section 2.5 was used for this purpose with a constraint of 60 parameters. Many of the pB10 specific oligomers show up as long branches in the tree, e.g. the GGCC patterns (Fig. 2). It is interesting to test if the classifiers can identify the backbone structure as plasmid-specific and by that capture the plasmid identity. We assume that these genes have not been acquired by HGT and define the error rate e as the probability of classifying a backbone gene to a non-IncP-b genome. This should be seen as a test of integrity that can be applied to any classifier of HGT. For the evaluation of the VLMk we ran the backbone genes when removing the genes to classify from the profiles, one-by-one, from all of the six IncP-b. This is done to remove the signal of the gene itself. Using a VLMk with 768 parameters, corresponding to pentamers, we manage to correctly classify all but two genes, these being kleB and traJ (Table 3). This results in an error e 5%.
3.4
False negatives II: Escherichia coli and Salmonella typhi
The gene order between pairwise closely related bacteria holds useful information for studying evolutionary processes (e.g. Dalevi et al., 2002). Positional conserved genes have a preserved gene-order meaning that their corresponding permutation is not
Bayesian classifiers for detecting HGT
Supplementary Material). This corresponds to an error e 5%. This indicates that a proportion of the atypical genes in Figure 1f will be naturally deviating in genomic signatures and hence not because of HGT.
3.5
Simulated data
Molecular data are hard to model. None of the above models is a ‘true’ representation of the mechanisms in any bacterium. However, simulations can be used to study the effect on classification of certain statistical properties of a bacterium (B) in a controlled manner. To produce data reflecting the forces that capture some of the important rules, we specified three characteristics the simulated data should have. These are listed below together with how they are modeled. (1) Codon usage: Simulated using a fixed order Markov model, k ¼ 2, trained on real data (coding regions only) in bacterium (B). (2) Longer oligomers: These are over-represented words in a bacterium and inserted at a site with probability poligo > 0. Fig. 2. A PST of the backbone genes in plasmid pB10 obtained after pruning with locking of 60 parameters. The PST captures important features of the underlying data. The patterns within the square brackets all occur with high counts, however, not necessary the highest of them all. The backbone of IncP-b plasmids are known to be enriched in the GGCC oligomer and this dependency is captured by the PST. The numbers within the brackets are next-symbol counts for A, C, G, T followed by the total number of nextsymbols for the context in []. T denotes terminal node (i.e. leaf). Table 3. Percentage of highly conserved IncP-b backbone genes that correctly classify to this group of plasmids (R751, pB4, pB10, pADP-1, pUO1, pJP4)
Gene(s)
VLM5
din, incC, kfrA klcA,B kleA-G korA-C ssb traC-traO trbA-P trfA.A1,A2
100 100 80 100 100 97 100 100
Comments
kleB genes matches AT
traJ genes matches CC
The comments describe cases where there are matches outside the IncP-b group. Abbreviations: AT ¼ Agrobacterium tumefaciens, CC ¼ Caulobacter cresentus.
disrupted other than by possible inversion of the whole segment. These genes are very unlikely to result from HGT, at least not after the two species diverged. If introduced earlier they would have been exposed to cellular forces, resulting in amelioration, over a long time and hence would have a composition similar to the host. Koski et al. (2001) identified a list consisting of 2951 positionally conserved genes between E.coli K-12 and S.typhi. A list of these genes was obtained (G.B. Golding, personal communication). When running all genes >1000 nt with the VLMk classifier of 768 parameters, in the dataset of 157 bacteria, the classifier identifies 95% of the genes to be of E.coli origin (see
(3) Random noise: Transforming a site in the sequence with probability pnoise > 0. using a zero order Markov model reflecting the composition of the bacterium (B). Using the schema we constructed a set of 15 artificial genomes of a million base pairs. The three classifiers were then tested on the data. The VLMk outperformed the others for all k and for all lengths of data except for longer test sequences. Having 2000 bp, all classifiers were 100% correct. The results are shown in the Supplementary Material.
3.6
Assigning confidence: when the group is not in the sample
The problem with all classifiers of the types we present is that there will always be a most likely candidate, an answer. However, what if the organism is not in the sample? This is a very likely scenario since there are millions of bacterial species and only a tiny fraction is available in databases. The probability that the correct organism is in the sample is hence quite small and there is a need for methods of rejecting hypotheses of HGT. For a target sequence s, suppose the maximum-likelihood genome pinpointed by our classifier is G. We wish to test the following null-hypothesis, (1) H0: The unknown sequence s has been generated from source G We adopt the following approach inspired by the technique of parametric bootstrapping (Efron, 1985): Calculate the likelihood of sequence L(s) of s under H0. Generate n samples of the same length as s from G at random. For each sample sequence si calculate L(si). Count the number of L(si) that are above L(s). If this value is above a, say 0.05, we reject H0 at the level a ¼ 0.05. (Note that the same method can be applied for whatever model— fixed or variable order—we use and will be reflected in the way the likelihood is computed.) Table 4 shows P-values that result for some of the backbone genes of the R751 backbone-genes against various
521
D.Dalevi et al.
Table 4. P-values for the 12 first [according to annotation] backbone genes of R751 against different profiles using an M4 model
Gene
R751
BS
CT
EC
PA
TM
kfrA korA korB incC1 incC2 kleA kleE korC klcA trfA2 ssb trbA
no no no no no no no no no no no no
no
no
no
no
no no no no no
¼ strongly reject at the 0.025 level, = reject at the 0.05 level, ¼ weak reject at the 0.1 level. no ¼ could not reject null-hypothesis at any of the above levels. Abbreviations: BS ¼ Bacillus subtilis, CT ¼ Chlamydia trachomatis, EC ¼ E.coli K12, PA ¼ Pseudomonas aeruginosa, TM ¼ Thermotoga maritima The number of samples n ¼ 500.
profiles. As a basic check, we note that the test does not reject any of the genes from the same profile. Many of the distant bacteria get clear rejections. However, it is interesting to observe that for especially Pseudomonas aeruginosa, which is a natural host for R751, and similar in genomic signatures, not so many rejections are made.
4
CONCLUSIONS
In this report, we expand previous work (Karlin and Burge, 1995; Dufraigne et al., 2005; Sandberg et al., 2003; Wang et al., 2005) by proposing the framework of fixed and variable length Markov chains to capture dependencies in genomic signatures. The contribution is a novel classifier with high flexibility, accuracy and mathematical soundness. We proposed a novel method for locking the model to a certain number of parameters so that it can naturally adjust to the size of the data. We showed that the integrity of the classifier is high in terms of the number of false negatives and proposed a novel method for rejecting false candidates of HGT.
ACKNOWLEDGEMENTS The authors wish to thank Marianne Ma˚nsson and Olle Nerman for useful discussions and Hsien Toh for critically reading the manuscript. This work was supported by the Swedish National Research School in Genomics and Bioinformatics, funded by the Swedish Foundation for Strategic Research (SSF). Conflict of Interest: none declared.
REFERENCES Adamczyk,M. and Jagura-Burdzy,G. (2003) Spread and survival of promiscuous IncP-1 plasmids. Acta Biochim. Pol., 50, 425–453. Borodovsky,M. and McIninch,J. (1993) Recognition of genes in DNA sequence with ambiguities. Biosystems, 30, 161–171.
522
Bu¨hlmann,P. and Wyner,A. (1999) Variable length Markov chains. Ann. Statist., 27, 480–513. Burge,C. et al. (1992) Over- and under-representation of short oligonucleotides in DNA sequences. Proc. Natl Acad. Sci. USA, 89, 1358–1362. Dalevi,D. et al. (2002) Measuring genome divergence in bacteria: a case study using chlamydian. J. Mol. Evol., 55, 24–36. Doolittle,W.F. (1999) Phylogenetic classification and the universal tree. Science, 284, 2124–2129. Dufraigne,C. et al. (2005) Detection and characterization of horizontal transfers in prokaryotes genomic signature. Nucleic Acids Res., 33, e6. Durbin,R., Eddy,S., Krogh,A. and Mitchison,G. (1998) Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge. Efron,B. (1985) Bootstrap confidence intervals for a class of parametric problems. Biometrika, 72, 45–58. Ellrott,K. et al. (2002) Identifying transcription factor binding sites through Markov chain. Bioinformatics, 18 (Suppl. 2), s100–s109. Forsdyke,D.R. and Mortimer,J.R. (2000) Chargaff’s legacy. Gene, 261, 127–137. Hallet,H. and Lagergren,J. (2000) New Algorithms for the Duplication-Loss Model. Proceedings of the Research on Computational Molecular Biology. ACM Press, New York, pp. 138–146. Hooper,S.D. and Berg,O.G. (2002) Detection of genes with atypical nucleotide sequence in microbial genomes. J. Mol. Evol., 54, 365–375. Karlin,S. and Burge,C. (1995) Dinucleotide relative abundance extremes: a genomic signature. Trends Genet., 11, 283–290. Koski,L.B. et al. (2001) Codon bias and base composition are poor indicators of horizontally transferred genes. Mol. Biol. Evol., 18, 404–412. Kroll,J.S. et al. (1998) Natural genetic exchange between Haemophilus and Neisseria: intergeneric of chromosomal genes between major human pathogens. Proc. Natl Acad. Sci. USA, 95, 12381–12385. Lawrence,J.G. and Ochman,H. (1997) Amelioration of bacterial genomes: rates of change and exchange. J. Mol. Evol., 44, 383–397. Lee,S.J. et al. (2004) Genomic conflict settled in favour of the species rather than the gene at GC percentage values. Appl. Bioinformatics, 3, 219–228. Ma¨chler,M. and Bu¨hlmann,P. (2004) Variable Length Markov chains: methodology, computing, and software. J. Comp. Graph. Stat., 13, 435–455. Muto,A. and Osawa,S. (1987) The guanine and cytosine content of genomic DNA and bacterial evolution. Proc. Natl Acad. Sci. USA, 84, 166–169. Nakamura,Y. et al. (2004) Biased biological functions of horizontally transferred genes in genomes [Erratum (2004) Nat. Genet., 36 1126.]. Nat. Genet., 36, 760–766. Pride,D.T. (2003) Evolutionary implications of microbial genome tetranucleotide frequency. Genome Res., 13, 145–158. Reinert,G. et al. (2000) Probabilistic and statistical properties of words: an overview. J. Comput. Biol., 7, 1–46. Ron,D. et al. (1996) The power of amnesia: learning probabilistic automata with variable memory length. Mach. Learn., 25, 117–149. Salzberg,S.L. et al. (1998) Microbial gene identification using interpolated Markov models. Nucleic Acids Res., 26, 544–548. Sandberg,R. et al. (2001) Capturing whole-genome characteristics in short sequences using a naive classifier. Genome Res., 11, 1404–1409. Sandberg,R. et al. (2003) Quantifying the species-specificity in genomic signatures, synonymous codon choice, amino acid usage and G+C content. Gene, 311, 35–42. Scherer,S. et al. (1994) Atypical regions in large genomic DNA sequences. Proc. Natl Acad. Sci. USA, 91, 7134–7138. Sharp,P.M. and Matassi,G. (1994) Codon usage and genome evolution. Curr. Opin. Genet. Dev., 4, 851–860. Sicheritz-Ponten,T. and Andersson,S.G. (2001) A phylogenomic approach to microbial evolution. Nucleic Acids Res., 29, 545–552. Teeling,H. et al. (2004) Application of tetranucleotide frequencies for the assignment of genomic. Environ. Microbiol., 6, 938–947. Wang,Y. et al. (2005) The spectrum of genomic signatures: from dinucleotides to chaos game. Gene, 346, 173–185. Wilkins,B.M. et al. (1996) Distribution of restriction enzyme recognition sequences on broad host range plasmid RP4: molecular and evolutionary implications. J. Mol. Biol., 258, 447–456. Woese,C.R. (1987) Bacterial evolution. Microbiol. Rev., 51, 221–271. Zhao,X., Huang,H. and Speed,T. (2004) Finding short DNA motifs using permuted Markov models. Proceedings of the Research on Computational molecular Biology. ACM Press, New York, pp. 68–75.