Reconstruction of highly heterogeneous gene-content evolution ...

Report 1 Downloads 62 Views
BIOINFORMATICS

Vol. 23 ISMB/ECCB 2007, pages i230–i239 doi:10.1093/bioinformatics/btm165

Reconstruction of highly heterogeneous gene-content evolution across the three domains of life Wataru Iwasaki* and Toshihisa Takagi Department of Computational Biology, Graduate School of Frontier Sciences, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8568, Japan

ABSTRACT Motivation: Reconstruction of gene-content evolutionary history is fundamental in studying the evolution of genomes and biological systems. To reconstruct plausible evolutionary history, rates of gene gain/loss should be estimated by considering the high level of heterogeneity: e.g. genome duplication and parasitization, respectively, result in high rates of gene gain and loss. Gene-content evolution reconstruction methods that consider this heterogeneity and that are both effective in estimating the rates of gene gain and loss and sufficiently efficient to analyze abundant genomic data had not been developed. Results: An effective and efficient method for reconstructing heterogeneous gene-content evolution was developed. This method comprises analytically integrable modeling of gene-content evolution, analytical formulation of expectation-maximization and efficient calculation of marginal likelihood using an inside-outsidelike algorithm. Simulation tests on the scale of hundreds of genomes showed that both the gene gain/loss rates and evolutionary history were effectively estimated within a few days of computational time. Subsequently, this algorithm was applied to an actual data set of nearly 200 genomes to reconstruct the heterogeneous gene-content evolution across the three domains of life. The reconstructed history, which contained several features consistent with biological observations, showed that the trends of gene-content evolution were not only drastically different between prokaryotes and eukaryotes, but were highly variable within each form of life. The results suggest that heterogeneity should be considered in studies of the evolution of gene content, genomes and biological systems. Availability: An R script that implements the algorithm is available upon request. Contact: [email protected]

1

INTRODUCTION

Genome projects have determined the sequence of more than 500 genomes and more than 1800 projects are still ongoing, reporting more than one genome sequence per week on average. The abundant genome sequences provide us with opportunities to study genome evolution at the universal-tree scale. Since the most fundamental information representing the genome sequences is gene content (the set of genes that each organism possesses), the reconstruction of gene-content evolutionary history is essential for studying the evolution of genomes, *To whom correspondence should be addressed.

as well as the evolution of biological systems encoded in these genomes, such as metabolic pathways and signaling cascades. A serious problem in gene-content evolution reconstruction is that the rates of gene gain and loss are not known in advance, in contrast to the well-established transition matrices available for nucleotide and amino acid evolution. To make matters more difficult, biological observations suggest that the rates of gene gain and loss should not be the same across a phylogenetic tree. For example, parasitization frequently results in massive gene loss (Nakabachi et al., 2006) and genome duplication results in massive gene gain (Dehal and Boore, 2005). Thus, the heterogeneity that exists in the rates of gene-content evolution must be considered. Hereafter, we use the term heterogeneity to represent the non-uniformity of gene gain/loss rates among different edges of a phylogenetic tree. To date, various methods for reconstructing gene-content evolution have been proposed (Csu¡¡ro¨s and Miklo´s, 2006; Hahn et al., 2005; Kunin and Ouzounis, 2003; Mirkin et al., 2003; Snel et al., 2002), in parallel with elaborate studies on the mathematical aspects of gene-content evolution (Dutih et al., 2004; Gu and Zhang, 2004; Huson and Steel, 2004; Karev et al., 2002, 2003, 2004). In these methods, the absence/presence or the number of genes are formulated as states of each ortholog group, and gene gain and loss are treated as transitions among these states, so that the maximum-parsimony or the maximumlikelihood estimation of ancestral states is applied to the reconstruction of the gene-content evolution. The orthologbased approach is selected because it is practically impossible to make reliable molecular phylogenetic trees for all ortholog groups, both because of computational costs and for statistical reasons: multiple alignments of many sequences often result in small numbers of unambiguously aligned positions that are insufficient to give significant trees. To the best of our knowledge, methods for reconstructing gene-content evolution that account for heterogeneity, effectively estimate the evolutionary rates, and are applicable to hundreds of genomes have not been developed. The pioneering maximum-parsimony-based methods (Kunin and Ouzounis, 2003; Mirkin et al., 2003; Snel et al., 2002) not only assume homogeneous evolution, but also require arbitrary parameters or criteria to estimate evolutionary history. Although the subsequent maximum-likelihood-based method (Hahn et al., 2005) introduced the concept of time, this method also assumes homogeneity. Recently, a highly sophisticated three-parameter model, which can encompass heterogeneity, has been proposed (Csu¡¡ro¨s and Miklo´s, 2006). The elaborate mathematics of the model enables efficient calculation of likelihood if the

ß 2007 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

W.Iwasaki and T.Takagi

heterogeneous parameters are obtained in advance. However, partly due to the mathematical complexity of the model, an efficient and effective algorithm for parameter estimation has not been proposed. In the present article, we present a novel method for reconstructing gene-content evolution, which is the first method to allow effective and efficient reconstruction while taking heterogeneity into account. Simulation tests were used to demonstrate the effectiveness and efficiency of the method, which was subsequently applied to the data from about 200 genomes. The reconstructed history shows that heterogeneity occurs widely across the three domains of life and should be considered in studies on the evolution of gene content, genomes and biological systems.

Let Mn ðtÞ be the 4  4 transition probability matrix, where the (i, j)-th element MðnÞ ij ðtÞ represents the transition probability from the state i to j in time t along Edgen . Then, Mn ðtÞ follows a differential equation: dMn ðtÞ ¼ Mn ðtÞRn , dt

ð1Þ

where 2

n 6 n Rn ¼ 6 4 0 0

0 n n  n n

n n  n n 0

3 0 0 7 7: n 5 n

This equation has a solution of Mn ðtÞ ¼ expðRn tÞ, where expðÞ is the matrix exponential. Since the above definition enables the diagonalization of Rn , Mn ðtÞ has an analytical solution of Mn ðtÞ ¼ Un Vn ðtÞWn ,

2 2.1

METHODS Input data and problem definition

Suppose that genome sequences from N organisms (Org1 , . . . , OrgN ) are given. The proposed method requires two types of data precomputed from these N genomes, i.e. ortholog groups and phylogenetic topology (note that the term ortholog is inappropriate in the strict sense; however, we use the term by convention, to describe genes with similar functions). For the first input, existing well-studied methods are used for creating ortholog groups, such as COG (Tatusov et al., 1997) and MBGD (Uchiyama, 2003). Let K be the number of ortholog groups calculated by some of those methods for genes in the N genomes. Then, the gene content of Orgn can be represented by a K-length vector ðnÞ > ðnÞ gn ¼ ½gðnÞ 1 , . . . , gK  , where gk is the number of genes belonging to the k-th ortholog group on the genome of Orgn . In the present study, fourth or more gene copies are not counted, so that gkðnÞ ¼ 0, 1, 2 or 3 for 8n, k. This is because high-copy number genes often have special characteristics (e.g. transposons) that would cause deviation from the general evolutionary mode of the whole gene content. The second input is phylogenetic topology, which is represented by a binary-rooted tree T that has N leaves and N  1 internal nodes. These internal nodes correspond to the ancestral species, denoted by OrgNþ1 , . . . , Org2N1 , so that paðnÞ > n if OrgpaðnÞ is the parent of Orgn , which means that the root of T is Org2N1 . To estimate T, both molecular phylogenetic trees for concatenated highly conserved genes, such as ribosomal RNAs and proteins, and so-called genome trees calculated by sophisticated methods (Dutilh et al., 2004) can be used. Since the main goal is not the estimation of phylogenetic relationships, T is supposed to be given based on existing methods. With these input data, the reconstruction of the gene-content evolution problem is defined as the estimation of likely H  fgn jn ¼ N þ 1, . . . , 2N  1g, given T and V  fgn jn ¼ 1, . . . , Ng.

2.2

Gene-content evolution model

Gene-content evolution is formulated as a finite-state, continuous-time Markov process. In the process, gene gain/losses are represented as transitions among the four states. Consider an ortholog group that evolves along an edge Edgen i.e. from OrgpaðnÞ to Orgn . A twoparameter birth-and-death model is utilized in which gene gain and gene loss occur with probabilities n t and n t, respectively, in a short time t. This means that, e.g. a high value for n indicates massive gene gain on Edgen by gene genesis, horizontal gene transfers, gene duplications and/or other mechanisms. The values n and n are different among the edges, which implies heterogeneity of the genecontent evolution.

ð2Þ

ð3Þ

where by omitting subscripts n of n and n for clarity, 2 3 2 2 1 2 =2  =2 ffi  =2 ffi p ffiffiffiffiffiffiffiffiffiffi p ffiffiffiffiffiffiffiffiffiffi 6 1 = ð1 þ 2=Þ= ð1  2=Þ= 7 7, ffi pffiffiffiffiffiffiffiffiffiffiffi Un ¼ 6 4 1 = pffiffiffiffiffiffiffiffiffiffi 2=  = 2=  = 5 1 1 1 1 2

1 60 6 Vn ðtÞ ¼ 4 0 0

0 eðþÞt 0 0

0 0pffiffiffiffiffiffi

eðþþ 0

2Þt

3 0 7 0 7, 0 pffiffiffiffiffiffi 5

eðþ

ð4Þ

ð5Þ

2Þt

Wn ¼ =ð4ð2 þ 2 ÞÞ: 2 42 =ð þ Þ 42 =ð þ Þ 6 2ð2 þ 2 Þ=ð þ Þ 2ð2 þ 2 Þ=ð þ Þ 6 pffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffi 6 4 ð þ   2Þ ð  2   2Þ pffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffi 2 ð þ  þ 2Þ ð   þ  2Þ 3 3 4 4 =ðð þ ÞÞ 4 =ð þ Þ 2 2 2 2 2ð þ  Þ=ð þ Þ 2ð þ  Þ=ð þ Þ 7 7 pffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffi 7: ð þ   2Þ 5   2   2 pffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffi ð þ  þ 2Þ   2 þ  2

ð6Þ

Ortholog groups were assumed to evolve independently, as is the case in the typical point-mutation models used to analyze sequence evolution. In addition, to focus on the general evolutionary mode of the whole gene content and to avoid overfitting of the parameters, we hypothesized that along Edgen , any ortholog group follows the same parameters n and n; however, if the ortholog groups are classified, different parameters can easily be assigned to each class. Hereafter, t is defined as a relative time scale so that the length of any edge is 1. That is, the overall transition matrix along Edgen is Mn ð1Þ (abbreviated as Mn ) and n and n are the gene gain and loss rates, respectively, per proportion of the edge. This definition was used because of the mathematical redundancy and the ambiguity associated with estimating evolutionary time. Alternatively, if given the evolutionary time tn for Edgen , dividing n and n by tn simply gives the evolutionary rate in time units.

2.3

Maximum-likelihood estimation of ancestral states using an expectation-maximization algorithm

If the model parameters  (i.e. 1 , . . . , 2N2 , 1 , . . . , 2N2 and the prior probability distribution at the last common ancestor A0 , A1 , A2 , A3 ) are given in advance, the most likely evolutionary history of gene

i231

W.Iwasaki and T.Takagi

content across the entire tree can be calculated. However, these parameters are not known in practice, so they must be estimated from the observed data V. In the present article, an efficient and effective algorithm based on the expectation-maximization algorithm (Dempster et al., 1977) is presented for estimating these parameters. For the detailed formulation, we referred to the related method for training amino acid substitution matrices while considering the structural context of proteins (Holmes and Rubin, 2002). The expectation-maximization algorithm iteratively estimates more likely parameters with tentative parameters (denoted by primes hereafter) by maximizing X Qðj0 Þ ¼ PðHjV, 0 Þ log PðV, HjÞ: ð7Þ

ðnÞ0 where by defining Cixyj to be

Qðj 0 Þ ¼

3 X

a0i log Ai þ

i¼0

2N2 3 X 3 XX

mijðnÞ0 ZðnÞ ij ,

ðnÞ0 Cixyj Z ¼

¼

1

3 X

t¼0

p¼0

3 X

Z ¼

1

t¼0 x¼0 y¼0

e

ð12Þ

where ð13Þ

Here, logðn dtÞ and logðn dtÞ are ðlog n þ log dtÞ and ðlog n þ log dtÞ, respectively. The infinite negative log dt means that any particular history has the probability of zero. Since this term can be ignored for the comparison of parameters, we ignore log dt hereinafter. In addition, the Taylor expansion of logð1  zdtÞ gives zdt. Thus, we get

i232

MðnÞ0 ij

,

3 X

! ðnÞ0 ðnÞ0 Uyq Vqq ð1  tÞWðnÞ0 dt qj

ð19Þ

0

0

1eðn þn Þ 0n þ0n

1

2

0

ðn þn Þ

e

0

eðn þn Þ pffiffiffiffiffiffiffi 0 0  2n n 1e pffiffiffiffiffiffiffi 0 0 0 0 2n n

1þe

ð20Þ

2n n

3 X

a0i log Ai þ

2N2 3 X 3 XX

mðnÞ0 ij

n¼1 i¼0 j¼0

MðnÞ0 ij

ðnÞ0

0 0 þBðnÞ0 i22j ðtÞð1  ðn þ n ÞdtÞ logð1  ðn þ n ÞdtÞ . 0 MðnÞ0 þBðnÞ0 ij , i33j ðtÞð1  n dtÞ logð1  n dtÞ

ðnÞ0 0 ðnÞ0 ðnÞ0 0 ðnÞ0 ij n log n þ ij n log n  ij n  ij n

ð18Þ

ðnÞ0

!

ð21Þ

ðnÞ0

ðij 0n log n þ ijðnÞ0 0n log n  ij n  ij n Þ :

0 0 þBðnÞ0 i11j ðtÞð1  ðn þ n ÞdtÞ logð1  ðn þ n ÞdtÞ

ZijðnÞ ¼

þ

ðnÞ0 ðnÞ0 UðnÞ0 yq Wqj Dpq ,

i¼0

t¼0

ðnÞ0 ðnÞ0 Bixyj ðtÞ ¼ Mix ðtÞMðnÞ0 yj ð1  tÞ:

þ

ðnÞ0 ðnÞ0 ðnÞ0 0 Therefore, we can directly calculate ðnÞ0 ij , ij , ij and ij with n and 0 n , utilizing equations (4), (6) and (15–20). Finally, we have the Q-function in the following form:

Qðj0 Þ ¼

ðnÞ0 ðnÞ0 0 þðBi10j ðtÞ þ BðnÞ0 i21j ðtÞ þ Bi32j ðtÞÞn dt logðn dtÞ 0 þBðnÞ0 i00j ðtÞð1  n dtÞ logð1  n dtÞ

ðnÞ0 Ci22j , ðnÞ0 Ci33j ,

pffiffiffiffiffiffiffi 20n 0n pffiffiffiffiffiffiffi ð0n þ0n Þ 20n 0n n þn  2n n e 3 pffiffiffiffiffiffiffi pffiffiffiffiffiffiffi ð0n þ0n þ 20n 0n Þ ð0n þ0n  20n 0n Þ 1e 1e pffiffiffiffiffiffiffiffi ffi pffiffiffiffiffiffiffiffi ffi 7 0 0 0 0 0 0 0 0 n þn þ 2n n n þn  2n n 7 pffiffiffiffiffiffiffi pffiffiffiffiffiffiffi 7  20n 0n 20n 0n 7 1e pffiffiffiffiffiffiffiffi 1þe pffiffiffiffiffiffiffi ffi 7 0 0 ð0n þ0n Þ 20n 0n 7 eðn þn Þ 20n 0n e 7: p ffiffiffiffiffiffiffi p ffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffi ffi 0 0 0 0 7 2  2 0 0 n n e n n 7 ðn þn þ 20n 0n Þ e pffiffiffiffiffiffiffiffiffi 7 e 0 0 eðn þn Þ 2 20n 0n 7 pffiffiffiffiffiffiffi pffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffi 7 0 0 0 0 5 2n n  2n n ð0n þ0n  20n 0n Þ e e p ffiffiffiffiffiffiffiffi ffi e 0 0 ðn þn Þ 0 0

ð11Þ This equation is transformed as follows: Z1  ðnÞ0 ðnÞ0 0 ðBi01j ðtÞ þ BðnÞ0 ZijðnÞ ¼ i12j ðtÞ þ Bi23j ðtÞÞn dt logðn dtÞ

þ

6 6 1eð0n þ0n Þ 6 0n þ0n 6 pffiffiffiffiffiffiffi 6 D0n ¼ 6 1eð0n þ0n þ 20n 0n Þ ffi 6 0 0 pffiffiffiffiffiffiffiffi 6 n þn þ 20n 0n 6 p ffiffiffiffiffiffiffi 4 ð0n þ0n  20n 0n Þ 1e pffiffiffiffiffiffiffiffi ffi 0 0 0 0

ð10Þ

3 X 3   X t dt 1t dt Pði!x!y!jji, j, 0n , 0n Þ log Pðx!yjn , n Þ :

3 X

2

by changing the order of the integral and the summation, we derive ZðnÞ ij

þ

ðnÞ0 Ci11j ðnÞ0 Ci22j

where

Since the logarithm of the probability for the continuous Markov process is given by an integral of

t¼0

ð17Þ

¼

ðnÞ0 Ci00j ðnÞ0 Ci11j

q¼0

ð8Þ

    log P HðnÞ ij jfrom t to tþdt n , n ,

ð16Þ

ðnÞ0 ij

q¼0

ðnÞ0 UðnÞ0 ip Wpx

HijðnÞ

1

ð15Þ

ðnÞ0 ðnÞ0 ðnÞ0 ðnÞ0 ij ¼ Ci10j þ Ci21j þ Ci32j ,

ðnÞ0 ðnÞ0 Uip Vpp ðtÞWðnÞ0 px

p¼0

n¼1 i¼0 j¼0

Z

BðnÞ0 ixyj ðtÞdt,

then,

where by letting HðnÞ ij be a particular evolutionary history that starts from the state i, evolves along Edgen , and ends with j, X ZðnÞ PðHijðnÞ ji, j, 0n , 0n Þ log PðHðnÞ ð9Þ ij ¼ ij jn , n Þ:

log PðHijðnÞ jn , n Þ ¼

t¼0

ðnÞ0 ðnÞ0 ðnÞ0 ðnÞ0 ij ¼ Ci01j þ Ci12j þ Ci23j ,

ijðnÞ0 ¼

H

This function gives the expectation value of log PðH, VjÞ with respect to H under 0 . Thus, if given mðnÞ0 and a0i , which are the expected ij numbers of ortholog groups of gkðpaðnÞÞ ¼ i ^ gkðnÞ ¼ j and gkð2N1Þ ¼ i, respectively, we can write

R1

ð14Þ

Next, we solve  that maximize Qðj0 Þ, with the condition of P3 partial differentiation of the Lagrangian i¼0 Ai ¼ 1. By solving theP form FðÞ ¼ Qðj0 Þ þ ð1  3i¼0 Ai Þ, we get P P 0n 3i¼0 3j¼0 mijðnÞ0 ijðnÞ0 =MðnÞ0 ij , ð22Þ n ¼ P3 P3 ðnÞ0 ðnÞ0 ðnÞ0 i¼0 j¼0 mij ij =Mij P P 0n 3i¼0 3j¼0 mijðnÞ0 ijðnÞ0 =MðnÞ0 ij n ¼ P3 P3 , ð23Þ ðnÞ0 ðnÞ0 ðnÞ0 i¼0 j¼0 mij ij =Mij a0 Ai ¼ P3 i

i¼0

a0i

:

ð24Þ

Therefore, through iterative updates of the parameters following the above equations, the most likely parameters ^ can be estimated. To iterate the computation to convergence, we must efficiently calculate and a0i . The iteration can be carried out efficiently with the both mðnÞ0 ij time complexity of O(nk), using an algorithm similar to the insideoutside algorithm for stochastic context-free grammars (Lari and Young, 1990), which is a special example of the belief propagation algorithm (Pearl, 1986).

W.Iwasaki and T.Takagi

ðnÞ Let IðnÞ k ðiÞ be the conditional probability that gk is i, given the inside of Orgn (i.e. Orgn and its descendants). Then, we have the recursive formulae in the ascending order (n ¼ 1, . . . , 2N  1): 8 if n ¼ 1, . . . , N gðnÞ i > > k > > ! ! > < X 3 3 X ðrcðnÞÞ0 ðrcðnÞÞ ðlcðnÞÞ0 ðrcðnÞÞ ð25Þ ðiÞ ¼ , IðnÞ Mip Ik ðpÞ Miq Ik ðqÞ k > > p¼0 q¼0 > > > : if n ¼ N þ 1, . . . , 2N  1

where ij is the Kronecker delta and OrglcðnÞ and OrgrcðnÞ are the left and right children of Orgn , respectively. ðpaðnÞÞ Next, let OðnÞ is i, given k ðiÞ be the conditional probability that gk the outside of Orgn (i.e. T except for Orgn and its descendants). Again, we have the recursive formulae in the descending order (n ¼ 2N  2, . . . , 1): ! 8 3 X ðsiðnÞÞ0 ðsiðnÞÞ > > > M I ðpÞ Ai 0 > ip k > > > p¼0 > > > < if paðnÞ ¼ 2N  1 !, ! ð26Þ ðiÞ ¼ OðnÞ k 3 3 > X X > ðsiðnÞÞ0 ðsiðnÞÞ ðpaðnÞÞ ðpaðnÞÞ > > M I ðpÞ O ðqÞM > ip qi k k > > > p¼0 q¼0 > > : otherwise where OrgsiðnÞ is the sibling of Orgn . Finally, we have 0

0

mðnÞ ij ¼ a0i ¼

K OðnÞ ðiÞMðnÞ IðnÞ ð jÞ X ij k k , P3 ð2N1Þ ðpÞ k¼1 p¼0 Ap Ik K X k¼1

P3

Ai Ið2N1Þ ðiÞ k

p¼0

Ap Ikð2N1Þ ðpÞ

ð27Þ

:

ð28Þ

with the time complexity of Since IkðnÞ ðiÞ and OðnÞ k ðiÞ can be calculated 0 0 O(n), the time complexity for mðnÞ ij ai and argmax Qðj Þ becomes O(nk). For the initial parameters, the uniform rates were adopted. Alternatively, any parameter set can be used if some previous knowledge is available. ^ we can calculate the Consequently, using the estimated parameters , most likely gene content of ancestral organisms: ^  arg max PðH, VjÞ ^ H

ð29Þ

H

¼ arg max H

K X

A^ gð2N1Þ k

k¼1

2N2 Y

! ^ ðnÞðpaðnÞÞ M gk

n¼1

gkðnÞ

:

ð30Þ

This calculation can also be performed with the time complexity of O(nk), using the dynamic programming that resembles the Viterbi

algorithm for the hidden Markov models (for the implementation, refer to the similar algorithm used to reconstruct ancestral biological sequences (Pupko et al., 2000)).

3

RESULTS AND DISCUSSION

3.1

Tests on simulation data sets

We first tested the effectiveness of the proposed algorithm on simulation data sets. Gene content that comprised 20 000 ortholog groups evolved along equilibrated binary phylogenetic trees, according to the Markov process. We tested a wide range of tree sizes with 32, 64, 128, 256, 512 and 1024 leaves. For the evolutionary processes, different gene gain/loss rates in the range of 0.1–0.0001 were assigned to different edges. This corresponds to 2–2000 gene gain/losses per edge, which appears to be biologically reasonable. The parameters were randomly determined so that common logarithms of the parameters were equally distributed between 4 and 1. This variability among the parameters should reflect the heterogeneity of the actual gene-content evolution, during which massive gene gains and losses sometimes occur, whereas the edges between close organisms usually accompany little genecontent evolution. Then, by using only the gene content at the leaves of the tree, we estimated the evolutionary history of the gene content and compared it with the ‘actual’ history. The algorithm successfully estimated the parameters with mean relative errors that ranged from 20 to 30%, with somewhat large values of the SDs (Table 1, the two columns farthest to the left) despite the 103 orders of the parameter distribution and the limitation on the observable data. In addition, since the parameters for older periods were expected to be more difficult to predict, the mean relative errors for each level of the tree were examined (Table 1). Remarkably, the estimations of the older edges showed accuracies comparable to those of the most recent ones, which means that the method also allows estimations of the gene gain/loss rates of older periods. A notable exception was the oldest edge, which was probably due to difficulties in predicting the states of the last common ancestor. These difficulties include the necessity of predicting the prior distribution and ambiguousness in specifying the position of the last common ancestor between its two children. However, considering again the 103 orders of the parameter distribution, the accuracy of the estimated

Table 1. Mean relative errors of estimated parameters Number of leaves

32 64 128 256 512 1024

All edges

0.23(0.87) 0.22(0.75) 0.22(0.52) 0.22(0.63) 0.22(0.46) 0.22(0.44)

Distance from leaves 1

2

3

4

5

6

7

8

9

10

0.20(0.31) 0.20(0.35) 0.20(0.34) 0.20(0.35) 0.20(0.35) 0.20(0.35)

0.22(0.48) 0.21(0.38) 0.23(0.44) 0.22(0.42) 0.23(0.47) 0.23(0.53)

0.22(0.41) 0.23(0.42) 0.22(0.41) 0.23(0.43) 0.23(0.45) 0.23(0.48)

0.20(0.28) 0.23(0.51) 0.23(0.46) 0.23(0.50) 0.23(0.49) 0.23(0.48)

0.99(4.34) 0.27(0.89) 0.21(0.33) 0.23(0.42) 0.22(0.47) 0.24(0.50)

— 1.15(4.95) 0.24(0.58) 0.22(0.36) 0.22(0.37) 0.23(0.44)

— — 1.04(3.74) 0.26(0.48) 0.21(0.35) 0.24(0.41)

— — — 1.79(7.71) 0.26(0.53) 0.26(0.45)

— — — — 1.26(4.52) 0.21(0.39)

— — — — — 0.93(2.87)

P Mean relative error is ( edges ðjestimated  true j=true Þ þ ðjestimated  true j=true ÞÞ=2jedgesj. Standard deviations are shown in parentheses. Each of the trees with 32, 64, 128 and 256 leaves was generated 100 times. Owing to computational time constraints, trees with 512 and 1024 leaves were generated 61 and 30 times, respectively.

i233

W.Iwasaki and T.Takagi

Percent correctly reconstructed

100 80 Proposed method 60 40 Maximum-parsimony

20 0 0

200

400

600

800

1000

1200

# Leaves Fig. 1. Percentage of correctly estimated gene-content evolutionary history. The proportions of the ortholog groups whose evolutionary history were perfectly reconstructed are shown. Error bars indicate SDs. Each of the trees with 32, 64, 128 and 256 leaves was generated 100 times. Owing to computational time constraints, trees with 512 and 1024 leaves were generated 61 and 30 times, respectively. For comparison, the results of the maximum-parsimony reconstruction are shown.

parameters would be sufficient to reconstruct most of the genecontent evolution, as described subsequently. The percentage of correctly reconstructed fractions of the gene-content evolutionary history was examined using the estimated parameter sets (Fig. 1). The proposed method reconstructed the complete evolutionary history of over 80% of the ortholog groups for the trees with 256 leaves. Moreover, this method succeeded in reconstructing the history of almost half of the ortholog groups for the tree with 1024 leaves. Notably, both results exceeded those obtained by the traditional maximum-parsimony method, which is also applicable to large data sets but assumes homogeneous evolution.

3.2

Reconstruction of heterogeneous gene-content evolution across the three domains of life

One of the major advantages of the proposed algorithm is its efficiency. Then, the algorithm was applied to universal treescale genomic data. The expanded COG ortholog table from the STRING database (von Mering et al., 2005), which contains 26 200 ortholog groups from 179 genomes covering the three domains of life, was utilized. For the phylogenetic topology, the highly resolved tree of life created from concatenated well-conserved genes (Ciccarelli et al., 2006) was used. With implementation in the R language on a Linux machine with a 3 GHz Intel Pentium 4 processor, parameter estimation and subsequent reconstruction of the evolutionary history occupied memory space of 300 MB and required 19.3 h of computation. The Akaike information criterion was 1:21  106 and smaller than that of the uniform rate model (1:45  106 ), which indicates that the present model better reflects the gene-content evolution. The reconstructed gene-content evolutionary history shows that heterogeneity is common throughout the universal tree of

i234

life (Fig. 2). In particular, eukaryotes and prokaryotes exhibit drastically different trends in terms of gene-content evolution: the former shows excessive gene gain whereas the latter shows superiority of gene loss. As a simple interpretation, these trends seem to reflect differences in general survival tactics between the two forms of life; i.e. prokaryotes with small genomes have evolved to proliferate rapidly, whereas eukaryotes have increased their genome sizes to acquire new gene sets to evolve various survival strategies (e.g. cell communication genes required for multicellularization). Not only are eukaryotes and prokaryotes characterized by different trends in gene-content evolution, significant variations are also observed within each form of life. The gene gain trend in eukaryotic genome evolution is prominent in the large-scale duplications. Genomic studies have revealed that multiple rounds of genome-scale duplications have occurred in the ancient species of vertebrates (Dehal and Boore, 2005, two rounds) and plants (Blanc et al., 2003, at least two rounds). Both events correspond to high rates of gene gain in the estimation (Fig. 2, circled a and b, respectively). On the other hand, the edge that leads to the last common ancestor of animals and fungi is remarkably short (Fig. 2, circled c), possibly because the diversification of animals, fungi and plants occurred within a short period of time (Douzery et al., 2004). In contrast to that of eukaryotes, prokaryotic evolution is characterized by the general trend of gene loss. Among the prokaryotes, -, - and -Proteobacteria are estimated to have experienced significant gene gains (Fig. 2, circled d-g). This observation is consistent with a large-scale molecular phylogenetic study suggesting that frequent horizontal gene transfers have increased the gene repertories of these three proteobacterial subclasses (Beiko et al., 2005). Another striking exception is the ancestor of the genus Streptomyces (Fig. 2, circled h). Streptomyces species, which produce natural antibiotics for medical treatments, are known for their large genome sizes, with more than 7500 ORFs (Bentley et al., 2002; Ikeda et al., 2003), which are larger than those of some eukaryotic species. In fact, a Streptomyces mutant species with an almost duplicated genome, such as seen in eukaryotic genomes, has been identified (Wenner et al., 2003). Thus, Streptomyces species are expected to be highly exceptional with respect to prokaryotic gene-content evolution. Note that one major difficulty with the present method is also illustrated in Fig. 2. The difficulty is that the estimated history must inherit errors in preparing input data. For instance, the massive gene loss toward Gallus gallus is likely due to gene annotation problems. Likewise, the present method cannot overcome errors in detecting similar sequences, splitting multi-domain genes or inferring phylogenetic trees. Since these problems have been rigorously examined by the experts on each technique, we do not discuss them later; however, these possibilities should be noted when evaluating estimated results of the present method.

3.3

An example of reconstructed evolutionary history

To show an example of reconstructed history, we examined the evolution of gene families that function in RNA-mediated gene silencing across the eukaryotic domain. This process is widely

W.Iwasaki and T.Takagi

Fig. 2. Reconstructed heterogeneous gene-content evolutionary history across the three domains of life. Edge length represents the numbers of gained or lost genes on each edge. Edge colors indicate the ratio between gain and loss: red edges denote active gene gains and blue edges represent gene losses. Higher taxonomic classifications are taken from NCBI Taxonomy. Circled letters are provided for the convenience of the reader.

i235

W.Iwasaki and T.Takagi

conserved among eukaryotes, whereby double-stranded RNA induces the inactivation of the cognate sequences (Zamore and Haley 2005). Genes assigned to the keyword RNA-mediated gene silencing in the UniProt database (Wu et al., 2006) were collected, and the reconstructed history of the sixteen ortholog groups that include these genes was investigated (Fig. 3). We used the phylogenetic tree by Ciccarelli et al. (Ciccarelli et al., 2006). Note, however, that the tree of eukaryotes is controversial, especially concerning the positions of nematodes (Aguinaldo et al., 1997; Blair et al., 2002; Halanych, 2004) and protists (Keeling et al., 2005). The reconstructed history illustrates some general features of the evolution of this biological process. As estimated here, the key components of the process, RNA-dependent RNAP and Argonaute, are believed to have already existed in the last common ancestor of eukaryotes (LCAeuk ) in defensive roles against transposons and viruses (Cerutti and Casas-Mollano, 2006). In parallel with later acquisitions of the accessory genes, as shown in Figure 3, each gene is believed to have multiplicated so as to subfunctionalize; thus, this machinery fulfills various roles, such as miRNA-mediated gene regulation, especially along the lineages of animals and plants (e.g. Yigit et al., 2006). In addition, this process has disappeared along the lineages of the Saccharomycetaceae and Plasmodium (Cerutti and Casas-Mollano, 2006; Nakayashiki et al., 2006), reflecting the loss of the key genes Dicer and Argonaute/Piwi (Fig. 3; light-orange and yellow-green circles respectively) in the process. Some differences were found between the history estimated by the proposed algorithm and the traditional maximumparsimony method (Fig. 3; superscripts). The most notable difference in the estimation by the maximum-parsimony method was the absence of the key components, RNAdependent RNAP and Argonaute, at LCAeuk . This difference arose because, by assuming the homogeneous evolution model, the maximum-parsimony estimation was strongly influenced by the gene content of Plasmodium falciparum, which has the distance of one edge to LCAeuk . Instead, the proposed algorithm considered the gene loss mode toward Plasmodium and estimated the presence of both genes. In fact, these two genes are believed to have been present at LCAeuk (Cerutti and Casas-Mollano, 2006), which demonstrates the advantage of the new method. Another notable difference is seen in the evolutionary history of the exonuclease Mut-7. The new method estimated the presence of this gene from LCAeuk to the last common ancestor of Coelomata, whereas the maximum-parsimony method could estimate independent acquisitions of this gene along several lineages. Of course, the correct scenario cannot be determined because the actual evolutionary history is unknown. However, since Mut-7 family genes are known to have significant similarities in terms of both sequence and function across the eukaryotic domain (Glazov et al., 2003), the common origin scenario carries some credibility. The apparent error in the estimation performed by the present method (as well as by the maximum-parsimony method) was the absence of Dicer at LCAeuk . This error would also be caused by the absence of this gene at Plasmodium. This genus is exceptional among the protists, since other protists have this gene and should have the

i236

RNA-mediated gene silencing system (Cerutti and CasasMallano, 2006), and this is exactly the reason that LCAeuk is believed to have possessed this gene. Therefore, the main reason for this error is the bias in the utilized genomic information. More genome sequences from the ongoing genome projects are needed to correct these errors.

4

CONCLUSION AND PERSPECTIVES

The proposed algorithm is the first gene-content evolution reconstruction method that accounts for heterogeneity with effective and efficient parameter estimations. The application of this method to an actual data set across the three domains of life showed that the modes of gene-content evolution are actually never uniform, reflecting the highly variable sizes (160 kb to 3 Gb) of the sequenced genomes. This result suggests that heterogeneity should be considered in studies of gene-content evolution. Since the rates and history of gene-content evolution estimated in the present study have a mathematical foundation with maximum likelihood, they should provide objective bases for studying the heterogeneous evolution of gene content across the universal tree of life. Four major future directions are envisioned based on the present study. The first direction is the refinement of the evolution model. We adopted a two-parameter model, which is simpler than a three-parameter model of duplication, loss and horizontal transfer (Csu¡¡ro¨s and Miklo´s, 2006), for one biological and two mathematical reasons. First, estimating the causes of gene birth (gene genesis, duplication, horizontal gene transfer or other processes) often requires detailed sequence information and individual analyses. Notably, new ortholog groups can emerge by duplication, in addition to gene genesis or horizontal gene transfer (e.g. different aminoacyltRNA synthetases that have significant sequence similarity are usually classified into different ortholog groups (Tatusov et al., 1997)). Hence, changing the gene gain/loss rates depending on current gene counts does not necessarily reflect actual evolution processes. Second, as Csu¡¡ro¨s and Miklo´s have pointed out, models with too many parameters lead to parameter overfitting. The heterogeneity consideration already requires several parameters, the number of which is proportional to the tree size. Thus, more complex models could result in a smaller than expected improvement in the estimation. Third, the simpler definition enables diagonalization of the transition rate matrix and effective computation of the algorithm. We anticipate extending the model to include additional parameters once the above obstacles have been overcome. The second direction is to utilize the information available on gene numbers greater than three. This information was ignored for the biological reason outlined in the Methods section. However, when concentrated analyses of the eukaryotic genecontent evolution are considered, in which gene multiplications have great significance, this type of extension may be preferable. In terms of mathematics, loosening this limitation is not a difficult task. For example, a 10  10 version of Rn can also be diagonalized analytically, so that M n ðtÞ has an analytical solution in much more complex equations, and a similar argument holds true for the expectation-maximization algorithm. Thus, the proposed algorithm can easily be modified

W.Iwasaki and T.Takagi

Fig. 3. Most likely evolutionary history of the gene families that function in RNA-mediated gene silencing across the eukaryotic domain. Superscript numbers denote possible alternative states estimated by at least one maximum-parsimony history that minimizes the number of transitions.

i237

W.Iwasaki and T.Takagi

to utilize highly multiplicated genes. In fact, as far as Rn is defined to enable its diagonalization, the new algorithm can utilize any evolution model to optimize the parameters and to estimate the most likely ancestral states. Moreover, using a numerical diagonalization approach, every evolution model, including the sophisticated duplication, loss and horizontal transfer model (Csu¡¡ro¨s and Miklo´s, 2006), can be utilized; however, computational efficiency would be impaired. These derivative methods should be tested and compared in individual cases, since the best method should depend on both input data and goals of analyses. The third direction is to apply the present method to the evolution of phenotypes. Reconstruction of ancestral states of phenotypic characters has a long history. In particular, some methods (Pagel, 1999; Schluter et al., 1997) also utilize maximum-likelihood-based estimation, but neither the edgespecific rates nor the ordered states. This is partly because much less data are available on the evolution of phenotypes than are available for gene content; however, by limiting the numbers of freely variable rates or focusing on specific phenotypes (e.g. ones that have many states or ones that have many related phenotypes whose evolutionary modes can be assumed to be the same), the present method could be applied to these estimations. The fourth and final direction is the most challenging, in that it involves the expansion of the evolution model to consider the interactions among genes. We hypothesized that different ortholog groups would evolve independently; however, this is apparently not the case. Thus, the interactions among different gene families in the course of heterogeneous gene-content evolution should be taken into account. To date, various mathematical analyses have been performed to clarify the evolution of biological systems by assuming the homogeneity of gene-content evolution, e.g. the preferential attachment model for scale-free, power-law networks (Barabasi and Albert, 1999). The directions for future research will deepen our understanding of the effects of heterogeneity on the evolution of intricately interacting gene sets, such as gene networks, metabolic pathways and biological modules.

ACKNOWLEDGEMENTS We are grateful to Drs Y. Yamamoto, S. Dohkan, T. Kato and three anonymous reviewers for valuable comments on the manuscript. This work was supported by KAKENHI (Grantin-Aid for Scientific Research) on Priority Areas ‘Systems Genomics’ from the Ministry of Education, Culture, Sports, Science and Technology of Japan. W. I. was supported by Research Fellowships from the Japan Society for the Promotion of Science for Young Scientists. Conflict of Interest: none declared.

REFERENCES Aguinaldo,A.M. et al. (1997) Evidence for a clade of nematodes, arthropods and other moulting animals. Nature, 387, 489–493. Barabasi,A.L. and Albert,R. (1999) Emergence of scaling in random networks. Science, 286, 509–512. Beiko,R.G. et al. (2005) Highways of gene sharing in prokaryotes. Proc. Natl. Acad. Sci. U. S. A., 102, 14332–14337.

i238

Bentley,S.D. et al. (2002) Complete genome sequence of the model actinomycete Streptomyces coelicolor A3(2) Nature, 417, 141–147. Blair,J.E. et al. (2002) The evolutionary position of nematodes. BMC Evol. Biol., 2, 7. Blanc,G. et al. (2003) A recent polyploidy superimposed on older large-scale duplications in the Arabidopsis genome. Genome Res., 13, 137–144. Cerutti,H. and Casas-Mollano,J.A. (2006) On the origin and functions of RNAmediated silencing: from protists to man. Curr. Genet., 50, 81–99. Ciccarelli,F.D. et al. (2006) Toward automatic reconstruction of a highly resolved tree of life. Science, 311, 1283–1287. Csu¡¡ro¨s, M. and Miklo´s, I. (2006) A probabilistic model for gene content evolution with duplication, loss, and horizontal transfer. RECOMB 2006, LNBI 3909, 206–220. Dehal,P. and Boore,J.L. (2005) Two rounds of whole genome duplication in the ancestral vertebrate. PLoS Biol., 3, e314. Dempster,A.P. et al. (1977) Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B-Stat. Methodol., 39, 1–38. Douzery,E.J. et al. (2004) The timing of eukaryotic evolution: does a relaxed molecular clock reconcile proteins and fossils? Proc. Natl. Acad. Sci. U. S. A., 101, 15386–15391. Dutilh,B.E. et al. (2004) The consistent phylogenetic signal in genome trees revealed by reducing the impact of noise. J. Mol. Evol., 58, 527–539. Glazov,E. et al. (2003) A gene encoding an RNase D exonuclease-like protein is required for post-transcriptional silencing in Arabidopsis. Plant J., 35, 342–349. Gu,X. and Zhang,H. (2004) Genome phylogenetic analysis based on extended gene contents. Mol. Biol. Evol., 21, 1401–1408. Hahn,M.W. et al. (2005) Estimating the tempo and mode of gene family evolution from comparative genomic data. Genome Res., 15, 1153–1160. Halanych,K.M. (2004) The new view of animal phylogeny. Annu. Rev. Ecol. Evol. Syst., 35, 229–256. Holmes,I. and Rubin,G.M. (2002) An expectation maximization algorithm for training hidden substitution models. J. Mol. Biol., 317, 753–764. Huson,D.H. and Steel,M. (2004) Phylogenetic trees based on gene content. Bioinformatics, 20, 2044–2049. Ikeda,H. et al. (2003) Complete genome sequence and comparative analysis of the industrial microorganism Streptomyces avermitilis. Nat. Biotechnol., 21, 526–531. Karev,G.P. et al. (2002) Birth and death of protein domains: a simple model of evolution explains power law behavior. BMC Evol. Biol., 2, 18. Karev,G.P. et al. (2003) Simple stochastic birth and death models of genome evolution: was there enough time for us to evolve? Bioinformatics, 19, 1889–1900. Karev,G.P. et al. (2004) Gene family evolution: an in-depth theoretical and simulation analysis of non-linear birth-death-innovation models. BMC Evol. Biol., 4, 32. Keeling,P.J. et al. (2005) The tree of eukaryotes. Trends Ecol. Evol., 20, 670–676. Kunin,V. and Ouzounis,C.A. (2003) GeneTRACE-reconstruction of gene content of ancestral species. Bioinformatics, 19, 1412–1416. Lari,K. and Young,S.J. (1990) The estimation of stochastic context-free grammars using the inside-outside algorithm. Comput. Speech Lang., 4, 35–56. Mirkin,B.G. et al. (2003) Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evol. Biol., 3, 2. Nakabachi,A. et al. (2006) The 160-kilobase genome of the bacterial endosymbiont Carsonella. Science, 314, 267. Nakayashiki,H. et al. (2006) Evolution and diversification of RNA silencing proteins in fungi. J. Mol. Evol., 63, 127–135. Pagel,M. (1999) The maximum likelihood approach to reconstructing ancestral character states of discrete characters on phylogenies. Syst. Biol., 48, 612–622. Pearl,J. (1986) Fusion, propagation, and structuring in belief networks. Artif. Intell., 29, 241–288. Pupko,T. et al. (2000) A fast algorithm for joint reconstruction of ancestral amino acid sequences. Mol. Biol. Evol., 17, 890–896. Schluter,D. et al. (1997) Likelihood of ancestor states in adaptive radiation. Evolution, 51, 1699–1711. Snel,B. et al. (2002) Genomes in flux: the evolution of archaeal and proteobacterial gene content. Genome Res., 12, 17–25. Tatusov,R.L. et al. (1997) A genomic perspective on protein families. Science, 278, 631–637. Uchiyama,I. (2003) MBGD: microbial genome database for comparative analysis. Nucleic Acids Res., 31, 58–62.

W.Iwasaki and T.Takagi

von Mering,C. et al. (2005) STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res., 33, D433–D437. Wenner,T. et al. (2003) End-to-end fusion of linear deleted chromosomes initiates a cycle of genome instability in Streptomyces ambofaciens. Mol. Microbiol., 50, 411–425.

Wu,C.H. et al. (2006) The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res., 34, D187–D191. Yigit,E. et al. (2006) Analysis of the C. elegans Argonaute family reveals that distinct Argonautes act sequentially during RNAi. Cell, 127, 747–757. Zamore,P.D. and Haley,B. (2005) Ribo-gnome: the big world of small RNAs. Science, 309, 1519–1524.

i239