Supertree Bootstrapping Methods for Assessing ... - Semantic Scholar

Report 1 Downloads 76 Views
Syst. Biol. 55(3):426–440, 2006 c Society of Systematic Biologists Copyright  ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150500541722

Supertree Bootstrapping Methods for Assessing Phylogenetic Variation among Genes in Genome-Scale Data Sets J. G ORDON B URLEIGH, AMY C. D RISKELL, AND M ICHAEL J. S ANDERSON Section of Evolution and Ecology, University of California, Davis, CA 95616, USA; E-mail: [email protected] (J.G.B.) Abstract.—Nonparamtric bootstrapping methods may be useful for assessing confidence in a supertree inference. We examined the performance of two supertree bootstrapping methods on four published data sets that each include sequence data from more than 100 genes. In “input tree bootstrapping,” input gene trees are sampled with replacement and then combined in replicate supertree analyses; in “stratified bootstrapping,” trees from each gene’s separate (conventional) bootstrap tree set are sampled randomly with replacement and then combined. Generally, support values from both supertree bootstrap methods were similar or slightly lower than corresponding bootstrap values from a total evidence, or supermatrix, analysis. Yet, supertree bootstrap support also exceeded supermatrix bootstrap support for a number of clades. There was little overall difference in support scores between the input tree and stratified bootstrapping methods. Results from supertree bootstrapping methods, when compared to results from corresponding supermatrix bootstrapping, may provide insights into patterns of variation among genes in genome-scale data sets. [Nonparametric bootstrapping; phylogenetics; supermatrix; supertree; supertree bootstrapping.]

Large data sets derived from whole genomes or from sequence databases are becoming more commonplace in phylogenetic studies. Numerous phylogenetic analyses have used data sets that include sequences from more than 100 loci (e.g., Daubin et al., 2001; Bapteste et al., 2002; Blair et al., 2002, 2005; Lee, 2002; Lerat et al., 2003; Rokas et al., 2003; Driskell et al., 2004; Philippe et al., 2004, 2005; Wolf et al., 2004; Dopazo and Dopazo, 2005; Philip et al., 2005). Perhaps the greatest challenge in phylogenetic analysis of data sets this large is heterogeneity among loci. Questions regarding the treatment of heterogeneous loci are not new (see Bull et al., 1993; de Queiroz et al., 1995; Huelsenbeck et al., 1996; Cunningham, 1997), but these questions are especially relevant when analyzing genome-scale data sets that often exhibit extensive gene-specific phylogenetic variation (e.g., Rokas et al., 2003; Driskell et. al., 2004). Interestingly, incongruence among genes appears rampant in genome-scale data sets whether the total evidence bootstrap results are uniformly high (e.g., Rokas et al., 2003) or reveal mixed levels of support (e.g., Bapteste et al., 2002; Driskell et al., 2004). The presence of such variation among genes may greatly affect or even mislead results of a total evidence phylogenetic analysis. For example, total evidence phylogenetic inference may be particularly influenced by loci with longer sequences or faster rates of evolution (Seo et al., 2005). Thus, when analyzing genome-scale data sets it is critical to understand not only the variation among characters but also the patterns of variation among genes—and to assess how this variation may affect phylogenetic inference. Supertree methods (Bininda-Emonds, 2004) may be useful in genome-scale phylogenetic analyses. Supertree methods combine input trees with partially overlapping sets of taxa to make comprehensive phylogenetic hypotheses incorporating all taxa present in the input. They are increasingly popular for combining data from disparate sources and building large phylogenies (e.g., Sanderson et al., 1998; Bininda-Emonds et al.,

2002; Bininda-Emonds, 2004). Recently, supertree methods also have been used to infer phylogenies from large, multigene data sets (Daubin et al., 2001; Cotton and Page, 2002; Philip et al., 2005). Whereas a total evidence, or supermatrix, approach, which concatenates sequences from all genes, uses nucleotide or amino acid sites as the basic unit of data, supertree approaches treat each input tree as the basic unit of data. If each input tree is built from sequences of a single gene in a genome-scale data set, supertree methods may be more sensitive to variation in the phylogenetic inference among genes than supermatrix methods. Similarly, supertree methods also may minimize the effects in phylogenetic inference of anomalous loci, such as loci with histories of horizontal transfer (Escobar-P´aramo et al., 2004). Furthermore, in genome-scale data sets, gene sequences may be missing due to gene gains or losses across taxa (e.g., Daubin, 2001) or simply because they have not been sampled (e.g., Driskell et al., 2004; Philippe et al., 2004; Yan et al., 2005). Because supertree methods explicitly build phylogenetic hypotheses from data sets without complete taxonomic overlap, genome-scale phylogenetics is a natural application of supertree approaches. This study explores the utility of two methods of supertree bootstrapping for assessing clade support in supertrees and the sampling variance associated with individual genes in genome-scale data sets. Though numerous supertree methods have been developed (see Bininda-Emonds, 2004), methods for generating support values for supertree inferences have received less attention (but, see Purvis, 1995; Cotton and Page, 2002; Bininda-Emonds, 2003; Creevey et al., 2004; Ronquist et al., 2004; Philip et al., 2005). We used four published genome-scale data sets to examine two previously proposed nonparametric bootstrapping methods for assessing clade support in supertrees. We also compare the differences in clade support measures from supertree bootstrapping with those from conventional supermatrix bootstrapping in these same data sets.

426

2006

BURLEIGH ET AL.—SUPERTREE BOOTSTRAPPING

M ETHODS Data Sets We analyzed four published data sets, each containing sequences from more than 100 genes and varying widely in the amount of missing data. Two of the supermatrices have extreme levels of missing data (Driskell et al., 2004). Each taxon in these two supermatrices has amino acid sequences from at least 10 genes, and each gene in the supermatrix has sequences from at least four taxa (Driskell et al., 2004). The metazoan supermatrix is composed of sequences from 1131 genes for 70 taxa. Each taxon has sequences from an average of 95 of the 1131 genes. Therefore, approximately 92% of the total gene sequences are missing from the supermatrix (Driskell et al., 2004). The green plant supermatrix contains sequences from 254 genes for 69 green plant taxa. Each taxon has sequences from an average of 40 genes, and thus, approximately 84% of the gene sequences are missing (Driskell et al., 2004). The bilaterian supermatrix of Philippe et al. (2005) is missing a smaller proportion of the data (∼35%) and is comprised of amino acid sequences from 49 taxonomic units and 146 genes. Many of the taxon labels in the bilaterian supermatrix as published by Philippe et al. (2005) represent chimeric taxa, in which sequences from a larger clade have been combined into a single taxonomic unit. In some cases we used taxon labels that differ from those of Philippe et al. (2005). Generally, the new taxon names represent higher taxonomic units that incorporate all taxa that compose each chimeric sequence. However, the sequences were not modified. A table that translates between the taxon labels used by Philippe et al. (2005) and those used in this study is available at http://systematicbiology.org. The fourth data set, the yeast supermatrix of Rokas et al. (2003), contains DNA sequences from 106 genes in eight yeast taxa with no missing gene sequences. All data sets used in this study are available at http://systematicbiology.org. Gene Tree Construction Phylogenetic trees were obtained using maximum parsimony (MP) for each gene in each supermatrix, and these gene trees were used as input trees for supertree analyses. All phylogenetic searches were implemented in PAUP∗ (Swofford, 2003). The metazoan, green plant, and bilaterian MP gene trees were constructed using a heuristic search strategy: each search used TBR branch swapping on four random addition sequence replicates and saved a maximum of 2500 trees per replicate. Branch and bound MP searches (Hendy and Penny, 1982) were used to identify MP gene topologies from the yeast data set. Supertree Construction and Bootstrapping All supertree analyses used the matrix representation with parsimony (MRP) supertree method (Baum, 1992; Ragan, 1992; Baum and Ragan, 2004), the most widely used supertree method (Bininda-Emonds, 2004). For each data set, the input trees consisted of a single

427

unrooted MP tree from each gene. The supertrees were later rooted using the outgroup taxa specified in the original supermatrix studies (see Rokas et al., 2003; Driskell et al., 2004; Philippe et al., 2005). If a gene tree had multiple optimal MP trees, the first MP tree in the tree file was used as the input tree. We also ran the analyses using a strict consensus of all MP trees, and the results were very similar (not shown). Input trees were transformed into their binary matrix representation (e.g., Farris et al., 1970; Baum, 1992; Ragan, 1992), called the MRP matrix, using the r8s program (Sanderson, 2003). Two variations of supertree bootstrapping were performed, which we term input tree bootstrapping and stratified bootstrapping. Input tree bootstrapping samples with replacement from among the original MP input trees to create new replicate data sets containing the same number of input trees as in the original data set (e.g., Daubin et al., 2001; Creevey and McInerney, 2004; Creevey et al., 2004; Philip et al., 2005). This version of input tree bootstrapping is analogous to the conventional method of bootstrapping used in phylogenetics (Felsenstein, 1985). However, the sampled characters are input trees, not single columns in a character matrix, and the resulting replicate matrices likely will not contain all input trees. The nonparametric bootstrap assumes that the resampled characters, in this case input gene trees, are independent and identically distributed (Felsenstein, 1985). Whereas this assumption seems at least as valid for different genes as for neighboring nucleotides, it is more problematic if input trees are built from shared data sets. For the green plant, metazoan, and bilaterian data sets, 100 replicates of input tree bootstrapping were performed, each consisting of four random addition sequence replicates with a maximum of 6 hours of TBR branch swapping and saving up to 10,000 trees. We chose this search heuristic because it was computationally feasible and identical to that used in the original MP supermatrix bootstrapping of the green plant and metazoan supermatrices by Driskell et al. (2004). We performed an identical MP supermatrix bootstrap analysis on the bilaterian data set of Philippe et al. (2005). Thus, the comparison of supertree and supermatrix bootstrap scores should not be affected by different heuristics. Longer tree searches may have yielded higher bootstrap scores (e.g., DeBry and Olmstead, 2000; Mort et al., 2000; Sanderson and Wojciechowski, 2000); however, they would likely not affect the overall comparison of supertree and supermatrix bootstrap scores. For the yeast data set, 100 replicates of input tree bootstrapping were performed, each using a branch and bound MP search. An MP supermatrix bootstrap analysis also was performed using PAUP∗ with branch and bound searches on 100 replicates. Stratified bootstrapping was previously proposed as a method to incorporate uncertainty within individual input trees into a supertree analysis (Cotton and Page, 2002; Page, 2004). First, each gene comprising the original supermatrix is bootstrapped. Confidence in each gene tree also was assessed based on 100 nonparametric bootstrap replicates (Felsenstein, 1985). For each gene sequence in the metazoan, green plant, and bilaterian supermatrices,

428

SYSTEMATIC BIOLOGY

each bootstrap replicate used a maximum of 1 hour of TBR branch swapping from a simple addition sequence starting tree and saved up to 1000 trees. For the yeast supermatrix, 100 bootstrap MP searches were completed using branch and bound searches. For each of the stratified bootstrap replicates, a tree from a single bootstrap replicate for each gene is selected randomly and used as an input tree in a subsequent supertree search. Thus, in every replicate of the stratified bootstrap, each gene is represented with a tree selected from one of its bootstrap replicates. We performed 100 replicates of stratified bootstrapping on all four data sets, using the same MRP tree heuristic search strategies as used in the input tree bootstrap analysis described above. We note that a similar approach to stratified bootstrapping would be to weigh individual data sets based on their bootstrap support (e.g., Ronquist, 1996; Bininda-Emonds and Sanderson, 2001) and resample these weighted data sets. Although these approaches both account for clade support in input trees, the stratified bootstrapping approach may provide a more complete picture of the tree support because it incorporates all clades that receive any support even if they are not represented in a majority rule bootstrap tree. Alternatively, bootstrapping the weighted trees would be more feasible when the original data sets are unavailable. R ESULTS Supertree Bootstrap In the metazoan, green plant, and bilaterian data sets, support values from both methods of supertree bootstrapping are often low. Although in each analysis some clades have high support, the majority of clades are poorly supported. Overall, the scores from input tree bootstrapping are similar to those from stratified bootstrapping in these three data sets. In the yeast data set, the supertree bootstrap support from both methods is at or near 100% for all clades in the optimal MP tree. In the metazoan data set, Mammalia and Amniota have at least 91% support for both supertree bootstrapping methods (Fig. 1a, b). Vertebrata and Metazoa have 83% and 84% support, respectively, in input tree bootstrap and 97% and 98% support with stratified bootstrapping. Support for Tetrapoda is never higher than 57% (Fig. 1a, b). The primates and numerous lower level clades are supported with near 100% bootstrap values in the supertree bootstraps, but many other clades show very low supertree bootstrap support (Fig. 1a, b). In the green plant data set, the land plants, seed plants, and flowering plants all have at least 92% support from supertree bootstrapping (Fig. 2a, b). Vascular plants have 82% and 51% support from input tree bootstrapping and stratified bootstrapping, respectively (Fig. 2a, b). Many angiosperm clades, including monocots and eudicots, have very low support from supertree bootstrapping (Fig. 2a, b). Though many clades representing major taxonomic groups are strongly supported in the bilaterian supertree bootstrap analyses, the relationships among these large clades generally have low bootstrap values

VOL. 55

(Fig. 3a, b). For example, Deuterostomia, Nematoda, Platyhelminthes, and Arthropoda are supported by at least 86% of the replicates in the input tree bootstrap (Fig. 3a) and 96% in stratified bootstrap (Fig. 3b). The fungi have 92% and 90% values for input tree bootstrapping and stratified bootstrapping, respectively, and other fungi clades all have at least 89% bootstrap scores (Fig. 3a, b). Nearly all bootstrap values from both supertree bootstrapping methods are 100% for all clades in the MP yeast topology (Fig. 4). The only exception is the single clade of Saccharomyces kudriavzevii, S. mikatae, S. paradoxus, and S. cerevisiae, which has a stratified bootstrap value of 96% (Fig. 4). Comparing Supertree and Supermatrix Bootstrap Scores We compared bootstrap support from both supertree bootstrapping methods to supermatrix bootstrapping for all clades that received minimally 5% support from at least one of the bootstrapping methods being compared. The bootstrap values from the supermatrix analyses generally exceed those from the supertree analyses in the green plant data set, though there also are a number of clades in which the supertree bootstrap values exceed the corresponding supermatrix bootstrap values (Fig. 5a, b). There is less difference in support values resulting from the supermatrix and both methods of supertree bootstrapping for the metazoan, bilaterian, and yeast data sets (Figs. 4a, b). Supermatrix versus stratified bootstrap scores.—On average, the supermatrix bootstrap support exceeded the input tree bootstrap support by 5.5% (median = 4.5%), 1.3% (median = 0.7%), and 0.6% (median = 0%) in the green plant, metazoan, and bilaterian data sets, respectively. For the green plant data, the supermatrix bootstrap support exceeded the input tree bootstrap support in 151 clades, and the input tree bootstrap values were higher in 97 clades. For the metazoan data, supermatrix bootstrap values were higher in 104 clades and input tree bootstrap support was higher in 84 clades. In the bilaterian data supermatrix, support was higher in 28 clades and input tree bootstrap support was higher in 29 clades. The supermatrix bootstrap scores exceeded input tree bootstrap scores by a maximum of 80.7%, 78.2%, and 56.0% in the metazoan, green plant, and bilaterian data sets, respectively, and the input tree bootstrap scores exceeded the supermatrix bootstrap scores by a maximum of 59.5%, 51.5%, and 45.5% in the three data sets. Supermatrix versus stratified bootstrap.—On average, the supermatrix bootstrap support exceeded the stratified bootstrap support by 5.2% (median = 3.4%) and 0.8% (median = 0%) in the green plant and metazoan data sets, respectively. For the green plant data, supermatrix bootstrap support was higher in 142 clades and stratified bootstrap support was higher in 90 clades. In metazoans, supermatrix support was higher in 78 clades and stratified bootstrap support was higher in 86 clades. However, on average in the bilaterian data set the stratified bootstrap values exceed the supermatrix bootstrap values by

2006

BURLEIGH ET AL.—SUPERTREE BOOTSTRAPPING

0.1% (median = 0%). In 35 clades from the bilaterian data set, the stratified bootstrap support is higher than the supermatrix support, and the supermatrix bootstrap support is higher in 28 clades. Supermatrix bootstrap scores exceeded stratified bootstrap scores by a maximum of 55.0%, 61.1%, and 40.0% in the metazoan, green

429

plant, and bilaterian data sets, respectively, and the stratified bootstrap scores exceeded the supermatrix bootstrap scores by a maximum of 36.8%, 47.5%, and 25.5% in the three data sets. Input tree versus stratified bootstrap.—On the whole, the input tree bootstrap scores are slightly lower than the

FIGURE 1. Supertree bootstrap consensus trees from the metazoan data set (Driskell et al., 2004). The taxon labels include the Genbank taxon ID numbers and the taxonomic name associated with that number. Bootstrap percentages are above each branch. (a) Results from input tree bootstrapping; (b) Results from stratified bootstrapping. (Continued)

430

SYSTEMATIC BIOLOGY

VOL. 55

FIGURE 1. (Continued)

stratified bootstrap scores in the metazoan and green plant data sets (Fig. 5c). On average among all observed clades, the stratified bootstrap score exceeded the input tree bootstrap score by 0.6% (median = 0.8%) for the metazoan data set and by 0.4% (median = 1.1%) for the green plant data set. In the metazoan data set, the stratified bootstrap support exceeded the input tree bootstrap support in 109 clades, whereas the input tree

bootstrap support was higher in 87 clades. In the green plant data set, the stratified bootstrap support exceeded the input tree bootstrap score in 128 clades, while the input tree bootstrap support was higher in 113 clades. The difference between supertree bootstrap methods was even smaller in the bilaterian data set. On average, the input tree bootstrap support exceeded that of stratified bootstrapping by only 0.1% (median = 0%). In

2006

BURLEIGH ET AL.—SUPERTREE BOOTSTRAPPING

the bilaterian data set, the support from stratified bootstrapping exceeded that of input tree bootstrapping in 28 clades, and support from input tree bootstrapping support was higher in 35 clades. Again, in each data set there were several cases in which clade support was

431

very different among supertree bootstrapping methods. Input tree bootstrap support exceeded stratified bootstrap support by a maximum of 51.2%, 49.2%, and 25.5% in the metazoan, green plant, and bilaterian data sets, respectively, and stratified bootstrapping exceeded input

FIGURE 2. Supertree bootstrap consensus trees from the green plant data set (Driskell et al., 2004). The taxon labels include the Genbank taxon ID numbers and the taxonomic name associated with that number. Bootstrap percentages are above each branch. (a) Results from input tree bootstrapping; (b) results from stratified bootstrapping. (Continued)

432

SYSTEMATIC BIOLOGY

VOL. 55

FIGURE 2. (Continued)

tree bootstrapping scores by a maximum of 37.2%, 48.6%, and 40.0% in the three data sets. D ISCUSSION Despite the growing interest in supertrees, often supertree studies do not explicitly discuss or use meth-

ods to assess confidence in supertree topologies (but, see Bininda-Emonds, 2003; Creevey et al., 2004; Ronquist et al., 2004; Philip et al., 2005). Yet, in many cases it is important to understand not only the optimal supertree topology but also its confidence limits. For example, supertrees are useful for studies in comparative biology, and bootstrapping supertrees allows one to incorporate

2006

BURLEIGH ET AL.—SUPERTREE BOOTSTRAPPING

phylogenetic uncertainty in these comparative analyses. This study demonstrates implementations of two simple supertree bootstrapping methods. The results provide a comparison of supertree and supermatrix phylogenetic methods and are the first to compare support values from supertree and supermatrix methods. There has been much interest in comparing the performance of supertree and supermatrix techniques (Bininda-Emonds and Sanderson, 2001; Kennedy and Page, 2002; Gatesy et al., 2004; Hughes and Vogler, 2004).

433

Although supertree methods sometimes perform as well or nearly as well as at recovering the true tree compared to total evidence methods in simulation (BinindaEmonds and Sanderson, 2001), in empirical data sets supertree methods generally find a greater number of equally optimal topologies and unambiguously resolve fewer phylogenetic relationships than corresponding total evidence approaches (Kennedy and Page, 2002; Gatesy et al., 2004; Hughes and Vogler, 2004). This result may be due to MRP matrices containing far fewer

FIGURE 3. Supertree bootstrap consensus trees from the bilaterian data set (Philippe et al., 2005). Bootstrap percentages are above each branch. (a) Results from input tree bootstrapping; (b) results from stratified bootstrapping. (Continued)

434

SYSTEMATIC BIOLOGY

VOL. 55

FIGURE 3. (Continued)

characters than the corresponding total evidence supermatrices. Furthermore, conventional MRP analyses contain no information about the number of characters supporting phylogenetic hypotheses represented by input trees, though this may be incorporated into a supertree analysis for example by weighting characters in an MRP matrix in proportion to support for a clade (Ronquist, 1996; Bininda-Emonds and Sanderson, 2001). If the supertree analyses do not resolve clades as well as the supermatrix analyses, one also might predict that the bootstrap values generally would be lower in su-

pertree analyses. Indeed, in the majority of clades, support from input tree or stratified bootstrapping is lower than support from supermatrix analyses in the metazoan and green plant data sets, but overall the supertree and supermatrix support is similar in the bilaterian and yeast data sets (Fig. 5a, b). Yet, this characterization of the relationship of supertree and supermatrix bootstrap support is incomplete. There are numerous cases in which the supertree bootstrap scores for clades are higher than supermatrix bootstrap scores (Fig. 5a, b). Therefore, it is worthwhile to examine specific reasons for large

2006

BURLEIGH ET AL.—SUPERTREE BOOTSTRAPPING

435

FIGURE 4. Supertree bootstrap consensus tree from the yeast data set (Rokas et al., 2003). The supertree bootstrap percentages are above each branch. The input tree bootstrap score is on the left separated by a slash from the stratified bootstrap score on the right. Below each branch are the number of genes that support/conflict with the quartet implied by each branch (as in Driskell et al., 2004).

differences in support scores from supertree and supermatrix analyses. Both supertree bootstrapping methods implicitly impose a different weighting scheme on the data than supermatrix analyses. The supertree bootstrapping on the genome-scale data sets treats all genes, no matter how large or how variable, as single characters (e.g., Doyle, 1992). Thus, in this case, supertree bootstrapping downweights data from genes with more informative sites and up-weights genes with fewer informative sites. In this respect, the supertree bootstrapping may resemble a supermatrix analysis that corrects the optimality score of each

gene based on its length (Seo et al., 2005). Though the overall bootstrap values from supertree bootstrapping may be lower than those from supermatrix bootstrapping, this trend may not persist or may be diminished when the proportion of genes that support a phylogenetic hypothesis differs from the proportion of sites that support the same hypothesis. In these cases, the sampling variance among sites differs from the variance among genes. We note that in many cases down-weighting longer genes and up-weighting shorter genes may not benefit a phylogenetic analysis. However, it still may be informative to compare the effects from the different

436

SYSTEMATIC BIOLOGY

VOL. 55

FIGURE 5. Relationships between bootstrap scores on clades. Each point on the graph represents a single clade. Closed squares represent clades from the metazoan data set, open circles represent clades from the green plant data set, and triangles represent clades from the bilaterian data set. The line represents equal values for both bootstrapping methods. Comparison of (a) input tree bootstrapping and supermatrix bootstrapping; (b) stratified bootstrapping and supermatrix bootstrapping; and (c) input tree bootstrapping and II. These comparisons only include clades that have at least 5% bootstrap support from at least one of the bootstrapping methods. (Continued)

2006

BURLEIGH ET AL.—SUPERTREE BOOTSTRAPPING

437

FIGURE 5. (Continued)

approaches to utilizing data from different genes in a phylogenetic inference. Furthermore, supertree analyses need not weight trees equally; for example, the input trees could be weighted based on length of genes or number of variable characters. Such situations can be demonstrated with an imaginary data set of 100 genes of equal length in which each gene alone is free of homoplasy but different genes support two competing clades, Clade A and Clade B. In the first example, 50 genes support Clade A over Clade B by a difference in tree length of 3 steps contributed by each gene and 50 genes support Clade B over Clade A by 30 steps each. With three or more uncontradicted characters, bootstrap support for the clade will be at least 95% for each gene (Felsenstein, 1985). In this case, both input tree and stratified supertree bootstrap scores would support both trees equally because an equal number of genes support each hypothesis. However, the supermatrix bootstrap would strongly support Clade B over Clade A, because 10 times as many characters support Clade B over Clade A. Conversely, suppose 99 genes support Clade A over Clade B by 3 steps and 1 gene supports Clade B over Clade A by 297 steps. Here the input tree and stratified supertree bootstrap scores would overwhelmingly support Clade A over Clade B because many more gene trees support Clade A over Clade B. However, because an equal number of characters character steps support Clade A and Clade B (297), the supermatrix bootstrap

would support both trees equally. In both examples, the supertree bootstrap more accurately represents the variation among genes within the data set, and the supermatrix bootstrap represents the variation among characters in the data sets. Cases in which supertree and supermatrix bootstrap scores greatly diverge could indicate clades that are influenced by a few anomalous loci or clades in which there are strongly competing hypotheses among genes. We suggest it may be informative to use supertree and supermatrix bootstrapping as complementary methods for assessing different levels of variance in large, multilocus data sets. One unexpected result from supertree bootstrapping is that even though there appears to be much incongruence among the genes that compose the yeast data set (Rokas et al., 2003), the supertree bootstrap values are 100% for nearly all clades (Fig. 4). To examine this result further, we determined the number of the yeast gene trees that support and do not support the quartet implied by each inner branch of the tree (see Driskell et al., 2004). The branch with the most apparent conflict has 69 genes supporting the quartet and 37 genes failing to support it (Fig. 4). Although this may appear to be a large number of conflicting genes, a clade that is supported by 69 characters with only 37 conflicting characters will have a high parsimony bootstrap score. Thus, the 100% supertree bootstrap scores are consistent with the observed level of incongruence. Yet this demonstrates that the supertree bootstrap percentage may differ greatly

438

SYSTEMATIC BIOLOGY

from the actual percentage of genes that conflict with a specific node. Though input tree bootstrapping and stratified bootstrapping examine different aspects of variation among gene trees, they provide similar results in the four data sets from this study (Fig. 5c). Input tree bootstrapping represents the sampling variance among genes and can be interpreted as reflecting the robustness of supertree inference to the choice of loci. On the other hand, stratified bootstrapping measures how uncertainty in the phylogenetic inference for the input genes affects supertree inference. In this respect, stratified bootstrapping incorporates the strength of the phylogenetic hypotheses from each gene into supertree analysis (Page, 2004). Furthermore, by incorporating information about secondary phylogenetic signals for each gene, stratified bootstrapping may help reveal or enhance hidden support (e.g., Barrett et al., 1995; Gatesy and Baker, 2005). Although there is no apparent a priori reason to expect that values from the two supertree bootstrapping methods will be similar, they might be similar in cases when the variation in the phylogenetic signal among genes is a manifestation of uncertainty in the phylogenetic signal within genes. For example, if the apparent variation in the phylogenetic signal among genes is due to sampling error within genes, then the two supertree bootstrapping values should be similar. If the genes strongly support highly divergent topologies, then we expect low values from input tree bootstrapping and higher values for stratified bootstrapping. Conversely, if the genes all weakly support the same topology, we expect higher values for input tree bootstrapping than stratified bootstrapping. One obvious additional supertree bootstrapping method would be to combine both supertree bootstrapping methods, resampling genes and then resampling bootstrap trees from the resampled genes. This combined supertree bootstrap would be conceptually similar to the two-tiered supermatrix bootstrapping proposed by Seo et al. (2005) in which one resamples genes and then resamples characters within the resampled genes. However, it may be informative to compare the two supertree bootstrapping methods individually as described here to examine differences in sampling variance among characters and among genes. Supertree methods have been criticized for a variety of reasons (e.g., Purvis, 1995; Novacek, 2001; Springer and de Jong, 2001; Wilkinson et al., 2001, 2005; Gatesy et al., 2002, 2004; Pisani and Wilkinson, 2002; Gatesy and Springer, 2004), and it is worth examining the implications of some of these criticisms on supertree bootstrapping in genome-scale data sets. First, the MRP method has been particularly criticized, perhaps because it is the most commonly used supertree method (e.g., Purvis, 1995; Pisani and Wilkinson, 2002; Gatesy and Springer, 2004). Supertree bootstrapping methods do not solve the problems associated with MRP, but input tree bootstrapping can be applied to any modifications of MRP or any other supertree method as can stratified bootstrapping if the source data is amenable to bootstrapping. Thus, supertree bootstrapping need not be limited by biases

VOL. 55

associated with MRP. Second, supertree studies have been criticized for using nonindependent input trees, or input trees constructed from overlapping data sets (Springer and de Jong, 2001; Gatesy et al., 2002; Gatesy and Springer, 2004). This criticism is particularly leveled at input tree sampling from previous supertree studies and is not necessarily an inherent limitation of supertree methods (Bininda-Emonds et al., 2004). The supertree methods proposed here utilize gene trees from genomescale data sets and have input trees built from nonoverlapping sequence data sets. The input trees in supertree bootstrapping of genome-scale data sets may be considered independent (excepting the perennial caveat regarding recombinational distance between loci). A third criticism of supertree methods is that they do not utilize the underlying data from the gene trees, and therefore they fail to account for the strength of support within genes. In total-evidence analyses that analyze the primary phylogenetic data, a strong phylogenetic signal that is not observed in analyses of individual data sets may emerge from analyses of the combined data sets (e.g., Barrett et al., 1991; Gatesy et al., 1999; Gatesy and Baker, 2005). Yet, stratified bootstrapping incorporates some measure of the strength of support in the underlying data into the supertree analysis by sampling from the bootstrap distributions of the input trees (Page, 2004). Supertree bootstrapping has several limitations. First, like any bootstrapping method, supertree bootstrapping is computationally intensive. Stratified bootstrapping requires time not only to bootstrap both the underlying data for the input trees as well as time for the actual supertree bootstrapping replicates. Thus, supertree bootstrapping may not be a fast alternative to supermatrix bootstrapping. Also, all bootstrap methods may not accurately estimate variance when data sets are small. Though many supertree analyses include hundreds of input trees (e.g., Daubin et al., 2001; Liu et al., 2001), the supertree bootstrapping scores may not be reliable when there are few input trees. Furthermore, because input trees from genome-scale data sets likely contain different sets of taxa (e.g., Driskell et al., 2004), it is possible that some replicates in input tree bootstrapping will not contain all taxa represented in the total set of input trees. In addition, some replicates of input tree bootstrapping may not contain the necessary taxonomic overlap to construct a resolved supertree. These sampling problems will be most common if some taxa are present in only one or a few input trees. If all taxa are represented in numerous input trees, as is the case in each of the four data sets from this study, then there likely will be few if any sampling problems associated with input tree bootstrapping. All replicates of input tree bootstrapping in this study contained all taxa. If some replicates do not include all taxa, then the representation of the bootstrap trees becomes a supertree rather than a consensus problem, and one cannot summarize the set of bootstrap trees with a majority rule consensus. Supertree and supermatrix data sets may contain much missing data, and the conventional nonparametric bootstrapping does not explicitly account for this

2006

439

BURLEIGH ET AL.—SUPERTREE BOOTSTRAPPING

missing data. Efron (1994) describes several alternate bootstrap procedures for assessing confidence in the face of missing data. Two variants are particularly relevant here. In the first, a protocol must exist for estimating the missing values in the data matrix. This might be undertaken by fitting the missing cells to a probability distribution based on the cells with data present. Then the actual data matrix is resampled and each sample has its missing data estimated based on this protocol, followed by construction of the tree for each of these matrices. In another variant, a model for the concealment mechanism, which is the process by which data “go missing,” must be postulated. In supertree analyses this might have to do with the distribution of taxon sampling among trees. In this bootstrap method, an estimate of the data matrix is constructed such that no data are missing (perhaps using the fitting method of the first variant), and then this matrix is resampled, each time subsequently subjected to the concealment process, reestimation of the full data matrix, and, finally, tree construction. In each variant of bootstrapping with missing data, variation in trees could be summarized in the conventional manner by majority rule consensus. Note that the structure of the supertree method used here (cells in an MRP matrix) rules out many methods that Efron (1994) proposed for estimating missing data or modeling concealment. However, in principle, either of these missing data bootstrapping procedures could be incorporated into supertree bootstrapping. This study examines only bootstrapping methods for assessing support in supermatrices and supertrees. However, there are numerous other methods, including jackknifing (e.g., Farris et al., 1996), Bremer support or decay indices (Bremer, 1988), partition Bremer support (Baker and DeSalle, 1997), and ILD tests (Farris et al., 1994), for assessing support and incongruence in large data sets that may be affected differently by missing data. The supertree bootstrapping methods described in this article are relatively easy to implement with existing software, and they provide useful information in the context of supertree phylogenetic inference. The increasing prevalence of genome-scale data sets necessitates methods for understanding the patterns of conflicting phylogenetic variation within the genome, and supertree bootstrapping can be useful for assessing variation among gene trees that may be masked by total evidence bootstrapping. Our study used a relatively simple MP analysis on each of the input gene sets, but genespecific heterogeneity also could be addressed in a supertree bootstrap by an explicit model-based approach, such as by assigning different models or substitution parameters for each gene. In the future it will be interesting to compare these supertree approaches to new total-evidence methods for incorporating heterogeneity among genes and assessing gene-specific variation from genome-scale data sets. Finally, although this article emphasizes the utility of supertree bootstrapping in phylogenetic analyses of genome-scale data sets, the methods may be extended for estimating uncertainty in other supertree analyses.

A Perl script implementing input tree bootstrapping is available at http://ginger.ucdavis.edu. It also is implemented in Clann (Creevey and McInerney, 2004) and Treeboot by Brian Moore. All data from this study are available at both http://systematicbiology.org and http://ginger.ucdavis.edu. ACKNOWLEDGEMENTS We thank Antonis Rokas for providing the yeast supermatrix and Herv´e Philippe for providing the bilaterian supermatrix. Olaf BinindaEmonds, John Gatesy, Brian Moore, and Rod Page provided helpful comments on this manuscript. This work was funded by NSF grants 0431154 and 03346963.

R EFERENCES Baker, R. H., and R. DeSalle. 1997. Multiple sources of character information and the phylogeny of Hawaiian Drosophilids. Syst. Biol. 46:654–673. Bapteste, E., H. Brinkmann, J. A. Lee, D. V. Moore, C. W. Sensen, P. Gordon, L. Durufl´e, T. Gaasterland, P. Lopez, M. Muller, ¨ and H. Philippe. 2002. The analysis of 100 genes supports the grouping of three highly divergent amoebae: Dictyostelium, Entamoeba, and Mastigamoeba. Proc. Natl. Acad. Sci. USA 99:1414–1419. Barrett, M., M. J. Donoghue, and E. Sober. 1991. Against consensus. Syst. Zool. 40:486–493. Baum, B. R. 1992. Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees. Taxon 42:637–640. Baum, B. R., and M. A. Ragan. 2004. The MRP method. Pages 17-34 in Phylogenetic supertrees: Combining information to reveal the tree of life (O. R. P. Bininda-Emonds, ed.) Kluwer Academic, Dordrecht, the Netherlands. Bininda-Emonds, O. R. P. 2003. Novel versus unsupported clades: Assessing the qualitative support for clades in MRP supertrees. Syst. Biol. 52:839–848. Bininda-Emonds, O. R. P. 2004. The evolution of supertrees. Trends Ecol. Evol. 19:315–322. Bininda-Emonds, O. R. P., J. L. Gittleman, and M. A. Steel. 2002. The (super)tree of life: Procedures, problems, and prospects. Annu. Rev. Ecol. Syst. 33:265–289. Bininda-Emonds, O. R. P., K. E. Jones, S. A. Price, M. Cardillo, R. Grenyer, and A. Purvis. 2004. Garbage in, garbage out: Data issues in supertree construction. Pages 267–280 in Phylogenetic supertrees: Combining information to reveal the tree of life (O. R. P. BinindaEmonds, ed.) Kluwer Academic, Dordrecht, the Netherlands. Bininda-Emonds, O. R. P., and M. J. Sanderson. 2001. Assessment of the accuracy of matrix representation with parsimony supertree construction. Syst. Biol. 50:565–579. Blair, J. E., K. Ikeo, T. Gojobori, and S. B. Hedges. 2002. The evolutionary position of nematodes. BMC Evol. Biol. 2:7. Blair, J. E., P. Shah, and S. B. Hedges. 2005. Evolutionary sequence analysis of complete eukaryote genomes. BMC Bionformatics 6:53. Bremer, K. 1988. The limits of amino-scid sequence data in angiosperm phylogenetic reconstruction. Evolution 42:795–803. Bull, J. J., J. P. Huelsenbeck, C. W. Huelsenbeck, D. L. Swofford, and P. J. Waddell. 1993. Partitioning and combining data in phylogenetic analysis. Syst. Biol. 42:384–397. Cotton, J. A., and R. D. M. Page. 2002. Going nuclear: Vertebrate phylogeny and gene family evolution reconciled. P. Roy. Soc. Lond. B Bio. 269:1555–1561. Creevey, C. J., D. A. Fitzpatrick, G. K. Philip, R. J. Kinsella, M. J. O’Connell, M. M. Pentony, S. A. Travers, M. Wilkinson, and J. O. McInerney. 2004. Does a tree-like phylogeny only exist at the tips in the prokaryotes? P. Roy. Soc. Lond. B Bio. 271:2551–2558. Creevey, C. J., and J. O. McInerney. 2004. Clann: Investigating phylogenetic information through supertree analyses. Bioinformatics 21:390– 392. Cunningham, C. W. 1997. Is congruence between data partitions a reliable predictor of phylogenetic accuracy? Empirically testing an

440

SYSTEMATIC BIOLOGY

iterative procedure for choosing among phylogenetic methods. Syst. Biol. 46:464–478. Daubin, V., M. Gouy, and G. Perri´ere. 2001. Bacterial molecular phylogeny using supertree approach. Genome Informatics 12:155–164. DeBry, R. W., and R. G. Olmstead. 2000. A simulation study of reduced tree-search effort in bootstrap resampling analysis. Syst. Biol. 49:171– 179. de Queiroz, A., M. J. Donoghue, and J. Kim. 1995. Separate versus combined analysis of phylogenetic evidence. Annu. Rev. Ecol. Syst. 26:657–681. Dopazo, H., and J. Dopazo. 2005. Genome-scale evidence of the nematode-arthropod clade. Genome Biol. 6:R41. Doyle, J. J. 1992. Gene trees and species trees: Molecular systematics as one-character taxonomy. Syst. Bot. 17:144–163. Driskell, A. C., C. An´e, J. G. Burleigh, M. M. McMahon, B. C. O’Meara, and M. J. Sanderson. 2004. Prospects for building the tree of life from large sequence databases. Science 306:1172–1174. Efron, B. 1994. Missing data, imputation, and the bootstrap. J. Am. Stat. Assoc. 89:463–475. Escobar-P´aramo, A. Sabbagh, P. Darlu, O. Pradillon, C. Vaury, E. Denamur, and G. Lecointre. 2004. Decreasing the effects of horizontal gene transfer on bacterial phylogeny: The Escherichia coli case study. Mol. Phylogent. Evol. 30:243–250. Farris, J. S., V. A. Albert, M. K¨allersjo, ¨ D. Lipscomb, and A. G. Kluge. 1996. Parsimony jackknifing outperforms neighbor-joining. Cladistics 12:99–124. Farris, J. S., M. K¨allersjo, ¨ A. G. Kluge, and C. Bult. 1994. Testing significance of incongruence. Cladistics 10:315–319. Farris, J. S., A. G. Kluge, and M. J. Eckhardt. 1970. A numerical approach to phylogenetic systematics. Syst. Zool. 19:172–191. Felsenstein, J. 1985. Confidence limits on phylogenies: An approach using the bootstrap. Evolution 39:783–791. Gatesy, J., and R. H. Baker. 2005. Hidden likelihood support in genomic data: Can forty-five wrongs make a right? Syst. Biol. 54:483–492. Gatesy, J., R. H. Baker, and C. Hayashi. 2004. Inconsistencies in arguments for the supertree approach: Supermatrices versus supertrees of Crocodylia. Syst. Biol. 53:342–355. Gatesy, J., C. Matthee, R. DeSalle, and C. Hayashi. 2002. Resolution of a supertree/supermatrix paradox. Syst. Biol. 51:652–664. Gatesy, J., P. O’Grady, and R. H. Baker. 1999. Corroboration among data sets in simultaneous analysis: Hidden support for phylogenetic relationships among higher level artiodactyls taxa. Cladistics 15:271– 313. Gatesy, J., and M. S. Springer. 2004. A critque of matrix representation with parsimony supertrees. Pages 369–388 in Phylogenetic supertrees: Combining information to reveal the tree of life (O. R. P. Bininda-Emonds, ed.) Kluwer Academic, Dordrecht, the Netherlands. Hendy, M. D., and D. Penny. 1982. Branch and bound algorithms for determining minimal evolutionary trees. Math. Biosci. 59:277–290. Huelsenbeck, J. P., J. J. Bull, and C. W. Cunningham. 1996. Combining data in phylogenetic analyses. Trends Ecol. Evol. 11:152–158. Hughes, J., and A. P. Vogler. 2004. The phylogeny of acorn weevils (genus Curculio) from mitochondrial and nuclear DNA sequences: The problem of incomplete data. Mol. Phylogenet. Evol. 32:601– 615. Kennedy, M., and R. D. M. Page. 2002. Seabird supertrees: Combining partial estimates of procellariiform phylogeny. Auk 119:88–108. Lee, Y., R. Sultana, G. Pertea, J. Cho, S. Karamycheva, J. Tsai, B. Parvizi, F. Cheung, V. Antonescu, J. White, I. Holt, F. Liang, and J. Quackenbush. 2002. Cross-referencing eukaryotic genomes: TIGR orthologous gene alignments (TOGA). Genome Res. 12:493–502. Lerat, E., V. Daubin, and N. A. Moran. 2003. From gene trees to organismal phylogeny in prokaryotes: The case of the γ -proteobacteria. PLoS 1:101–109. Liu, F.-G. R., M. M. Miyamoto, N. P. Freire, P. Q. Ong, M. R. Tennant, T. S. Young, and K. F. Gugel. 2001. Molecular and morphological supertrees for eutherian (placental) mammals. Science 291:1786– 1789.

VOL. 55

Mort, M. E., P. S. Soltis, D. E. Soltis, and M. L. Mabry. 2000. Comparison of three methods for estimating internal support on phylogenetic trees. Syst. Biol. 49:160–171. Novacek, M. J. 2001. Mammalian phylogeny: Genes and supertrees. Current Biology 11:R573–R575. Page, R. D. M. 2004. Taxonomy, supertrees, and the tree of life. Pages 247–266 in Phylogenetic supertrees: Combining information to reveal the tree of life (O. R. P. Bininda-Emonds, ed.) Kluwer Academic, Dordrecht, the Netherlands. Philip, G. K., C. J. Creevey, and J. O. McInerney. 2005. The Opisthokonta and Ecdysozoa may not be clades: Stronger support for the grouping of plant and animal than for animal and fungi and stronger support for the Coelomata than Ecdysozoa. Mol. Biol. Evol. 22:1175–1184. Philippe, H., N. Lartillot, and H. Brinkman. 2005. Multigene analysis of bilaterians corroborate the monophyly of Ecdysozoa, Lophotrochozoa, and Protostomia. Mol. Biol. Evol. 22:1246–1253. Philippe, H., E. A. Snell, E. Bapteste, P. Lopez, P. W. H. Holland, and D. Casane. 2004. Phylogenomics of eukaryotes: Impact of missing data on large alignments. Mol. Biol. Evol. 21:1740–1752. Pisani, D., and M. Wilkinson. 2002. MRP, taxonomic congruence and total evidence. Syst. Biol. 51:151–155. Purvis, A. 1995. A modification to Baum and Ragan’s method for combining phylogenetic trees. Syst. Biol. 44:251–255. Ragan, M. A. 1992. Phylogenetic inference based on matrix representation of trees. Mol. Phylogenet. Evol. 1:53–58. Rokas, A., B. L. Williams, N. King, and S. B. Carroll. 2003. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425:798–804. Ronquist, F. 1996. Matrix representation of trees, redundancy, and weighting. Syst. Biol. 45:247–253. Ronquist, F., J. Juelsenbeck, and T. Britton. 2004. Bayesian supertrees. Pages 193–224 in Phylogenetic supertrees: Combining information to reveal the tree of life (O. R. P. Bininda-Emonds, ed.) Kluwer Academic, Dordrecht, the Netherlands. Sanderson, M. J. 2003. r8s: Inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock. Bioinformatics 19:301–302. Sanderson, M. J., A. Purvis, and C. Henze. 1998. Phylogenetic supertrees: Assembling the trees of life. Trends in Ecol. Evol. 13:105–109. Sanderson, M. J., and M. F. Wojciechowski. 2000. Improved bootstrap confidence limits in large-scale phylogenies, with an example from Neo-Astragalus (Leguminosae). Syst. Biol. 49:671–685. Seo, T.-K., H. Kishino, and J. L. Thorne. 2005. Incorporating genespecific variation when inferring and evaluating optimal evolutionary tree topologies from multilocus sequence data. Proc. Natl. Acad. Sci. USA 102:4436–4441. Springer, M. S., and W. W. de Jong. 2001. Which mammalian supertree to bark up? Science 291:1709–1711. Swofford, D. L., 2003. PAUP*: Phylogenetic analysis using parsimony (*and other methods), version 4.0b10. Sinauer Associates, Sunderland, Massachusetts. Wilkinson, M., J. A. Cotton, C. Creevey, O. Eulenstein, S. R. Harris, F.-J. Lapointe, C. Levasseur, J. O. McInerney, D. Pisani, and J. L. Thorley. 2005. The shape of supertrees to come: Tree shape related properties of fourteen supertree methods. Syst. Biol. 54:419–431. Wilkinson, M., J. L. Thorley, D. T. J. Littlewood, and R. Bray. 2001. Towards a phylogenetic supertree of Platyhelminthes? Pages 292– 301 in Interrelationships of the Platyhelminthes (D. Littlewood and R. Bray, eds.) Chapman Hall, London. Wolf, Y. I., I. B. Rogozin, and E. V. Koonin. 2004. Coelomata and not ecdysozoa: Evidence from genome-wide phylogenetic analysis. Genome Res. 14:29–36. Yan, C., J. G. Burleigh, and O. Eulenstein. 2005. Identifying optimal incomplete phylogenetic data sets from sequence databases. Mol. Phylogent. Evol. 35:528–535. First submitted 11 July 2005; reviews returned 30 August 2005; final acceptance 4 November 2005 Associate Editor: Olaf Bininda-Emonds