2005 ACM Symposium on Applied Computing
A New Approach for Gene Prediction Using Comparative Sequence Analysis Rong Chen and Hesham H. Ali Department of Computer Science College of Information Science and Technology University of Nebraska at Omaha Omaha, NE 68182-0116
[email protected] ABSTRACT
Two main approaches to gene recognition exist. The intrinsic approach uses statistical algorithms to find the optimal parse of a fragment of genome to identify protein-coding and non-coding parts, based on global and local compositional features of specific species. The other method is similarity-based, finding regions similar to proteins, cDNA or ESTs that are already known [11]. These methods are useful, and they are already developed in some gene-finding programs, but each method has its limitation too. The specificity and sensitivity of predictions by statistical methods are still insufficient, especially in the context of long genomic fragments containing multiple genes. Moreover, statistical methods rely on information derived from known genes of specific species, so they are not good in prediction for newly sequenced organisms. The advantage of similarity-based approaches is that they rely on accumulated pre-existing biological data, and thus produce biologically relevant predictions [16]. However, similarity-based methods tend to be biased towards finding genes that are similar to known genes and cannot find genes encoding new proteins [2, 8].
The availability of large fragments of genomic DNA makes it possible to apply comparative genomics for identification of protein-coding regions. In this work, a comparative analysis is conducted on homologous genomic sequences of organisms with different evolutionary distances and the conservation of the noncoding regions between closely related organisms is found. In contrast, more distance shows much less intron similarity but less conservation on the exon structures. This study sought to illuminate the impact of evolutionary distances on the performance of the proposed gene-finding program based on the cross-species sequence comparison. Base on the finding from comparative study and training of data sets, we proposed a model by which coding sequence could be identified by comparing sequences of multiple species, both close and approximately distant. The reliability of the proposed method is evaluated in terms of sensitivity and specificity, and results are compared to those obtained by other popular gene prediction programs. Provided sequences can be found from other species at appropriate evolutionary distances, this approach could be applied in newly sequenced organisms where no species-dependent statistical models are available.
Keywords Gene Prediction, comparative genomics, coding and noncoding regions, multiple species.
1. INTRODUCTION The sequencing of genomes of many organisms has been completed. However, the biological interpretation of these genetic sequences, i.e. annotation, is not keeping pace with this avalanche of raw sequence data. There is still a real need for accurate and fast tools to analyze these sequences and, especially, to find genes and determine their functions [11]. Many gene prediction programs are currently publicly available. In general, these approaches consist of finding the evidence for a gene and combining the evidences in order to predict the structure of a gene, for either prokaryotes or eukaryotes.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’05, March 13-17, 2005, Santa Fe, New Mexico, USA. Copyright 2005 ACM 1-58113-964-0/05/0003…$5.00.
177
The sequencing of genomes of many organisms has been competed now. However, the biological interpretation of these genetic sequences remains largely outstanding. Biological interpretation is not keeping pace with this avalanche of raw sequence data. There is still a real need for accurate and fast tools to analyze these sequences and, especially, to find genes and determine their functions [11] Since the large fragments of genomic DNA and even complete chromosomes of some eukaryotic organism are available, and hence it is possible to apply the comparative genomics in identification of protein-coding regions [8]. This approach is based on the theory of evolution: during evolution, the functional parts may have encountered more pressure of selection, so they tend to be more conserved than non-functional parts. Thus, when the genomes of different organisms are compared, the local sequence conservation is usually an indication of biological functionality [2, 8]. It can be assumed, in gene-finding problems, that protein-coding regions, as functional parts of sequences, are more conserved than non-coding regions. In our work, we conducted a comparative analysis of homologous genomic sequences of organisms with different evolutionary distances to find the conservation of the coding regions as well as the non-coding regions between organisms with different distances. We sought to find the impact of evolutionary distance on the performance of the gene-finding program based on the two-species sequence comparison. Then, we will propose a model that incorporates a third species, a one with closer distance, and by which coding sequence could be identified by comparing sequences of multiple species.
In Section 2, a summary of previous work is presented. In Section 3, the conservation of genomic structures of organisms from five species is examined. In Section 4, a novel gene-recognition approach by which protein-coding regions could be identified by comparison of multiple genomic sequences is proposed. A summary of the experimental results are summarized in Section 5.
finding methods based on sequence comparison of organisms. mus musculus (mouse), gallus gallus (chicken), xenopus laevis (frog) and drosophila melanogaster (fruit fly) are chosen for analysis of relationship with homo sapiens (human). The selection of species is based on the consideration of each organism’s evolutionary distance with homo sapiens, the genome sequence availability, and the results of previous work.
2. PREVIOUS WORK
One hundred and seventeen orthologous human-mouse gene pairs are available from Batzoglou et al. [2]; According to the authors, these sequences are carefully annotated so they can be considered as a standard of truth [2,8]. This set of data is expanded to include the homologous sequence from the other species, by TBLASTX, which compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. GenBank release 137.0 was searched against to get the complete data set. A dynamic programming algorithm was implemented to determine the similarity among the sequences. In the following sections, the comparison of exon structures, intron structures and the overall gene structures will be described separately.
Recently, several algorithms have been developed for automated gene recognition by genomic comparison. ROSETTA [2] is the first program for gene recognition based on cross-species (humanmouse) comparison of genomic DNA from two closely related organisms. In particular, it hypothesized conserved exon-intron structure in the two sequences and, further, that the corresponding exons in the two genes have roughly the same length [2, 11]. Unlike ROSETTA, and AGenDA [8] depend little on speciesspecific properties such as codon usage or the nucleotide distribution. Pro-Gen [9], on the other hand, does not assume conservation of the exon-intron structure, and sequences are compared by protein level instead of nucleotide level. Theoretically, the advantage of this method is in not being species-specific. However, in practice, performance will depend on the evolutionary distance between the compared sequences [11]. Previous results show that the relationship is not straightforward. Moreover, the similarity might not cover entire coding exons but might be limited to the most conserved parts of them. Alternatively, it may sometimes extend to introns and/or to the UTRs and promoter elements, especially in the case when genomes are evolutionarily close [11]. This leads to the difficulty of distinguishing between what is functional and what is nonfunctional with this approach. It is therefore necessary to take more information into account to identify conserved gene structures in syntenic genome sequences [8, 11].
3.1 The comparison of exon structures In the mouse-human comparison, the number and length of exons are well conserved. Among all of the human-mouse gene pairs, 96% have the same number of exons, and 75% have identical exons; these results are mostly in accordance with Batzoglou et al. [2]. For human-chicken gene pairs, 77% are identical in exon number, and among the aligned exons. Compared to the strong conservation of exon-intron structure in human-mouse and human-chicken gene pairs, much less conservation in exon structure was shown for greater evolutionary distances. The number and length of exons are more divergent in human-frog and human-fruit fly gene pairs. 45% of human exons cannot find the aligned regions in frog sequences, another 20% exons, only a small portion (