An Algorithm for Generating Representative Functional Annotations Based on Gene Ontology In-Yee Lee1, Jan-Ming Ho1, Wen-Chang Lin2 1 2
Institute of Information Science, Academia Sinica, Nankang, Taipei, Taiwan 115, R.O.C.
Institute of Biomedical Sciences, Academia Sinica, Nankang, Taipei, Taiwan 115, R.O.C. 1
{iylee,
[email protected]},
[email protected] Abstract The authors address the issue of providing highly representative descriptions in automated functional annotations. For an uncharacterized sequence, a common strategy is to infer such annotations from those of well-characterized sequences that contain its homologues. However, under many circumstances, this strategy fails to produce meaningful annotations. Using information revealed by the structured vocabularies of Gene Ontology, we propose a quantitative algorithm to assign representative annotations. We established a confidence function that reflects both the precision and coverage of a candidate annotation, and reasoned the function's parameters from analyses of significant forms of candidate distributions on the GO graph. We tested the algorithm with our self-designed BIO101 (http://BIO101.iis.sinica.edu.tw)—an automated annotation system that supports the workflows of functional annotations for expressed sequence tags (ESTs). According to our experimental results, the algorithm is capable of producing representative and meaningful functional annotations.
1. Introduction High-throughput methods for genomic sequencing have resulted in large repositories of public domain data [17]. To make knowledge sharing and utilization more efficient, large databases are being established [11] [4] [14], but with data that gains value only after annotation tasks are performed—in other words, after functional or structural sequence properties are determined. In this paper we will propose an automated approach to the labor-intensive job of annotating uncharacterized sequences (denoted here as Sequ). Using a sequence similarity algorithm [1], data from different organisms are analyzed to identify a set of well-characterized homologues (Shom) for each Sequ. Based on the functional
annotations associated with Shom, it is possible to construct an informative and representative Seq u annotation. Due to the heterogeneity of vocabularies and formats used by various databases, functional information is difficult to manage electronically. For this reason, many designers and researchers are promoting functional annotations based on Gene Ontology (GO) [15] [5] [7] [8] [11] [12] [13] [14] [16] [17] [18]. For instance, the Gene Ontology Consortium [15] is creating sets of domain-specific vocabularies for describing molecular characteristics across various organisms. The GO term hierarchy consists of three ontology: molecular functions, biological processes, and cellular components. Directed acyclic graphs (DAGs) are formed by GO terms and their associated “is-a” and “part-of” relations. Since GO provides not only controlled but also structured vocabularies, we believe that GO annotations yield information that can assist automated annotations. In an automated annotation, if a Sequ’s function is identical to those of the Sethom then the annotations inferred from Set hom will be meaningful. However, there are many cases where this is not true—for example, Rosetta stone proteins [10] that have homologues associated with two different proteins, yet are fused into a single polypeptide chain. Such proteins are better annotated using functions common to both homologues, rather than any specific annotation inferred from a similar homologue within Sethom, even if the homologue exhibits a high rate of similarity with Sequ. In this case, the structured GO graph is ideal for identifying more representative annotations. The set of GO terms associated with homologues in a specific Shom (denoted STermh) form a unique DAG distribution. Information revealed by such distributions can be analyzed to determine which terms (candidates) best represent the functional properties of a Sequ. In a GO DAG, a parent node describes a function exhibited by all of its child nodes. Terms that are lower in height (i.e., closer to the root) describe more general functions; the greater the height, the more specific the function.
Evaluating how well a term describes Sequ requires definitions of two criteria: a) coverage, meaning whether a term covers all aspects of a Sequ’s functions; and b) precision, or a term’s specificity. The most representative term (MRT) of a STermh requires a balance between coverage and precision. To achieve this, we propose a quantitative model to assign a confidence value that reflects these two criteria for each candidate term. Combinatorial analyses of different distributions are performed to establish an accurate confidence function; by classifying and studying various STermh distributions and their proper MRTs, we can reason the upper and lower bounds of the confidence function’s parameters. For any Sequ, terms that are assigned the highest confidence values are considered MRTs that can serve as automated functional annotations.
2. Description of the Algorithm We will use the following scenario to explain our proposed model. To annotate Sequ, similarity matching is performed on well-characterized sequences. Two identified homologues, Seq1 and Seq2, are annotated with the GO terms C1 and C2, respectively. On the associated DAG, C1 and C2 have an identical parent (P) whose functional properties are common to both C1 and C2. Even though term P is not associated with any matching homologue, it serves as the best description of Sequ. The three propositions that result from this observation are: Proposition 1: A set of candidate terms (STermc) will include all STermh terms and their parent terms—that is, STermc ≠ STermh, STermh ⊆STermc. Proposition 2: Each candidate term is given a confidence value CV(Termi). The candidate with the highest value is designated as the most representative term (MRT). Formally, MRT(Sequ) =CV(Termi) | Termi∈STermc, CV(Termi) > CV(Termj) for all Termj ∈ STermc. Proposition 3: The confidence values of child candidates propagate to and accumulate at their parent candidates. Proposition 3 allows the confidence values to reflect a term’s coverage; however, simply propagating child term values sacrifices precision in favor of coverage. To adequately reflect precision, we propose proposition 4. Proposition 4: The weight of a child term decays exponentially for each propagation to a lower level. Proposition 5: We define the distance between two nodes on the GO DAG as the minimum number of edges between two nodes. Based on these propositions, we suggest the following confidence function for the candidate terms. From Proposition 1, a unique STermc is inferred from each STermh.
The scores of Termh∈STermh are used to determine scores for each Termc∈ STermc, Termc∉STermh. The confidence value of such a Termc is initially set to 0. Furthermore, each Termh is assigned an initial confidence value of 1*Score(Termh), where Score(Termh) denotes the number of sequences in Shom that is annotated using Termh. Formally, CVself(Termh)= Score(Termh). CVself(Termc)=0, for Termc∉SetTermh. Let Schi_c = { Termchi_c | Termchi_c is a descendant of Termc} From Proposition 3, a portion of a Term c‘s confidence value is contributed by the scores of Term chi_c∈Schi_c. The portion that is contributed by a Term chi_c is defined as CVcon(Termchi_c). CVcon(Termchi_c)=CV(Termchi_c)* α d , 0