gene ontology similarity measures based on linear order statistics ...

Report 3 Downloads 78 Views
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems © World Scientific Publishing Company

GENE ONTOLOGY SIMILARITY MEASURES BASED ON LINEAR ORDER STATISTICS JAMES M. KELLER Electrical and Computer Engineering Dept. University of Missouri-Columbia Columbia, MO 65211

JAMES C. BEZDEK Computer Science Dept. University of West Florida Pensacola, FL 32514

MIHAIL POPESCU Health Management and Informatics Dept. University of Missouri-Columbia

NIKHIL R. PAL Electronics and Communication Sciences Unit Indian Statistical Institute Calcutta - 700 108, INDIA

JOYCE A. MITCHELL Department of Biomedical Informatics School of Medicine University of Utah Salt Lake City, Utah 84112-5750

JACALYN M. HUBAND Computer Science Dept. University of West Florida Pensacola, FL 32514 Received (received date) Revised (revised date) Accepted (accepted date)

The standard method for comparing gene products (proteins or RNA) is to compare their DNA or amino acid sequences. Additional information about some gene products may come from multiple sources, including the set of Gene Ontology (GO) annotations and the set of journal abstracts related to each gene product. Gene product similarity measures can be based on evaluating sets of descriptor terms found in the GO taxonomy, and/or the index term sets of the related documents (MeSH annotations). While our techniques can be applied to term sets from any taxonomy, we restrict our examples in this article to GO annotations. We investigate the use of linear order statistics (LOS) to build similarity relations on pairs of terms that are used in the GO as linguistic descriptors of genes and gene products. One of our objectives is to investigate the construction and utility of visual

1

2

Keller et al. assessments of relational data (in this case, dissimilarity matrices) for discovering tendencies of groups of gene products to "cluster together". We use gene product data derived from a group of 194 gene products representing three protein families extracted from ENSEMBL. Our examples suggest that LOS similarity measures are more effective than traditional sequence-based similarity measures at capturing relationships between pairs of gene products in ENSEMBL families when annotation information is available. We show examples of how these similarity measures can assist in knowledge discovery and gene product family validation. Keywords: Gene similarity, Gene Ontology (GO), Linear Combination of Order Statistics (LOS), Visualization of gene families, Knowledge Discovery, ENSEMBL, SWISS-PROT

1. Introduction The Human Genome Project was a multibillion-dollar global research project that resulted in the knowledge of the full sequence of DNA for the human species and associated model organisms. Much of the research community is now concentrating on the analysis of gene structure, function, expression and regulation. Many experimental situations involve the investigation of sets of genes to determine their function and whether they are similar to each other. In order to pursue these experiments, the research community needs methods to measure the similarity (or dissimilarity) among genes and gene products. Many traditional methods are available, but new methods are needed for these new types of investigations. The traditional methods consider their DNA or amino acid sequences by using algorithms including BLAST1, Smith-Waterman20, Needleman-Wunch15, and TCoffee16, among others. Further analyses often explore the similarity between gene product groups that are regulated together11,12,17,24. Another set of investigations considers whether sets of protein products are produced by the same sequence of DNA but arise from alternate splicing of a RNA transcript, producing gene isoforms6,9. It is also useful to consider groups of closely related genes as part of a “gene family”, defined as a group of genes that make similar products (www.ornl.gov/TechResources/ Human_Genome/glossary/glossary_g.html). These families of genes are believed to have evolved from a single ancestral gene by a series of duplications and later divergence so that the locations of the genes may be dispersed within a single organism or even found in separate species. Algorithms to cluster genes and gene products into gene families are also part of the standard analysis packages and available through web sites like ENSEMBL3 . The problem with analysis of a group of proteins that one would find in a microarray experiment (and also those contributed to GenBank) is that the set consists of several types of amino acids sequences. These can be snippets (pieces) of gene products that have been artificially broken during the experiment but that appear to be regulated in an interesting fashion. They can be proteins that come from the same gene family but are active simultaneously (such as the alpha and beta chain of hemoglobin coming together to produce an active hemoglobin protein). They can also be genetic isoforms, coming from the same RNA transcript but alternatively spliced and having somewhat different functions.

Gene Ontology Similarity Measures Based on Linear Order Statistics

3

Our goal is to extend the set of tools available to biological researchers. We report here on the investigation of Gene Ontology (GO) based measures of (dis)similarity between gene products. In Ref. 9 we developed fuzzy measures on sets of terms from the GO to build similarity values between pairs of gene products. In this paper, we investigate new (dis)similarity measures based on order statistics7; in particular, on Linear Sums of Order Statistics (LOS) on sets of GO terms. We use these measures both to search for and assess clusters within gene sequence databases and to assist in knowledge discovery with respect to the results obtained through sequence comparison. Visual examination of images resulting from these measures reveals genetically significant relationships between sets of genes and gene products. With a pilot set of data, we show that the LOS measures computed from GO terms can be used to identify gene product families from taxonomybased information and assist in knowledge discovery and gene product family validation. 2. The Gene Ontology as a source for (dis)similarity measures The Gene Ontology (GO) [http://www.geneontology.org] - a representation of terms associated with genes and gene products - is an attempt to impose a controlled vocabulary and parent-child structure on terms used in the research literature. A major goal of the GO is to ensure consistent descriptions of gene products in different databases22. The GO project began in 1998, and now is used to annotate gene products from many species. The GO consists of three structured, controlled vocabularies (ontologies) that describe gene products in terms of their cellular components (CC), biological processes (BP) and molecular functions (MF) in a species-independent manner. The three subtrees of the Gene Ontology (CC, BP and MF) are connected only at the root node and can be considered independently, if desired. Terms in the GO subtrees are organized as directed acyclic graphs (DAGs), which differ from true hierarchies in that a given 'child' node in a DAG can have many 'parents' (less specialized terms). The important point is that as you go down a path in the GO from its root (which is "GO"), terms become more specific, and hence, more important to the description of the gene product. Species distinctions in annotation are handled by the “sensu” terms that refer to a specific species in the manner in which a function is handled. This preserves the “true path rule” within the ontology: there is always a strict truth to the path from the annotation used for a gene product all the way to the top node of the branch. In addressing nodes in the GO, we can either use the GO identifier (ID), for example, g = GO:004722 or the term name itself, T = “protein serine/threonine phosphatase activity”. There is a one-to-one correspondence between GO IDs and the corresponding terms. So, we will refer to the term as T(g) or just as T if no confusion exists. In this paper, we will abbreviate the GO ID down to just the numeric part minus the leading zeros, for example, T(4722) = “protein serine/threonine phosphatase activity”. In some cases, it will be easier for us to list the set of GO IDs instead of the list of corresponding terms, for example, we say that G1 (representing gene MTMR1) has GenBank ID AAC79117, and is represented in ENSEMBLE by the term set G1 = {4722, 4725, 16787, 6470, 8372}.

4

Keller et al.

3. Construction of GPD194.12.10.03 The basis of our illustrative computations is a set of 194 human gene products that were clustered into three protein families using the Markov clustering algorithm3,23. The gene products (and their) families were retrieved on December 10, 2003 using the Ensemble browser [http://www.ensembl.org/]. Table 1 itemizes several characteristics of these clusters, which we call the GPD19412.10.03 data. Since this data is dynamic, we include the date on which it was extracted in the name. We represent GPD as a set F of c protein families (clusters); that is, F = {F1, F2, ...,Fc}, where the i-th family, Fi, contains individual gene products. The number of gene products, Ni, within the i-th family represents the cardinality of the i-th family: |Fi|=Ni. Table 1. Characteristics of the GPD19412.10.03 data set Ensembl ID ENSF00000000339 ENSF00000000073 ENSF00000000042

Ni = Number of Individual Human Gene Products 21 87 86

Fi = Protein Family

No. of genes

myotubularin receptor precursor collagen alpha chain

7 7 13

All myotubularins have protein tyrosine phosphatase enzymatic activity and are involved in dephosphorylation. Receptor precursor proteins are integral to the plasma membrane and are involved in the fibroblast growth factor signaling pathway influencing cell division and cell differentiation. Collagen alpha chain genes are involved in producing the alpha chain of type 1 collagen that adds strength and structure to connective tissue found in ligaments, bones and cartilage. The 194 human gene products are mapped to 27 genes. The myotubularin family had 7 genes: {MTMR1-4, MTMR6-8}, the receptor precursor had 7 genes: {FGFR1-4, TIE1, RET, TEK} and the collagen alpha family had 13 genes: {COL9A2, COL24A1, COL3A1, COL4A3, COL9A1, COL21A1, COL1A2, COL2A1, COL4A2, COL4A1, COL5A3, COL4A6, COL27A1}. 4. Construction of sequence similarities for GPD19412.10.03 Each of the 194 gene products in GPD19412.10.03 is used to retrieve its protein sequence from the ENSEMBL database. The 194 amino acid sequences are then submitted to both the Smith-Waterman routine20 and the BLAST procedure1 to obtain a sets of pairwise numerical similarities {sij : sij ∈ [0, 1] ; 1 ≤ i, j≤ 194} and {bij : bij ∈ [0, 1] ; 1 ≤ i, j≤ 194}. To obtain sij ∈ [0, 1] in the Smith-Waterman technique, we used the normalized alignment length, that is s ij =

alignment _ length (gene _ product i , gene _ product j ) min{length (gene _ product i ), length (gene _ product j )}

For the BLAST similarity, we used the truncated E-score3 , that is:

.

Gene Ontology Similarity Measures Based on Linear Order Statistics

5

if E - score < 0 ⎧0 ⎪ b ij = ⎨1 if E - score > 100 ⎪E - score(i, j)/100 else ⎩

We array these similarities into a 194x194 matrices S and B which have as the i,j-th elements [S]ij = s ij and [B]ij = bij , respectively. To get dissimilarities (the usual basis for relational pattern recognition algorithms), we compute dsij=1-sij and dbij=1-bij for all i and j. These are also arrayed as 194x194 matrices DS and DB with i,j-th element denoted by [DS]ij = ds ij and [DB]ij = db ij . We let DSσ, DBσ denote any permutation of the rows and columns of DS and DB, and we will refer to them as a scrambled versions of the dissimilarity matrices. GPD19412.10.03 is a set of gene products; DS and DSσ (respectively, DB and DBσ) are numerical representations of dissimilarities between pairs of gene products in GPD19412.10.03. Although our discussion is restricted to the 194 gene products we use here, these constructions apply to any finite number of gene products. We rely heavily on visual assessment of GPD19412.10.03 and transformations of it using image representations of the matrices of dissimilarity values between gene products. Let I(DS) and I(DB) be the images obtained by submitting DS and DB to the MATLAB command IMAGESC. Specifically, IS194 = IMAGESC(DS) and ISσ,194 = IMAGESC(DSσ). In these images, black corresponds to dsij = 0 (minimum dissimilarity = maximum similarity); white corresponds to dsij = 1 (maximum dissimilarity = minimum similarity); and grays (in-between black and white) are scaled linearly. Similar definitions apply to DB.

20

40

60

80

100

120

140

160

180

20

40

60

80

(1a) IS194

100

120

140

160

180

(1b) ISσ,194

6

Keller et al.

(1c) IB194

(1d) IBσ,194

Figure 1. Ordered and scrambled sequence dissimilarity images for GPD19412.10.03 Figure 1(a) displays IS194, the image corresponding to dissimilarity data DS. The 194 gene products are ordered so that, as in Table 1, the first 21 come from the myotubularin family, the next 87 from the receptor precursor family, and the last 86 from the collagen alpha chain family. This ordering should impose some block structure on IS194, and some diagonal block structure is evident - the three large, dark blocks roughly correspond to proteins in the three families. Of course, it is easier to "see this" in Figure 1(a) once we know that that there are three families, and that the sequences from the same family appear in "family order". Still, our eyes tell us that there is a lack of strong similarity within families in Figure 1(a), as evidenced by lighter streaks within the large darker blocks. Figure 1(b) displays an image ISσ,194 of one scrambled version, say DSσ,194, of DS, obtained by randomly permuting the indices of the rows and columns of D194. Please compare Figures 1(a) and 1(b). This should give you a better sense of how useful – and disparate – visual representations of dissimilarity data can be. If you saw only Figure 1(b), you would hardly guess that there was a "family relationship" among the entities underlying this image. You would be tempted to conclude that no more than a few of the 194 gene products represented by the rows and columns of DSσ,194 had much similarity to each other. And certainly, you would not guess that there were biologically informative groups (or clusters) in the source data, and that Figures 1(a) and 1(b) were related to each other in any significant way. But they are. Figures 1(c) and 1(d) display the corresponding images for BLAST dissimilarity matrices on the GPD19412.10.03 sequence data. One observation is that the BLAST scores tend to be closer to binary values. This is good if two gene products from the same family have a very low dissimilarity (as many from family 1 seem to have), but not so good if the low scores are obtained across families and if the very high values are computed within a family (as in family 3). We believe that this observation favors the use of the SmithWaterman algorithm for sequence comparisons because we can exploit the partial matching in subsequent processing. In this article we want to investigate the construction and utility of visual assessments of relational data (in this case, dissimilarity matrices) for

Gene Ontology Similarity Measures Based on Linear Order Statistics

7

discovering tendencies of groups of gene products to "cluster together". The genetic significance of any clusters found (which is the ultimate goal of our study) will be addressed in a future paper. How much faith should we place in the apparent family groups seen in Figure 1(a)? Recall that members of the three families were “defined” using the Markov clustering algorithm (MCL)3,23. However, all clustering algorithms, including MCL impose - via the models that they optimize - assumptions about the nature and structure of the data they process. Clustering outputs are ALWAYS at the mercy of these mathematical and computational assumptions. These assumptions, explicit or implicit, may or may not be realized by the physical objects themselves - in this case, gene products - represented by the numerical data that are clustered. We want to emphasize that EVERY clustering algorithm will produce clusters upon request - that is their job. The usefulness and validity of algorithmically suggested clusters must be established by domain experts. Saying that "the computer produced these groups" is not enough. The results we display in this paper may reinforce those of previous studies, e.g., in Ref. 3, or they may differ slightly, or even greatly. This does not suggest that any particular set of clustering outputs are "right" or "wrong". When the outputs of different clustering algorithms on the same or similar data sets agree, our confidence in algorithmic clusters increases. But when results from different algorithms are markedly different, this serves notice to all involved: "look out our algorithms can't find a common ground - and there may or may not be one". In this latter case, further computational experiments are not optional - they are mandatory, if we intend to pass judgment on physical processes (in this case, meaningful groupings of gene products into families) based on computer outputs. 5. Construction of GO similarities for GPD In this section we develop our method for constructing a (dis)similarity measure on a generic gene product data set (GPD) using sets of terms from the GO. Discussion of each step in the algorithm follows the pseudo-code. The (%) symbol signifies a comment line. Part A of the algorithm is the data collection step. First, note that we have specified the (calendar) date as an input to the algorithm. This is necessary since the databases used to generate GPD sets are updated almost daily. This means that repeatable results are possible only when the date of data acquisition is recorded, and you can get the same data as used in some reported study (such as this one). If you want to test our methods on GPD19412.10.03, this data can be found at our gene ontology working group web site: (http://www.missouri.edu/~popescum/). We also preface our detailed discussion of the algorithm with the remark that equations (A-E) are the ones we chose to use in this paper. But they are hardly sacred; users may prefer other choices for each of the functions shown there, and the algorithm should still yield useful results. Algorithm GO.DISSIM.GPD Part A: Extract Gene data from a database, such as ENSEMBL Inputs:

A set of gene families to be studied.

8

Keller et al.

Outputs:

Text files containing a list of genes IDs, gene products IDs, and associated GO terms IDs; The date of extraction.

Steps: 1. Using a database, such as ENSEMBL, extract a list of information for each gene family, including Gene IDs, Protein IDs, and GO IDs. % The GO IDs represent the GO node identifiers (equivalently, the annotations (terms) in the GO). % In general, we should use Gene Product IDs instead of Protein IDs, but ENSEMBL exports % Protein IDs and hence, we will use that term here.

2. Sort through the lists and generate text files for the following information: a. A composite list of distinct GO IDs: GO_Set = {g 1 ,..., g N G } b. A composite list of Protein IDs. c. For each protein ID, create a list containing the Protein ID and its corresponding GO terms: {( Protein IDi, GO_termsi ): 1 ≤ i ≤ N}, where GO_Termsi = {Ti1 ,..., Ti, n i } . % Each Tip is a GO term, equivalent to the corresponding GO ID, associated with Protein IDi.

Table 2. Members of Family F1 of GPD19412.10.03 Gene ID MTMR1 MTMR1 MTMR1 MTMR2 MTMR2 MTMR2 MTMR2 MTMR2 MTMR3 MTMR3 MTMR3 MTMR3 MTMR3 MTMR4 MTMR6 MTMR6 MTMR6 MTMR6 MTMR7 MTMR8 MTMR8

Protein ID AAC79117 AAD40368 CAA12271 AAC12865 AAC51682 AAC79118 AAH30779 BAA83025 AAC79119 AAF40203 AAF40204 AAF40205 BAA20826 AAH35609 AAC78841 AAH40012 AAL01037 CAD89918 AAC77820 AAH12399 BAA90964

GO_TERMS 4722 4725 16787 6470 8372 4722 4725 16787 6470 8372 4722 4725 16787 6470 8372 4722 4727 16787 6470 7517 8151 8372 8138 4722 4727 16787 6470 7517 8151 8372 8138 4722 4727 16787 6470 7517 8151 8372 8138 4722 4727 16787 6470 7517 8151 8372 8138 4722 4727 16787 6470 7517 8151 8372 8138 4727 8138 16787 6470 5737 16020 4727 8138 16787 6470 5737 16020 4727 8138 16787 6470 5737 16020 4727 8138 16787 6470 5737 16020 4727 8138 16787 6470 5737 16020 4721 8270 6470 4722 4725 16787 6470 8372 4722 4725 16787 6470 8372 4722 4725 16787 6470 8372 4722 4725 16787 6470 8372 4725 16787 6118 6470 4721 16787 6470 4721 16787 6470

Gene Ontology Similarity Measures Based on Linear Order Statistics

9

Part A creates the data for Part B and Part C. The data set GPD extracted from a database such as ENSEMBL is a set of N gene products clustered (by MCL for ENSEMBL) into c families, GPD = {F1, F2, ..., Fc}. For each family, GO annotations are obtained from ENSEMBL by clicking “Export a list of genes for this family” and then selecting three attributes to export: “Gene ID”, “Protein ID”, “GO ID”. The resulting text file contains the list of GO terms related to each gene product (“Protein ID”), listed as "GO ID". This file is processed to create the GO_TERM sets for each gene product, the composite set of distinct GO IDs, and the list of Protein IDs. Thus, the i-th gene product in the family is represented by the ni terms {Ti1, Ti2, ... , Ti,ni}. For the i-th and j-th gene products we have G i = {Ti1 ,..., Tik ,..., Tin i } and G j = {Tj1 ,..., Tjk ,..., Tjn j } . As an example, the GPD19412.10.03 data represented in Table 1 come to us in c=3 families, GPD= {F1, F2, F3}. Table 2 lists the N1 = 21 triples (Gene ID1, Protein ID1, GO_TERMS1 ) retrieved at ENSEMBL ID = ENSF00000000339 in Family F1, members of the protein family myotubularin. To save some space, the third column in Table 2 shows GO IDi (the term set for protein i) symbolically as G i = {Ti1,..., Tik , ..., Ti, n i } . Part B: Generate the Coefficients of Association for the selected data Inputs: Outputs:

The list of distinct GO IDs, obtained in Part A; A list of all terms in the GO. A matrix of coefficients describing the association between pairs of GO ID terms.

Steps: 1. Choose a CORPUS (a database such as SWISS-PROT) to use for counting all GO terms. 2. For each term, Tk, in the GO, count the number of occurrences in the CORPUS of the term or any of its children and convert to a probability. ⎛ count (Tk + children of Tk in CORPUS) ⎞ ⎟⎟ 1 ≤ k ≤ |GO| p(Tk ) = ⎜⎜ count (all GO terms in CORPUS) ⎠ ⎝

(A)

% We call p(Tk) a probability, and treat it as such, even though p is not a probability distribution % over the GO since

∑ p(Tk ) ≠ 1 .

Tk ∈GO

3. For each pair of GO IDs in the GO_Set, identify the nearest common ancestor (NCAN): Tij = T(gij ) = T(NCAN(gi , gj)), 1 ≤ i, j ≤ NG. % NG is the number of GO IDs in GO_Set (number of distinct GO_IDs in GPD).

4. Convert the probabilities to coefficients of association (coa) using the node information content (importance), as in Ref. 18: c ij = coa (g i , g j ) = ic(Tij ) = − log 2 (p(Tij ) 1 ≤ i, j ≤ NG (B) and save as a matrix, C, of size NG ¯ NG .

10

Keller et al.

In Part B, our goal is to define a measure of similarity between Gi and Gj, denoted as s(G i , G j ) . Several previous approaches to constructing the similarities {s(G i , G j)} use the sets {Gi} directly8,9,13. Most of these measures use either a crisp or soft count of the number of elements (terms) in the intersection G i ∩G j, relative to some normalizing factor related to the total number of terms in both sets. While easy to define and interpret, measures of this kind provide only a “coarse” snapshot of the commonality between the two sets. A major problem is that when G i ∩G j = ∅ (i.e., they have no common term), most of these measures return a value of zero. At first glance, this seems intuitively reasonable. However, the GO is a directed graph. Thus, a pair of disjoint term sets Gi and Gj may contain terms, say S ∈ G i and T ∈ G j that have a common parent. Certainly S and T are related in this case - they are siblings - so the similarity of Gi to Gj should be greater than zero. We want measures of association that recognize relationships between non-intersecting term sets due to the directed, acyclic nature of the three GO subtrees. Recognition of ancestry in the GO is achieved two ways. First, the numerator of Formula (A), count(Tk + children of Tk ) in CORPUS , recognizes the potential for similarity due to ancestry by counting Tk and all of its descendants in the source database (CORPUS). In the information-content approach adopted here (discussed in detail in Lord et al. 2003), p(Tk) is the relative frequency of occurrence of this event in the CORPUS. The GO annotation count in SWISS-PROT in September 2003 was 83468, so this number was used as the normalizing factor in the denominator of (A) on that date. For each pair of GO IDs in the GO_SET, we define the coefficient of association (COA) in equation (B), c ij = coa (g i , g j ) = ic(Tij ) = − log 2 (p(Tij )) using the same Perl scripts that were used by Lord et al13. This involves finding the Nearest Common Ancestor Node (NCAN) in the GO and assigning the information content of the ancestor to the pair in question, and it is the same approach as used by Resnik18. Figure 2 depicts the computation of the COA between the pair of terms Tip = T(5007) = “Fibroblast growth factor receptor activity”, one descriptor of G22, Protein ID = FGFR1, and Tjq = T(4674) = “protein serine/threonine kinase activity”, one descriptor of G76, Protein ID = FGFR2, both are gene products from family 2: receptor precursor. When ip = jq, the NCAN is also this same node (i.e., we do not have to "climb up" the GO tree to find a node above the pair as in Figure 2). Thus, the “self coefficient of association” for T(5007) is 12.890 (-log2(11/83468)). Here, 11 is the total number of times that the term “Fibroblast growth factor receptor activity” or any of its descendents appeared in the corpus in September 2003, with the total number of counts within the GO being 83468. Similarly, “self coefficient of association” for T(4674), ic(T(4674)) = 7.115. To scale the numbers between 0 and 1, we divided the COA terms by max T∈GO {− log 2 (p(T))} = 15.35 and displayed the results in parentheses. While these values are not similarities, this transformation seems to make them more "appealing". (By definition, similarities must be between 0 and 1, but this is not a requirement for COAs.) Scaling is really unnecessary because, in equation (D), the numerator and denominator both have the same scaling factor, and thus, the factor will always cancel.

Gene Ontology Similarity Measures Based on Linear Order Statistics

11

GO ID=3673 T(3673)=”Gene Ontology”

… NCAN(5007, 4674) GO ID = 4672 T(4672) = " protein kinase activity " ic(T(4672)) =6.409 (0.418)

GO ID=4713

GO ID=19199

GO ID=4714

GO ID = 5007 T(5077) = "fibroblast growth factor receptor activity" ic(T(5007)) = 12.890 (0.840)

coa(5007,4674) = ic(T(4672)) = 6.409 (0.418)

GO ID = 4674 T(4674) = " protein serine/threonine kinase activity ic(T(4674)) = 7.115 (0.464)

Figure 2. Illustration of the NCAN concept. The term with GO ID=5007 and the term with GO ID=4674 have the term with GO ID=4672 as the nearest common ancestor (NCAN). The information content (ic) of the term 4672 is used to calculate the similarity (coa) between the term 5007 and the term 4774. If Tip and Tjq are in different subtrees of the GO (CC, BP, MF), then NCAN(Tip , Tjq ) is the root node "GO". By the method of counting in the corpus, the root node has probability 1, and hence, information content zero. Thus, the COA between pairs of terms in separate subtrees is zero, and so, the similarity between the terms is also zero. Conversely, if the NCAN of Tip and Tjq is “deep” within one branch of the GO, their pairwise similarity will be high, because term specificity increases with depth in the GO. In Figure 2, NCAN(T(5007), T(4674)) = T(4672), and so, the coa(T(5007), T(4674)) = ic(T(4672)) = 6.409 (scaled to 0.418). We note that the NCAN term, “Protein kinase activity” is not in the GO_Set extracted in part A of the algorithm. It is retrieved in part B

12

Keller et al.

only to compute the COA between T(5007) and T(4674). At the end of this step, we have in the matrix C, the COAs between all the pairs of terms that occur in the GO annotations for the distinct gene products in the GO_Set. But formulas (A) and (B) are not the only choices. Building a COA from the length of the shortest path between the two nodes (or as a function of that length) is another approach8. Moreover, the function –log2(p(T)) used here is one of many possible choices discussed in the pioneering work of Sneath and Sokal on numerical taxonomy21. Many of the functions listed in Ref. 21 might probably serve equally well for this preliminary step in building similarity measures on sets of GO terms. A systematic study of "good" measures of association would be an interesting and useful extension of this research.

Part C: Compute the dissimilarity measure The text files generated in Part A and the matrix of coefficients of association generated in Part B. Outputs: A matrix of (dis)similarity measures between pairs of Protein IDs. Steps: For each pair of gene proteins, (say Protein IDi and Protein IDj, 1 ≤ i, j ≤ N), 1. extract values from the matrix of coefficients (C) to create three sets of coefficients ICi,j, ICi,i , ICj,j, where ICi,j = { The coefficients of association for all pairs of terms found by combining each term from Protein IDi with each term from Protein IDj } = {c1 , c 2 , L , c k , L , c n i n j } , renumbered linearly.

Inputs:

% ni is the number of terms associated with the ith protein. % ICij has ninj values when i ≠ j; ICii has nini values when i = j.

2.

Sort each set ICi,j, ICi,i , ICj,j in decreasing order. As an illustration, the sorted list for ICi,j can be written as : IC (ij) = {c (1) ≥ c ( 2) ≥ L ≥ c ( k ) ≥ L ≥ c (nin j ) } . % The sorting is done in decreasing order so that pairs with stronger association can be given % higher weights while combining the coa’s for computing the similarity values.

3.

Choose one of three weight vectors w: i) LOS1: w M = (1.0, 0, ..., 0) ; “Maximum”; ii)

LOS2: w 2 =

1 (1.0, 1.0, 0.5, 0,..., 0) ; “At Least Two”; or m2

iii)

LOS4: w 4 =

1 (1.0, 1.0, 1.0, 1.0, 0.5, 0,..., 0) . “At Least Four”. m4

Gene Ontology Similarity Measures Based on Linear Order Statistics

13

nin j

% The weight vector is normalized so that

∑ wk =1 .

k =1

Most of the time this implies

% that m2 = 2.5 and m4 = 4.5 (see subsequent text for discussion). Note that, LOS1 is a % very conservative one which computes the similarity based on the pair of terms most % strongly associated. Hence LOS1 is likely to produce a crisper relation than, say, % LOS4, which gives equal importance to the top four pairs of terms.

4.

For each set of ordered coefficients, compute the weighted sum of coefficients: s ij (w (*) ) , s ii (w (*) ) , and s jj (w (*) ) , where (∗) = M, 2 or 4 and nin j

s ij ( w (*) ) = ∑ w k ⋅ c ( k ) . k =1

(C) %Adjust the length of

w (*)

as needed to match the number of coefficients in the sets.

5. Compute the normalized similarity between Protein IDi and Protein IDj, for 1 ≤ i, j ≤ N s ij (w (*) ) sn ij (w (*) ) = , where (∗) = M, 2 or 4. max{s ii ( w (*) ), s jj (w (*) )} (D) 6.

Compute dissimilarities from the normalized similarities: dij = 1 - snij ; 1 ≤ i, j ≤ N, and save as a matrix of size N ¯ N.

After the matrix, C, of COAs is built for the pairs in GO_Set, we must combine subsets of them to get a true (dis)similarity relation {s ij = s(G i , G j) : 1 ≤ i, j ≤ N} on GPD before relational clustering methods can be used for knowledge discovery. This step is really the heart of our new methods. As before, let G i = {Ti1 ,..., Ti,n i } and G j = {Tj1 ,..., Tj,n j } be the term sets that represent gene products i and j. To interpret the calculations in equations (C) and (D), remember that there are ninj COAs from the matrix C associated to pairs of terms in Gi and Gj. The nini COA values are ICij = {ic( NCAN(Tip , Tjq )) : 1 ≤ p ≤ n i ;1 ≤ q ≤ n j } . For convenience, and by convention, we have eliminated the double index, and replaced it with an index that simply runs from 1 to ninj. We sort the elements of ICij in decreasing order, resulting in IC (ij) = {c (1) ≥ c ( 2) ≥ L ≥ c ( k ) ≥ L ≥ c ( n i n j ) } , where the subscripts in parenthesis indicate the sorted order. Since the sort will be exactly the same for ICji, we see that IC(ij) = IC(ji). Now we choose a set of ninj weights w = ( w 1 ,..., w n i n j ) according to the LOS operator desired. Then, the (un-normalized) LOS-based similarity between Gi and Gj is a function of the weight vector w: nin j

s ij (w ) = ∑ w k ⋅ c ( k ) . While this sum is a linear combination of the COAs, the operator k =1

itself is highly non-linear because the data are sorted prior to multiplying by the weights

(E)

14

Keller et al. nin j

and adding. Normalization is done first so that ∑ w k = 1 . Note that, for a given pair of k =1

gene product terms sets, we may have n i n j < 4 . For example, Gi may have two terms

while Gj has only 1, resulting in a truncated weight vector (two elements). But, since the nin j

normalization is done over the ninj weights, i.e., so that ∑ w k = 1 , we always get a valid k =1

LOS function. We remark that linear combinations of order statistics are known variously as Ordered Weight Average (OWA) operators25, or generalized order filters4. The advantage of viewing them as OWA operators is that the weight vectors can be designed to correspond to linguistic quantifiers. For our choices, the weights are arranged in descending order to roughly correspond to the linguistic concepts of “Maximum”, “At Least Two” and “At Least Four”. In fact, we picked these because intuitively, we wanted the similarity between gene products to be high if “at least one (or 2 or 4)” of the term pairs were similar. Moreover, LOS operators are a special case of the Choquet integral5,10,14, which was used to develop some GO-based fuzzy dissimilarity measures in9. We note that Speer et al.22 and Lord et al.13 both used the maximum COA computed from pairs of annotating terms as a measure of gene product similarity, but did not perform normalization as indicated below. As shown in equation (C), the first weight (w1) multiplies the largest COA (c(1) ). The second weight (w2) multiplies the second largest COA (c(2) ); and so on. More explicitly, assuming that the order set IC(ij) contains at least 5 elements, the normalized weight vectors and the three corresponding LOS operators look like: LOS1

w M = (1.0, 0, ..., 0) s ij (w M ) = w 1c (1) = c (1)

LOS2

w 2 = (0.4, 0.4, 0.2, 0, ..., 0) s ij (w 2 ) = 0.4 ⋅ (c (1) + ⋅c ( 2) ) + 0.2 ⋅ c (3)

LOS4

w 4 = (0.22, 0.22, 0.22, 0.22, 0.11, 0,..., 0) s ij (w 4 ) = 0.22 ⋅ (c (1) + c ( 2) + c (3) + c ( 4) ) + 0.11 ⋅ c (5) Choosing wM corresponds to using the "least amount of information" (just the largest COA value) we possess about associations between the terms describing each pair of gene products. We seem to use more information when we choose w2, because the largest three COA values are used, but the weights that multiply the three values still sum to 1, so the weights essentially "discount" the importance of the increasingly smaller COA values used. And by the same argument, we expect LOS4 to yield "even better" results than LOS2, because LOS4 uses the five highest COAs between pairs of terms, but again, the five values are heavily discounted by the weights, that always must sum to 1. The phrase “using more information” is a seductive one - we will be alert to collecting evidence that corroborates (or not) our intuitive expectations.

Gene Ontology Similarity Measures Based on Linear Order Statistics

15

To guarantee that equation (C) produces a true similarity measure, we need to normalize it so that s(Gi, Gi) = 1 for all i; and s(G i , G j ) ∈ [0,1] for all i ≠ j. This is accomplished by equation (D), sn ij (w ) =

s ij (w ) max{s ii ( w ), s jj (w )}

.

(D)

When i = j, (D) clearly takes the value 1. Since the self COAs used to construct sii(w) and sjj(w) will always contain non-zero values, the denominator of (D) is well-defined. And since the normalization is done over all ninj weights, we always get a valid similarity measure, even when the weight vector w is truncated. There are an infinite number of weight vectors that could be used to compute LOS operators on data. For example, we might want to use a “trimmed mean” to ignore both the most optimistic and most pessimistic values (most likely candidates for outliers), averaging the “center” numbers. However, the creation of a true similarity measure requires that some normalization be applied so that s(Gi, Gi) = 1 for all i; and s(G i , G j ) ∈ [0,1] for all i ≠ j. We have shown that for three choices of weight vectors, we meet these conditions. Investigating other weight vector choices requires a careful derivation of the normalization procedure. Finally, we convert the normalized similarities S=[snij] into dissimilarities D=[dij]. This is the easiest step to understand, and is a necessary one for most of the work that follows. Our discussion of how to define, compute and interpret GO-based (dis)similarities is now complete; we turn to some examples that illustrate the application of our ideas. 6. Examples of GO.DISSIM.GPD We shall discuss three examples. Example 1 demonstrates how our algorithm can be used to compute the similarity values between a pair of gene products and how the choice of the LOS operator influences the final measure of similarity. In particular, we shall consider gene products from the same family. The second example investigates how the LOS similarities will look like when we consider gene products from different families. In this case, for any reasonable measure of similarity one would expect a lower level of similarity than that between a pair of products in the same family. The purpose of the last example is to investigate the effectiveness of the proposed measures of similarity in revealing family relations between gene products described by terms. In particular, we run algorithm GO.DISSIM.GPD on GPD19412.10.03 and visually assess the quality of the similarity measures. We also investigate how closely our ontology based measures of similarity agree with sequence based similarity measures. Example 1. Table 3 shows the GO term sets, GO IDs, and information content values (ic(T)) of G1 = AAC79117 and G14 = AAH35609 in the myotubularin family F1. These are the sets of terms shown in column 3, rows 1 and 14 of Table 2. G1 (representing gene MTMR1) has GenBank ID AAC79117, and is represented in ENSEMBLE by the term set G1 = {4722, 4725, 16787, 6470, 8372} in the GO. Similarly, G14 (representing gene MTMR4), has GenBank ID AAH35609, is represented by G14 = {4721, 8270, 6470}.

16

Keller et al.

Since the numbers of terms in these two term sets are n1 = 5, n14 = 3, there are 8 total ic(T) values and 15 COAs. GO ID 6470 with term T1,4 = T14,3 = " protein amino acid dephosphorylation " is a common descriptor of both proteins. The ic() values of different terms are included in columns 3 and 6 of Table 3. Table 3. Terms, GO IDs and ic for G1 and G14 in F1 of GPD19412.10.03 G1 = AAC79117 (MTMR1) T1,1= protein serine/threonine phosphatase activity

GO ID

ic(T)

G14 = AAH35609 (MTMR4) T14,1 = protein phosphatase activity

GO ID

ic(T)

4722

0.653

4721

0.548

4725

0.627

T14,2 = zinc ion binding

8270

0.578 0.608

0.348

T14,3 = protein amino acid dephosphorylation

6470

16787 T1,3 = hydrolase activity

6470

0.608

T1,4= protein amino acid dephosphorylation

8372

0.378

T1,2= protein tyrosine phosphatase activity

T1,5= cellular_component unknown COAs between pairs of terms in Table 3 are shown in Table 4. As described in the explanation of Part C of GO.DISSIM.PROC, the values in Table 4 have been scaled by dividing them by the largest ic(T) in the GO (i.e., 1/ max T∈GO {− log 2 ( p(T))} =15.35) to make them more "appealing" (which to us, means, lying between 0 and 1). Recall that this scaling is unnecessary because in equation (D), the numerator and denominator both have the same scaling factor, and thus, they will always cancel. Table 4. (Scaled) COAs between G1=AAC79117 and G14=AAH35609. T11(4722) T12(4725) T13(16787) T14(6470) T15(8372)

T14,1(4721) 0.549 0.549 0.349 0 0

T14,2 (8270) 0.105 0.105 0.105 0 0

T14,3(6470) 0 0 0 0.608 0

Eight of the fifteen pairwise COAs in Table 4 are 0. For example, the COA between terms T11(4722) and T14,3(6470) is 0, indicating that their nearest common ancestor is the root

Gene Ontology Similarity Measures Based on Linear Order Statistics

17

node "GO", i.e., they are in different subtrees of the GO. It is important to see that the pairwise COA between terms T14(6470) and T14,3(6470) is not 1. − log 2 (p( NCAN(6470,6470))) − log 2 (p(6470)) Here, c(6470,6470) = = = 0.608 ≠ 1 max T∈GO {− log 2 (p(T))} 15.350 because the function chosen for use in equation (B) is not a similarity measure. However, analogous to similarity measures, this value is the largest number that can be obtained by comparing term 6470 to any other term in the GO. Finally, the normalized GO similarity values are computed with equation (D) using each of our weight vectors. The unnormalized LOS values for Table 4 are s1,14(wM) = 0.608; s1,14(w2) = 0.573; s1,14(w4) = 0.464. The self association tables for G1 and G14 are needed to complete the example. They are found in the appendix (Tables 7 and 8). Finally, the normalized similarity measures for this example are: LOS1(G1, G14) = sn1,14(wM) = 0.931 ; LOS2(G1, G14) = sn1,14(w2) = 0.904 ; LOS4(G1, G14) = sn1,14(w4) = 0.777 . Since the top coa and the next coa are not significantly different, LOS1 and LOS2 produce very comparable measures of similarity. On the other hand, since the top four coa’s include a value that significantly differs from the other values, the LOS4 similarity value is noticeably lower than both LOS1 and LOS2 values. This trend is visible in the unnormalized similarity values also. Example 2. Example 1 shows values for the similarity between a pair of proteins in the same family. How will these values look if we compare proteins from different families? To find out, we compute the similarity between G1 = {4722, 4725, 16787, 6470, 8372} in Example 1, and G25 = {4872, 5007, 16049, 5624, 16021}, the terms that represent the 25th gene in Family F2, GenBank ID AAA35838 (representing the human gene FGFR1). Table 5 contains the 25 pairwise COAs for the terms representing this pair of genes, scaled as in Table 4. Table 5. (Scaled) COAs between G1=AAC79117 and G25=AAA35838. T11(4722)

T25,1(4872) 0.105

T25,2(5007) 0.254

T25,3(16049) 0

T12(4725)

0.105

0.254

0

0

0

T13(16787)

0.105

0.254

0

0

0

T14(6470)

0

0

0.125

0

0

T15(8372)

0

0

0

0.130

0.130

The un-normalized LOS values for Table 5 are s1,25(wM) = 0.254; s1,25(w2) = 0.254; s1,25(w4) = 0.211.

T25,4(5624) 0

T25,5(16021) 0

18

Keller et al.

The normalized GO similarity values computed with equation (D) using each of our weight vectors are listed in Table 6. For ease of comparison, we have re-listed in Table 6 the values for the within-family similarities gotten in Example 1. Note that the self-association table for G25 is also found in the appendix (Table 9).

Table 6. Term set similarities for term pairs within and between families Intra family similarities

Inter family similarities

G1 and G14 in F1

G1 in F1, G25 in F2

LOS1(G1, G14) = sn1,14(wM) = 0.931 LOS2(G1, G14) = sn1,14(w2) = 0.904 LOS4(G1, G14) = sn1,14(w4) = 0.777

LOS1(G1, G25) = sn1,25(wM) = 0.303 LOS2(G1, G25) = sn1,25(w2) = 0.383 LOS4(G1, G25) = sn1,25(w4) = 0.353

Since G1 and G25 are from different families, the COAs between term pairs in Table 5 are generally lower than those found in Table 4. We say “generally lower” because the families under consideration are outcome of an algorithm and therefore two products from different families may be biologically more similar than a pair of products in the same family. Moreover, since our measures of similarity are computed based on terms used to describe a product, there are is no guarantee that always between class similarities will be lower than within class similarities. When we aggregate the COAs to get the similarities shown in Table 6, we see an even greater disparity. This observation motivates our use of LOS operators to derive dissimilarity data that will be used for cluster analysis in a separate study. Finally, as seen in column 2 of Table 6, it is possible, because of the normalization in equation (D) to create true similarity values, for LOS2 to be greater than LOS1 (this is impossible for the unnormalized similarities in equation (C) with our weight vectors).

Example 3. Step 5 of Part C of our algorithm converts the normalized similarities to dissimilarities using equation (E). Figure 3 displays the dissimilarity matrices obtained by running algorithm GO.DISSIM.GPD on GPD19412.10.03, and then converting the matrices to images using the method described in Section 2. The rows and columns of the images in Figure 3 are ordered as given by their ENSEMBL families listed in Table 1. For ease of comparison, view 3(d) in Figure 3 is the image that appeared in view (a) of Figure 1. Visual comparison of the three GO-based images to the sequence-based image in view 3(d) suggests that family relationships between pairs of gene products in F1, F2 and F3 of GPD19412.10.03 are captured much more vividly by GO similarities than by sequence similarities. Don't forget, though, that the family groupings are themselves suggested by an algorithm (MCL clusters); it remains to be seen whether the structure seen in GO-based images such as those in Figure 3 provides a realistic representation of genetically important relationships between the gene products represented by sets of terms extracted by our algorithm. Finally, the images in this example suggest that terms in the GO itself have been entered in a meaningful way - i.e., these visual results provide a kind of "seat of the pants" check of the GO for consistency and coherence. We will explore this more fully in a subsequent article. However, even from this visual standpoint, we can perform some knowledge discovery. Since most sequences that belong to the same gene have similar

Gene Ontology Similarity Measures Based on Linear Order Statistics

19

annotation, we expect to see dark squares on the diagonal of the similarity matrix. Inspecting Figures 3(a), 3(b) and 3(c), we see that the solid blocks are interrupted by several “streaks” (three of them marked by circles in Figure 3(c)). The two “streaks” in family 2 correspond to the sequences index 30 (GenBank ID AAA35960) and 107 (GenBank ID BAK45250). Verification in March 2004 of the GO annotation for these two sequences revealed that their annotation had been updated since we extracted the 12.10.03 data. Now they are similar to one of their neighbors, 29 and 106, respectively. When updated, the two above mentioned “streaks” disappeared. Hence, even the visual representation of the GO similarity can be used to identify the sequences from a family with questionable annotation, making this a tool for knowledge discovery. The third “streak” from Figure 3(c) (in family 3) has another explanation. The related sequences, 120 and 121, both represent the gene COL21A1. Although a collagen gene, it is obvious that COL21A1 does not seem to cluster in the “collagen alpha chain” family ENSF 42 as classified by the MCL algorithm in December 2003. Returning to the ENSEMBL site in July 2004, we found that the “collagen alpha chain” family (ENSF 42) was split in 5 collagen families with COL21A1 moved in with other gene products. This again highlights the potential of new similarity measures to assist humans (by visual analysis of outliers) in knowledge discovery. In fact, authors in Ref. 3 used domain annotations to validate the MCL-derived clusters and find outliers. Instead, as suggested by this example, we believe that we can also use GO-based similarities for this purpose. Note that, if the 194 objects are permuted, the same LOS dis-similarity relations will not reveal such nice family structure. Since our objective is to assess the quality of the similarity relations, we used the family structure to order the data. In general, such cluster/family structure will not be known. Since the quality of visualization using IMAGESC depends on the input sequence, we were careful to use the same ordering for our comparisons. A better visualization scheme would follow from applying a reordering technique (such as VAT2) to all images before visualization.

(a) LOS1: MAX

(b) LOS2: At Least Two

20

Keller et al.

20

40

60

80

100

120

140

160

180

20

(c) LOS4: “At Least Four”

40

60

80

100

120

140

160

180

(d) IS194 (sequence similarities)

Figure 3. Dissimilarity images for GPD19412.10.03 ; (a)-(c) are GO-based similarity measures; (d) is a Smith-Waterman sequence based similarity measure (see Figure 1(a)) 7. Conclusions We have introduced a new way to combine the information content at nodes in the gene ontology to compute a coefficient of association (COA) between pairs of terms that annotate gene products. The heart of our method is the use of linear sums of order statistics to convert sets of COAs into similarity measures between the gene products themselves. Our method is both flexible and general. We have shown how to "customize" un-normalized similarities through the choice of different ordered weight vectors, and we have pointed out that other functions can be used in our algorithm GO.DISSIM.GPD. For our choices of “linguistically correlated” weight vectors, we have demonstrated how to produce true similarity measures from the LOS operators. We have shown image displays of dissimilarity matrices built from the conventional Smith-Waterman and BLAST sequence similarities, as well as those built using our GO-based similarity measures. The images suggest that GO-based similarities may capture clusters (here, protein families) in a more realistic way than BLAST scores. We also demonstrated the potential for knowledge discovery even from the analysis of the visual display of this data. However, much remains to be done. For example, if we use a relational clustering algorithm on one of the matrices D produced by our method, will it discover the same clusters as MCL? If the clusters are different, which ones are more interesting (genetically)? Will we be able to explain through clustering those "small visual anomalies" highlighted in Figure 3(c) from an analysis of clustering results? If so, this opens up the potential for knowledge discovery on much larger data sets where visualization is more difficult.

Acknowledgment Mihail Popescu is supported by the National Library of Medicine Biomedical and Health Informatics Research Training grant 2-T15-LM07089-11.

Gene Ontology Similarity Measures Based on Linear Order Statistics

21

Appendix A. Self Association Tables for G1, G14, and G25 for examples 1 and 2. In this appendix, we tabulate the self association values for the three gene products in examples 1 and 2 along with the unnormalized LOS calculations necessary to compute equation (D) in the GO.DISSIM.GPD.

Table 7. (Scaled) Self COAs for G1=AAC79117 T11(4722)

T11(4722) 0.653

T12(4725) 0.549

T13(16787) 0.349

T14(6470) 0

T15(8372) 0

T12(4725)

0.549

0.627

0.349

0

0

T13(16787)

0.349

0.349

0.349

0

0

T14(6470)

0

0

0

0.608

0

T15(8372)

0

0

0

0

0.378

s11(wM) = 0.653; s11(w2) = 0.634; s11(w4) = 0.597.

Table 8. (Scaled) Self COAs for G14=AAH35609 T14,1(4721) 0.549 0.105 0

T14,1(4721) T14,2 (8270) T14,3(6470)

T14,2 (8270) 0.105 0.578 0

T14,3(6470) 0 0 0.608

s14,14(wM) = 0.608; s14,14 (w2) = 0.584; s14,14 (w4) = 0.416. Table 9. (Scaled) Self COAs for G25=AAA35838 T25,1(4872))

T25,1(4872) 0.368

T25,2(5007)

0.368

T25,3(16049)

T25,2(5007) 0.368

T25,3(16049) 0

T25,4(5624) 0

T25,5(16021) 0

0.839

0

0

0

0

0

0.596

0

0

T25,4(5624)

0

0

0

0.448

0.147

T25,5(16021)

0

0

0

0.147

0.331

22

Keller et al.

S25,25(wM) = 0.839; s25,25 (w2) = 0.664; s25,25 (w4) = 0.536. References 1.

2. 3. 4. 5. 6.

7. 8. 9.

10. 11. 12. 13. 14. 15. 16. 17. 18.

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402. Bezdek, J.C., & Hathaway, R.J. 2002. VAT: A Tool for Visual Assessment of (Cluster) Tendency, Proc. IJCNN 2002, pp. 2225-2230. Enright, A.J., Van Dongen, S., and Ouzounis, C.A. 2002. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research. 30(7), 1575-1584. Grabisch, M. 1994. Fuzzy integrals as a generalized class of order filters. Proc. of SPIE. 2315, 128-136. Grabisch, M., Murofushi, T., and Sugeno M. (eds.). 2000. Fuzzy Measures and Integrals: Theory and Applications, Physica-Verlag, Heidelberg, Germany. Grasso, C., Modrek, B., Xing, Y., and Lee, C. 2004. Genome-wide detection of alternative splicing in expressed sequences using partial order multiple sequence alignment graphs. Proc. Pacific Symposium on Biocomputing. 9, 29-41. Huber, P. 1981. Robust Statistics. John Wiley and Sons, New York, N.Y. Jiang, J.J., and Conrath, D.W. 1997. Semantic Similarity Based on Corpus Statistics and Lexical Ontology. Proc. Int. Conf. Research on Comp. Linguistics X, Taiwan, 1-15. Keller, J, Popescu, M., and Mitchell, J. 2004. Taxonomy-based soft similarity measures in bioinformatics, Proc., IEEE International Conference on Fuzzy Systems, Budapest, Hungary, 23-30. Keller, J., Gader, P., Tahani, H., Chiang, J-H., and Mohamed, M. 1994. Advances in fuzzy integration for pattern recognition. Fuzzy Sets and Systems. 65, 273-283. Khatri, P., Draghici, S., Ostermeier, G.C., and Krawetz, SA. 2002. Profiling gene expression using Onto-Express. Genomics. 79(2), 266-270. Lee, S.I., and Bazoglu, S. 2003. Application of independent component analysis to microarrays. Genome Biology. 4(11):R76, 1-21. Lord, P.W., Stevens, R.D., Brass, A., and Goble, C.A. 2003. Semantic similarity measure as a tool for exploring the gene ontology. Proc., Pacific Symposium on Biocomputing, 601-612. Murofushi, T. and Sugeno, M. 1991. A theory of fuzzy measures: representations, the Choquet integral, and null sets. J. Math Analysis and Applications. 159, 532-549. Needleman, S., and Wunsch, C. 1970. A general method applicable to the search for similarities in the amino acid sequences. J. Mol. Biol. 48, 444-453. Notredame, C., Higgins, D., and Heringa, J. 2000. T-Coffee: A novel method for multiple sequence alignments. Journal of Molecular Biology, 302, 205-217. Raychaudhuri, S., Sutphin, P.D., Chang, J.T, and Altman, R.B. 2001. Basic microarray analysis: grouping and feature reduction. Trends in Biotechnology, 19 (5), 189-193. Resnik, P. 1995. Using information content to evaluate semantic similarity in a taxonomy. Proc., IJCAI. 448-453.

Gene Ontology Similarity Measures Based on Linear Order Statistics

23

19. Sakai, H., and Maruyama, O. 2004. Extensive search for discriminative features of alternative splicing. Proc. Pacific Symposium on Biocomputing. 9, 54-65. 20. Smith, T.F., and Waterman, M.S. 1981. Identification of common molecular substances. J. Mol. Biol. 147, 195-197. 21. Sneath, P. H. A., and Sokal, R. R. 1973. Numerical Taxonomy, W. H. Freeman and Company, San Francisco, CA. 22. Speer N., Spieth C., and Zell A. 2004. A Memetic Clustering Algorithm for the Functional Partition of Genes Based on the Gene Ontology. Proc. of the 2004 IEEE Symposium on Comp. Intell. in Bioinf. and Comp. Biology (CIBCB 2004), San Diego, CA, 252 – 259. 23. The Gene Ontology Consortium. 2000. Gene Ontology: tool for the unification of biology. Nature Genet. 25, 25–29. 24. Van Dongen, S. 2000. Graph clustering by flow simulation, Ph.D. Thesis. University of Utrecht, The Netherlands. 25. Xu, Y., Olman, V., and Xu, D. 2002. Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics. 18(4), 536-545. 26. Yager, R. 1988. On ordered weighted averaging aggregation operators in multicriteria decision making. IEEE Trans. on Systems, Man, and Cybernetics. 18(1), 183-190.