Computers in Biology and Medicine 35 (2005) 627 – 643 http://www.intl.elsevierhealth.com/journals/cobm
Multi-criterial coding sequence prediction. Combination of GeneMark with two novel, coding-character speci,c quantities Yannis Almirantis∗ , Christoforos Nikolaou Institute of Biology, National Research Center for Physical Sciences “Demokritos”, Athens 15310, Greece Received 23 December 2003; accepted 7 April 2004
Abstract This work applies two recently formulated quantities, strongly correlated with the coding character of a sequence, as an additional “module” on GeneMark, in a three-criterial method. The di6erence in the statistical approaches implicated by the methods combined here, is expected to contribute to an e8cient assignment of functionality to unannotated genomic sequences. The developed combined algorithm is used to fractionalize a collection of GeneMark-predicted exons into sub-collections of di6erent expectation to be coding. A further modi,cation of the algorithm allows for the assignment of an improved estimation of the probability to be coding, to GeneMark-predicted exons. This is on the basis of a suitable training set of GeneMark-predicted exons of known functionality. ? 2004 Elsevier Ltd. All rights reserved. Keywords: Coding potential; GeneMark; Gene ,nding; Multi-criterial methods
1. Introduction The structure of the genome of most eukaryotic organisms is highly complex, with the majority of genes being interrupted by introns and long non-coding intervening sequences. Thus, in view of the on-going genome projects, there is a constant need for continuous improvement of the existing gene ,nding tools and especially for ab initio (not based on sequence-similarity with known genes) annotation methods. Methods for determining the functionality of genomic sequences are based on three di6erent approaches. Content-methods rely on the overall, statistical properties of a sequence. Signal-based methods focus on the presence or absence of speci,c sequence patterns. Finally, comparative methods ∗
Corresponding author. Tel.: +30-1-650-3619; fax: +30-1-651-1767. E-mail address:
[email protected] (Y. Almirantis).
0010-4825/$ - see front matter ? 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiomed.2004.04.002
628
Y. Almirantis, C. Nikolaou / Computers in Biology and Medicine 35 (2005) 627 – 643
make determinations based on sequence homology [1]. Up to now a number of works have pointed out the inadequacy of a simple method to achieve such a complex goal, as is the determination of the structure of a coding sequence, including promoter region, exon-intron boundaries, termination and untranslated sequences [2–9]. All of them also point out, the need of combining di6erent methods to the task of coding sequence prediction. Recently, there have been attempts in this direction, which live up to such expectations, giving rise to improved combined algorithms [10–12]. The presented work is an attempt to improve a well-established method, previously assessed by works sited above, namely GeneMark [13], through its combination with two recently developed algorithms. These algorithms are able to distinguish coding from non-coding sequences, based on two di6erent approaches operating at di6erent length scales: Codon Asymmetry Measure (CAM) quanti,es di6erences in the use of mirror symmetric triplets (like AGT and TGA) in a DNA sequence [14]. The CAM value for an examined sequence is taken to be the maximal of three, each computed separately for one of the three reading frames. Coding sequences are characterized by high CAM values due to the codon and amino acid skews found in almost all proteins, while non-coding sequences at this scale are presented as near-random. Obviously, CAM characterizes sequences at the level of the “short length scale” of a few nucleotides, which is drastically a6ected by the grammar and syntax of the triplet code. Modi,ed Standard Deviation (MSD) is a measure of the degree of clustering of similar nucleotides in a given sequence [15,16]. The computed quantity depends on a chosen “block length” m, thus, characterizing the examined sequence in a given length scale. m may take values from a few tenths to some hundreds of nucleotides, a length region which we could characterize as “middle length scale”. It is found that, when studied at this length scale, coding sequences are mostly near-random, while the non-coding genomic component clearly presents a tendency for high clustering of similar nucleotides [17]. The algorithms for the computation of CAM and MSD are brieIy presented in the corresponding appendices, while more details and related discussion may be found in the cited references. Purpose of the present paper is not to compare or review, even brieIy, the very extended recent literature of works aiming to develop reliable tools for the genome annotation. The choice of GeneMark is just indicative and this work could be done using several other gene ,nding tools assigning to every predicted exon a probability estimation to be true-coding. Moreover, at this stage of our work, it is not attempted to reconstruct the complete gene structure of the studied sequence. We just assess the performance of the three-criterial method we have developed, to better specify the predictions of a given gene ,nding program. It is obvious that the implementation of CAM and MSD in the search of coding regions in nonannotated genomic sequences will generate methods, which are “content-based”, as they do not allow for the precise positioning of the limits of individual coding segments in the examined sequence. They can therefore, be used only as additional “patches”, meaning modules aiming to improve the performance of a standard gene-,nding tool. On the other hand, they retain certain advantages, which can be brieIy discussed herein: (i) Both MSD and CAM are generally “naive” methods, not demanding an exhausting training. The boundary values of the corresponding coding potential measures, able to achieve discrimination of functionality states, can be determined very quickly. Moreover, the concepts behind the development of both quantities are quite di6erent from the ones, on which most of the
Y. Almirantis, C. Nikolaou / Computers in Biology and Medicine 35 (2005) 627 – 643
629
gene-,nding tools are based. It is obvious that the work presented here does not intent to substitute HMM and related algorithms, which have been proven quite suitable for the general task of gene structure determination in higher eukaryotic sequences. We believe however, that combinations of di6erent approaches may result into improved methodologies to be used for the determination of the role of new sequences. (ii) Both methods are species-independent, as they refer to general, fundamental properties of a coding sequence. In this way, their application on sequences from a large variety of organisms is carried out without problems. This becomes important in cases of horizontal gene transfer, where species–speci,c statistical properties vary among the coding sequences of the same organism [18]. In addition, MSD and CAM both incorporate speci,c features that “,lter” the background nucleotide composition, becoming in this way independent of intra-genomic compositional Iuctuations, such as varying G+C content, isochore organization, etc. (iii) Both MSD and CAM have been shown sensitive in speci,cally determining regions that are to be translated [14,16]. Hence they may serve in the distinction of untranslated regions (UTRs) from regularly expressed exons. (iv) Another relative advantage is the methods’ simplicity, as their key-parameters are ,xed in a few steps, while the algorithms themselves may be easily parsed and can incorporate suitable modi,cations without considerable di8culty. On the other hand, main weakness is that their power of identifying the coding potential of a given segment, while very high for segments long enough, drops considerably when going to the lengths of a few hundreds of nucleotides, where the length of most exons lies. The material in this article is organized into two, mostly independent, Sections 1 and 2. Section 1 is devoted to the formulation of an algorithm based on the combination of the quantities CAM, MSD and PGM (the probability assigned by GeneMark to a predicted exon to be coding) to fractionate a collection of predicted exons into sub-populations with di6erentiated probabilities to be correctly predicted as true-coding, and to the assessment of the e8ciency of this algorithm. In Section 2, a method is formulated, allowing the assignment to an individual GeneMark-predicted exon of an estimation of the probability to be true-coding, on the basis of its CAM, MSD and PGM values. A large collection of sequences of known intron-exon partitions is used as “training set”. This method is assessed by its application on collections of sequences of known intron-exon partitions used as “test sets” and the results are compared to the PGM values of the exons of each test set. 2. Fractionalization of a collection of predicted exons into subsets of dierentiated expectation to be true-coding 2.1. Methods and sequences An appropriate computer program is run, computing CAM of a given collection of sequences and for a given range of are characterized as “Truly” and “Falsely” Predicted (Exons) rived annotation information provided by the databank entry.
and MSD for each predicted exon exon lengths. The predicted exons on the basis of experimentally deA predicted exon is considered as
630
Y. Almirantis, C. Nikolaou / Computers in Biology and Medicine 35 (2005) 627 – 643 800
number of predicted exons
700 600 500 400 300 200 100 0 0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
coding fraction of predicted exons
Fig. 1. Coding-fraction distribution of the GeneMark-predicted exons of the Hum col.aUb.
(approximately) “true-predicted exon” if, at least, half of its length is veri,ed by the entry annotation to be coding. Any predicted exon with coding content lower than 50%, will be named in the sequel “falsely predicted exon”. This approximation is justi,ed by data presented in Fig. 1. It is obvious that the curve of the true-coding fraction of predicted exons is bi-modal, thus justifying this two-valued approximation. Exon populations for the whole collection are denoted in the sequel as TPETOT and FPETOT for true- and false-predicted exons, respectively. Then, the program scans the three-parameter space, de,ned by CAM, MSD and PGM (the GeneMark estimated probability of a predicted exon to be coding, provided by this gene-,nding tool) and registers, for a given triad of threshold values {CAMthr ; MSDthr ; PGMthr }, the numbers TPE∗∗∗ and FPE∗∗∗ , belonging into each of the eight possible combinations of parameter “regions” (the indice * may be + or −). With ‘+’ is denoted the “coding favoring” region of a parameter, i.e. the values higher than the thresholds for CAM and for PGM , and the values lower than the threshold for MSD [13,14,16]. For each parameter, the region complementary to the one denoted by ‘+’ (i.e. the region of low expectation for coding potential), is denoted by ‘−’. Thus, e.g. (+ + −) means the region of the three-parameter space de,ned by: CAM ¿ CAMthr ; MSD ¡ MSDthr and PGM ¡ PGMthr . With TPE++− and FPE++− are denoted the truly and falsely predicted exon populations, with their CAM, MSD and PGM values lying inside the (+ + −) three-parameter region, which is used here as example. ++− Next, we introduce the quantities suitable for the subsequent analysis: The ratio PTC = ++− ++− ++− ++− ++− =(TPE + FPE ) is a measure of specicity, while the ratio RTC = TPE =TPETOT TPE is a measure of sensitivity. Both these quantities are de,ned for a given region of the three-parameter space (in this example the one denoted as ‘+ + −’). One additional quantity of interest in our work is the ratio W , of the number of true-predicted exons over the total number of predicted exons of a collection: collection name) W((exon length width) =
TPETOT ; where PETOT = TPETOT + FPETOT : PETOT
Y. Almirantis, C. Nikolaou / Computers in Biology and Medicine 35 (2005) 627 – 643
631
Note, that W characterizes overall the e8ciency of GeneMark as a gene ,nding tool for a speci,c sequence collection. It does not relate to the combined method we propose herein, as by construction sums-up contributions from all eight sub-regions of the three-parameter space. Collections of sequences have been formed using the “extended query form” of the SRS7 data-mining tool of the EMBL databank. Firstly, 990 human sequences with lengths ranging from 4 to 8 thousands of nucleotides (knts) were collected. From this initial set, 342 sequences, each hosting at least one coding sequence (CDS), including three or more exons, and for which the corresponding mRNA is experimentally identi,ed, have been retrieved. Two hundred and ,fty one sequences were retained, after elimination of redundant entries. This ,nal collection named Hum coll.aUb is divided into two sub-collections named Hum col.a (123 sequences) and Hum col.b (128 sequences). The other sequence collections mentioned in the sequel, have been formed using similar procedures with the one described above. Note that, the length of the data bank entries used is not crucial, as we do not attempt to decipher the complete gene structure but to assess the coding potential of individual predicted exons. The sequences used in this work have been submitted individually to the EMBL/GeneMark webinterface (http://www.ebi.ac.uk/GeneMark/) using the default parameter values. The GeneMark matrix corresponding to the GC content of the examined sequence was always used. One additional technical detail is that for the whole procedure, only the part of the databank entry hosting the ,rst coding sequence meeting the requirements described above, i.e. only the predicted exons hosted in the sequence space occupied by this coding sequence, are considered. It has to be noticed that the predicted exons considered here are strictly coding segments not including UTRs. 2.2. Results and discussion 2.2.1. Introductory results At a ,rst stage, results are derived taking into account only GeneMark-predicted exons with length (L) exceeding the 120 nucleotides (nts). This is done because of the expected low e8ciency of the applied methods for very short exons. In Table 1 the results of the above analysis for Hum col.a exons, the triad of parameter thresholds being: {CAM = 2:8, MSD = 0:038, PGM = 0:7} are depicted. The last line of Table 1a indicates that more than half of the true-coding predicted exons lie in the (+++) parameter region. A practical application of these results is that (providing that this collection is typically behaving), for any GeneMark-predicted human exon, of length higher than 120, with parameter values lying in the parameter region de,ned above, we may estimate that its probability to be coding equals to ∼ 0:98, instead of ∼ 0:82 (which is the W value for the above collection for L ¿ 120). More generally, we have (for the chosen threshold values) the complete set of predicted exons “fractionalized” into sub-populations depending on the values of their CAM, MSD and PGM . This leads to the generation of a partition of the parameter space into regions with associated expectations of predicted exons to be coding, which are equal to the corresponding values ∗∗∗ depicted in Table 1a. These expectations however, have to be considered taking into account of PTC the numbers of predicted exons, which in some cases are very small to guarantee a good statistics. In order to have a ,rst estimation of the applicability of the results of the presented analysis on new sequences, i.e. how much the ,ndings based on Hum col.a are typical for, at least human, sequences, the same analysis for Hum col.b is depicted in Table 1b, while in Table 1c are presented the results for a mouse collection named Mus col., formed by the procedure described above, which
632
Y. Almirantis, C. Nikolaou / Computers in Biology and Medicine 35 (2005) 627 – 643
Table 1 Length of the considered predicted exons: L ¿ 120 (Hum col.a; Hum col.b; Mus col.) True-pred. exons (TPE∗∗∗ )
Parameter region CAM
MSD
False-pred. exons (FPE∗∗∗ )
∗∗∗ PTC
R∗∗∗ TC
0.0000 1.0000 0.5259 0.8333 0.5000 1.0000 0.7593 0.9826
0.0000 0.0097 0.1187 0.1459 0.0039 0.0117 0.1595 0.5506
0.0000 0.6667 0.6181 0.8778 0.5000 0.8333 0.7475 0.9785
0.0000 0.0038 0.1698 0.1508 0.0038 0.0095 0.1412 0.5210
0.3333 1.0000 0.5537 0.8763 0.2500 0.6667 0.6909 0.9806
0.0050 0.0100 0.1671 0.2120 0.0025 0.0050 0.0948 0.5037
PGM
(a) Hum col.a—length of the considered predicted exons: L ¿ 120 FPETOT = 112; TPETOT = 514; W = 0:821; CAMthr = 2:8; MSDthr = 0:038; PGMthr = 0:7 — — — 0 9 — — + 5 0 — + — 61 55 — + + 75 15 + — — 2 2 + — + 6 0 + + — 82 26 + + + 283 5 (b) Hum col.b—length of the considered predicted exons: L ¿ 120 FPETOT = 108; TPETOT = 524; W = 0:829; CAMthr = 2:8; MSDthr = 0:038; PGMthr = 0:7 — — — 0 7 — — + 2 1 — + — 89 55 — + + 79 11 + — — 2 2 + — + 5 1 + + — 74 25 + + + 273 6 (c) Mus col.—length of the considered predicted exons: L ¿ 120 FPETOT = 95; TPETOT = 401; W = 0:808; CAMthr = 2:8; MSDthr = 0:038; PGMthr = 0:7 — — — 2 4 — — + 4 0 — + — 67 54 — + + 85 12 + — — 1 3 + — + 2 1 + + — 38 17 + + + 202 4
includes, after removing redundancy, 107 sequences (EMBL entries). We observe that the W values for both human collections are very close, as expected due to their size and the absence of any bias in their formation. Moreover, the mouse collection has also a W value close to the human, allowing ∗∗∗ and R∗∗∗ values are very similar for every direct comparison of the results. The corresponding PTC TC one of the eight sub-regions of the three-parameter space. In several cases the mouse collection is closer to one of the two human collections than the two human ones are to each other. This is a hint that inter-special di6erences are not signi,cant, at least when the W values of the collections are close enough. Notice that signi,cant di6erences in the W values between species imply di6erent aptitudes of the GeneMark tool for these genomes.
Y. Almirantis, C. Nikolaou / Computers in Biology and Medicine 35 (2005) 627 – 643
633
Table 2 Length of the considered predicted exons: L ¿ 120 (a. Hum col.a; b. Hum col.b; c. Mus col.)
Lines 1a, 2a, 3a report quantities of the Hum col.a. Lines in bold (1b, 2b, 3b) report quantities of the Hum col.b. Shaded lines (1c, 2c, 3c) report quantities of the Mus col.
2.2.2. Thresholds reliability and organism dependence The parameter thresholds set, used above, are just typical values taken as example. In order to fully explore the possibilities of the presented method we have scanned the three-parameter space using a variety of threshold triads, considering separately Hum col.a, Hum col.b and Mus col., and retaining only the quantities associated to the (+++) sub-region of the parameter space. We have +++ found that several values of the “probability of an exon to be coding”, PTC may be attained +++ for varying fractions RTC of the complete set of truly coding exons. In order to have a unique estimator of the e8ciency of a combination of any three-parameter threshold choice, we also register +++ the product PTC ∗ R+++ TC . Typical cases are presented in Table 2 for the three collections mentioned above. Hum col.a quantities are presented in lines 1a, 2a, 3a accompanied with the corresponding values of the parameter thresholds, while, for comparison, in the following lines, the same quantities of the Hum col.b (lines 1b, 2b, 3b) and Mus col. (lines 1c, 2c, 3c) are given (for the same sets of parameter thresholds). The ,rst case of parameter threshold set, is chosen to be presented, because +++ it corresponds to the overall maximum of the product PTC ∗ R+++ TC . The next lines correspond to speci,cities equal to ∼ 0:95 (line 2) and ∼ 0:98 (line 3) for the Hum col.a collection. A ,rst observation is that the three collections behave approximately in the same way. One principal reason for the deviations between collections is that the quantity, FPE+++ remains always a rather small integer, even for relatively big sequence sets as the ones studied here. This is yet another indication of the three-criteria method’s high speci,city, as the number of falsely predicted exons remains relatively small in all cases, regarding the (+++) three-parameter subregion. In +++ addition, the proximity of the corresponding PTC and R+++ values for the three collections seems TC to allow us for an extrapolation of the determined parameter thresholds in new sequences, at least in cases where GeneMark performance remains close enough.
634
Y. Almirantis, C. Nikolaou / Computers in Biology and Medicine 35 (2005) 627 – 643
Table 3 Length of the considered predicted exons: L ¿ 120
2.2.3. Comparing the proposed method with GeneMark Simple inspection of the preceding tables reveals that the combined use of GeneMark with CAM and MSD leads to the formation of sub-populations of the predicted exons with di6erentiated incidences to be true-coding, depending on their {CAM, MSD, PGM } threshold values. However, this combined method will be worthwhile only if it is proven to be superior than the corresponding one-parameter method based only on one threshold imposed to the GeneMark parameter PGM . In ∗∗∗ and R∗∗∗ we de,ne P GM :∗ and RGM :∗ to be the speci,city and sensitivity complete analogy to PTC TC TC TC measures respectively, with + or − in the place of * for higher and lower PGM values than a chosen PGM threshold, respectively. In Table 3 and for Hum col.a, (i) results of the predicted exons sub-populations associated to the (+++) sub-space of the three-parameters method and (ii) results of predicted exons classi,cation method based on the GeneMark associated quantity PGM (shaded lines), are put together. In order to assess the three-parameter method in comparison to the PGM -based evaluation of conjectured exons, we ,rstly have to select a threshold triad (for the new method) and a PGM +++ GM :+ threshold (for GeneMark alone), which both give the same PTC and PTC (speci,city) value. +++ GM :+ Then, we carry on in the comparison of the corresponding RTC and RTC (sensitivity) values. In the three double lines of this table are depicted the ,ndings corresponding to the “probability GM :+ +++ of an exon to be true-coding” (speci,city: PTC or PTC ) equal, respectively to ∼ 0:91, ∼ 0:95 and ∼ 0:98. The comparison of the two methods will be based on the values of the sensitivity GM :+ +++ associated to the above probabilities. Simple inspection reveals that for PTC or PTC equal to 0.91, the three-parameter method is practically equivalent with the PGM -based method. For 0.95 the :+ three-parameter method results to a ∼ 8:4% improved value of R+++ if compared to RGM TC TC . When we go to the level of the 0.98, then, the improvement would reach ∼ 49:4%. These results indicate a clear advantage of the application of the three-criteria method over the PGM -based one-parameter method, especially if a high speci,city threshold is imposed to the “high probability sub-population” (+++) of predicted exons. Since R+++ is indicative of the sensitivity TC of the method used, to “discover” true-coding exons from a pool of GeneMark-predicted exons, the advantage of the method proposed herein to retrieve the true-coding exons by classifying them in the “favorable” (+++)-associated sub-population is obvious. The three-criteria method has the
Y. Almirantis, C. Nikolaou / Computers in Biology and Medicine 35 (2005) 627 – 643
635
Table 4 Hum col.a—length of the considered predicted exons: L ¿ 120 (FPETOT = 112; TPETOT = 514; W = 0:821; CAMthr = 5:0; MSDthr = 0:021; PGMthr = 0:6) Parameter region CAM
MSD
PGM
— — — — + + + +
— — + + — — + +
— + — + — + — +
True-pred. exons (TPE∗∗∗ )
False-pred. exons (FPE∗∗∗ )
∗∗∗ PFC
R∗∗∗ FC
58 279 0 0 20 156 1 0
70 36 0 0 3 3 0 0
0.5469 0.1143 — — 0.1304 0.0189 0.0000 —
0.6250 0.3214 0.0000 0.0000 0.0268 0.0268 0.0000 0.0000
inherent property to divide the parameter space into eight instead of two sub-regions. However, in most cases, either the speci,cities corresponding to the six ‘intermediate’ exon sub-collections are close to the ratio W , or the absolute size of these sub-collections is too small. Thus, the predictive use of the related results is not possible at this stage.
2.2.4. Formation of a sub-collection of “negatively diagnosed” predicted exons A further application of the fractionalization of a given set of predicted exons may be done when we scan the three-parameter space looking for optimal triads of thresholds, which provide a negatively predicted exon sub-population, belonging to the (- - -) sub-space, hosting an important number of (falsely predicted) exons. This may drive to the re-evaluation of the functionality of these exons in the framework of a massive annotation project for new sequences. We have performed a search similar to the one presented in Table 2, retaining quantities related to the negatively predicted sub-region of the three-parameter space (data not shown). In Table 4 the results for one such typical triad of threshold values selected for the high population of negatively predicted exons are depicted. Quantities associated to the eight sub-regions of the three-parameter space are presented. In the last ∗∗∗ and R∗∗∗ , we choose to present the quantities P ∗∗∗ =FPE∗∗∗ =(TPE∗∗∗ + two columns, instead of PTC TC FC ∗∗∗ ∗∗∗ ∗∗∗ FPE ) and RFC = FPE =FPETOT , which represent the speci,city and sensitivity, respectively, to address exons falsely predicted by GeneMark. We verify that a suitable choice of the three-parameter thresholds may lead systematically to the isolation of most of the falsely predicted exons (here the 62.5%) in the (- - -) parameter space sub-region with associated probability for a predicted exon to be non-coding (here) ∼ 0:55. Notice that, this probability, for the whole sequence collection under examination, i.e. its (1 − W ) value, is ∼ 0:18. The Hum col.b and the Mus col. give concordant results for the same thresholds set (data not shown). As we already have noticed, re-evaluation of the functionality of exons belonging to the (- - -) sub-population is inferred by this result.
636
Y. Almirantis, C. Nikolaou / Computers in Biology and Medicine 35 (2005) 627 – 643
Table 5 AraUDro col.—length of the considered predicted exons: L ¿ 120 (FPETOT = 12; TPETOT = 354; W = 0:967; CAMthr = 5:0; MSDthr = 0:021; PGMthr = 0:6) Parameter region CAM
MSD
PGM
— — — — + + + +
— — + + — — + +
— + — + — + — +
True-pred. exons (TPE∗∗∗ )
False-pred. exons (FPE∗∗∗ )
∗∗∗ PFC
R∗∗∗ FC
40 256 0 1 3 54 0 0
9 2 0 0 0 1 0 0
0.1837 0.0078 — 0.0000 0.0000 0.0182 — —
0.7500 0.1667 0.0000 0.0000 0.0000 0.0833 0.0000 0.0000
2.2.5. Extension of the method to highly successfully GeneMark-predicted sequence collections We have formed a sequence collection named AraUDro col., putting together 36 Drosophila and Droso = 0:98 and W Arabid = 0:95). Obviously, direct comparison with the 15 Arabidopsis sequences (WL¿120 L¿120 results of previous sequence sets is not possible because the W value of this collection (which equals to 0.967) is much higher than in human and mouse collections. This means that the GeneMark tool itself is ‘,tter’ or better standardized for these organisms. Moreover, an application of threshold triads generating a true-coding rich (+++) sub-population (like in Tables 1a–c) is not of interest here, as the whole set of predicted exons is very rich in true-coding ones (only 12 out of 366 are falsely predicted). On the other hand, performing an analysis similar to the one of Table 4 for AraUDro col., would be potentially useful since the exons of the (- - -) sub-population, expected to include most of the falsely predicted exons, could be reevaluated for their coding character. In Table 5 this analysis for the parameter triad used in Table 4 is presented. We ,nd that the 75% of all the falsely predicted coding exons belongs to the (- - -) sub-region of the three-parameter space (i.e. the sensitivity −− measure is R− = 0:75). The corresponding probability of an exon of this parameter sub-region to FC −−− be non-coding (i.e. the speci,city measure) is PFC = 0:18. Notice that this has to be contrasted to the quantity (1 − W ) characterizing the whole collection which here is 0.03. This means that we have con,ned the 75% of the falsely predicted exons in one sub-set of the initial collection with a six-fold incidence. 2.2.6. Overview of the application of the method on short predicted exons Next, we test the e8ciency of our method of fractionalization of a given set of GeneMark predicted exons, of L ¡ 120. We use the Hum col.aUb for the exon length width: 80 ¡ L ¡ 120. In Table 6, we present the results for the parameter thresholds’ triads which have given for L ¿ 120 +++ Hum col:aUb = 0:66, a speci,city value of PTC = 0:98 (see Table 2, line 5). We observe that while W80¡L¡120 +++ +++ the obtained PTC equals to 0.88 for RTC = 0:48, thus indicating that the parameter thresholds determined for middle and high exon lengths are e8cient for the fractionalization of a low-length exon collection too. When we scan the three-parameter space, we ,nd that the threshold triad
Y. Almirantis, C. Nikolaou / Computers in Biology and Medicine 35 (2005) 627 – 643
637
Table 6 Hum col.aUb—length of the considered predicted exons: 80 ¡ L ¡ 120 (FPETOT =302; TPETOT =586; W =0:660; CAMthr = 2:7; MSDthr = 0:038; PGMthr = 0:74) Parameter region CAM
MSD
PGM
— — — — + + + +
— — + + — — + +
— + — + — + — +
True-pred. exons (TPE∗∗∗ )
False-pred. exons (FPE∗∗∗ )
∗∗∗ PTC
R∗∗∗ TC
3 1 40 55 8 2 193 284
10 2 104 20 8 1 120 37
0.2308 0.3333 0.2778 0.7333 0.5000 0.6667 0.6166 0.8847
0.0051 0.0017 0.0683 0.0939 0.0137 0.0034 0.3294 0.4846
+++ {2:7; 0:037; 0:80} leads to PTC = 0:92 with R+++ TC = 0:40. Using the PGM based one-parameter GM :+ :+ method, we obtain for PTC = 0:92 the value RGM = 0:29. This shows that the three-parameter TC fractionalization method is clearly better than the one based only on one threshold, for the PGM probability estimation, even in this length-range of predicted exons. Performing an analysis of the negatively predicted exons, again with the thresholds used in Table 4, we ,nd that 53% of the falsely predicted coding exons belongs to the (- - -) sub-region −− of the three-parameter space (i.e. the sensitivity measure is R− = 0:53). The corresponding probFC ability of an exon of this parameter sub-region to be non-coding (i.e. the speci,city measure) is −−− PFC = 0:68. Notice that this has to be contrasted to the quantity (1 − W ) characterizing the whole collection which here is 0.34.
3. A tool assigning to predicted exons an improved probability to be true-coding 3.1. Methods and sequences It would be of interest, based on the results of the analysis presented in Section 1, to assign to individual GeneMark-predicted exons, an improved probability to be really coding. To this end, the following algorithm is developed: 1. All the truly and falsely predicted exons of a (training) large collection of sequences of known intron-exon partition, (ideally belonging to the same organism), are represented as points into a 3-D space, spanned by CAM, MSD and PGM , by means of their corresponding parameter values. While the training collection may include predicted exons of a large width of lengths, the described algorithm may isolate and use as active training set, for the examination of each individual studied GeneMark-predicted exon, the training collection exons belonging to a speci,ed length range. 2. Let the coordinates of the point (CAMe ; MSDe and PGMe ) in this 3-D space be the parameter values proper to the predicted exon we shall examine. The populations TPEs and FPEs (s stands for “sample”), of the truly and falsely predicted exons lying inside a box centered at the point
638
Y. Almirantis, C. Nikolaou / Computers in Biology and Medicine 35 (2005) 627 – 643
(CAMe ; MSDe and PGMe ), are counted. Their sum PEs = TPEs + FPEs is then de,ned. The number box PEsmin and a set of starting dimensions CAM0box: , MSDbox 0 , PGM0 of the sampling box, are provided to s the algorithm as external parameters. If the condition: PE ¿ PEsmin does not hold, then, the algorithm will adjust the volume of the sampling box duplicating, triplicating, etc. the sizes of its edges, until the condition gets ful,lled. This readjustment preserves a good statistic despite the di6erences in the local point density prevailing in various regions of the parameter 3-D space. 3. We consider, that the ratio PEST = TPEs =PEs represents the estimation of the probability of the examined predicted exon to be coding, based on our combined CAM–MSD–PGM method. This new estimator of coding probability has to be assessed in comparison to the PGM value provided by GeneMark. In order to perform this assessment, we form an additional (test) collection of sequences of known intron-exon partition. For all the GeneMark-predicted exons of this collection, PEST is computed. Then, we de,ne as QEST and QPGM , respectively, the sums of the absolute values of the deviations of PEST and of PGM from the fraction of true coding space in each predicted exon, for which we will use the symbol P% , i.e.: P% = (true coding nucleotides of a GeneMark-predicted exon)/(its total length). Thus, we have QEST =
N
i
|PEST −
i P% |=N
i=1
and
QPGM =
N
i i |PGM − P% |=N;
i=1
where i is a indice scanning in arbitrary order all the predicted exons of the test collection and N their total number. Ideally, a tool assigning coding probability estimate to predicted exons, should give value 1 if they are coding in all their length, while for a predicted exon, hosting some noncoding length, should tend to give, as coding probability estimate, the exon’s true-coding fraction. This form of prediction accuracy in the nucleotide level is implicit in the above construction. Hence, comparison of the QEST and QGM reIects the relative advantage of these two codingcharacter prediction tools: the higher the value of the ratio QPGM =QEST is, the higher the e8ciency of the three-parameter method will be. Following the procedure described in Section 1, 359 human entries with sequence lengths ranging from 8 to 16 knts, including one CDS with at least three exons and experimentally found mRNA, have been extracted from the EMBL databank. After cleaning for redundancy, the 242 annotated sequences left have been divided into two groups. The ,rst 160, put together with the 251 sequences of the Hum col.aUb collection form the “training collection”, while the remaining 82 form the “human test collection”. The mus col. and the AraUDro col. described in Section 1 are also used as “test collections”. 3.2. Results and discussion First, we have applied the above-described algorithm to the human “test collection”, using the data taken from the “training collection” of the preceding paragraph, for a variety of values for the starting edge-sizes of the sampling box and for the accepted minimum for the number of test points (exons) belonging to it. A ,rst conclusion derived is that the method is stable (robust) enough for these parameters, i.e. considerable variations of these free parameters of the algorithm do not alter very much the obtained values of PEST and PPGM . The results included in Table 7 remain
Y. Almirantis, C. Nikolaou / Computers in Biology and Medicine 35 (2005) 627 – 643
639
Table 7 Min exon length
Max exon length
CAMbox: 0
MSDbox 0
box PGM0
PEsmin
QEST
QPGM
I
Results 1a 1b 1c
from “Human test collection” 100 300 0.002 300 500 0.002 500 ∞ 0.002
0.02 0.02 0.02
0.02 0.02 0.02
40 40 40
0.224 0.123 0.039
0.244 0.216 0.095
21 140 161
Results 2a 2b 2c
from “Mouse test collection” 100 300 300 500 500 ∞
0.02 0.02 0.02
0.02 0.02 0.02
40 40 40
0.244 0.107 0.032
0.245 0.182 0.173
10 65 148
Results 3a 3b 3c
from “Arabidopsis–Drosophila test collection” 100 300 0.002 0.02 300 500 0.002 0.02 500 ∞ 0.002 0.02
0.02 0.02 0.02
40 40 40
0.188 0.117 0.005
0.205 0.205 0.115
20 58 151
0.002 0.002 0.002
qualitatively unaltered for varying the dimensions of each of the sample box edges by a factor of ∼ 0:3–3. and the PEsmin by a factor of ∼ 0:5–2. Table 7 includes the results of the application of the algorithm, combined with the de,ned “training set” (411 sequences of human origin); ,rstly (three ,rst lines), with the human 82-sequences test collection, secondly (three next lines), with the 107 sequences of the mouse Mus col. test collection, and thirdly (three last lines), with the 51 Arabidopsis and Drosophila sequences of the AraUDro col. test collection. In each group, the ,rst line represents the application of the proposed method on exons of the test collection in the length band of 100–300 nts and the second and third lines, in the length bands of 300–500 and 500–∞, respectively. Simple comparison of the values of QEST and QPGM reveals that in lines 1a and 2a the advantage of using the coding-probability estimate given by the proposed algorithm instead of PGM is rather negligible, while a slight improvement is present in the case of the AraUDro col. test collection. For lengthier exons however, (in the length width 300–500 nts, lines 1, 2 and 3b) the use of PEST instead of PGM results in a 1.76-fold, 1.7-fold and 1.75-fold improvement for human, mouse and AraUDro col. test sets, respectively. For very lengthy exons (exon lengths ¿ 500 nts, lines 1 and 2c) this improvement gets 2.44 and 5.4 for human and mouse, respectively. In the Arabidopsis / Drosophila test collection a considerable improvement is also reached, even if the extremely high QEST =QPGM ratio of line 3c seems to be due to insu8cient statistics. Notice that, in agreement to the results of Section 1 (see also [14,16]), the human training collection seems to work equally well for human and mouse test collections. Moreover, it also works very well for collections originating from organisms like Arabidopsis and Drosophila, although marked by di6erent W values (see Section 1) than the training set (here of human origin). 3.2.1. Technical note In this simple assessment of the proposed algorithm, for the sake of simplicity, several options are kept as simple as possible. Thus, when examining a test-collection exon, we use as active “training
640
Y. Almirantis, C. Nikolaou / Computers in Biology and Medicine 35 (2005) 627 – 643
set”, the training set exons just belonging into the same length band (100–300, 300–500, 500–∞). This simpli,cation, somehow reduces the obtained values of improvement of the new estimator. When using the algorithm with a new, unknown-functionality GeneMark-predicted exon of length, say, Le , it is better to use, in general, an “active test collection” belonging to a length range centered at Le . A thump rule may be to take into account as training set exons, the ones belonging to the length width (Le − 100)–(Le + 100) provided that Le is less than ∼ 400 nts. For 400 ¡ Le ¡ 600 the length width (Le − 100)–∞ may be taken as training set exons, while for higher length exons, a good choice would be to use the length width 500–∞, unless a very large training set is available. The starting dimensions of the sampling box always have to be taken su8ciently small, as was always the case in the presented tests. In the last column of Table 7, number I of steps of the box’s increase is given. Notice that for sets of lengthy exons, the sampling box has to considerably increase in volume (high I values are reached by the algorithm’s auto-adjustment feature), in order to host a su8cient number of exon-points.
4. Concluding remarks In the preceding Sections 1 and 2 of this work, two closely related algorithms are presented, aiming to improve the result of existing gene ,nding tools, by implementing two measures of the coding potential of sequence segments recently developed (CAM and MSD, see [14,16]). The ,rst algorithm may be applied on a set of predicted exons and fractionalizes it into subsets with di6erent mean expectation to be true-coding, while the second, assigns to an individual predicted exon, an estimation of its probability to be coding, based on a “training set” comprising predicted exons of known functionality. In the present work, GeneMark was taken as a standard gene-,nding tool combined with CAM and MSD. Our results show that the three-criterial method has an advantage in comparison to GeneMark alone. It is reasonable to believe, that the proposed algorithm bene,ts on the fact that MSD and CAM address statistical and linguistic properties that di6er from the ones addressed by GeneMark. In this way, the combination of the three approaches is advantageous in both the discovery of exons and the veri,cation of their prediction with a smaller error rate. One advantage of the method described in Section 2, is that it seems to work equally well, computing coding probabilityestimations, for exons of the same or di6erent origin with the training set. We ,nd that this holds, not only if thresholds or training collections of human origin are applied to mouse sequences (whose Markov chain matrices are any way close enough), but also, if human-based training is applied on Drosophila or Arabidopsis sequences. This result is expected to be further corroborated, given the marked species independence found for CAM and MSD values attributed to coding and non-coding sequences [14,16]. Other widely used gene-,nding tools may be considered instead of GeneMark. In another direction of further development of the presented algorithms, other measures of the coding potential of DNA segments may be used alternatively to MSD and CAM or in combination with them in multi-criterial methods. Further re,nement of this approach, especially, combination of CAM with other tools which are also sensitive to the reading frame [19], may lead to modi,cations of the presented method allowing eventual correction of the reading frame initially proposed by the gene ,nding tool. All programs and data sets used throughout are available upon request.
Y. Almirantis, C. Nikolaou / Computers in Biology and Medicine 35 (2005) 627 – 643
641
Appendix A. The CAM algorithm The CAM algorithm may be described as follows: Consider the examined sequence as a juxtaposition of triplets in the three possible reading frames (RF = 1–3). For each reading frame the following steps are executed: 1. The 64 frequencies of occurrence of triplets are computed, denoted each as Fijk (subscripts i; j; k stand for the ,rst, second and third nucleotide of the triplet). 2. We compute, in each reading frame separately, the frequencies of occurrence for the four nucleotides (A; G; C; T ) for each position in the triplet: ,rst (f), second (s) and third (t): FA; f , FG; f , FC; f , FT; f , FA; s , FG; s , etc. Then, for each triplet frequency we compute in all three RFs a “correcting ratio” Rijk , which has implicit in its structure the RNY motif of the most preferred codons following the formula: Rijk =
Fijk : Fk; f Fj; s Fi; t
3. We compute the Codon Asymmetry Measure for each RF (CAM1; 2 CAMRF =
A;G;C;T A;G;C;T A;G;C;T i
j
k
RF |RRF ijk − Rkji | ; 48
or 3 )
as follows:
RF = 1; 2; 3:
The denominator 48 stands for the number of the non-zero contributing mutually (per couple) symmetric triplets. Finally, we de,ne CAM as the maximum between CAM1 , CAM2 and CAM3 : CAM = max{CAM1 ; CAM2 ; CAM3 }: Appendix B. The MSD algorithm The MSD algorithm may be described as follows: 1. We divide our sequence of length L, in L−m+1 partially superposed blocks of length m and we compute the four nucleotide frequencies of occurrence for every segment: P(m)nu; j (nu =A; G; C; T and j = 1; 2; 3; : : : ; L − m + 1). 2. We associate with each of these segments of length m, a longer “halo” of length h, which is always centered at the centre of the jth segment, and used for the computation of local means. We de,ne the quantity: Q2 (m; h)nu; j = (P(m)nu; j − P(h)nu; j )2 ; where P(h)nu; j is the same-nucleotide-frequency of the “halo” associated to the jth segment. 3. In the next step, we average over all L − m + 1 segments, de,ning in this way the “Modi,ed Standard Deviation” of juxtaposed blocks MSD(m; h)nu for given m; h and for the nucleotide nu = A; G; C; T : 2
MSD (m; h)nu =
L− m+1
Q2 (m; h)nu; j =(L − m + 1):
j=1
642
Y. Almirantis, C. Nikolaou / Computers in Biology and Medicine 35 (2005) 627 – 643
4. Finally, we average over the four nucleotides in order to obtain MSD(m; h) used in the following as a quanti,cation of the nucleotide clustering of sequences. We use weighed averages with weight for every nucleotide its frequency of occurrence in the whole sequence fnu ; nu = A; G; C; T . fnu MSD(m; h)nu MSD(m; h) = nu=A;G;C;T
The parameter values used herein are: m = 15; h = 500. It is shown [16] that for sequences shorter than 500 nucleotides, small values of m (around 15) give the best coding/non-coding distinction scores. References [1] A.D. Baxevanis, B.F. Ouellette, Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, Wiley, New York, 2001. [2] J.W. Fickett, C.S. Tung, Assessment of protein coding measures, Nucl. Acids Res. 20 (1992) 6441–6450. [3] G.B. Singh, S.A. Krawetz, Computer-based exon detection: an evaluation metric for comparison, Int. J. Genome Res. 1 (1994) 321–338. [4] R. Lopez, F. Larsen, H. Prydz, Evaluation of the exon prediction of the GRAIL software, Genomics 24 (1994) 133–136. [5] E.E. Snyder, G.D. Stormo, Identi,cation of protein coding regions in genomic DNA, J. Mol. Biol. 248 (1995) 1–18. [6] M. Burset, R. Guigo, Evaluation of gene structure prediction methods, Genomics 34 (1996) 353–367. [7] R. Guigo, P. Agarwal, J.F. Abril, M. Burset, J.W. Fickett, An assessment of gene prediction accuracy in large DNA sequences, Genome Res. 10 (2000) 1631–1642. [8] S. Rogic, A.K. Mackworth, F.B. Ouellette, Evaluation of gene-,nding programs on mammalian sequences, Genome Res. 11 (2001) 817–832. [9] E. Kraemer, J. Wang, J. Guo, S. Hopkins, J. Arnold, An analysis of gene,nding programs for Neurospora crassa, Bioinformatics 17 (2001) 901–912. [10] K. Murakami, T. Takagi, Gene recognition by combination of several gene,nding programs, Bioinformatics 14 (1998) 665–675. [11] S. Rogic, B.F. Ouellette, A.K. Mackworth, Improving gene recognition accuracy by combining predictions from two gene-,nding programs, Bioinformatics 18 (2002) 1034–1045. [12] V. Pavlovic, A. Garg, S. Kasif, A Bayesian framework for combining gene predictions, Bioinformatics 18 (2002) 19–27. [13] M. Borodovsky, J. McIninch, GeneMark: parallel gene recognition for both DNA strands, Comput. Chem. 17 (1993) 123–133. [14] C. Nikolaou, Y. Almirantis, Mutually symmetric and complementary triplets: di6erences in their use distinguish systematically between coding and non-coding genomic sequences, J. Theor. Biol. 223 (2003) 477–487. [15] Y. Almirantis, A standard deviation based quanti,cation di6erentiates coding from non-coding DNA sequences and gives insight to their evolutionary history, J. Theor. Biol. 196 (1999) 297–308. [16] C. Nikolaou, Y. Almirantis, A study of the middle-scale nucleotide clustering in DNA sequences of various origin and functionality by means of a method based on a modi,ed standard deviation, J. Theor. Biol. 217 (2002) 479–492. [17] B.E. Blaisdell, A prevalent persistent global non-randomness that distinguishes coding from non-coding eukaryotic nuclear DNA sequences, J. Mol. Evol. 19 (1983) 122–133. [18] A.V. Lukashin, M. Borodovsky, GeneMark.hmm: new solutions for gene ,nding, Nucl. Acids Res. 26 (1998) 1107–1115. [19] C. Nikolaou, Y. Almirantis, Uneven patterns in the frequencies of occurrence of n-tuplets in genomic sequences may reveal their coding potential, J. Mol. Evol. (2004) in press.
Y. Almirantis, C. Nikolaou / Computers in Biology and Medicine 35 (2005) 627 – 643
643
Yannis Almirantis has received his bachelor degree in Chemistry from the University of Thessaloniki, Greece in 1981, and his DEA in Organic Chemistry from the University of Paris XI at Orsay, France, in 1983. He has ,nished his Ph.D. Thesis at the Free University of Brussels, Belgium, entitled: ”Chiral symmetry breaking” and pattern selection in chemical and biological systems”, in 1988. He has worked as research associate (1989–1995) and as researcher (1996–now) at the National Research Center for Physical Sciences “Demokritos”, Athens, Greece, on the modeling of developmental events and the study of statistical and probabilistic aspects of genome organization. Christoforos Nikolaou received a degree in Chemistry from the University of Patras, Greece in 1999. In 2000 he was awarded a 4-year scholarship from the National Research Center for Physical Sciences “Demokritos”. He is currently a member of the Theoretical Biology and Computational Genomics group at the Institute of Biology, working on his Ph.D. Thesis, which is dealing with scale-dependent non-randomness in genomic sequences, to be presented at the University of Athens in the following year.