gene and exon prediction using time domain ... - Semantic Scholar

Report 2 Downloads 89 Views
GENE AND EXON PREDICTION USING TIME DOMAIN ALGORITHMS Eliathamby Ambikairajah, Julien Epps*, and Mahmood Akhtar The University of New South Wales, Sydney 2052, Australia, *National ICT Australia, Eveleigh 1430, Australia [email protected], [email protected], [email protected] (triplet of three base pairs) usage frequency. Many techniques for gene and exon prediction based on period-3 periodicity have been used and proved successful, some of which are discussed below. In order to apply digital signal processing techniques to periodicity detection, the character sequences of DNA can be first converted into four binary indicator numeric sequences [12]. For example, for a DNA sequence x[n] = ATGCTATT…., the binary indicator sequence for each base type would look like: xA[n] = {1, 0, 0, 0, 0, 1, 0, 0, ….}, xC[n] = {0, 0, 0, 1, 0, 0, 0, 0, ….}, xG[n] = {0, 0, 1, 0, 0, 0, 0, 0, ….}, xT[n] = {0, 1, 0, 0, 1, 0, 1, 1, ….}, where xA[n] + xC[n] + xG[n] + xT[n] = 1

ABSTRACT The protein-coding regions of DNA sequences normally exhibit a period-3 behaviour that can be used to identify gene locations. Various methods have been used to automatically identify the coding regions, however these have been predominantly ‘frequency’ domain techniques. Numerous ‘time’ domain techniques are available from the signal processing literature, and this work examines their applicability to gene and exon prediction in DNA sequences. Two techniques new to this application are introduced: the Time Domain Periodogram (TDP) and the Average Magnitude Difference Function (AMDF). We also present an indicative comparison of time domain and existing frequency domain techniques, from which the AMDF appears to be the most promising technique. Autoregressive modelling methods are further investigated in this work.

and where n represents the base index. Different types of numerical representations sequences (e.g. binary, complex, weighted complex, quaternionic) can be employed, however in this work we confine ourselves to the commonly used binary indicator sequences [12]. Anastassiou [2] presented an optimized spectral content measure based on a sliding DFT window for exon detection. Vaidyanathan and Yoon [11] proposed the use of an IIR anti-notch filter. Recently, Rao and Shepherd [7] proposed the use of an auto-regressive (AR) model for detection of 3-periodicity for small DNA sequences. However, all of these techniques are frequency-domain techniques. The notion of ‘frequency’ is relevant to DNA sequences in the sense that portions of the sequence can occur with periodic repetition (particularly during coding regions). The notion of ‘time’ is similarly relevant from the perspective that the periodicity of repetition can be estimated using time domain techniques similar to pitch determination algorithms in speech processing. In this paper, we propose the use of two computationally efficient time-domain techniques that are new in this application domain: the Time Domain Periodogram (TDP) and the Average Magnitude Difference Function (AMDF). We also present a comparison between these proposed ‘time’ domain and the ‘frequency’ domain DFT and AR techniques. The DFT has been used widely for sequence processing [2], whereas AR modelling has recently emerged as a useful alternative for identifying short coding and non-coding regions as discussed above.

1. INTRODUCTION The gene segments of the DNA molecule are known to carry useful information and are responsible for protein synthesis [6]. The understanding of the nature of this information and its role in determining the particular biological functions encoded by the genes is an ongoing and increasingly multidisciplinary research problem. A key step towards the solution of this problem is to find the protein coding regions in eukaryotes. This is because in eukaryotes, the protein coding regions (exons) are separated by non-coding regions (introns), whereas in procaryotes these regions are continuous. Tsonis et al. [10] employed Fourier analysis of DNA coding and non-coding sequences in an attempt to identify possible patterns in gene sequences. They found that while intronic sequences show a rather random pattern, coding sequences show periodicities and in particular a periodicity of 3. In order to completely understand this periodicity of 3, consider the periodic sequence A-- A-- A-- A-- …, where the blanks ‘-’ can be filled randomly by A, T, C or G (the four types of bases in a DNA sequence). This sequence shows a periodicity of three because of the repetition of the base A. Trifonov [9] suggested that one of the initial causes of periodicity could be the universal (RNY)n pattern (R=A or G, Y=C or T, N=any base). Eskesen et al. [3] concluded that DNA periodicity in exons is determined by codon

0-7803-9243-4/05/$20.00 ©2005 IEEE

199

2. EXISTING APPROACHES TO EXON PREDICTION

N −1

− j

2 π nk N

,

(1)

where k = 0, … , N-1. The DFTs XA[k], XT[k], XC[k] and XG[k] for the above binary indicator sequences can thus be calculated using equation (1). The periodicity of 3 in protein coding regions of a DNA sequence suggests that the DFT coefficient corresponding to k = N/3 (where N is chosen to be a multiple of 3) in each DFT sequence should be large. Thus if we take the window size N to be sufficiently large, but a multiple of 3 (in the order of a few hundred, e.g. 351), a peak is observed at the sample value k = N/3 [2]. The strength of the peak depends markedly on the gene, as it is sometimes highly pronounced, but at times quite weak. Note that the calculation of the DFT at a single frequency (k = N/3) is sufficient. The window is then moved by one sample or more in order to obtain a plot of S[k] where

S [k ] =



X m [k ] ,

3. NEW TIME DOMAIN APPROACHES TO EXON PREDICTION 3.1. Overview

(2)

The approach we have taken to applying time-domain algorithms to DNA sequences is shown in the block diagram of Figure 1. The sequence is first converted to four binary indicator sequences.

m

and m ∈ {A, T, C, G}, and this has been used as a measure of the total spectral information of DNA character string by Tiwari et al. [8], and later, with optimization, by Anastassiou [2]. A longer window implies longer computation time and compromises the base-domain resolution in predicting the exon location. 2.2. IIR Anti-notch Filter

DNA sequence

An IIR digital filter with a magnitude response showing a sharp peak at θ = 2π/3, known as an antinotch filter, can be used to detect the periodicity of 3 in a DNA sequence [11]. The binary indicator sequences xA[n], xT[n], xC[n] and xG[n] are applied to the input of such an anti-notch filter, producing outputs yA[n], yT[n], yC[n] and yG[n] respectively. The expression Y [n ] =



y m [n ] , 2

k

Pole positions can thus be estimated using linear prediction, and from the all-pole model, the magnitude at frequency 2π/3 can be estimated [7]. This technique has particular value for short DNA sequences, where a DFT would require a large window length to obtain the required frequency resolution The power spectrum can then be estimated from the expression in (4) at the frequency θ = 2π/3. The choice of model order is of particular interest, since choosing p too low or too high will result in unnecessarily smoothed or spurious modelling of spectral peaks respectively. In [7], relatively large model orders of 30 and 50 are employed. A particular advantage of the AR technique is that it requires relatively few sample values (i.e. base pairs) to calculate the AR model, which is convenient if the exon regions are short and/or closely spaced.

n=0

2

∑a z

(4)

, −k

k =1

The discrete Fourier transform (DFT) of a sequence x[n] of length N is defined as

∑ x [n ] e

p

1+

2.1. DFT of Short Segments

X [k ] =

1

H ( z) =

Conversion to binary indicator sequence

uA[n]

Bandpass filtering

bA[n]

Time domain algorithm

dA[n]

Conversion to binary indicator sequence

uC[n]

Bandpass filtering

bC[n]

Time domain algorithm

dC[n]

Conversion to binary indicator sequence

uG[n]

Bandpass filtering

bG[n]

Time domain algorithm

dG[n]

Conversion to binary indicator sequence

uT[n]

Bandpass filtering

bT[n]

Time domain algorithm

dT[n]

Linear combination

coding/noncoding decision

Figure 1. Overall scheme of exon prediction using time domain algorithms.

(3)

m

The binary indicator sequences cannot be processed by the full range of traditional digital signal processing techniques until they are converted into non-binary numerical values. We achieve this by filtering the binary indicator sequences with a second-order resonant filter with centre frequency set to 2π/3 (similar to [11]), which emphasizes the period-three behaviour of the DNA sequence, but de-emphasizes the other components, as shown in Figure 2.

where m ∈ {A, T, C, G}, can then be used as a feature to predict the coding regions in a DNA sequence. 2.2. AR Modelling Autoregressive modelling has previously been used with considerable success in speech processing. AR models assume that the data has been generated from a process containing p poles only, as follows:

200

containing sub-sequences of length equal to the period (k) being tested. The columns are then summed, and the maximum and minimum of the resulting vector are then used to derive the final estimate of the degree of periodicity at period k, as follows: N /k ⎡ ⎫⎪⎤ ⎧⎪ N / k ⎢TDPvector [k ] = ⎨ b[kn − (k − 1)],..., b[kn]⎬⎥ (8) ⎢⎣ ⎪⎭⎥⎦ ⎪⎩ n =1 n =1





TDP [k] = max(TDPvector[k]) − min(TDPvector[k])

Practically, TDP[k] will produce a peak if correlation exists at period k = 3. It can be shown [1] that for large N, TDP[k] has a very sharp peak, enabling accurate detection of periodicity. For efficient implementation, equation (9) can be simplified to:

Figure 2. Second-order resonant filter with centre frequency 2π/3.

TDP [k] = max(TDPvector[k]) .

The transfer function of the resonant filter is given by

H ( z) = 2π θ= 3

1 − z −2 1 − 2 R cos θz

−1

2 −2

+R z

, with

and R = 0.992 .

Following the convention of several recent authors (e.g. Vaidyanathan and Yoon [11], Guan and Tuqan [4]), and for the purposes of comparison with their previously reported results, we show our results for the gene F56F11.4 in C-elegans (base number 7021 – 15080 in chromosome III; accession number AF099922), containing five exons. The proposed AMDF and TDP time-domain algorithms, along with the two existing techniques described in section 2 were applied to F56F11.4. Figure 3 shows the four individual band pass filtered binary indicator sequences bA[n], bT[n], bC[n], bG[n] after application of the AMDF, and then the combined sequence, together with the true coding regions indicated. From this it is evident that some bases contribute to the period-3 behaviour more than others, however in combination, the gene regions are very clear.

(6)

The Average Magnitude Difference Function (AMDF) has long been used in speech processing for pitch determination, and can be applied to any periodic or near-periodic sequence of length greater than the period to derive an estimate of the periodicity. The AMDF for a bandpass-filtered binary indicator sequence b[n] as a function of the period k is defined as 1 N

N

∑ b[n] − b[n − k ] ,

(10)

4. EXPERIMENTAL RESULTS

(5)

3.1. Average Magnitude Difference Function

AMDF [k ] =

(9)

(7)

n =1

where N is the window length. In our case, k = 3. Practically, the AMDF function will produce a deep null if significant correlation exists at period k = 3. Because we have band pass filtered the signal before applying the AMDF, small values are found throughout the non-coding and coding regions. By contrast, if a different period k ≠ 3 is tested, it will produce relatively low AMDF values during the non-coding regions (where there is relatively more non-period 3 behaviour) and extremely large AMDF values during the coding regions (where the behaviour is entirely period 3). AMDF can be used for small or long DNA sequences when attempting to detect periodicity of 3. 3.2. Time Domain Periodogram

Figure 3. Sample-by-sample AMDF values of the four bandpass-filtered binary indicator sequences for Celegans, and their sum. The binary sequence in the final plot represents the hand-labelled locations of the intronic (low value) and exonic (high value) regions.

The Time Domain Periodogram (TDP) is an algorithm used for periodicity detection in sunspots and pitch detection for speech processing. In contrast to the AMDF, the data are summed rather than taking differences. The data are arranged in a matrix, with rows

201

Figure 4 shows an example comparison of four techniques: ƒ Discrete Fourier Transform (DFT), with window length N = 351 base pairs, as in [11]. ƒ Autoregressive modelling (AR), with 20 poles and window length N = 81 base pairs. ƒ Average Magnitude Difference Function (AMDF), window length N = 117 base pairs. ƒ Time Domain Periodogram (TDP), window length N = 117 base pairs.

5. CONCLUSION

This paper has reviewed three existing methods and proposed two new time domain techniques for exon region prediction in DNA sequences using period-3 behaviour. The proposed average magnitude difference function and time domain periodogram techniques produce equivalent gene and exon prediction accuracy to existing techniques, but achieve this for very small frame lengths, allowing better exon discrimination where the coding regions are short and/or closely spaced. Future work will include performing a larger scale comparison of all methods with the GENSCAN software over a wide range of different types and lengths of DNA sequences, examining singular value decomposition techniques for accurate segmentation of very short and/or closely spaced coding regions, and investigating accurate endpoint detection and robust classification.

In recent experimental work, we have tested the above techniques on 39 different sequences of different coding region lengths and spacings. We have found that each of the methods above were sensitive to the window length N. It was found that the AR method was well-suited to the detection of shorter coding regions, whereas the DFT required a larger window length, and the proposed AMDF and TDP techniques required only slightly longer window lengths. From figure 4, it is evident that the two time domain algorithms perform with high accuracy compared to the frequency domain techniques, and have the additional advantages of computational efficiency and accurate estimation at smaller frame sizes. While the exact parameter values are expected to vary from one database to the next, the characteristics observed in these experiments are conjectured to hold for other databases. It is also clear from Figure 4 that the two time domain techniques proposed show a very smooth contour compared to the existing frequency domain techniques, thus facilitating the accurate detection of end points (exon regions).

6. ACKNOWLEDGEMENTS

This work is supported by a UNSW Goldstar Award, 2005 for genomic signal processing research. 7. REFERENCES [1] Ambikairajah, E., and Carey, M. J., “The Time-Domain Periodogram Algorithm”, Signal Processing, vol. 5, 1983, pp. 491-513. [2] Anastassiou, D., “Frequency-domain analysis of biomolecular sequences”, Bioinformatics, vol. 16, no. 12, 2000, pp. 1073-1082. [3] Eskesen S.T., Eskesen, F.N., Kinghom, B.. and Ruvinsky, A., “Periodicity of DNA in exons”, BMC Molecular Biology, vol. 5, no. 1, August 2004, p. 12. [4] Guan, R., and Tuqan, J., “IIR Filter Design for Gene Identification”, in Proc. IEEE Workshop on Gen. Sig. Proc and Stat., 2004. [5] Gutierrez, G., Oliver, J. L. and Martin, A., “On the origin of the periodicity of three in protein coding DNA sequences”, J Theor Biol, vol. 167, no. 4, 1994, pp. 413-414. [6] Mount, D. W., Bioinformatics Sequence and Genome Analysis, Cold Spring Harbor Lab. Press, New York, 2001. [7] Rao, N. and Shepherd, S.J., “Detection of 3-periodicity for small genomic sequences based on AR technique”, Int. Conf. on Comm., Circ. and Sys., vol. 2, 2004, pp. 1032-1036. [8] Tiwari, S., Ramachandran, S., Bhattacharya, A. and Ramaswamy, R., “Prediction of probable genes by Fourier analysis of genomic sequences”, CABIOS, vol. 113, 1997, pp. 263-270. [9] Trifonov, E. N., “Elucidating sequence codes: three codes for evolution”, Ann NY Acad Sci, vol. 870, 1999, pp. 330-338. [10] Tsonis, A. A., Elsner, J. B. and Tsonis, P. A., “Periodicity in DNA coding sequences: Implications in Gene Evolution”, Journal of Theor. Biol., vol. 151, no. 3, 1991, pp. 323-331. [11] Vaidyanathan, P. P., and Yoon, B.-J., “Gene and exon prediction using allpass-based filters”, in Proc. IEEE Workshop on Gen. Sig. Proc and Stat., 2002. [12] Voss, R. F., “Evolution of long-range fractal correlations and 1/f noise in DNA base sequence”, Physical Review Letters, vol. 68, 1992, pp. 3805-3808.

Figure 4. Comparison of four techniques for exon prediction in the gene F56F11.4 in C-elegans (base no. 7021–15080, chromosome III, accession no. AF099922), using period-3 properties. The binary sequences represent the hand-labelled locations of the intronic (value 0) and exonic (value 1) regions.

202