A profile-based protein sequence alignment algorithm ... - IEEE Xplore

Report 0 Downloads 125 Views
A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu1,2 Fa Zhang1 and Zhiyong Liu3 1, Key Laboratory of Computer System and architecture, the Institute of Computing Technology, Chinese Academy of Sciences 2, Graduate School of Chinese Academy of Sciences 3, National Natural Science Foundation of China Abstract- Aiming at the two main shortcomings in Homology

Smith-Waterman algorithm [2].

Modeling, we have designed and established a domain

utilizes the sequence information, the quality of the

clustering database.

alignment will drop greatly when the sequence identity is

work for it.

Searching the database is a fundamental

However, current alignment algorithms are

less than 30%.

Since this method only

Multiple sequence alignments create

mainly based on the sequences, ignoring the structure

alignment between more than three sequences.

conservation in domain. This paper proposed a profile-based

simultaneous alignment of several sequences is a NP-hard

alignment which considers the structure information into the

computational problem, most of the methods use a heuristic

profile, based on the character of our domain database. We

algorithm, such as ClustalW [3], DIALIGN [4] and

designed an experiment within the database.

T-COFFEE [5].

The results

Since

However, the alignment quality and

show that both the quality and sensitivity of our scheme are

computational cost are two critical problems in this kind

better than pure Smith-Waterman and sequence-based profile

method.

algorithms. We strongly believe that this work can help to

with the development of the PSI-BLAST program by

improve the protein structure prediction.

Alschul et al [6]. These methods improve the alignment

Profile-based methods have greatly accelerated

quality by using a profile to describe the characters in the I. Sequence

alignment

similar sequences and aligning a sequence or a profile with

INTRODUCTION is a

fundamental

tool

in

other profile.

Because the profile accurate records the

Computational Biology and Bioinformatics. With this tool

most relevant information from the multiple sequence

we can get a lot of useful information, such as which genes

alignment, the quality of this method is better than the

have the same function, which RNAs belong to the same

others.

class and which proteins have the same structure topology,

alignment methods, such as PSI-BLAST [6] and HMMER

etc.

[7].

Moreover, in the area of protein structure prediction,

Several groups have published profile-to-profile Most of profile-based methods use standard

obtaining the alignment between structure-unknown protein

Smith-Waterman local alignment method, but they vary

sequence (query) and its structure-known homologies

significantly in a number of important respects, such as

(templates) is the most fundamental step in the modeling

scoring functions, gap penalties, weighting schemes and

processing, and the quality of the alignment affects the

whether adding a secondary-structure substitution matrix.

prediction result greatly.

Although all these methods use different information and

Generally speaking, there are three categories of

different methodologies, an accurate alignment still remains

methods to create an alignment: single sequence based,

a major challenge, especially when the sequence similarity

multiple sequence alignments and profile based.

Single

fell into the twilight zone (3%

The profile specific position-dependent penalties for insertions and deletions can be set a high value to prevent insertions in positions where no gaps occurs and set a low value to allow insertions in regions where insertions are

95.66

observed in the alignment. The penalty applied, gap(L), Fig 2. The distribution of different types in one coordinate cluster

for creating a gap during the match of profile to query is given by gap(L) = gap’[gap_open+gap_ext*L], in which

B. Building Profile from Sequence and Structure

gap’ is the penalty given in the last column of the profile, L

Information Based on the sequence and structure information, we build a profile for each domain cluster. The profile is defined as a sequence position-specific scoring matrix M(p,a) composed of 21 columns and m rows (m = length of alignment). The first 20 columns of each row specify the score of the 20 amino acid residues respectively. An additional column contains a penalty for insertions or deletions at that position. In position p of alignment A (N structures), AA(a) is defined as the class of amino acid type a, SS(i) is the class of carbon alpha coordinates clustering i (which is mentioned in the last section) and the W(p,a) is the weight for the appearance of amino acid a at position p. For the sequence information, the weight of each amino acid type is determined as follows: Supposed that there are n(a) items in residue class AA(a), then the average weight for class AA(a) is W1(p,a) = n(a)/N. For the structure information, the weight for each class is determined as follows: Supposed that there are n(si ) items in class SS(i), then the weight is W2(p,si ) = n(si )/N. Then, the W(p,a) can be calculated with W1 and W2.

is the number of residue positions in the gap, and gap_open

W ( p, a ) = [W 1 ( p, a) *

∑ n(a, i) *W

2

and gap_ext are the penalties for gap opening and gap extension, respectively. C. profile-based alignment Since our profile accurate record both sequence and structure properties, with the profile of each domain cluster, we can use the Smith-Waterman local alignment algorithm to find which domain the query sequence more likely belongs to.

The major difference of our profile-based

alignment from dynamic programming algorithm and other profile-based alignment algorithms lies in the scoring scheme.

Our profile-based alignment uses not only the

sequence information derived from domain cluster, but also uses the structure information extracted from superimposed structure ensembles, whereas, in the raw dynamic programming algorithm, the score is based on the comparison of amino acids in the corresponding positions in two sequences, other profile-based alignment algorithms mostly use the sequence information derived by family sequences.

( p, si )] * σ

IV.

AllSS ( i )

RESULTS AND DISCUSSIONS

To evaluate the performance of the alignment scheme Here,

σ

is a normalized unit which ensures that

∑ W ( p, a ) = 1 .

a∈{ a min o acid type}

described in this paper, we tested it within the whole database.

There is a reference sequence whose structural

distance between others in one domain cluster is the

smallest.

Also we selected the sequence whose structural

the amino acid type number to weight on that type.

Using

distance is the remotest to the reference as the benchmark.

these 3 different alignment methods, we compared the

Then the benchmark sequence was searched by our

query (total 1,051 datasets) with the entire database, table

profile-based alignment algorithm with the whole database.

III shows the number of hits and false for each method.

With the statistics information got from the database,

From the table we can see that profile-based method

we classified the domain clusters into four types: sequence

improves the alignment significance. Using the consensus

and structure conserved; structure conserved; sequence

sequences aligned by Smith-Waterman algorithm, we can

conserved and mixed. The conservation is defined as the

only got 788 (~25%) hits in this cluster type. It has high

number of amino acid type or 3D coordinates cluster less

false rate.

than half of the total number at each position in the

information into the profile and scoring scheme, it improves

alignment. The sequence and structure conserved, is the

the hit rate up to 88%, but there still remain 122 false hits.

domain cluster whose amino acid type and 3D coordinates

Our profile contains not only the sequence information but

are both conserved; the structural conserved, is the domain

also the structure information, so it can improve the hit rate

which only the 3-D coordinates are conserved; the third one

up to 91%.

Sequence-based profile brings the sequence

is only the amino acid type accord with the conserved

TABLE III

condition; the last one consists of both amino acid

THE NUMBER OF HITS AND FALSE FOR EACH METHOD

conserved parts and 3D coordinates conserved parts.

hits

false

Smith-Waterman

788 (75%)

263 (25%)

Sequence-based profile alignment

929 (88%)

122 (12%)

Combined profile alignment

952 (91%)

99 ( 9%)

We listed the number of domain cluster in each type, as shown in table II. TABLE II. THE NUMBER OF DOMAINS CLUSTER IN EACH TYPE. Class type

Number of Domain cluster

sequential and structural conserved

1051

structural conserved

28

sequential conserved

784

mixed type

974

A

B

We picked up some domain clusters from each type to evaluate our score scheme of alignment algorithm.

In

each domain cluster, we selected a query and then aligned it to the whole database.

C

Fig 2. (A), a segment of the multiple structure alignment in cluster IPR000291. (B), the relevant structure superposition. (C), the alignment between query and cluster IPR005905 by Smith-Waterman algorithm.

Sequence and Structure Conserved Domain

300

It can be said that domain cluster in this type is the

score matrix in Smith-Waterman algorithm is BLOSUM62.

155

148

141

134

127

120

113

99

106

92

85

78

71

64

57

50

0 43

The

50

36

and sequence-based profile alignment algorithms.

100

29

alignment with pure Smith-Waterman sequence alignment

150

22

alignment significant, we compared our profile-based

200

8

To evaluate the

1

can reflect the conservative features.

Number of Entries

most conserved one. Within this type, our scoring scheme

250

15

A.

Score

The gap open and gap extension is 12 and 2 respectively.

Fig 3. Distribution of alignment scores for comparing a query from

The sequence-based profile alignment is one normal

IPR000291 with the whole database.

profile-based alignment.

It builds the profile by counting

Fig.2 and Fig.3 demonstrate another example that our

the results with the structure information.

method has more sensitivity than other two methods. Here,

We chose a query, labeled 1blbA1_175, form the

we chose a query, labeled 2dln_248_276, from the domain

domain cluster IPR001064. Fig.4A shows the structure

cluster IPR000291.

Fig. 2A and 2B show a segment of the

superposition in the domain cluster. Since the amino acid

multiple structure alignment and the relevant structure

type in some positions is variable, both Smith-Waterman

superposition in the cluster. We can find that these domains

and sequence-based profile alignment methods gave the

very similar in structure level but have some difference in

highest score to the consensus of cluster IPR011024.

sequence level. Also, we note that there is a domain in

Although some segment in the alignment were matched

cluster IPR005905 has a segment, which is sequential

well, as shown in Fig.4B, the result was wrong.

identity with the query, as shown in Fig. 2C. So both

our profile-based alignment method can give the highest

Smith-Waterman and sequence-based profile alignment

scores to the consensus of right domain cluster, as shown in

identified the query belongs to cluster IPR005905. However,

Fig.5. The highest score is 119, which is the alignment

our profile-based method can distinguish the query form

score between the query and the profile of cluster

other clusters. Fig.3 shows that the alignment scores for

IPR001064.

However

comparing the query with the whole database. In this figure, the highest score is 160, which is the alignment score between the query and the profile of cluster IPR000291, the right domain cluster. So using our profile, we can improve significantly the alignment sensitivity. B.

Structure Conserved Domain The domain in this type is only structural conserved.

The structure topology in one domain cluster takes on the

alignment between query and consensus of cluster IPR011024.

But their amino acid type in some position is

In biology the amino acid type can be mutated

while structure and function is the same.

300 250

This

Number of Entries

phenomenon is difficult to handle with sequence alignment schemes, such as local, global or sequence-based profile alignment.

200 150 100 50

hits

false

Smith-Waterman

20 (71%)

8 (29%)

Sequence-based profile alignment

23 (82%)

5 (18%)

Combined profile alignment

27 (96%)

1 ( 4%)

In this cluster type, we tested the 3 kinds of alignment

116

111

106

96

101

91

86

81

76

71

66

61

Fig. 5. The distribution of alignment scores for comparing the query from cluster IPR001064 with the whole database.

C.

Sequential Conserved Domain TABLE V THE NUMBER OF HITS AND FALSE FOR EACH METHOD

methods, the hits and false results were shown in table Ⅳ. Although there are only 28 domain clusters in this type, the results still show that our profile-based method can improve Because the profile reflects the characters of a

family, sequence-based profile method improves the hit rate a little.

56

Score

THE NUMBER OF HITS AND FALSE FOR EACH METHOD

the hit rate.

51

46

41

36

31

26

21

1

16

0

TABLE Ⅳ

6

variable.

(A), the structure superposition in cluster IPR001064. (B) the

11

same shape.

Fig. 4.

Furthermore, combined profile method improves

hits

false

Smith-Waterman

665 (85%)

119 (15%)

Sequence-based profile alignment

735 (94%)

49 ( 6%)

Combined profile alignment

745 (95%)

39 ( 5%)

There are 784 domain clusters belong to this type in our database. Table V shows the results to compare the 3 kind

of alignment methods. Here we selected a query from

veracity through combining the sequence and structure

domain cluster IPR001356. Fig.6A shows that there are

information, although there is only one percent improve in

some variable regions in these domains, and Fig.6B shows

hit rate than sequence-based profile alignment.

Fig.6C shows the

We selected randomly a query from cluster IPR004227

highest alignment score is 57, between the query to the

in this type. Fig. 7 shows the distribution of alignment

profile form cluster IPR001356, whereas other two methods

scores for comparing the query with the whole database.

implied that the query belongs to the cluster IPR007107.

The highest score implies that the query belongs to cluster

Therefore, our scheme proved again to improve the

IPR004227. This figure shows again that our method have

alignment results and sensitivity.

more sensitivity than other 2 methods.

that the sequences are more conserved.

TABLE VI THE NUMBER OF HITS AND FALSE FOR EACH METHOD hits

false

Smith-Waterman

428

546

Sequence-based profile alignment

936

38

Combined profile alignment

941

33

A 250

Number of Entries

200

150

100

50

730

726

722

718

714

57

53

49

45

41

Fig 7. The distribution of alignment scores for comparing the query

250

from cluster IPR004227 with the whole database.

200 150

V.

100

CONCLUSION AND FUTURE WORKS

55

49

52

43

46

37

40

31

34

25

28

19

22

13

16

algorithm, used to our domain-based template database. 7

0

10

In this paper we proposed a profile-based alignment

1

50

4

Number of Entries

37

Score

300

Score

Fig6.

(A), the structure superposition and (B) multiple structure

alignment in domain cluster IPR001356. (C). the distribution of alignment scores for comparing the query from cluster IPR001356 with the whole database.

Mixed Type Table Ⅵ shows the result comparison of three methods

using the mixed type datasets.

The statistics analysis shows that most of the domain clusters in our database are conserved both in structural and

C

D.

710

B

33

29

25

21

17

9

13

5

1

0

Because there are some

variable regions in sequence level in this kind of domain cluster, the Smith-Waterman algorithm behaves much worse than others. Our profile-based method can improve the

sequential level, so each element in our profile combines the structural clustering information and the sequence information.

With

this

profile,

we

developed

a

profile-based query-template alignment method. To validate if our method is more accurate and sensitivity than other query-template alignment methods,

we divided our

database into four types, based on sequence and structure conservation.

In each type, we made some experiments.

The results form each type show that our profile can accurate describe the feature of that domain cluster, as well

as, our profile-based method can align the query to right template with low-fault.

It show that our method have

more sensitivity than other query-template alignment methods.

Biol Crystallogr, 2002. 58(Pt 6 No 1): p. 899-907. [11] Zdobnov, E.M. and R. Apweiler, “InterProScan--an integration

As described above, our final goal is protein structure prediction.

Res, 2005. 33(Database Issue): p. D201-205. [10] Berman, H.M., et al., The Protein Data Bank. Acta Crystallogr D

So, how to use our domain-based template

database and our profile-based query-template alignment method to improve the prediction of protein structure will be investigated in our next work. ACKNOWLEDGMENT

platform

for

the

signature-recognition

methods

in

InterPro,”

Bioinformatics, 2001. vol.17(9), p. 847-848. [12] Bateman, A., et al., The Pfam protein families database. Nucleic Acids Res, 2004. 32(Database issue): p. D138-141. [13] Murzin, A.G., et al., SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol, 1995. 247(4): p. 536-540.

This work was supported by the National Natural Science Foundation of China project under 60503060 and key project under 90612019.

[14] Letunic, I., et al., SMART 4.0: towards genomic data integration. Nucleic Acids Res, 2004. 32(Database issue): p. D142-144. [15] Haft. D.H., J.D. Selengut, and O. White, “The TIGRFAMs database of protein families,” Nucleic Acids Res, 2003, 31(1), p.371-373.

REFERENCES

[16] Holm. L. and C. Sander, “Protein structure comparison by alignment of distance matrices,” J Mol Biol. 1993. 233(1), p. 123-138.

[1] Needleman S, Wunsch C, “A general method applicable to the search

[17] Shindyalov IN, Bourne PE, “Protein structure alignment by

for similarities in the amino acid sequence of two proteins,” J Mol Biol,

incremental combinatorial extension (CE) of the optimal path,” Protein

1997, vol.48, p443-453.

Engineering, 1998, vol. 11(9), p739-747.

[2] Smith T, Waterman M, “Identification of common molecular subsequences,” J Mol Biol, 1981, vol.147, p195-197.

[18] Pearson. W.R, “Rapid and sensitive sequence comparison with FASTP and FASTA,” Methods Enzymol, 1990, vol.183, p. 63-98.

[3] J. Thompson, D. Higgins, and T. Gibson, “CLUSTALW: improving the

[19] S. F. Altschul, W. Gish, W. miller, E. W. Myers and D. J. Lipman,

Sensitivity of Progressive Multiple Sequence Alignment through

“Basic Local Alignment Search Tool,” J. Mol. Biol. 1990. 215,

Sequence Weighting Position Specific Gap Penalties and Weight

p403-410.

Matrix Choice”, Nucleic Acids Res, 1994, vol. 22, p.673-690. [4] Michael Brudno, Michael Chapman, Berthold Gottgens, Serafim Batzoglou and Burkhard Morgenstern, “Fast and sensitive multiple alignment of large genomic sequences,” Bioinformatics 2003, vol.4, p 66-78. [5] C. Notredame, D. Higgins, J. Heringa, “T-Coffee: A novel method for multiple sequence alignments,” J Mol Biol, 2000, vol.302, p205-217. [6] Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z. and et al, “Gapped BLAST and PSI-BLAST: A new generation of database programs,” Nucleic Acids Res, 1997, vol.25, p3389-3402. [7] SR Eddy, “Profile hidden markov models,” Bioinformatics, 1998, Vol 14, p755-763. [8] Fa Zhang, Jingchun Chen, Zhiyong Liu and Bo Yuan, “The construction of Structural Templates for the Modeling of Conserved Protein Domains,” International Conference on Bioinformatics and its Applications(ICBA’04), Fort Lauderdle. Florida. USA. [9] Mulder, N.J., et al., InterPro, progress and status in 2005. Nucleic Acids

[20] Gribskov, M., McLachlan, A.D., and Eisenberg, D, “Profile analysis: Detection of distantly related proteins,” Proc. Natl. Acad. Sci, 1987 vol.84, p4355-4358.