A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu1,2 Fa Zhang1 and Zhiyong Liu3 1, Key Laboratory of Computer System and architecture, the Institute of Computing Technology, Chinese Academy of Sciences 2, Graduate School of Chinese Academy of Sciences 3, National Natural Science Foundation of China Abstract- Aiming at the two main shortcomings in Homology
Smith-Waterman algorithm [2].
Modeling, we have designed and established a domain
utilizes the sequence information, the quality of the
clustering database.
alignment will drop greatly when the sequence identity is
work for it.
Searching the database is a fundamental
However, current alignment algorithms are
less than 30%.
Since this method only
Multiple sequence alignments create
mainly based on the sequences, ignoring the structure
alignment between more than three sequences.
conservation in domain. This paper proposed a profile-based
simultaneous alignment of several sequences is a NP-hard
alignment which considers the structure information into the
computational problem, most of the methods use a heuristic
profile, based on the character of our domain database. We
algorithm, such as ClustalW [3], DIALIGN [4] and
designed an experiment within the database.
T-COFFEE [5].
The results
Since
However, the alignment quality and
show that both the quality and sensitivity of our scheme are
computational cost are two critical problems in this kind
better than pure Smith-Waterman and sequence-based profile
method.
algorithms. We strongly believe that this work can help to
with the development of the PSI-BLAST program by
improve the protein structure prediction.
Alschul et al [6]. These methods improve the alignment
Profile-based methods have greatly accelerated
quality by using a profile to describe the characters in the I. Sequence
alignment
similar sequences and aligning a sequence or a profile with
INTRODUCTION is a
fundamental
tool
in
other profile.
Because the profile accurate records the
Computational Biology and Bioinformatics. With this tool
most relevant information from the multiple sequence
we can get a lot of useful information, such as which genes
alignment, the quality of this method is better than the
have the same function, which RNAs belong to the same
others.
class and which proteins have the same structure topology,
alignment methods, such as PSI-BLAST [6] and HMMER
etc.
[7].
Moreover, in the area of protein structure prediction,
Several groups have published profile-to-profile Most of profile-based methods use standard
obtaining the alignment between structure-unknown protein
Smith-Waterman local alignment method, but they vary
sequence (query) and its structure-known homologies
significantly in a number of important respects, such as
(templates) is the most fundamental step in the modeling
scoring functions, gap penalties, weighting schemes and
processing, and the quality of the alignment affects the
whether adding a secondary-structure substitution matrix.
prediction result greatly.
Although all these methods use different information and
Generally speaking, there are three categories of
different methodologies, an accurate alignment still remains
methods to create an alignment: single sequence based,
a major challenge, especially when the sequence similarity
multiple sequence alignments and profile based.
Single
fell into the twilight zone (3%
The profile specific position-dependent penalties for insertions and deletions can be set a high value to prevent insertions in positions where no gaps occurs and set a low value to allow insertions in regions where insertions are
95.66
observed in the alignment. The penalty applied, gap(L), Fig 2. The distribution of different types in one coordinate cluster
for creating a gap during the match of profile to query is given by gap(L) = gap’[gap_open+gap_ext*L], in which
B. Building Profile from Sequence and Structure
gap’ is the penalty given in the last column of the profile, L
Information Based on the sequence and structure information, we build a profile for each domain cluster. The profile is defined as a sequence position-specific scoring matrix M(p,a) composed of 21 columns and m rows (m = length of alignment). The first 20 columns of each row specify the score of the 20 amino acid residues respectively. An additional column contains a penalty for insertions or deletions at that position. In position p of alignment A (N structures), AA(a) is defined as the class of amino acid type a, SS(i) is the class of carbon alpha coordinates clustering i (which is mentioned in the last section) and the W(p,a) is the weight for the appearance of amino acid a at position p. For the sequence information, the weight of each amino acid type is determined as follows: Supposed that there are n(a) items in residue class AA(a), then the average weight for class AA(a) is W1(p,a) = n(a)/N. For the structure information, the weight for each class is determined as follows: Supposed that there are n(si ) items in class SS(i), then the weight is W2(p,si ) = n(si )/N. Then, the W(p,a) can be calculated with W1 and W2.
is the number of residue positions in the gap, and gap_open
W ( p, a ) = [W 1 ( p, a) *
∑ n(a, i) *W
2
and gap_ext are the penalties for gap opening and gap extension, respectively. C. profile-based alignment Since our profile accurate record both sequence and structure properties, with the profile of each domain cluster, we can use the Smith-Waterman local alignment algorithm to find which domain the query sequence more likely belongs to.
The major difference of our profile-based
alignment from dynamic programming algorithm and other profile-based alignment algorithms lies in the scoring scheme.
Our profile-based alignment uses not only the
sequence information derived from domain cluster, but also uses the structure information extracted from superimposed structure ensembles, whereas, in the raw dynamic programming algorithm, the score is based on the comparison of amino acids in the corresponding positions in two sequences, other profile-based alignment algorithms mostly use the sequence information derived by family sequences.
( p, si )] * σ
IV.
AllSS ( i )
RESULTS AND DISCUSSIONS
To evaluate the performance of the alignment scheme Here,
σ
is a normalized unit which ensures that
∑ W ( p, a ) = 1 .
a∈{ a min o acid type}
described in this paper, we tested it within the whole database.
There is a reference sequence whose structural
distance between others in one domain cluster is the
smallest.
Also we selected the sequence whose structural
the amino acid type number to weight on that type.
Using
distance is the remotest to the reference as the benchmark.
these 3 different alignment methods, we compared the
Then the benchmark sequence was searched by our
query (total 1,051 datasets) with the entire database, table
profile-based alignment algorithm with the whole database.
III shows the number of hits and false for each method.
With the statistics information got from the database,
From the table we can see that profile-based method
we classified the domain clusters into four types: sequence
improves the alignment significance. Using the consensus
and structure conserved; structure conserved; sequence
sequences aligned by Smith-Waterman algorithm, we can
conserved and mixed. The conservation is defined as the
only got 788 (~25%) hits in this cluster type. It has high
number of amino acid type or 3D coordinates cluster less
false rate.
than half of the total number at each position in the
information into the profile and scoring scheme, it improves
alignment. The sequence and structure conserved, is the
the hit rate up to 88%, but there still remain 122 false hits.
domain cluster whose amino acid type and 3D coordinates
Our profile contains not only the sequence information but
are both conserved; the structural conserved, is the domain
also the structure information, so it can improve the hit rate
which only the 3-D coordinates are conserved; the third one
up to 91%.
Sequence-based profile brings the sequence
is only the amino acid type accord with the conserved
TABLE III
condition; the last one consists of both amino acid
THE NUMBER OF HITS AND FALSE FOR EACH METHOD
conserved parts and 3D coordinates conserved parts.
hits
false
Smith-Waterman
788 (75%)
263 (25%)
Sequence-based profile alignment
929 (88%)
122 (12%)
Combined profile alignment
952 (91%)
99 ( 9%)
We listed the number of domain cluster in each type, as shown in table II. TABLE II. THE NUMBER OF DOMAINS CLUSTER IN EACH TYPE. Class type
Number of Domain cluster
sequential and structural conserved
1051
structural conserved
28
sequential conserved
784
mixed type
974
A
B
We picked up some domain clusters from each type to evaluate our score scheme of alignment algorithm.
In
each domain cluster, we selected a query and then aligned it to the whole database.
C
Fig 2. (A), a segment of the multiple structure alignment in cluster IPR000291. (B), the relevant structure superposition. (C), the alignment between query and cluster IPR005905 by Smith-Waterman algorithm.
Sequence and Structure Conserved Domain
300
It can be said that domain cluster in this type is the
score matrix in Smith-Waterman algorithm is BLOSUM62.
155
148
141
134
127
120
113
99
106
92
85
78
71
64
57
50
0 43
The
50
36
and sequence-based profile alignment algorithms.
100
29
alignment with pure Smith-Waterman sequence alignment
150
22
alignment significant, we compared our profile-based
200
8
To evaluate the
1
can reflect the conservative features.
Number of Entries
most conserved one. Within this type, our scoring scheme
250
15
A.
Score
The gap open and gap extension is 12 and 2 respectively.
Fig 3. Distribution of alignment scores for comparing a query from
The sequence-based profile alignment is one normal
IPR000291 with the whole database.
profile-based alignment.
It builds the profile by counting
Fig.2 and Fig.3 demonstrate another example that our
the results with the structure information.
method has more sensitivity than other two methods. Here,
We chose a query, labeled 1blbA1_175, form the
we chose a query, labeled 2dln_248_276, from the domain
domain cluster IPR001064. Fig.4A shows the structure
cluster IPR000291.
Fig. 2A and 2B show a segment of the
superposition in the domain cluster. Since the amino acid
multiple structure alignment and the relevant structure
type in some positions is variable, both Smith-Waterman
superposition in the cluster. We can find that these domains
and sequence-based profile alignment methods gave the
very similar in structure level but have some difference in
highest score to the consensus of cluster IPR011024.
sequence level. Also, we note that there is a domain in
Although some segment in the alignment were matched
cluster IPR005905 has a segment, which is sequential
well, as shown in Fig.4B, the result was wrong.
identity with the query, as shown in Fig. 2C. So both
our profile-based alignment method can give the highest
Smith-Waterman and sequence-based profile alignment
scores to the consensus of right domain cluster, as shown in
identified the query belongs to cluster IPR005905. However,
Fig.5. The highest score is 119, which is the alignment
our profile-based method can distinguish the query form
score between the query and the profile of cluster
other clusters. Fig.3 shows that the alignment scores for
IPR001064.
However
comparing the query with the whole database. In this figure, the highest score is 160, which is the alignment score between the query and the profile of cluster IPR000291, the right domain cluster. So using our profile, we can improve significantly the alignment sensitivity. B.
Structure Conserved Domain The domain in this type is only structural conserved.
The structure topology in one domain cluster takes on the
alignment between query and consensus of cluster IPR011024.
But their amino acid type in some position is
In biology the amino acid type can be mutated
while structure and function is the same.
300 250
This
Number of Entries
phenomenon is difficult to handle with sequence alignment schemes, such as local, global or sequence-based profile alignment.
200 150 100 50
hits
false
Smith-Waterman
20 (71%)
8 (29%)
Sequence-based profile alignment
23 (82%)
5 (18%)
Combined profile alignment
27 (96%)
1 ( 4%)
In this cluster type, we tested the 3 kinds of alignment
116
111
106
96
101
91
86
81
76
71
66
61
Fig. 5. The distribution of alignment scores for comparing the query from cluster IPR001064 with the whole database.
C.
Sequential Conserved Domain TABLE V THE NUMBER OF HITS AND FALSE FOR EACH METHOD
methods, the hits and false results were shown in table Ⅳ. Although there are only 28 domain clusters in this type, the results still show that our profile-based method can improve Because the profile reflects the characters of a
family, sequence-based profile method improves the hit rate a little.
56
Score
THE NUMBER OF HITS AND FALSE FOR EACH METHOD
the hit rate.
51
46
41
36
31
26
21
1
16
0
TABLE Ⅳ
6
variable.
(A), the structure superposition in cluster IPR001064. (B) the
11
same shape.
Fig. 4.
Furthermore, combined profile method improves
hits
false
Smith-Waterman
665 (85%)
119 (15%)
Sequence-based profile alignment
735 (94%)
49 ( 6%)
Combined profile alignment
745 (95%)
39 ( 5%)
There are 784 domain clusters belong to this type in our database. Table V shows the results to compare the 3 kind
of alignment methods. Here we selected a query from
veracity through combining the sequence and structure
domain cluster IPR001356. Fig.6A shows that there are
information, although there is only one percent improve in
some variable regions in these domains, and Fig.6B shows
hit rate than sequence-based profile alignment.
Fig.6C shows the
We selected randomly a query from cluster IPR004227
highest alignment score is 57, between the query to the
in this type. Fig. 7 shows the distribution of alignment
profile form cluster IPR001356, whereas other two methods
scores for comparing the query with the whole database.
implied that the query belongs to the cluster IPR007107.
The highest score implies that the query belongs to cluster
Therefore, our scheme proved again to improve the
IPR004227. This figure shows again that our method have
alignment results and sensitivity.
more sensitivity than other 2 methods.
that the sequences are more conserved.
TABLE VI THE NUMBER OF HITS AND FALSE FOR EACH METHOD hits
false
Smith-Waterman
428
546
Sequence-based profile alignment
936
38
Combined profile alignment
941
33
A 250
Number of Entries
200
150
100
50
730
726
722
718
714
57
53
49
45
41
Fig 7. The distribution of alignment scores for comparing the query
250
from cluster IPR004227 with the whole database.
200 150
V.
100
CONCLUSION AND FUTURE WORKS
55
49
52
43
46
37
40
31
34
25
28
19
22
13
16
algorithm, used to our domain-based template database. 7
0
10
In this paper we proposed a profile-based alignment
1
50
4
Number of Entries
37
Score
300
Score
Fig6.
(A), the structure superposition and (B) multiple structure
alignment in domain cluster IPR001356. (C). the distribution of alignment scores for comparing the query from cluster IPR001356 with the whole database.
Mixed Type Table Ⅵ shows the result comparison of three methods
using the mixed type datasets.
The statistics analysis shows that most of the domain clusters in our database are conserved both in structural and
C
D.
710
B
33
29
25
21
17
9
13
5
1
0
Because there are some
variable regions in sequence level in this kind of domain cluster, the Smith-Waterman algorithm behaves much worse than others. Our profile-based method can improve the
sequential level, so each element in our profile combines the structural clustering information and the sequence information.
With
this
profile,
we
developed
a
profile-based query-template alignment method. To validate if our method is more accurate and sensitivity than other query-template alignment methods,
we divided our
database into four types, based on sequence and structure conservation.
In each type, we made some experiments.
The results form each type show that our profile can accurate describe the feature of that domain cluster, as well
as, our profile-based method can align the query to right template with low-fault.
It show that our method have
more sensitivity than other query-template alignment methods.
Biol Crystallogr, 2002. 58(Pt 6 No 1): p. 899-907. [11] Zdobnov, E.M. and R. Apweiler, “InterProScan--an integration
As described above, our final goal is protein structure prediction.
Res, 2005. 33(Database Issue): p. D201-205. [10] Berman, H.M., et al., The Protein Data Bank. Acta Crystallogr D
So, how to use our domain-based template
database and our profile-based query-template alignment method to improve the prediction of protein structure will be investigated in our next work. ACKNOWLEDGMENT
platform
for
the
signature-recognition
methods
in
InterPro,”
Bioinformatics, 2001. vol.17(9), p. 847-848. [12] Bateman, A., et al., The Pfam protein families database. Nucleic Acids Res, 2004. 32(Database issue): p. D138-141. [13] Murzin, A.G., et al., SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol, 1995. 247(4): p. 536-540.
This work was supported by the National Natural Science Foundation of China project under 60503060 and key project under 90612019.
[14] Letunic, I., et al., SMART 4.0: towards genomic data integration. Nucleic Acids Res, 2004. 32(Database issue): p. D142-144. [15] Haft. D.H., J.D. Selengut, and O. White, “The TIGRFAMs database of protein families,” Nucleic Acids Res, 2003, 31(1), p.371-373.
REFERENCES
[16] Holm. L. and C. Sander, “Protein structure comparison by alignment of distance matrices,” J Mol Biol. 1993. 233(1), p. 123-138.
[1] Needleman S, Wunsch C, “A general method applicable to the search
[17] Shindyalov IN, Bourne PE, “Protein structure alignment by
for similarities in the amino acid sequence of two proteins,” J Mol Biol,
incremental combinatorial extension (CE) of the optimal path,” Protein
1997, vol.48, p443-453.
Engineering, 1998, vol. 11(9), p739-747.
[2] Smith T, Waterman M, “Identification of common molecular subsequences,” J Mol Biol, 1981, vol.147, p195-197.
[18] Pearson. W.R, “Rapid and sensitive sequence comparison with FASTP and FASTA,” Methods Enzymol, 1990, vol.183, p. 63-98.
[3] J. Thompson, D. Higgins, and T. Gibson, “CLUSTALW: improving the
[19] S. F. Altschul, W. Gish, W. miller, E. W. Myers and D. J. Lipman,
Sensitivity of Progressive Multiple Sequence Alignment through
“Basic Local Alignment Search Tool,” J. Mol. Biol. 1990. 215,
Sequence Weighting Position Specific Gap Penalties and Weight
p403-410.
Matrix Choice”, Nucleic Acids Res, 1994, vol. 22, p.673-690. [4] Michael Brudno, Michael Chapman, Berthold Gottgens, Serafim Batzoglou and Burkhard Morgenstern, “Fast and sensitive multiple alignment of large genomic sequences,” Bioinformatics 2003, vol.4, p 66-78. [5] C. Notredame, D. Higgins, J. Heringa, “T-Coffee: A novel method for multiple sequence alignments,” J Mol Biol, 2000, vol.302, p205-217. [6] Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z. and et al, “Gapped BLAST and PSI-BLAST: A new generation of database programs,” Nucleic Acids Res, 1997, vol.25, p3389-3402. [7] SR Eddy, “Profile hidden markov models,” Bioinformatics, 1998, Vol 14, p755-763. [8] Fa Zhang, Jingchun Chen, Zhiyong Liu and Bo Yuan, “The construction of Structural Templates for the Modeling of Conserved Protein Domains,” International Conference on Bioinformatics and its Applications(ICBA’04), Fort Lauderdle. Florida. USA. [9] Mulder, N.J., et al., InterPro, progress and status in 2005. Nucleic Acids
[20] Gribskov, M., McLachlan, A.D., and Eisenberg, D, “Profile analysis: Detection of distantly related proteins,” Proc. Natl. Acad. Sci, 1987 vol.84, p4355-4358.