CABIOS
Vol. 13 no. 3 1997 Pages 231-234
'TransMem': a neural network implemented in Excel spreadsheets for predicting transmembrane domains of proteins Patrick Aloy, Juan Cedano, Baldomero Oliva, Francesc X.Aviles and Enrique Querol1 Abstract Motivation: Genomic sequences from different organisms, even prokaryotic, have plenty of orphan ORFs, making necessary methods for the prediction of protein structure and function. The prediction of the presence of hydrophobic transmembrane (HTM) stretches is a valuable clue for this. Results: The program, TransMem, based on a neural network and running on personal computers (either Apple Macintosh or PC, using Excel worksheets), for the prediction and distribution of amino acid residues in transmembrane segments of integral membrane proteins is reported. The percentage of residue predictive accuracy obtained for the set of proteins tested is 93%, ranging from 99.9% for the best to 71.7% for the worst prediction. The segment-based accuracy is 93.6%; 63.6% of the protein set match any of the predicted and observed segment locations. Availability: TransMem is available upon request or b\ anonymous ftp: IP address: luz.uab.es, directory /pub/ TransMem. It is also placed on the EMBL file server (ftp:// ftp.ebi.ac.uk/pub/software/mac/TransMem). Contact: E-mail:
[email protected] The increasing number of DNA and protein sequences entering databases makes necessary the use of algorithms to predict protein structure and function (Rost and Sander, 1993; Bork etal., 1994; Clotet etai, 1994; Eisenhaberef al., 1994). Among proteins, membrane proteins represent a more demanding challenge because their three-dimensional (3D) structures are very difficult to solve by X-ray crystallography and usually they are involved in important biological functions. Predicting the presence of anchored or transmembrane hydrophobic stretches on a reading frame yields valuable information and clues to search for its function. The first successful empirical algorithm for transmembrane helix prediction was that of Kyte and Doolittle (1982). More Inslilut de Biologia Fonamental and Departament de Bioqui'mica i Biologia Molecular, Universital Autonoma de Barcelona, 08193 Bellaterra, Barcelona, Spain 'To whom correspondence should be addressed
© Oxford University Press
accurate procedures implemented from mainframes to personal computers have recently been reported (Argos and Rao, 1986; von Heijne, 1992; Persson and Argos, 1994, 1996), some of them using neural networks (Rost et at., 1995; Fariselli and Casadio, 1996). In this paper, a neural networkbased program will be described for personal computers running in Excel worksheets to predict with 93% accuracy the presence of hydrophobic transmembrane (HTM) stretches from a protein sequence. The minimal requirements of program hardware and software are: Apple Macintosh or PC computers with 2 Mbytes of RAM; Excel spreadsheet 4.0. The program is user friendly. When the user opens the program (file named 'TransMem'), he or she will find a dialogue window with different buttons: 'Enter Sequence', 'Clear Screen', 'Save Results', 'Run' and 'Quit' indicating different functions that can be performed. The protein sequence must be entered or imported (into 'Enter Sequence') in one-letter amino acid code in cells of a typical Excel spreadsheet box. For users not familiar with Excel, it should be mentioned that a spreadsheet cell takes up to 255 characters; thus, if the sequence exceeds this number, additional cells have to be used. Nevertheless, in order to visualize better the whole sequence in the cell charts, it is advisable to enter the sequence as 50-character strings. The running time is 4.5 s per protein residue on a Power Macintosh, less on a Pentium PC. The output of the results is depicted as a list of amino acid stretches or as a profile. As a training set, the following five protein sequences were used, with 39 transmembrane helices (in SWISS-PROT code): anion transport protein (b3at_hiirnan); dopamine receptor (dadr_human); glucose transporter type I (gtr^human); haemagglutinin-neuraminidase (hema_ndvu) and rhodopsin (opsd_human). From a set of 69 protein sequences from the SWISS-PROT database previously used to design the program of Rost et al. (1995), 55 have been chosen as a testing set. It deserves to be mentioned that the exact location of transmembrane helices is often controversial due to the very small set of protein membrane 3D structures solved and deposited in the data banks. The neural network used in this method is based on the Perceptron model (Rosenblatt, 1958), known as MLP (Multilayer Perceptron). Each of the different amino acids has been
231
P.Aloy el al.
Table 1. Observed and predicted transmembrane segments in a set of 56 proteins (in SWISS-PROT code) Protein code
Observed HTM
Predicted HTM
Protein code
bacr_halha
23-42 57-76 95-114 121-140 148-167 191-210 217-236
23-41 59-76 98-115 120-143 149-169 192- . . -236
cek2_chick
lprc_H
12-35
12-28
lprc_M
52-76 111-137 143-166 198-223 260-284
48-77
lprc_L
1-21
1-16
4f2_human
82-104
82-101
5ht3_mouse
246-272 278-296 306-324 465-484
248-267 283-302 305-329 465-483
34-54 68-88 93-113
148-168 185-205 219-239 280-300 321-341 349-369 380-400 439-469 466-486 bach_halhm
1-15 27-50 74-92 103-126 130-154 163-186 198-206
365-389
2-20 368-389
cyoa_ecoli
51-69 93-111
46-67 91-108
cyob_ecoli
17-35 58-76 102-121 144-162 195-213 232-250 277-296 320-339 348-366 382-401 410-429 457-476 494-513 588-607 614-634
16-39 56-75 104-127 139-163 197-215 230-252 286-303 314-333 337-368 375- . . -440 458-477 494-516 591- . . -627
cyoc_ecoli
32-50 67-85 102-120 143-161 185-203
28-56 67-91 99-120 138-162 181-204
cyod_ecoli
18-36 46-64 81-99
21- . . -65 78-103
cyoe_ecoli
10-28 38-56 79-97 108-126
14-33 37-55 79-106 109-128 142-178 193- . . -247 264-285
-
117-142 175-194 233-250
adt_ricpr
Predicted HTM
145-166 203-220 271-288
33-53 84-111 116-139 171-198 226-249
2mlt
Observed HTM
25-52 -
-
67-85 93-112 139-165 186-205 217-235 273-299 325-342 350- . . -403 449- . . -484 _ 27-52
82-100 102-124 136-153 167. -204
_
198-216 229-247 269-287
3-18 646-664
fce2_human
22-47
24-45
glp_P'g
63-85
65-84
-
92-114
1-25 93-112
58-81
59-80
-
1-23 530-557
glpa_human glpc_human
cb21_pea
-
62-81 114-134 182-198
115-133 183-199
glra_rat
614-632 806-826
614-637 803-825
malf_ecoli
gmcr_human
321-346
324-347
gp 1 b_human
148-172
145-169
gtp_crilo
7-32 58-79 95-114 126-145
11-32 61-80 95-114
(cont)
232
-
646-668
egfrjiuman
motb_ecoli
539-558 585-603 17-35 40-58 73-91 277-295 319-337 371-389 418-436 486-504
. -55 70-93 277-305 321-357 372-388 428-446 484-506
28-49
25-50
17- .
Neural network prediction of transmembrane domains
Table I. Continued Observed HTM
Predicted HTM
Protein code
Observed HTM
Predicted HTM
165-184 195-211 222-240 253-269 275-294 379-397
158- . . -210 223-240 250-272
mprd_human
186-210
183-208
mypOJiuman
154-179
156-178
ngfrjiuman
251-272
254-272
nep_human
28-50
27-50
35-55 _
39-58 160-176 193-209
oppb_salty
hema_measi
35-55
36-58
10-30 100-121 138-158 173-190 227-250 272-293
10-27 91-120 133-156 167-189 233-253 273-300
hg2a_human
16-72
49-67
oppc_salty
_
16-34 76-107 423-442
31-59 102-122 140-160 164-180 216-236 268-290
38-58 97-132 144- . . -177 217-243 270-289
ops3_drome
58-82 95-119 134-152 172-196 221-248 285-308 317-341
62-88 93-115 131-151 171-188 220-239 284-309 324-341
pigrjiuman
621-643
625-642
pt2m_ecoli
25-44 51-69
19-40
—
79-107 130- . . -196 272- . .-335 381-396
Protein code
hema_cdvo
iggb_strsp
423-443 il2a_human
—
241-259 il2b_human
_
-
1-16 240-258
241-265
11-29 247-265
ita5_mouse
356-381
356-382
lacy_ecoli
11-33 47-67 75-99 103-125 145-163 168-187 212-234 260-281 291-310 315-334 347-366 380-399
11-36 46-67 76- . . -126 147-165 168-188 —
135-164 166-184 274-291 314-333
270_
. -321 349-372 381-402
-
sece_ecoli
19-36 45-63 93-111
17-35 43-63 97-122
suis_human
13-32
11-34
292-313
284-313
-
25-42
1-12 178-195 204-219 25-43
19-40
20-44
lech_human
40-60
41-58
leci_mouse
40-60
60-79
lep_ecoli
4-22 58-76
1-22 62-80
_
1-18 514-532
trbmjiuman
517-536 516-539
509-535
vmt2_iaann
63-88
66-86 415-430 446-462 540-556
vnb_inbbe
magl_mouse (cont) trsr_human
_ -
—
tcbl_rabit
Accuracy = (21.615/23.252) x 100 = 93%. encoded as a 21-component vector (Quian and Sejnowski, 1988). Therefore, the network can be considered as a window with 21 positions moving along the protein sequence. This kind of codification defines the network architecture: 441
neurons in the input layer (21 x 21), 20 in the hidden and only one in the output one. In the training procedure, a modification of the standard Backpropagation (Rumelhart et ai, 1986) was used as a
233
P.Aloy et al.
learning algorithm, known as Backpropagation of Momentum. This algorithm allows an increase in the learning coefficient when the input patterns are similar and a decrease when an unusual pattern appears. A hyperbolic tangent has been used as activation function. This is a C1 non-decreasing function with domain in the interval [— 1, +1 ]. This activation range was used because the learning process is faster than a typical codification form [0,1] (Fausett, 1994). The initial weights were randomly assigned, being the values uniformly distributed along the range [-0.3, +0.3], in order to obtain the largest slope of the hyperbolic tangent function within its domain, thus being the maximum learning rate. The net was trained by placing a binary output value (+1 if the residue is alocated in a transmembrane stretch, —1 otherwise) in the middle of the window. This strategy allows a specific residue to be related with its neighbours along the sequence. The program was tested on 172 transmembrane segments belonging to 55 different proteins, involving ~25 000 amino acid residues. These proteins correspond to those used by Rost et al. (1995), in their neural network, but removing all sequences presenting significant sequence similarity (>25% over 80 or more residues) to the training set. The accuracy of the transmembrane prediction of the net was evaluated by comparing the predicted helices to the helix transmembrane assignments of the SWISS-PROT database, and calculated as follows: Accuracy =
Number of residues correctly predicted x 100 Total number of residues
According to this definition, the percentage accuracy obtained for the set of proteins tested is ~93%. The best prediction (nep_human) was 99.9% and the worst (bach_halhm) was 71.7%. The segment-based scores are 93.6% (161 out of 172) correctly predicted and 6.4% (11 out of 172) incorrectly or not found. In addition, the program finds 18 false HTMs. Finally, 63.6% (35 out of 55) match any of the predicted and observed segment locations. These results were obtained using several net topologies, either decreasing or increasing the window size, hence the number of neurons in its input layer, the worse results being when decreasing the window size. This would indicate that the information contained in the 20 residues around the central one was necessary. On the other hand, results did not improve significantly when increasing the window size, but the net did spend a longer time in the learning process. When the number of hidden layers was modified, the results were similar. The advantage of such a simple program as this is its userfriendliness on a personal computer. The price to be paid for simplicity and mobility is a loss of prediction accuracy. Acknowledgements This research was supported by grants BIO94-0912-CO2-01 and IN94-0347 from the CICYT (Ministerio de Educacion y Ciencia, Spain), and by the
12A
Centre de Referenda de Biotecnologi'a de la Generalitat de Catalunya. Support from the F.Roviralta Foundation is also acknowledged. J.C. is a PFPI fellowship recipient of the Ministerio de Educacion y Ciencia (Spain).
References Argos.P. and Rao.J.K. (1986) A conformational preference parameter to predict helices in integral membrane proteins. Biochim. Biophys. Ada, 869, 197-214. Bork.P., Ouzounis.C. and Sander.C. (1994) From genome sequences to protein function. Curr. Opm. Struct. Biol., 4, 393-403. Clotet.J., Cedano.J. and Querol.E. (1994) A spreadsheet computer program combining algorithms for prediction of protein structural characteristics. Comput. Applic. Biosci.. 10. 495-500. Eisenhaber.F., Persson.B. and Argos.P. (1995) Cnt. Rev Biochem. Mol. Biol.. 30, 1-94. Fariselli.P. and Casadio.R. (1996) HTP. a neural network-based method for predicting the topology of helical transmembrane domains in proteins. Comput. Applic. Biosci., 12, 41-48. Fausett.L. (1994) Fundamentals of Neural Networks: Architectures. Algorithms and Applications. Prentice-Hall. New Jersey. Kyte.J. and Doolittle.R.F. (1982) A simple method for displaying the hydropathic character of a protein. J. Mol. Biol.. 157. 105-132. Persson.B. and Argos.P. (1994) Prediction of transmembrane segments in proteins utilising multiple sequence alignments. J. Mol. Biol.. 237, 182— 192. Persson.B. and Argos.P. (1996) Topology prediction of membrane proteins. Protein Sci.. 5, 363-371. Qian.N. and Sejnowski.T.J. (1988) Predicting the secondary structure of globular proteins using neural network model. J. Mol. Biol.. 202. 865-884. Rosenblatt.F. (1958) The Perceptron: A probabilistic model for information storage and organization in the brain. Psycol. Rev.. 65. 386-408. Rost.B. and Sander.C. (1993) Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol., 232, 584-599. Rost.B.. Casadio.R.. Fariselli.P. and Sander.C. (1995) Transmembrane helices predicted at 95% accuracy. Protein Sci.. 4, 521-533. Rumelhart.D E.. Hinton.G. E. and Williams.R.J. (1986) Learning representations by back-propagating error. Nature. 323. 533-536. von Heijne.G. (1992) Membrane protein structure prediction. J. Mol. Biol.. 225, 487-494. Received on August 19, 1996; revised on December 23, 1996; accepted on January 2, 1997