Excel spreadsheets for predicting transmembrane domains of proteins

Report 1 Downloads 41 Views
CABIOS

Vol. 13 no. 3 1997 Pages 231-234

'TransMem': a neural network implemented in Excel spreadsheets for predicting transmembrane domains of proteins Patrick Aloy, Juan Cedano, Baldomero Oliva, Francesc X.Aviles and Enrique Querol1 Abstract Motivation: Genomic sequences from different organisms, even prokaryotic, have plenty of orphan ORFs, making necessary methods for the prediction of protein structure and function. The prediction of the presence of hydrophobic transmembrane (HTM) stretches is a valuable clue for this. Results: The program, TransMem, based on a neural network and running on personal computers (either Apple Macintosh or PC, using Excel worksheets), for the prediction and distribution of amino acid residues in transmembrane segments of integral membrane proteins is reported. The percentage of residue predictive accuracy obtained for the set of proteins tested is 93%, ranging from 99.9% for the best to 71.7% for the worst prediction. The segment-based accuracy is 93.6%; 63.6% of the protein set match any of the predicted and observed segment locations. Availability: TransMem is available upon request or b\ anonymous ftp: IP address: luz.uab.es, directory /pub/ TransMem. It is also placed on the EMBL file server (ftp:// ftp.ebi.ac.uk/pub/software/mac/TransMem). Contact: E-mail: [email protected]

The increasing number of DNA and protein sequences entering databases makes necessary the use of algorithms to predict protein structure and function (Rost and Sander, 1993; Bork etal., 1994; Clotet etai, 1994; Eisenhaberef al., 1994). Among proteins, membrane proteins represent a more demanding challenge because their three-dimensional (3D) structures are very difficult to solve by X-ray crystallography and usually they are involved in important biological functions. Predicting the presence of anchored or transmembrane hydrophobic stretches on a reading frame yields valuable information and clues to search for its function. The first successful empirical algorithm for transmembrane helix prediction was that of Kyte and Doolittle (1982). More Inslilut de Biologia Fonamental and Departament de Bioqui'mica i Biologia Molecular, Universital Autonoma de Barcelona, 08193 Bellaterra, Barcelona, Spain 'To whom correspondence should be addressed

© Oxford University Press

accurate procedures implemented from mainframes to personal computers have recently been reported (Argos and Rao, 1986; von Heijne, 1992; Persson and Argos, 1994, 1996), some of them using neural networks (Rost et at., 1995; Fariselli and Casadio, 1996). In this paper, a neural networkbased program will be described for personal computers running in Excel worksheets to predict with 93% accuracy the presence of hydrophobic transmembrane (HTM) stretches from a protein sequence. The minimal requirements of program hardware and software are: Apple Macintosh or PC computers with 2 Mbytes of RAM; Excel spreadsheet 4.0. The program is user friendly. When the user opens the program (file named 'TransMem'), he or she will find a dialogue window with different buttons: 'Enter Sequence', 'Clear Screen', 'Save Results', 'Run' and 'Quit' indicating different functions that can be performed. The protein sequence must be entered or imported (into 'Enter Sequence') in one-letter amino acid code in cells of a typical Excel spreadsheet box. For users not familiar with Excel, it should be mentioned that a spreadsheet cell takes up to 255 characters; thus, if the sequence exceeds this number, additional cells have to be used. Nevertheless, in order to visualize better the whole sequence in the cell charts, it is advisable to enter the sequence as 50-character strings. The running time is 4.5 s per protein residue on a Power Macintosh, less on a Pentium PC. The output of the results is depicted as a list of amino acid stretches or as a profile. As a training set, the following five protein sequences were used, with 39 transmembrane helices (in SWISS-PROT code): anion transport protein (b3at_hiirnan); dopamine receptor (dadr_human); glucose transporter type I (gtr^human); haemagglutinin-neuraminidase (hema_ndvu) and rhodopsin (opsd_human). From a set of 69 protein sequences from the SWISS-PROT database previously used to design the program of Rost et al. (1995), 55 have been chosen as a testing set. It deserves to be mentioned that the exact location of transmembrane helices is often controversial due to the very small set of protein membrane 3D structures solved and deposited in the data banks. The neural network used in this method is based on the Perceptron model (Rosenblatt, 1958), known as MLP (Multilayer Perceptron). Each of the different amino acids has been

231

P.Aloy el al.

Table 1. Observed and predicted transmembrane segments in a set of 56 proteins (in SWISS-PROT code) Protein code

Observed HTM

Predicted HTM

Protein code

bacr_halha

23-42 57-76 95-114 121-140 148-167 191-210 217-236

23-41 59-76 98-115 120-143 149-169 192- . . -236

cek2_chick

lprc_H

12-35

12-28

lprc_M

52-76 111-137 143-166 198-223 260-284

48-77

lprc_L

1-21

1-16

4f2_human

82-104

82-101

5ht3_mouse

246-272 278-296 306-324 465-484

248-267 283-302 305-329 465-483

34-54 68-88 93-113

148-168 185-205 219-239 280-300 321-341 349-369 380-400 439-469 466-486 bach_halhm

1-15 27-50 74-92 103-126 130-154 163-186 198-206

365-389

2-20 368-389

cyoa_ecoli

51-69 93-111

46-67 91-108

cyob_ecoli

17-35 58-76 102-121 144-162 195-213 232-250 277-296 320-339 348-366 382-401 410-429 457-476 494-513 588-607 614-634

16-39 56-75 104-127 139-163 197-215 230-252 286-303 314-333 337-368 375- . . -440 458-477 494-516 591- . . -627

cyoc_ecoli

32-50 67-85 102-120 143-161 185-203

28-56 67-91 99-120 138-162 181-204

cyod_ecoli

18-36 46-64 81-99

21- . . -65 78-103

cyoe_ecoli

10-28 38-56 79-97 108-126

14-33 37-55 79-106 109-128 142-178 193- . . -247 264-285

-

117-142 175-194 233-250

adt_ricpr

Predicted HTM

145-166 203-220 271-288

33-53 84-111 116-139 171-198 226-249

2mlt

Observed HTM

25-52 -

-

67-85 93-112 139-165 186-205 217-235 273-299 325-342 350- . . -403 449- . . -484 _ 27-52

82-100 102-124 136-153 167. -204

_

198-216 229-247 269-287

3-18 646-664

fce2_human

22-47

24-45

glp_P'g

63-85

65-84

-

92-114

1-25 93-112

58-81

59-80

-

1-23 530-557

glpa_human glpc_human

cb21_pea

-

62-81 114-134 182-198

115-133 183-199

glra_rat

614-632 806-826

614-637 803-825

malf_ecoli

gmcr_human

321-346

324-347

gp 1 b_human

148-172

145-169

gtp_crilo

7-32 58-79 95-114 126-145

11-32 61-80 95-114

(cont)

232

-

646-668

egfrjiuman

motb_ecoli

539-558 585-603 17-35 40-58 73-91 277-295 319-337 371-389 418-436 486-504

. -55 70-93 277-305 321-357 372-388 428-446 484-506

28-49

25-50

17- .

Neural network prediction of transmembrane domains

Table I. Continued Observed HTM

Predicted HTM

Protein code

Observed HTM

Predicted HTM

165-184 195-211 222-240 253-269 275-294 379-397

158- . . -210 223-240 250-272

mprd_human

186-210

183-208

mypOJiuman

154-179

156-178

ngfrjiuman

251-272

254-272

nep_human

28-50

27-50

35-55 _

39-58 160-176 193-209

oppb_salty

hema_measi

35-55

36-58

10-30 100-121 138-158 173-190 227-250 272-293

10-27 91-120 133-156 167-189 233-253 273-300

hg2a_human

16-72

49-67

oppc_salty

_

16-34 76-107 423-442

31-59 102-122 140-160 164-180 216-236 268-290

38-58 97-132 144- . . -177 217-243 270-289

ops3_drome

58-82 95-119 134-152 172-196 221-248 285-308 317-341

62-88 93-115 131-151 171-188 220-239 284-309 324-341

pigrjiuman

621-643

625-642

pt2m_ecoli

25-44 51-69

19-40



79-107 130- . . -196 272- . .-335 381-396

Protein code

hema_cdvo

iggb_strsp

423-443 il2a_human



241-259 il2b_human

_

-

1-16 240-258

241-265

11-29 247-265

ita5_mouse

356-381

356-382

lacy_ecoli

11-33 47-67 75-99 103-125 145-163 168-187 212-234 260-281 291-310 315-334 347-366 380-399

11-36 46-67 76- . . -126 147-165 168-188 —

135-164 166-184 274-291 314-333

270_

. -321 349-372 381-402

-

sece_ecoli

19-36 45-63 93-111

17-35 43-63 97-122

suis_human

13-32

11-34

292-313

284-313

-

25-42

1-12 178-195 204-219 25-43

19-40

20-44

lech_human

40-60

41-58

leci_mouse

40-60

60-79

lep_ecoli

4-22 58-76

1-22 62-80

_

1-18 514-532

trbmjiuman

517-536 516-539

509-535

vmt2_iaann

63-88

66-86 415-430 446-462 540-556

vnb_inbbe

magl_mouse (cont) trsr_human

_ -



tcbl_rabit

Accuracy = (21.615/23.252) x 100 = 93%. encoded as a 21-component vector (Quian and Sejnowski, 1988). Therefore, the network can be considered as a window with 21 positions moving along the protein sequence. This kind of codification defines the network architecture: 441

neurons in the input layer (21 x 21), 20 in the hidden and only one in the output one. In the training procedure, a modification of the standard Backpropagation (Rumelhart et ai, 1986) was used as a

233

P.Aloy et al.

learning algorithm, known as Backpropagation of Momentum. This algorithm allows an increase in the learning coefficient when the input patterns are similar and a decrease when an unusual pattern appears. A hyperbolic tangent has been used as activation function. This is a C1 non-decreasing function with domain in the interval [— 1, +1 ]. This activation range was used because the learning process is faster than a typical codification form [0,1] (Fausett, 1994). The initial weights were randomly assigned, being the values uniformly distributed along the range [-0.3, +0.3], in order to obtain the largest slope of the hyperbolic tangent function within its domain, thus being the maximum learning rate. The net was trained by placing a binary output value (+1 if the residue is alocated in a transmembrane stretch, —1 otherwise) in the middle of the window. This strategy allows a specific residue to be related with its neighbours along the sequence. The program was tested on 172 transmembrane segments belonging to 55 different proteins, involving ~25 000 amino acid residues. These proteins correspond to those used by Rost et al. (1995), in their neural network, but removing all sequences presenting significant sequence similarity (>25% over 80 or more residues) to the training set. The accuracy of the transmembrane prediction of the net was evaluated by comparing the predicted helices to the helix transmembrane assignments of the SWISS-PROT database, and calculated as follows: Accuracy =

Number of residues correctly predicted x 100 Total number of residues

According to this definition, the percentage accuracy obtained for the set of proteins tested is ~93%. The best prediction (nep_human) was 99.9% and the worst (bach_halhm) was 71.7%. The segment-based scores are 93.6% (161 out of 172) correctly predicted and 6.4% (11 out of 172) incorrectly or not found. In addition, the program finds 18 false HTMs. Finally, 63.6% (35 out of 55) match any of the predicted and observed segment locations. These results were obtained using several net topologies, either decreasing or increasing the window size, hence the number of neurons in its input layer, the worse results being when decreasing the window size. This would indicate that the information contained in the 20 residues around the central one was necessary. On the other hand, results did not improve significantly when increasing the window size, but the net did spend a longer time in the learning process. When the number of hidden layers was modified, the results were similar. The advantage of such a simple program as this is its userfriendliness on a personal computer. The price to be paid for simplicity and mobility is a loss of prediction accuracy. Acknowledgements This research was supported by grants BIO94-0912-CO2-01 and IN94-0347 from the CICYT (Ministerio de Educacion y Ciencia, Spain), and by the

12A

Centre de Referenda de Biotecnologi'a de la Generalitat de Catalunya. Support from the F.Roviralta Foundation is also acknowledged. J.C. is a PFPI fellowship recipient of the Ministerio de Educacion y Ciencia (Spain).

References Argos.P. and Rao.J.K. (1986) A conformational preference parameter to predict helices in integral membrane proteins. Biochim. Biophys. Ada, 869, 197-214. Bork.P., Ouzounis.C. and Sander.C. (1994) From genome sequences to protein function. Curr. Opm. Struct. Biol., 4, 393-403. Clotet.J., Cedano.J. and Querol.E. (1994) A spreadsheet computer program combining algorithms for prediction of protein structural characteristics. Comput. Applic. Biosci.. 10. 495-500. Eisenhaber.F., Persson.B. and Argos.P. (1995) Cnt. Rev Biochem. Mol. Biol.. 30, 1-94. Fariselli.P. and Casadio.R. (1996) HTP. a neural network-based method for predicting the topology of helical transmembrane domains in proteins. Comput. Applic. Biosci., 12, 41-48. Fausett.L. (1994) Fundamentals of Neural Networks: Architectures. Algorithms and Applications. Prentice-Hall. New Jersey. Kyte.J. and Doolittle.R.F. (1982) A simple method for displaying the hydropathic character of a protein. J. Mol. Biol.. 157. 105-132. Persson.B. and Argos.P. (1994) Prediction of transmembrane segments in proteins utilising multiple sequence alignments. J. Mol. Biol.. 237, 182— 192. Persson.B. and Argos.P. (1996) Topology prediction of membrane proteins. Protein Sci.. 5, 363-371. Qian.N. and Sejnowski.T.J. (1988) Predicting the secondary structure of globular proteins using neural network model. J. Mol. Biol.. 202. 865-884. Rosenblatt.F. (1958) The Perceptron: A probabilistic model for information storage and organization in the brain. Psycol. Rev.. 65. 386-408. Rost.B. and Sander.C. (1993) Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol., 232, 584-599. Rost.B.. Casadio.R.. Fariselli.P. and Sander.C. (1995) Transmembrane helices predicted at 95% accuracy. Protein Sci.. 4, 521-533. Rumelhart.D E.. Hinton.G. E. and Williams.R.J. (1986) Learning representations by back-propagating error. Nature. 323. 533-536. von Heijne.G. (1992) Membrane protein structure prediction. J. Mol. Biol.. 225, 487-494. Received on August 19, 1996; revised on December 23, 1996; accepted on January 2, 1997