Predictive Vaccinology: Optimisation of Predictions ... - Semantic Scholar

Report 1 Downloads 79 Views
Predictive Vaccinology: Optimisation of Predictions Using Support Vector Machine Classifiers Ivana Bozic1,2, Guang Lan Zhang2,3, and Vladimir Brusic2,4 1

Faculty of Mathematics, University of Belgrade, Belgrade, Serbia [email protected] 2 Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613 {guanglan,vladimir}@i2r.a-star.edu.sg 3 School of Computer Engineering, Nanyang Technological University, Nanyang Avenue, Singapore 639798 4 School of Land and Food Sciences and the Institute for Molecular Bioscience, University of Queensland, Brisbane, Australia Abstract. Promiscuous human leukocyte antigen (HLA) binding peptides are ideal targets for vaccine development. Existing computational models for prediction of promiscuous peptides used hidden Markov models and artificial neural networks as prediction algorithms. We report a system based on support vector machines that outperforms previously published methods. Preliminary testing showed that it can predict peptides binding to HLA-A2 and -A3 supertype molecules with excellent accuracy, even for molecules where no binding data are currently available.

1 Introduction Computational predictions of peptides (short sequences of amino acids) that bind human leukocyte antigens (HLA) molecules of the immune system are essential for designing vaccines and immunotherapies against cancer, infectious disease and autoimmunity [1]. There are more than 1800 different HLA molecules characterised to date, and many of them have unique peptide binding preferences. A HLA supertype is a group of HLA molecules that share similar molecular structure and also have similar binding preferences – they bind largely overlapping sets of peptides [2]. Some dozen class I HLA supertypes have been described, of which four major supertypes (HLAA2, -A3, -B7, and -B44) are present in approximately 90% of the human population [2]. Prediction of peptide binding to a single HLA molecule is less relevant for the design of vaccines applicable to a large proportion of human population. However, predictions of promiscuous peptides, i.e. those that bind multiple HLA molecules within a supertype, are important in the development of vaccines that are relevant to a broader population. Identification of peptides that bind HLA molecules is a combinatorial problem. Therefore high accuracy of such predictions is of great importance, because it makes identification of vaccine targets more cost- and time-effective. A large number of prediction methods have been used for identification of HLA binding peptides. They include [reviewed in 3] binding motifs, quantitative matrices, decision trees, artificial neural networks, hidden Markov models, and molecular modeling. In addition, it was reported that models using support vector machines (SVM) perform better than other prediction methods when applied to a single HLA molecule [4,5]. Predictions of peptide binding to multiple HLA molecules of a supertype were M. Gallagher, J. Hogan, and F. Maire (Eds.): IDEAL 2005, LNCS 3578, pp. 375–381, 2005. © Springer-Verlag Berlin Heidelberg 2005

376

Ivana Bozic, Guang Lan Zhang, and Vladimir Brusic

performed using hidden Markov models (HMM) [6] and artificial neural networks (ANN) [7]. These reports also showed that accurate predictions promiscuous peptides within a supertype can be performed for HLA variants for which no experimental data are available. The accuracy of predictions of peptide binding to multiple HLA variants within A2 and A3 supertypes was measured by the area under the receiver operating characteristic curve (AROC) [8]. The reported values were 0.85< AROC 0 ⇔ y = 1 and f (x) < 0 ⇔ y = −1 , which would accurately predict the classes of unseen data points. The value

ρ := min f (x j ) ⋅ y j

(1)

1≤ j ≤ m

is called the margin and stands for “the worst” classification over the whole training set. Training examples that lie right on the margin are called support vectors. If the training data are linearly separable, we seek to find a linear function f (x) that has the maximum margin. This is equivalent to constructing a maximum-margin hyperplane that separates the two classes (binders and non-binders) in n-dimensional space of our training data. In the case of linearly non-separable data, the idea is to map the training data into a higher-dimension feature space F (Hilbert space) via a non-linear map Φ : R n → F, dim F > n, and construct a separating maximum-margin hyperplane there. This mapping is simplified by introducing a kernel function k: k (x, x' ) = (Φ (x), Φ (x' )) .

(2)

Because real-life data are usually non-separable (even in the feature space) due to noise, we allow some misclassified training examples. This is enabled by introduction of a new parameter C, which is a trade-off between the margin and the training error. Constructing a maximum-margin hyperplane is a constrained convex optimization problem. Thus, solving the SVM problem is equivalent to finding a solution (see [12]) to the Karush-Kuhn-Tucker (KKT) conditions, and, equivalently, the Wolfe dual problem: Find αj which maximize W(α), W (α ) =

m



αj −

j =1

1 2

m

∑α α j

s y j ys k

(x j , x s ) ,

j , s =1

(3)

m

∑α

0 ≤ α j ≤ C,

j

y j = 0.

j =1

This is a linearly constrained convex quadratic program, which is solved numerically. The decision function is m

f ( x) =

∑y α j

j k ( x, x j ) + b .

(4)

j =1

αi are the solutions of the corresponding quadratic program and b is easily found

using the KKT complementarity condition [12]. Commonly used kernels include linear, Gaussian and polynomial. 3.2 Implementation We used the SVMlight package with a quadratic programming tool for solving small intermediate quadratic programming problems, based on the method of Hildreth and D'Espo [13]. Prior to training, every peptide from our dataset was transformed into a

378

Ivana Bozic, Guang Lan Zhang, and Vladimir Brusic

virtual peptide. This data were then transformed into a format compatible with the package used, and every virtual peptide (a1,a2,…,al) was translated into a set of n = 20×l indicators (bits). Each amino acid of a virtual peptide is represented by 20 indicators. In the complete sequence of indicators, l indicators are set to 1, representing specific residues that are present at a given position, and the other 20×(l-1) indicators are set to 0. In our case, x j = (i1 , i 2 ,..., in ) , i s ∈ {0,1} , s ∈ {1,..., n} , j ∈ {1,..., m} represents a virtual peptide. The values yj = ±1 indicate whether a peptide binds (1) or does not bind (-1) to a HLA molecule. Blind testing was performed for assessing the performance of SVM for prediction of promiscuous peptides. To test the predictive accuracy of peptide binding to each of the HLA-A2 and -A3 variants, we used all peptides (binders and non-binders) related to this variant as the testing data and used all peptides related to other variants from the same supertype as training data. For example, the training set of HLA-A*0201 contained all peptides related to all HLA-A2 molecules except for HLA-A*0201. Testing of peptide binding to each HLA variant was performed without inclusion of experimental data for this particular HLA variant in the training set. Testing results, therefore, are likely to represent an underestimate of the actual performance since the final model contains available data for all HLA variants. We performed blind testing on five HLA-A2 and seven HLA-A3 molecules, for which sufficient data were available for valid testing (Table 1). Other variants of HLA-A2 and -A3 molecules were excluded from testing, since there was insufficient data for generating adequate test sets. Throughout blind testing, we examined three kernels (linear, Gaussian and polynomial) and various combinations of the SVM parameters (trade-off c, σ for Gaussian and d for polynomial kernel). We trained 50 different SVM models for each supertype, with c varying from 0.01 to 20, d from 1 to 10 and σ from 0.001 to 1. Models with the best prediction performance (highest average AROC) were Gaussian kernel with σ=0.1, c=0.5 for HLA-A2 and Gaussian kernel with σ=0.1 and c=2 for HLA-A3. Table 1. Blind testing: number of peptides in training and test sets for A2 and A3 supertype. Each training data set contained peptides related to other *14 HLA-A2 or **7 HLA-A3 molecule variants HLA-A2 molecule *0201 *0202 *0204 *0205 *0206

*

Training data Binders Nonbinders 224 378 619 2361 641 2162 648 2346 621 2349

Test data Binders Nonbinders 440 1999 45 25 23 224 16 40 43 37

HLA-A3 molecule *0301 *0302 *1101 *1102 *3101 *3301 *6801

**

Training data Binders Nonbinders 573 1447 534 1277 538 1313 538 1325 636 1482 645 1474 621 898

Test data Binders Nonbinders 107 89 146 259 142 223 142 211 44 54 35 62 59 638

4 Experimental Results AROC values of the SVM models that showed the best prediction performance are shown in Table 2 along with the corresponding values for optimized HMM [6] and ANN [7] predictions. The average SVM prediction accuracy is excellent for both A2 (AROC=0.89) and A3 supertype predictions (AROC=0.92). SVMs performed marginally

Predictive Vaccinology: Optimisation of Predictions

379

better than HMM and ANN prediction methods on A2 supertype molecules, and significantly better on A3 supertype molecules (p