Soft Nearest Prototype Classification - Semantic Scholar

Report 5 Downloads 127 Views
SEO ET AL: SOFT NEAREST PROTOTYPE CLASSIFICATION

1

Soft Nearest Prototype Classification Sambu Seo, Mathias Bode and Klaus Obermayer

Abstract— We propose a new method for the construction of nearest prototype classifiers which is based on a Gaussian mixture ansatz and which can be interpreted as an annealed version of Learning Vector Quantization. The algorithm performs a gradient descent on a cost-function minimizing the classification error on the training set. We investigate the properties of the algorithm and assess its performance for several toy data sets and for an optical letter classification task. Results show (i) that annealing in the dispersion parameter of the Gaussian kernels improves classification accuracy, (ii) that classification results are better than those obtained with standard Learning Vector Quantization (LVQ 2.1, LVQ 3) for equal numbers of prototypes and (iii) that annealing of the width parameter improved the classification capability. Additionally, the principled approach provides an explanation of a number of features of the (heuristic) LVQ methods. Index Terms— learning vector quantization, soft nearest prototype classification, muticlass classification, Gaussian mixture model.

I. I NTRODUCTION EARNING Vector Quantization (LVQ) ([1], [2], [3]) is a class of learning algorithms for nearest prototype classification (NPC). LVQ was introduced by T. Kohonen ([4]) almost 20 years ago, and has since then been widely used (see [5] for an extensive bibliography). Like the K-nearest neighbor method ([6]) NPC is a local classification method in the sense that classification boundaries are approximated locally. Instead of making use of all the data points of a training set, however, NPC relies on a set of appropriately chosen prototype vectors. This makes the method computationally more efficient, because the number of items which must be stored and to which a new data point must be compared for classification is considerably less. NPCs have been motivated in the literature in two ways. One motivation comes from Bayesian decision theory which is a fundamental approach to classification problems. Construction of the classifier involves two steps, i the construction of models for the probability densities of the different classes and ii the construction of the classification boundaries using the criterion of maximum a posteriori (MAP) probability. If for example - probability densities are well approximated by Gaussian mixtures [7] and if all components are assumed to be of equal strength and variance, the MAP classifier reduces to a NPC based on Euclidean distances between the data points and the centers of the components. Approaches of this kind can work well [7] but have to cope with the disadvantage that a more difficult problem (estimation of probability densities) has to be solved than necessary (estimation of the classification boundary) and more training data are needed. The second motivation comes from the idea of directly estimating the discriminant functions for multiclass classification

L

()

( )

problems. In NPC, the discriminant functions are parametrized using a set of prototype vectors for each class, and classification is based on the distance between a data point and the class to which its closest prototype belongs to. Often an Euclidean distance measure is used. One of the most common methods for the construction of prototype-based discriminant functions is Learning Vector Quantization (LVQ) [1], [2] for which several variants have been developed in the past. The constructed classifiers work well in many classification tasks [5], but the selection (“learning”) rules have the disadvantage of being heuristic and may not be optimal. In order to improve classification performance, costfunctions have been proposed for model selection. Katagiri et al. [8], [9], [10], [11] have investigated cost-functions which were derived in a two step procedure. First a family of parametrized discriminant functions was constructed and then a measure of performance was defined which is related (but not equal) to the rate of misclassification leading to an individual loss for every data point. Model selection is performed by (stochastic) gradient descent on the total loss over the training set. A continuous loss function was used in order to apply gradient-based optimization and several hyperparameters were introduced for this purpose. Discriminant function whose parametrization is based on prototypes and cost-functions whose minimization is related to LVQ learning procedures have been proposed by McDermott[11] and Komori et al.[10]. These approaches provide good classification results for various applications [10], [11] and the derivation of LVQ-like learning procedures using a cost-function approach is valuable when it comes to the analysis of convergence properties. But a disadvantage remains, because the choice of the discriminant and cost functions is still a heuristics. Both were constructed with the goal in mind to derive LVQ as an optimization procedure, but it may be hard to judge whether their particular form is the best choice for the data at hand. In our contribution we try to overcome some of the abovementioned difficulties by combining an explicit ansatz for the probability densities of the classes (cf. the Bayesian approach) with a criterion for model selection which directly minimizes the rate of misclassification (cf. the discriminant function approach). On the one hand, this ansatz helps to make the assumptions underlying the choice of discriminant functions and model selection more explicit - because the underlying generative model is made explicit. On the other hand, we expect the method to make more efficient use of the information which is contained in the class labels of the training set, because the discriminant function is optimized directly. Using a Gaussian mixture ansatz as an example we derive a LVQ learning procedure and we show that: i Classification results indeed im-

()

SEO ET AL: SOFT NEAREST PROTOTYPE CLASSIFICATION

2

( ) ( )

prove compared to previously proposed LVQ methods. ii It is possible to define an annealing schedule for optimization which leads to classifiers with improved performance. iii Because underlying model assumptions are made explicit, the method can be adapted (different distance measures, different parametrization of the discriminant function, etc.) more easily to different kinds of data. II. N EAREST P ROTOTYPE C LASSIFICATION L EARNING V ECTOR Q UANTIZATION

AND

In this section we briefly review the standard LVQ approach to the construction of nearest prototype classifiers, which will serve as our standard benchmark method in the following secf j ; j gMj=1 of labelled tions. An NPC consists of a set T prototype vectors. The prototype vectors j 2 X  RD are ; :::; Ny are their vectors in data space, and the j 2 I , j corresponding class labels (typically M > Ny ). The class of a new data point x is determined: 1) by selecting the prototype q which is closest to x,

= (

)

=1

q

(

= arg min d(x; j ); j

(1)

)

where d x; j is the distance measure, and 2) by assigning the label q of this prototype to the data point x. A popular choice for d is the Euclidean distance

d(x; j )

= kx

j kL2 ;

(2)

but other distance measures may be chosen depending on the problem at hand. The most robust LVQ methods for the construction of NPCs are LVQ 2.1 and LVQ 3 [2], [3]. For every data point x from the training set, LVQ 2.1 first selects the two nearest prototypes l , m according to the distance measure eq. (2). If the labels l and m are different, the prototypes are adjusted according to

l (t + 1) m (t + 1)

= =

l (t) + (t)(x l ); l = y; m (t) (t)(x m ); m 6= y:

III. S OFT N EAREST P ROTOTYPE C LASSIFICATION A. Cost-function and learning rule: General case f xi ; yi gNi=1 ; xi 2 X  Let us consider a dataset S D R ; yi 2 I , where N is the number of (training) data points and I is a set of class labels. Our goal is to find the NPC which optimizes classification performance. Therefore, we select a f j ; j gMj=1 of labelled prototype vectors whose paset T rameters are determined by minimizing the rate E S ; T of misclassification with respect to j for the training set, i.e.

= (

= ( 

1

where  < scales the learning rate. For both algorithms weight vectors are updated only if the data point x is close to the classification boundary, i.e. if it falls into a window





min dd((x;x;ml )) ; dd((x;x;ml )) > s; where s = 11 + ww ; of relative width 0 < w  1 [2]. This “window rule” was introduced, because otherwise prototype vectors may diverge.

)

(

N X M X = N1 P (j jxk )(1

E (S ; T )

k=1 j =1

Æ(yk = j ))

)

(3)

=! min; P (j jxk ) = Æ(j = qk ); qk = argmin kxk r k: r Æ( ) is 1, if is true, and 0 else. P (j jxk ) is the assignment probability, i.e. the probability that a data point xk is assigned to a prototype j . If P (j jxk ) is 1, for j being the nearest proto-

type to xk , and 0 else, the NPC operates in a “winner-takes-all” mode. In order to minimize E S ; T with respect to the parameters , we introduce fuzzy assignment probabilities P j jx . Replacing hard by soft assignments allows for a gradient-based optimization procedure, as we will see soon. Also, classification errors which result from misclassifications of data points near the class boundaries are less weighted which avoids oscillations in the values  during learning and leads to a faster convergence rate. If the assignment probabilities are from the normalized exponential form

(

P (j jx)

)

( )

exp ( d(x; j )) ; k=1 exp ( d(x; k ))

=

PM

( )

(4)

where d x;  is the distance measure between data point x and prototype , we can rewrite the cost function, eq. (3), as the sum

N X 1 Es (f(x; y)g; T ) = ls((xk ; yk ); T ) N

(5)

k=1

of the individual costs

If the labels l and m are equal, no parameter update is being performed. LVQ 3 differs from LVQ 2.1 by an additional update rule for the prototype vectors l , m if both class labels

l ; m are equal to the label y of the actual data point. In this case, the prototype vectors are changed according to

l;m (t + 1) = l;m (t) +  (t)(x l;m ); l = m = y;

)

lsk = =

M X j =1

P (j jxk )(1 Æ(yk = j ))

X

fj: j 6=yk g

P (j jxk ):

(6)

lsk is an abbreviation of ls((xk ; yk ); T ).

Equation (6) shows that the individual loss of a data point xk is the sum of the assignment probabilities of the data point xk to all the prototypes of the incorrect classes. Because the individual costs are continuous and bounded by  x; y ; T  with respect to , the cost function, eq. (5), can be minimized by stochastic gradient descent [12], [13],

0 ls(( ) ) 1



l (t + 1)

= l(t)

(t)

 lst ; l

(7)

SEO ET AL: SOFT NEAREST PROTOTYPE CLASSIFICATION

3

(t),

where t is the iteration number and

1 X

(t) = 1

t=0 is the learning rate. Using

 lst l

=

1 X

and

t=0

(t)2 < 1;

(8)

lst) d(x; t ) ; l

P (ljx)(Æ( l 6= yt )

(9)

of misclassification which should be minimized (i.e. the probability that a data point is assigned to the “correct” class is implicitly maximized.) Because classification using NPCs depends only on the relative distances between data points and prototypes, we assume that every component has the same 1 2 and p j width and strength i.e. j2 M; 8j ; : : : ; M: For a mixture ansatz with D-dimensional Gaussian components of similar width and strength,

=

1

we then obtain the learning rule

l (t + 1)

= (l )(t) =

l (t) (t)(l )(t) ( (x;l ) ; P (ljxt )lst d l (x;l ) P (ljxt )(1 lst ) d ; l

p(xjj )

(10)

if l if l

= y; 6= y:

X

fj: j = 0 g

P (j jx):

B. The Gaussian Mixture Ansatz Let us assume that the probability density p x of the data points x can be described by a mixture model, and that every component j of the mixture is homogeneous in the sense, that it generates data points which belong only to one class Cj . The probability density of the data is then given by:

()

p(xjT )

=

Ny X X

=1 fj : j = g

p(xjj )p(j );

(12)

where Ny is the number of classes and j is the class label of the data points generated by component j . p j is the probability that data points are generated by a particular component j and p xjj is the conditional probability that this component j generates a particular data point x. Let us now consider a data point x and its true class label y; and let us define the restricted probability densities

()

( )

(

p(x; yjT )

=

p(x; yjT )

=

)

X

fj: j =yg X fj: j 6=yg

p(xjj )p(j );

(13)

p(xjj )p(j ):

(14)

p x; yjT is the probability density that a data point x is generated with the correct class label y , and p x; y jT is the probability density that a data point x is generated with another class label 6 y . Then we can define the cost function of the classification problem via the rate

(  )

=

N X = N1 ls((xk ; yk ); T ); k=1 p(xt ; yt jT ) ls((xt ; yt); T ) = p(xt jT )

Es (f(x; y)g; T )

(15)



j )2 22 ;

fj: j 6=yg

(11)

If the distance measure emphasizes the local neighborhood, then eqs. (11) and (4) reduce to a NPC (eq. 1 and below).

(x

ls((x; y); T ) = p(px(tx; yjTtjT) ) t   P (x j )2 exp 2 j : j 6=yg 2   = fP (x k )2 exp 2 k 2 X P (j jx); =

( )( ( = ) ls )

= arg max

0

= (22 )D=2 exp



(16)

we obtain

Because of the “soft” assignment probabilities, eq. (4), the prototypes are modified for a given data point according to the data-dependent learning rate P ljx Æ l 6 yt t . Once the prototypes are determined, new data points x can be classified using

1

=

() =

where



(17)



exp (x22j )2 p(xjj )P (j ) P (j jx) = = P  (x k)2  p(x) k exp 2 2

(18)

is the posterior probability that the data point x was generated by the component j . The individual cost given by eq. (17) is a special case of the individual cost given by eq. (6) with the (x j )2 distance measure d x; j 2 2 . Stochastic gradient descent then leads to the learning rule

(

l (t + 1) (l )(t)

= = =

)=

l (t)  (t)(l )(t); (19) P (ljxt )(Æ( l 6= yt ) lst )(xt l )  P (ljxt )lst (xt l ); if l = y; P (ljxt )(1 lst )(xt l ); if l 6= y;

()=

(t) . In contrast to the algorithms of the LVQwith  t 2 family, which are based on hard assignments (only the two winner prototypes from the correct and incorrect class are affected), all of the prototypes with the correct labels are attracted towards the data point x proportional to their distance and weighted by the factor P ljx  , whereas all of the incorrect prototypes are repelled from the data point x proportional to their distance and weighted by the factor P ljx . 2 is a hyperparameter of the learning rule (19). In the spirit of deterministic annealing, which is a useful optimization procedure for clustering problems (cf. [14], [15]),  2 is set to a large value initially and is decreased during optimization (“annealing”) until an optimal value f is reached. Figure 1 summarizes the learning algorithm, which we will call soft nearest prototype classification (SNPC) in the following.

( ) ls

( )(1 ls)

C. The “Window Rule” Equations (19) show, that SNPC - like the other algorithms of the LVQ family - performs an update of the prototype vectors

SEO ET AL: SOFT NEAREST PROTOTYPE CLASSIFICATION

4

0 <   0:25 for update and change a prototype only if ls(1 ls) > . The emergence of a window zone for the update

1) set initial i2 , and final f2 , values of the dispersion parameter,  0 < ; < , no. of prototypes for each class, and annealing schedule for t , 2) initialize prototypes f k ; yk gM k=1 , 3) while (  2 > f2 ) a) set initial learning rate 0 , b) repeat i calculate assignment probabilities (eq. (18)) ii update prototypes (eq. (19) ) iii update learning rate until a stopping criterion is fulfilled, 2 . c)  2 d)

0

()

1 1

(

of prototypes can also be exploited by active learning strategies [16], [17], [18]: Only data points which fall into the window region must be labelled.

)

() ( ) ( )

ls) = ls(1ls(1 ls)ls)PyP(ylj(xlj)x; ); where Py (ljx) and Py(ljx), Py (ljx) Py(ljx)

= =

P P

exp



if if

(x l )2 2 2

fj: j =yg exp 

exp

l 6= y

l = y;



(x l )2 2 2

fj: j 6=yg exp



(20)



(x

j )2 2 2 j )2 2 2

;

Nt X = N1t ls((xk ; yk ); T );

= 0 ls(1 ls) 0 25 ls

ls(1 ls)

(21)

as a function of the positions of the prototypes for a value 2 of the dispersion parameter. Nt denotes the number of test data points. For problem no. 2 there is a pronounced minimum at finite values for i and SNPC converges to the these optimal values (Fig. 3 f). For problem no. 1 the minimum of EsT is assumed at 1 2 ! 1, hence the values of the prototypes diverge (Fig. 3 e). For the case 1 2 ! 1, the class boundary x1 is fixed and therefore the rate of the misclassification,

=1

EhT

denote the posterior probabilities that data point x belongs to component l with label y or 6 y , respectively (see appendix for a brief derivation).  : is data-dependent The common factor  and gives rise to the fact that only datapoints, which fall into a particular area of input space, contribute to the update of prototypes. This area is characterized by being sufficiently larger than zero and sufficiently smaller than one. It covers all datapoints close to the current classification boundary of the SNPC. This “active area” corresponds to the window of the LVQ 2.1 and LVQ 3 methods and performs a similar function (making the methods more robust) as we will see below. Figure 2 shows the size of the active region for a twodimensional toy example with four prototypes (white symbols) from four classes, for different values of the dispersion parameter  2 . Gray values indicate the strength of the factor for datapoints, which have the same label as their closest prototype. The figure shows that the width of the active area grows with increasing  2 and that the value of decreases with increasing distance from the current class boundaries. To accelerate the learning process one can define a threshold value

=3 =1

k=1

=

=0

;

ls(1 ls)

EsT

=



(x

= 3

=0

which depends on whether the label y of the data point x and the label j of the prototype j are similar or different. SNPC, however, differs from LVQ by a data-dependent learning rate



= 15 =1

= 15

Fig. 1. Summary of the SNPC method.

P (ljx)(Æ( l 6= y)

IV. N UMERICAL E XPERIMENTS FOR T OY E XAMPLES We first consider two simple one-dimensional classification problems (Fig. 3). In problem no. 1 (Fig. 3 a), datapoints are drawn independently and identically distributed (iid) from two : (class Gaussian distributions, which are located at 1 1) and 2 : (class 2) and whose widths are d . The classifier consists of two prototypes 1;2 , one for each class. In problem no. 2 (Fig. 3 b), datapoints are drawn iid from three Gaussian distributions, which are located at 1 ; 3 (class 1), and 2 (class 2) and whose widths are d . The classifier consists of three prototypes, two ( 1;3 ) for class 1 and one (2 ) for class 2. Figures 3 c,d, show the log of the test error EsT measured on the test data set with Nt data points,

qk

= N1

Nt X

t k=1

(1

= argmin kxk r

Æ(yk ; qk ));

(22)

r k2

remains constant. The ”soft” loss function, however, decreases, if the prototypes diverge. The divergence of prototypes in certain situations is a known feature of the algorithms in the LVQ family and arises because a ratio of probability densities, eq. (15), is minimized during learning. Note, however, that this divergence problem does not affect the usefulness of the resulting classifier, because i the decrease in test error, eq. (21), is marginal for large ji j; ii it rarely occurs in practical problems, and iii the classification boundary for hard classification, eqs. (1,2), does not change (crosses in figs. 3 e,f). Learning may thus be terminated as soon as ji j crosses a given threshold. Next we consider a two-dimensional classification problem (Fig. 4). 300 data points are drawn iid from each of four Gaussian components for training (Figs. 4 a,b); the classifier consists of two prototypes for every class. The final prototypes are determined using SNPC with annealing in the dispersion parameter. The schedule of annealing was  2 t :  : (t 1) 2 and f : . The learning rate was changed accord6000 :  6000+ ing to t t . Figure 4 c shows the log of the

( )

= 0 0005 ( ) = 01

() ( )

( ) = 0 02 0 9

SEO ET AL: SOFT NEAREST PROTOTYPE CLASSIFICATION

5

”soft” test error, eq. (21), and the log of the ”hard” test error, eq. (22), as a function of the log of the inverse of the dispersion parameter  2 . The test set contained Nt data points (1000 data points drawn iid from class). Figure 4 shows i that annealing leads to a better generalization performance of the classifier (cf. Fig. 4 c), ii that annealing optimizes the structure of the model (the number of prototypes is reduced from 8 to 5, cf. Fig. 4 b), and iii that there is an optimal value for the dispersion parameter  2 (cf. arrow in Fig. 4 c). The latter can be understood as follows: For large values of  2 prototypes are located outside the data distribution (in order to minimize Es S ; T ). Therefore, the classification boundary is quite different from the optimal boundary of a Bayes classifier, and the values for EhT remain large. If, however,  2 becomes too small, only very few data points are located within the window zone and overfitting due to noise increases the generalization error again.

= 4000

()

( ) ( )

(

)

V. B ENCHMARKS

WITH

R EAL W ORLD DATA

We investigated the performance of the algorithms, LVQ 2.1, LVQ 3, SNPC with a fixed value of the dispersion parameter, and SNPC with annealing using the data set letter from the UCI Machine Learning Repository1. The dataset is generated from a large number of black-and-white rectangular pixel displays of the 26 capital letters in the English alphabet. The letters were taken from 20 different fonts and each symbol was randomly distorted. Each pixel image was afterwards converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values between 0 and 15. The dataset contains a total of 20,000 items. For the investigation of the performance of the methods LVQ 2.1, LVQ 3, and SNPC with fixed value of the dispersion parameter, we split the data set into two subsets with the same number of data points for each label. The first subset, the training set, was used to find the optimal values of the hyperparameters (window width ! for LVQ 2.1 and LVQ 3, and parameter  for SNPC without annealing). 10-fold cross-validation was performed on the training set for several values of the hyperparameters, and the values which gave rise to the minimal error were selected. The generalization error for the optimal hyperparameters was estimated by 10-fold cross-validation on the second (test-) set. For the investigation of the performance of the annealed variant of SNPC, SNPC-AN, we split the data set into two subsets which contained 80% (for training) and 20% (for the determination of the test error) of the total amount of data. Using the training set the optimal parameters were selected by 10-fold cross-validation and the generalization error of the optimal parameters was estimated on the test set. The prototypes of a each class were initialized by adding random vectors to the centroid l 2 RD of all the data points l :    l   . which belong to this class, i.e. l l is the vector of standard deviations of data points with label l along the coordinate axes,  is the number of prototypes per class, and  2 is a number drawn randomly from an

= + 0 02

[ 1 1℄

1 ftp://ftp.ics.uci.edu/pub/machine-learning-databases

/letter-recognition/

uniform distribution.  was set to 0.1 for LVQ 3. The learning 12000 rate was annealed using t :    12000+ t in the case 52000 0  52000+t ; 0 of LVQ 2.1 and LVQ 3, and t :   : for SNPC with and without annealing. For SNPCAN, we used the schedule  2 t 2 t  t ; t (1 : 1) 2 , where t where : and 2 1  opt 1 was chosen from ; : depending on the number of proto2 , types per class. Annealing was terminated if  2 t < 2  opt where 2 was chosen from : ; : depending on the number of prototypes per class. Learning was terminated after all training data points were used 30 times. Fig. 5 shows the average rate of misclassification, eq. (22) and the standard deviation obtained on the test set. With all parameters optimized, LVQ 2.1 performs better than LVQ 3, despite the heuristic correction rule for the divergence of prototypes. SNPC, however, consistently performs better than LVQ and the annealed version provides better results than the version for constant value of  2 , even for optimized values of the hyperparameters.

( +1) = 0 1 = ( + 1) = ( ) = ( 1) ( ) ( ) = (0) = 0 99 (0) = [2 1 2℄ () [0 5 0 8℄

0 1 +0 1 ( 1)

VI. R ELATION TO OTHER WORK In this section we compare the performance of the SNPC algorithms with the performance obtained with the Bayes classifier approach by [7] and the LVQ algorithm derived from the cost function approach of [9], [11]. Kambhatla and Leen [7] constructed a Bayes classifier (Gaussian Mixture Bayes classifier, GMB), which assigns a data point x to a class yq according to

= argmax p(yk )p(xjyk ); k

yq

( )

(23)

where p yk denotes the prior probability of the k th class and p xjyk denotes the corresponding class conditional density functions. They approximated the density functions by mixtures

(

)

p(xjyk )

Nk X

=

j =1

p(j )pk (xjj );

of D-dimensional Gaussian components,

pk (xjj )

=

exp



1 2

(x kj )T kj 1(x q (2) D2 jkj j



kj )

:

Nk denotes the number of components of the kth class and kj and kj are the mean and the covariance matrix of the j th component of the k th class. The centroids and covariance matrices are estimated separately for each class using the expectation maximization method [19]. In order to obtain comparable benchmark results we assume that all components of the GMB mixtures have similar strengths 1 and widths, i.e. kj : : : ; M, 2 I and p j M; 8j 2 and we treat  as a hyperparameter which has to be optimized. In this case, the Bayes classifier reduces to the NPC

 =

=1

()=

yq = argmin k

N

min kx j =1 k

kj k2



:

(24)

SEO ET AL: SOFT NEAREST PROTOTYPE CLASSIFICATION

6

Figure 6 shows the rate of misclassification, eq. (22), as a function of the number of prototypes for NPCs, eq. 1, which were constructed using the GMB learning procedure for the data set letter. The dispersion parameter  2 was optimized using 10-fold cross-validation on the training set (50% of the data). The optimization of the centroids kj was performed separately for each class k , and learning was terminated when the average Euclidean distances between the “new” and the “old” parameter 7. vectors was less than The worse classification performance is related to the fact, that GMB solves the more difficult estimation problem and makes less efficient use of the information contained in the labels of the training data w.r.t. classification boundary than SNPC. If the model complexity of GMB is increased, i.e. if the full covariance matrices are estimated from the data, the performance of the GMB classifier improves. However, classification results are no longer directly comparable with SNPC because of the different complexity of the model classes which underlie GMB and SNPC. Katagiri et al. [8], [9], [10], [11] constructed their classifiers using a two-step procedure which they called the minimum classification error (MCE) and generalized probabilistic decent (GPD) methods. First a parametrized set of discriminant functions g x; k jT is constructed for each class k , where T denotes the set of parameters. Then an individual loss lsk xjT is constructed for each data point x, based on its distance dk xjT from the discriminant function, such that i the loss is continuous and differentiable w.r.t. the parameters in T and ii the loss is monotonously related to the rate of misclassification obtained with hard classification using the final classifiers. For the following benchmarks we used the method described in [11]. Let T fj ; j gMj=1 be a set of labelled prototypes. Then

10

(

)

( )

()

( )

( )

=

g(x; kjT ) dk (xjT )

= =

lsk(xjT ) =

2 X 4 kx

j k

j =k

2 4

gk (

Ny

1

X

1 j6=k

1 1+e dk (xjT )

0;

3 5

;

1 

;

3 gj  5

if k

(25)

(

VII. S UMMARY In this paper we investigated a principled approach to Learning Vector Quantization. Starting from a cost function which can be interpreted as the rate of misclassification for the case of fuzzy assignments of data points to prototypes, we derive a learning algorithm for those prototypes using stochastic gradient descent. On the one hand, the principled approach provides an explanation of several features of the (heuristic) LVQ methods, including the divergence of prototypes and the role of the window region. On the other hand, the learning algorithm generates nearest prototype classifiers with reduced test error for the same number of prototypes, and may therefore be beneficial in applications. Because of the emergence of a window region in SNPC, active learning strategies can be used minimizing the number of labelled data points necessary for learning.

(26)

(27)

is the number of classes, gk is an abbreviation of g x; kjT and  ,  ,  are positive constants (the hyperparameters of the method). The average loss over the training set is then minimized using gradient descent methods. For large values of  and  the resulting classifier is equivalent to a NPC based on a Euclidean distance measure, and the following learning rule is obtained: where

ls (1 ls )

1 

=y

else;

prototype of the incorrect classes. Equation (28) describes a LVQ learning rule; the window rule of LVQ 2.1 is implemented via the prefactor y y . Figure 6 shows the rate of misclassification, eq. 22, as a function of the number of prototypes for NPCs constructed using the MCE/GPD learning procedure, eqs. 28, for the data set letter. Results were obtained in the limit of large values of  and  , and the hyperparameter  was optimized using 10-fold cross-validation on the training set (50% of the data). The figure shows that the MCE/GPD method performs as well as SNPC for fixed values of the hyperparameters (see fig. 5) but that SNPC provides better classification results if annealing is used. Both methods have been compared using Euclidean distance measures. Because SNPC is related to a generative model approach, however, the distance measure can be related to the properties of the probability distribution of the data, and the classifier and its selection procedure can be adapted if prior knowledge about the density functions is available.

Ny

)

= 8(t) +  (t)l (t) (28) < lsy (1 lsy )(x l ); if l = qy ; l (t) = : lsy (1 lsy )(x l); if l = qy; 0; else:  (t) =   (t), lsy is an abbreviation of lsy (xjT ), qy is the l (t + 1)

nearest prototype of the correct class while qy is the nearest

A PPENDIX Derivation of eq. (20):

P (ljx) (Æ( l 6= y)

ls) = Æ( l 6= y) ZZy((xx)) P (ljx)(1 ls) y Z (x) Æ( l = y) y P (ljx)ls Zy (x) 2 Zy(x) exp( (x22l) ) = Æ( l 6= y)(1 ls) Z (x) Z (x) y (x l )2 Z (x) exp( 22 ) Æ( l = y)ls y Z (x) Zy (x)  = ls(1ls(1 ls)ls)PyP(ylj(xlj)x) ifif

ll 6== yy :

SEO ET AL: SOFT NEAREST PROTOTYPE CLASSIFICATION

7

Z (x); Zy (x); Zy(x) are the normalization factors of the assignment variables, i.e.

Zy (x) Z (x) Zy(x) Z (x)

=

 (x  )2  P j fj: j =yg exp 2 2   P (x r )2 r

= 1 ls; =

exp

 (x  )2  P j exp fj: j 6=yg 2 2   P (x r )2

= ls;

r

exp

and

Py (ljx) Py(ljx)

2 2

= =

exp exp



2 2

(x l )2 2 2

Z (x) y

(x l )2 2 2

Zy(x)

 

:

R EFERENCES [1] T. Kohonen,“Improved versions of learning vector quantization,” in Neural Networks, IJCNN International Joint Conference, Vol. 1, pp. 545 -550, 1990. [2] T. Kohonen, Self Organization Maps. Springer-Verlag , 2001. [3] T. Kohonen, J. Hynninen, J. Kangas, J. Laaksonen, and K. Torkkola, “Lvq Pak: The Learning Vector Quantization Program Package,” Laboratory of Computer and Information Science of Helsinki University of Technology, 1995. [4] T. Kohonen, “Learning Vector Quantization,” Technical report, Helsinki Univ. of Tech., 1986. [5] Neural Networks Research Centre Helsinki Univ. of Tech., “Bibliography on the Self-Organizing Map (SOM) and Learning Vector Quantization (LVQ),” http://liinwww.ira.uka.de/bibliography/Neural /SOM.LVQ.html, 2002. [6] R. O. Duda, and P. E. Hart, Pattern classification and scene analysis, Wiley, New York, 2000. [7] N. Kambhatla and T. K. Leen,“Classifying with Gaussian mixtures and clusters,” in Neural Information Processing Systems Conference, Vol. 7, pp. 681-688, 1995. [8] B.-H. Juang and S. Katagiri, “Discriminative learning for minimum error classification (pattern recognition),” IEEE Transactions on Signal Processing, Vol. 40, pp. 3043–3054, Dec. 1992. [9] S. Katagiri, and C.-H. Lee and B.-H. Juang, “New discriminative training algorithms based on the generalized probabilistic descent method,” in Neural Networks for Signal Processing, Proceedings of IEEE Workshop pp. 299–308, 1991. [10] T. Komori, and S. Katagiri, “Application of a generalized probabilistic descent method to dynamic time warping-based speech recognition,” in International Conference on Acoustics, Speech, and Signal Processing, pp. 497–500, 1992. [11] E. McDermott and S. Katagiri, “Prototype-based minimum classification error/generalized probabilistic descent training for various speech units,” in Computer Speech and Language, Vol. 8, pp. 351–368, 1994. [12] H. Robbins, and S. Monro, “A stochastic approximation method,” in Ann. Math. Stat., Vol. 22, pp. 400–407, 1951. [13] L. Bottou, “Online learning and stochastic approximations,” in Neural Networks, 1998. [14] T. Graepel, M. Burger, and K. Obermayer, “Phase Transduction in Stochastic Self-Organizing Maps,” in Physical Review E., Vol. 56, pp. 3876–3890, 1997. [15] D. Miller, A. V. Rao, K. Rose, and A. Gersho, “A global optimization technique for statistical classifier design,” in IEEE Transactions on Signal Processing, Vol. 44, pp. 3108–3122, 1996. [16] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby, “Selective Sampling Using the Query by Committee Algorithm,” in Machine Learning, Vol. 28, pp. 133-168, 1997. [17] D. MacKay, “Information-Based Objective Functions for Active Data Selection,” in Neural Computation, Vol. 4, pp. 590–604, 1992.

[18] S. Seo, M. Wallat, T. Graepel, and K. Obermayer, “Gaussian process regression: Active data selection and test point rejection,” in International Joint Conference on Neural Networks IJCNN , Vol. 3, pp. 241-246, 2000. [19] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood for incomplete data via the EM algorithm,” in Journal of the Royal Statistical Society, pp. 1–38, 1977.

SEO ET AL: SOFT NEAREST PROTOTYPE CLASSIFICATION

8

2

2

(a) σ =0.1

(b) σ =0.3

1.5

1.5

1

1

0.2

0.5

0.2

0.5

2

0.15

x

x

2

0.15 0

0

0.1

0.1

−0.5

−0.5 0.05

−1

0.05

−1

−1.5

−1.5 −1

0

x

1

−1

1

0

x

1

1

2

2

(c) σ =0.05

(d) σ =0.1

1.5

1.5

1

1

0.2

0.5

0.2

0.5

2

0.15

x

x

2

0.15 0

0

0.1 −0.5

0.1 −0.5

0.05

−1 −1.5

0.05

−1 −1.5

−1

0

x1

1

−1

0

x1

1

Fig. 2. Active areas for different values of dispersion parameter  2 and for two different prototype configurations. White symbols indicate the positions of the 0:25 prototypes in 2D data space; their class is indicated by the shape of the symbol. The strength of the factor ls(1 ls) is indicated by brightness (0 black white), for each location in data space and for data points which have the same label as their nearest prototype.

!

!

,

SEO ET AL: SOFT NEAREST PROTOTYPE CLASSIFICATION

9

(a) 2 classes, 2 clusters

(b) 2 classes, 3 clusters

0.2

p(x|y)

p(x|y)

0.15 0.1

0.1

0.05 0.05 y=1 y=2 0 −5

y=1 y=2

0 x

0

5

−5

0

(c) test error

−2

−2

θ

θ

1

−1

1

−1

−3

−3

−4

−4

−5

1

2

θ 2 (e)

3

4

5

10 θ1 θ2

5

B 0 −5 −10 0

1

iteration

1

position of prototypes

position of prototypes

5

x (d) test error

2

θ 3 (f)

3

4

4

θ ,θ 1 3 θ

2

B

2

0 −2 −4 0

6

x 10

2

1

2

iteration

3 4

x 10

Fig. 3. SNPC applied to two 1D toy problems. (a) Problem no. 1: Data points are drawn iid from two Gaussian distributions, one for each class (dashed and solid line for class 1 and 2). Parameters are: location 1 = 1:5 (class 1), 2 = 1:5 (class2), width d = 1. (b) Problem no. 2: Data points are drawn iid from three Gaussian distributions, two for class 1 (dashed line) and one for class 2 (solid line). Parameters are: location 1 = 3; 3 = 3 (class 1), 2 = 0 (class 2), width d = 1. The classifier consists of three prototypes, two for class 1 (1 ; 3 ) and one for class 2 (2 ). (c) Test error, eq. (21), as a function of the location 1 and 2 of the prototypes of problem no. 1. Contour lines indicate steps of 20. The minimum is located towards the lower right corner (see arrow). (d) Test error, eq. (21), as a function of the location 1 and 3 of the prototypes 1 and 3 of problem no. 2. 2 was set to 0. Contour lines indicate step of 20. The minimum is indicated by the dot. (e) Trajectories of the prototypes 1 and 2 during learning with SNPC for constant value of the dispersion parameter. The thin line and the crosses indicate the classification boundary for hard classification, eq. (1). Parameters are: Initialization: (1ini ; 1 ) = ( 2; 1); (2ini ; 2 ) = (2; 1);  2 = 1: 10000 and 1000 data points per class were drawn iid from the distributions shown in (a) for the training and test sets, respectively. The learning rate was set to = 0:01. (f) Trajectories of the prototypes 1 ; 2 ; 3 during learning with SNPC for constant value of the dispersion parameter and for two different initialization of prototypes. The thin line and the crosses indicate the classification boundary for hard classification, eq. (1). Parameters are: Initialization: (1ini; 1 ) = ( 4; 1); (2ini; 2 ) = (0; 2); (3ini; 3 ) = (4; 1); 2 = 1 and (1ini; 1 ) = ( 2; 1); (2ini; 2 ) = (0; 2); (3ini; 3 ) = (2; 1); 2 = 1: 10000 and 1000 data points per class were drawn iid from the distributions shown in (b) for the training and test sets, respectively. The learning rate was set to = 0:01.

SEO ET AL: SOFT NEAREST PROTOTYPE CLASSIFICATION

10

(a)

(c)

2

−2.4

soft hard

−2.5

x2

1.5

−2.6

0.8

1

1.2

x

1.4

1.6

1.8

1 (b)

2

log of test error

1 −2.7 −2.8 −2.9 −3

1

−3.1

x2

1.5

0.8

1

1.2

x1

1.4

1.6

1.8

−3.2

4

5

6

7

log of 1/σ2

1 = (0:95; 1); 2 = (1:2; 1:32); 3 = (1:6; 1:1); 4 = (1:2; 1:75); 11 = (0:01; 0:005); 12 = (0:005; 0:015); 21 = (0:01; 0); 22 = (0; 0:011); 31 = (0:01; 0:005); 32 = ( 0:005; 0:015); 41 = (0:02; 0); 42 = (0; 0:01).) The classifier consists of 8 prototypes, 2 for each class. (a) Initial values of the prototypes of the classifier (open symbols, class is indicated by symbol type). Crosses indicate the centers of the four Gaussian components of the data distribution, the ellipses indicate the widths in every direction as given by the covariance matrix : (b) Final values of prototypes determined by SNPC with 2

Fig. 4. SNPC with annealing applied to a 2D toy problem. 300 data points are drawn iid from each of four Gaussian components for training (locations:

Æ

C

annealing and for an optimal f (cf. arrow Fig. 4 c). Note, that the two prototypes of the classes ; and lie on top of each other. (c) Log of the test errors EsT and EhT as a function of log 1=2 for SNPC with annealing. Parameters: For training 300 data points per class are drawn iid from the Gaussian distribution, 1200 2 2 (t 1) ;  2 = 0:0005. schedule for learning rate: (t + 1) = 0:1 1200+ f t ; schedule for  -annealing:  (t) = 0:02 0:9





SEO ET AL: SOFT NEAREST PROTOTYPE CLASSIFICATION

11

0.26 LVQ 3 LVQ 2.1 SNPC SNPC−AN

0.24

Misclassification Rate

0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08 1

2

3

4

5

6

7

8

9

10

Number of Prototypes per Class Fig. 5. Performance of LVQ 2.1, LVQ 3, and SNPC with (“SNPC-AN”) and without (“SNPC”) annealing. The average test error (rate of misclassification) and its standard deviation was determined by 10-fold cross-validation and is plotted as a function of the number of prototypes per class. For details see text.

SEO ET AL: SOFT NEAREST PROTOTYPE CLASSIFICATION

12

0.6

GMB MCE/GPD SNPC−AN

0.55

Misclassification Rate

0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 2

4

6

8

10

Number of Prototypes per Class Fig. 6. Performance of GMB, MCE and SNPC with annealing (”SNPC-AN”). The average test error (rate of misclassification) and its standard deviation was determined by 10-fold cross-validation and is plotted as a function of the number of prototypes per class. For details see text.