Learning Vector Quantization for Multimodal Data - Semantic Scholar

Report 1 Downloads 99 Views
Learning Vector Quantization for Multimodal Data Barbara Hammer1 , Marc Strickert1 , and Thomas Villmann2 1 Department of Mathematics/Computer Science, University of Osnabr¨uck, D-49069 Osnabr¨uck, Germany 2 Clinic for Psychotherapy and Psychosomatic Medicine, University of Leipzig, Karl-Tauchnitz-Straße 25, D-04107 Leipzig, Germany

Abstract. Learning vector quantization (LVQ) as proposed by Kohonen is a simple and intuitive, though very successful prototype-based clustering algorithm. Generalized relevance LVQ (GRLVQ) constitutes a modification which obeys the dynamics of a gradient descent and allows an adaptive metric utilizing relevance factors for the input dimensions. As iterative algorithms with local learning rules, LVQ and modifications crucially depend on the initialization of the prototypes. They often fail for multimodal data. We propose a variant of GRLVQ which introduces ideas of the neural gas algorithm incorporating a global neighborhood coordination of the prototypes. The resulting learning algorithm, supervised relevance neural gas, is capable of learning highly multimodal data, whereby it shares the benefits of a gradient dynamics and an adaptive metric with GRLVQ.

1 Introduction LVQ has been proposed by Kohonen as an intuitive prototype-based clustering algorithm [3]. It combines the simplicity of self-organizing learning with the accuracy of supervised training algorithms. Successful applications can be found in widespread areas such as data-mining, robotics, or linguistics [3]. Numerous modifications of the original LVQ-learning rule exist including an adaptive learning rate (OLVQ) or adaptation inspired by optimum Bayesian decision (LVQ2.1), to name just a few [3]. Here we focus on generalized relevance learning vector quantization (GRLVQ) which has been introduced in [2] based on GLVQ as proposed in [6]. GRLVQ minimizes a cost function via a stochastic gradient descent, and hence, shows a very stable behavior in comparison to simple LVQ or LVQ2.1. In addition, it introduces an adaptive metric for the training procedure: weighting factors for the input dimensions are trained with Hebbian learning. This modification is particularly successful if huge-dimensional data is dealt with – the curse of dimensionality is automatically avoided because the weighting factors for noisy or useless dimensions become small during training. All of the above methods crucially depend on the initialization of the prototypes. The algorithms use iterative local learning rules which can easily get stuck in local optima. Commonly, the prototypes are initialized with the centerpoint of the classes or with random points from the training set. Therefore, the final prototypes will only represent those regions of the data points where at least one representative has been initialized. Thus LVQ often fails for multimodal class distributions. Kohonen offers a possible solution to this problem in his book ‘Self-Organizing Maps’ [3]. He proposes a combination of LVQ with the unsupervised self-organizing feature map (SOM). The SOM allows to obtain a topology preserving representation of a data set through a lattice of prototypes. Its iterative learning rule uses global information: the degree of neighborhood of the prototypes. Therefore almost arbitrarily

initialized prototypes spread faithfully over the whole data set. Kohonen proposes to initialize the prototypes of LVQ with the prototypes obtained from a SOM with posteriorly assigned class labels. As an alternative, Kohonen modifies the learning rule of LVQ2.1 directly to so-called LVQ-SOM: in each recursive step not only the closest correct and wrong prototype are updated, but also the whole neighborhood of the respective prototype. The above methods partially solve the problem of local optima in LVQ-training. Nevertheless, they have some drawbacks: SOM is based on a fixed topology of the prototypes which has to fit the internal topology of the data. Often a two-dimensional lattice is chosen. Usually, the topological structure of the data is not known, and methods which identify and repair topological defects are costly [7]. A posterior choice of labels for the prototypes of the unsupervised SOM causes problems: the borders of the classes are not accounted for in SOM which might yield different classes sharing the same prototype. Small classes may not be represented at all. The combination LVQSOM avoids this latter problem, nevertheless the topology of the chosen lattice has to fit. Moreover, updates of all neighbored neurons in each recursive step cause a magnification of inherent instabilities of LVQ2.1. Kohonen advises to use LVQ-SOM only with small neighborhood size after initial unsupervised SOM training. Alternative approaches substitute the local updates of LVQ with global mechanisms like annealing techniques [1], utility counters [5], or greedy approaches [8]. We propose a combination of GRLVQ with the neural gas algorithm (NG) [4] as an alternative. NG constitutes an unsupervised training method with a faithful representation of a given data distribution as objective. Unlike SOM, NG does not assume a fixed prior topology of the prototypes but develops an optimal topology according to the given training data. We combine both methods in such a way that a parameterized cost function is minimized through the learning rule. The cost function leads to a training process which is either similar to NG or to simple GRLVQ, depending on the choice of parameters. During training, parameters are varied: at the beginning, neighborhood cooperation assures a distribution of the prototypes among the data set; at the end of training, a good separation of the classes is accounted for. This method yields very good results and a stable behavior even on highly multimodal classification problems.

2 GRLVQ and NG

= (

)

1 )

=1

f xi ; yi 2 Rn  f ; : : : ; C g j i ; : : : ; mg of training Assume a finite set X data is given and a clustering of the data into C classes is to be learned. We denote the components of a vector x 2 Rn by x1 ; : : : ; xn in the following. A set of prototypes W fw1 ; : : : ; wK g in Rn for the classes is chosen. The label i is assigned to wi iff wi belongs to the th class. Ri fx j x; y 2 X; 8wj j 6 i ! jx wi j  jP x wj j g denotes the receptive field of wi . Here, i  are scaling factors with  P 2 1=2 and jx y j denotes the weighted Euclidian i i i i xi yi metric. GRLVQ adapts the prototypes wi and the factors i such that the difference of

g, and the receptive fields the points belonging to the th class, S fx j x; y 2 X; y of the corresponding prototypes, i = Ri , is as small as possible. This is achieved by a stochastic gradient descent on the cost function

(

=

=

) =1

=

(

( )

) ( )

0

= ( =

=

m i i X := sgd  (xi ) where (xi ) = dd ((xxi )) + dd ((xxi )) :   i sgd(x) = (1 + exp( x)) denotes the logistic function. d (xi ) = jxi wi+ j is i i +

C

+

GRLVQ

=1

1

+

2

the squared weighted Euclidian distance of x to the nearest prototype w + of the same

( )=

class as xi , and d xi jixi wi j2 is the squared weighted Euclidian distance of xi i to the nearest prototype w of a different class than x . The learning rule of GRLVQ is obtained by taking the derivatives of the above cost function [2]:

4wi+ = 4(dsgd(x(i) +(xd ))(dxi ))(x )    (xi wi+ ) 0

+

i

+



i



2

( (x ))d (x )    (xi wi ) 4wi = 4(d sgd i ) + d (xi )) ( x   0

i

i

+

+

2

0 ( (xi ))  4j = (d 2(xsgd d (xi )(xij wji+ ) i ) + d (xi ))  

2

+

2

d+ (xi )(xij wji )2



0

where  is the diagonal matrix with entries 1 , . . . , n . ,  , + > are learning rates. GLVQ only updates the prototypes wi , keeping the weighting factors i fixed to =n. NG adapts a set of unlabeled prototypes W fw1 ;m: : : ; wKng in Rn such that they 1 represent a given set of unlabeled data X fx ; : : : ; x g in R . The cost function of NG, K m X X

=

C = NG

i=1 j =1

1

=

h (kj (xi ; W ))(xi wj )2 =h(K )

( )=P

()

K 1 is minimized with a stochastic gradient descent where h K i=0 h i . Thereby, h x x= .  > denotes the degree of neighborhood cooperation and is decreased to during training. kj xi ; W denotes the rank of wj in f ; : : : ; K g if the prototypes are ranked according to the distances jxi wk j . The according learning rule becomes j i i j

( ) = exp( 0

)

0

(

)

0

4w = 2 h (kj (x ; W ))(x

0

1

w)

where  > is the learning rate. Note that this update constitutes an unsupervised learning rule which minimizes the quantization error. The initialization of the prototypes is not crucial because of the involved neighborhood cooperation through the ranking kj .

3 Supervised Relevance Neural Gas Algorithm The idea of supervised relevance neural gas (SRNG) is to incorporate neighborhood cooperation into the cost function of GRLVQ. This neighborhood cooperation helps to spread prototypes of a certain class over the possibly multimodal data distribution of the respective class. We use the same notation as for GRLVQ. Denote by W y i the set of prototypes labeled with y i and by Ki its cardinality. Then we define the cost function

( )

C

SRNG

=

m X

X

i=1 wj 2W (yi )

h (kj (xi ; W (yi )))  sgd( (xi ; wj ))=h(Ki )

i ; wj ) = jxi wj j i ) = jxi wj j + d (xi ). As above, h(K ) ( x d ( x i     PK i i j = j h (j ). kj (x ; W (y )) 2 f0; : : : ; Ki 1g denotes the rank of w if the prototypes in W (y i ) are ranked according to the distances jwk xi j . d (xi ) is the distance to the closest wrong prototype wi . Note that lim! C = C . If 

with 

i

=0

2

2

1

0

SRNG

GRLVQ

is large, typically at the beginning of training, the prototypes of one class share their responsibility for a given data point. Hence neighborhood cooperation is taken into account such that the initialization of the prototypes is no longer crucial. The learning rule is obtained by taking the derivatives. Given a training example xi ; y i , all wj 2

(

)

1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Fig. 1. Artificial multimodal data set used for training: training set 1 and prototypes found by SRNG (2, spread over the clusters) and GRLVQ (, only within two clusters in the middle). W yi , the closest wrong prototype wi , and the factors k are adapted:

( )

4wj = 4

( (xi ; wj ))h (kj (xi ; W (yi)))d (xi )    (xi wj ) (jxi wj j + d (xi )) h(Ki ) 0 i j i i jxi wj j     (xi wi ) 4wi = 4 P sgd ( ((xjx;iw ))whj j (k+j (dx (; xWi ))(y h))) (Ki ) w 2W y   0 i j i i 4k = 2 P sgd(j(xi (xw;jwj ))+hd (k(xj (i ))x ;hW(K(y ))))  i w 2W y    i i d (x )(xk wkj ) jxi wj j (xik wki ) +

sgd0

2

2

j

j

2

( i)

2

2

( i)

2

2

2

2

In comparison to the GRLVQ learning rule, all prototypes responsible for the class of xi are taken into account according to their neighborhood ranking. For  ! , GRLVQ is obtained. With respect to SOM-LVQ, this update allows an adaptive metric like GRLVQ, and shows more stable behavior due to the choice of the cost function and the fact that neighborhood cooperation is only included within the prototypes of one class. Note that adaptation of the prototypes without adapting the metric, i.e. the factors i , is possible, too. We refer to this modification as supervised neural gas (SNG).

0

4 Experiments

1 6

We tested the algorithm on artificial multimodal data sets. Data sets to consist of two classes with clusters with about points for each cluster which are located on a two dimensional checkerboard. Data sets , , and differ with respect to the overlap of the classes (see Figs. 1,2). Data sets , , and are copies of , , and , respectively, where dimensions have been added: A point x1 ; x2 is embedded as x1 ; x2 ; x1 1 ; x1 2 ; x1 3 ; 4 ; 5 ; 6 where i is uniform noise with j1 j  : ,

50

5 (

+

+

30

6

+

)

13 24

5

6

(

)

13

0 05

Fig. 2. Artificial multimodal data sets: training set 3 (left) and training set 5 (right).

j j  0:1, j j  0:2, j j  0:1, j j  0:2, and j j  0:5. Hence dimensions 6 to 2

3

4

5

6

8 contain pure noise, dimensions 3 to 5 carry some information. Data set 7 comprised 3 classes of different cardinality in two dimensions (see Fig. 3). Data set 8 consists of an embedding of set 7 in 8 dimensions as above. The sets are randomly divided into training and test set. All prototypes are randomly initialized around the origin. Training has been done for 3000 cycles with 50 prototypes for each class for sets 1 to 6 and 5 or 6 prototypes for each class for sets 7 and 8. The algorithms converged after about 2000 cycles in all runs. Learning rates are  = 0:1,  = 0:05,  = 0:0001. The initial neighborhood size  = 100 is decreased by 0:995 after each epoch. The reported results +

have been validated in several runs. For comparison, a classification with prototypes set by hand in the cluster centers (opt) and a nearest neighbor classifier (NN) are provided. Simple GLVQ and GRLVQ without neighborhood cooperation are not capable of learning data sets to . The prototypes only represent the clusters closest to the origin (see Fig. 1) and classification is nearly random. Nevertheless, the weighting factors obtained from GRLVQ for data set indicate the importance of the first two dimensions with 1  3 : , 2 : . For data sets , , and , some less important dimensions are emphasized by GRLVQ, as well. SNG produces good results for data sets , , and , and slightly worse results for their -dimensional counterparts. SRNG shows very stable behavior and a good classification accuracy in all runs. The relevance terms clearly emphasize the important first two dimensions with vectors   : ;  : ;  : ;  ;  ;  ;  ;  . The prototypes spread over the clusters and mostly only or out of clusters are missed in the runs for data sets to . Since data sets and are nearly randomly distributed (see Fig. 2), no generalization can be expected in these runs. Nevertheless, SNG and SRNG show comparably small test set errors in these runs, too. A similar behavior can be observed for accuracy, SNG performs the smaller sets and , where SRNG achieves close to

1 6

13

= 0 23 5

2 = 0 33

46 8

8

= ( 0 34 0 4 0 18 0 0 0 0 0) 2 3 100 1 4 5 6 7

data1 opt 100 100 NN GLVQ 49.4/48.9 GRLVQ 50.2/50 99/98 SNG SRNG 99/99

8

data2 100 95.4 50/49 50/50 80/72 95/94

100%

data3 data4 97.3 97.3 93.1 77.1 51.5/49.5 50/49.5 55/49 50.5/50 90/90 75/64 92/91 92/92

data5 58 67.3 50/50 51/50 61/50 66/54

data6 58 50.1 51/50 51/49 70/50 67/55

data7 100 96.3 62/65 62/65 98/97 98/97

data8 100 84.7 62/61 65/65 88/85 99/98

Table 1. Training/test set accuracy (%) of the obtained clustering for multimodal data sets.

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig. 3. Training set 7 and prototypes found by SRNG (2) and GRLVQ ().

slightly worse if additional noisy dimensions are added, and GRLVQ and GLVQ are not capable of representing all clusters of the data (see Fig. 3). Experiments with real life data for GRLVQ in comparison to GLVQ can be found e.g. in [2]. Tehreby, runs for SRNG for these data sets provide only a further slight improvement since these data, unlike the above artificial data sets, are not highly multimodal.

5 Conclusions We have proposed a modification of GRLVQ which includes neighborhood cooperation within prototypes of each class. Hence initialization of prototypes is no longer crucial for successful training which has been demonstrated for several multimodal data distributions. In addition, the algorithm SRNG shows a very stable behavior due to the chosen cost function. Synchronous to the prototypes, weighting factors for the different input dimensions are adapted, thus providing a metric which is fitted to the data distribution. This additional feature is particularly beneficial for high dimensional data.

References 1. T. Graepel, M. Burger, and K. Obermayer. Self-organizing maps: generalizations and new optimization techniques. Neurocomputing, 20:173-190, 1998. 2. B. Hammer, T. Villmann. Estimating relevant input dimensions for self-organizing algorithms. In: N. Allison, H. Yin, L. Allinson, J. Slack (eds.), Advances in Self-Organizing Maps, pp.173-180, Springer, 2001. 3. T. Kohonen. Self-Organizing Maps. Springer, 1997. 4. T. Martinetz, S. Berkovich, and K. Schulten. ‘Neural-gas’ network for vector quantization and its application to time-series prediction. IEEE TNN 4(4):558-569, 1993. 5. G. Patan´e and Marco Russo. The enhanced LBG algorithm. Neural Networks 14:1219-1237, 2001. 6. A. S. Sato, K. Yamada. Generalized learning vector quantization. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in NIPS, volume 7, pp.423–429. MIT Press, 1995. 7. T. Villmann, R. Der, M. Herrmann, and T. M. Martinetz, Toplogy Preservation in SelfOrganizing Feature Maps: Exact Definition and Precise Measurement, IEEE TNN 8(2):256 266, 1997. 8. N. Vlassis and A. Likas. A greedy algorithm for Gaussian mixture learning. Neural Processing Letters 15(1): 77-87, 2002.