Dynamic Hyperparameter Scaling Method for LVQ ... - Semantic Scholar

Report 4 Downloads 42 Views
Dynamic Hyperparameter Scaling Method for LVQ Algorithms Sambu Seo and Klaus Obermayer Abstract— We propose a new annealing method for the hyperparameters of several recent Learning Vector Quantization algorithms. We first analyze the relationship between values assigned to the hyperparameters, the on-line learning process, and the structure of the resulting classifier. Motivated by the results we then suggest an annealing method, where each hyperparameter is initially set to a large value and is then slowly decreased during learning. We apply the annealing method to the LVQ 2.1, SLVQ-LR, and RSLVQ methods, and we compare the generalization performance achieved with the new annealing method and with a standard hyperparameter selection using 10-fold cross validation. Benchmark results are provided for the datasets letter and pendigits from the UCI Machine Learning Repository. The new selection method provides equally good or - for some data sets - even superior results when compared to standard selection methods. More importantly, however, the number of learning trials for different values of the hyperparameters is drastically reduced. The results are insensitive to the form and parameters of the annealing schedule.

I. I NTRODUCTION Learning Vector Quantization (LVQ) ([1], [2]) is a class of learning algorithms for nearest prototype classification (NPC). Similar to the K-nearest neighbor method [3], nearest prototype classification (NPC) approximates the classification boundaries locally. Instead of making use of all the data points in a training set, however, NPC relies on a set of appropriately chosen prototype vectors only. This makes the method computationally more efficient, because the number of items which must be stored, and to which a new data point must be compared during classification, is considerably less. A nearest prototype classifier consists of a set T = {(θj , cj )}M j=1 of labeled prototype vectors. The parameters θj ∈ X ≡ RD are vectors in data space and the cj ∈ I, j = 1, ..., Ny are their corresponding class labels. Typically, the number of prototypes is larger than the number of classes, such that every classification boundary is determined by more than one prototype. The class y of a new data point x is determined by the class of the nearest prototype to x. The goal of learning algorithms for NPC is to the find a set of suitable prototypes, such that the resulting classifier is able to predict the class of unseen data points. Because of its computational efficiency, NPC has been a popular method for real time applications and the analysis of genetic data. Examples where it has been successfully used include speech recognition [4], [5], the analysis of gene expression profiles of primary breast tumor cells [6] and cancer class prediction from gene expression profiling [7]. Sambu Seo and Klaus Obermayer are with the Department of Electrical Engineering and Computer Science, Berlin University of Technology, Berlin, Germany, (phone: +49-30-31473628, +49-30-31473120; fax: +4930-31473120; email: [email protected], [email protected]).

Learning Vector Quantization was introduced by T. Kohonen ([8]) almost 20 years ago, and has since then been widely used (see [9] for an extensive bibliography). In order to improve classification performance, several variants of the LVQ procedure have been proposed since then, where learning is related to the minimization of an appropriate cost function ([5], [11], [12]). The derivation of LVQ-like learning procedures using a cost-function approach is particularly valuable when it comes to the analysis of convergence properties. Recently several soft learning algorithms for NPC were proposed [13], [14] which combine an explicit ansatz for the probability densities of the classes with a criterion for model selection which directly minimizes the rate of misclassification. This ansatz helps to make the assumptions underlying the choice of discriminant functions and model selection more explicit - because the underlying generative model is made explicit. In contrast to the other LVQ algorithms all of the prototypes are adjusted by a given data point during the learning process. The amount of change of the prototypes is determined by the learning factors which depend on one or two hyperparameters. The common property of all these new LVQ methods [5], [11], [12], [13], [14] is that they have one or two hyperparameters whose values have to be properly chosen. In machine learning the hyperparameters are usually kept fixed during training and because they control certain aspects of the model class (prior knowledge) and the learning procedure, e.g. the window width for data selection, the smoothness of the assignment probabilities, the complexity of the resulting function, etc. Usually an optimal hyperparameter is selected from a set of candidates as the one giving the best performance according to a model selection criterion, such as the hold-out test set method, the k-fold cross validation (CV) or the leave-one-out method (For more details on model selection methods, see [3]). In the hold-out test set method a test set which is independent of the training set is used to estimate the generalization error of the optimized model. The disadvantage of this method is that not all of the available data can be used for the model construction. The k-fold CV method makes use of all available data, by splitting the data set into k disjoint subsets (folds) of the same size ≡ N/K. The algorithm is trained k times, each time using a different fold as hold-out test set and the remaining k − 1 subsets as a training set. If k is equal to the number of data points, then this method is called leave-one-out CV. Because of the repeated training, CV is a computational intensive method. If a model has two hyperparameters, for each of which 10 candidates should be evaluated, then all 100 resulting combinations of pairs of candidates have to be evaluated 10 times. This results in a total of 1000 training and test

processes. In order to solve such computational complexity, Bengio ([15]) introduced a method for the simultaneous optimization of many hyperparameters for the special case in which the objective function is a quadratic polynomial of the model parameters, and furthermore continuous and differentiable w.r.t. the hyperparameters. The hyperparameters are optimized using gradient descent on the objective function. However for many objective functions these conditions do not hold. For example the width parameter of the window rule of LVQ algorithms is not even part of the cost function (see section II). In this work we propose a new heuristic method for the treatment of the hyperparameters which determine the active region used during learning. The proposed method can be used as an alternative to model selection methods. This approach is motivated by the fact that the problem of finding the optimal hyperparameters suffers from high computational complexity and the occurrence of ’bad’ local optima caused by the high non-convexity of the cost function. The proposed method consists of a dynamic scaling of the hyperparameters during the learning process, which, for more than one hyperparameter, is conducted simultaneously. The following four learning vector quantization algorithms are shortly reviewed in the section II: LVQ 2.1, Soft LVQ based on likelihood ratio (SLVQ-LR), its hard version LVQ-LR and robust soft LVQ (RSLVQ). The dynamic hyperparameter scaling method is described in section III and in section IV this approach is applied on the UCI data sets letter and pendigits. We compared the performance of the classifiers when optimized via either a 10-fold CV method or the dynamic hyperparameter scaling method. II. LVQ A LGORITHMS In this section we shortly review the four LVQ algorithms and illustrate the influence of the hyperparameters on the learning of the prototypes. A. LVQ 2.1 LVQ 2.1 was suggested by Kohonen [2] and has been shown to provide good NPC classifiers ([10], [9]). For each data point (x, y) from the training set S = {(xi , yi )}N i=1 , LVQ 2.1 first selects the two nearest prototypes θl , θm according to the Euclidean distance. If the labels cl and cm are different and if one of them is equal to the label y of the data point, then the two nearest prototypes are adjusted according to θl (t + 1) = θl (t) + α(t)(x − θl ), θm (t + 1) = θm (t) − α(t)(x − θm ),

cl = y, cm 6= y. (1)

If the labels cl and cm are equal or both labels differ from the label y of the data point, no parameter update is done. The prototypes, however, are changed only if the data point x is close to the classification boundary, i.e. if it falls into a window   1−ω d(x, θm ) d(x, θl ) min > s, where s = , , (2) d(x, θl ) d(x, θm ) 1+ω

(a) ω = 0.05

(b) ω = 0.15

10

10

5

5

0 −10

−5

0

5

10

0 −10

−5

(c) ω = 0.4 10

10

5

5

0 −10

−5

0

0

5

10

5

10

(d) ω = 0.7

5

10

0 −10

−5

0

Fig. 1: Active Region of LVQ 2.1 for different width parameters ω. This figure depicts a 2D problem with two prototypes of different classes. It shows how the active region, which is the region of data space where the data points fall into the window, changes with the window width parameter ω. The active region is narrow and close to the class boundary for small ω, whereas it spreads and is enlarged by increasing ω. For fixed ω, the active region depends on the distance between the data point and the class boundary as well as on the distance between the data point and the prototypes. The influence of the latter distance on the active region decreases with decreasing ω. of relative width 0 < ω ≤ 1 ([1]). This ’window rule’ had to be introduced, because otherwise prototype vectors may diverge. Figure 1 shows, for different width parameters ω, the active region of LVQ 2.1, in which data points fall into the window and are used for the adjustment of the prototypes. The active region (denoted as black area) lies in the vicinity of the class boundary for small ω and it is enlarged by increasing ω. For fixed ω, the active region depends on the distance between a data point and the class boundary and on the distance between data point and prototypes. The influence of the latter on the active region is decreased by decreasing ω. For ω = 1, every data point falls into the window, i.e. the prototypes are changed by every data point according to the rule, eq. (1). For large ω and for reasonable prototypes, the average distance between the data points and the nearest incorrect prototypes is larger than the average distance between the data points and the correct prototypes. Therefore, during the learning process the average amount of repulsion from the data set is larger than the average amount of attraction to the data set. This phenomenon leads to the divergence of the prototypes. For small ω, the average amount of change of the two nearest prototypes is almost the same, because (i) the prototypes are adjusted only by the data points near the class boundary and (ii)the class distribution near the boundary is usually random (for inseparable class problems). Therefore, using appropriate ω, the divergence problem can be solved. LVQ 2.1 provides good classification performance but it is heuristic. Hence it may not be optimal and, moreover, it

is not clear (i) what is optimized by the learning rule, eq. (1), (ii) why the learning rule leads to the divergence of the prototypes and (iii) whether the chosen window rule is the best choice. These questions were answered in [14], where an LVQ learning algorithm was derived by maximizing a likelihood ratio based on a Gaussian mixture model. B. Soft LVQ based on Likelihood Ratio (SLVQ-LR) The learning algorithm SLVQ-LR was derived by maximizing the likelihood ratio function which was constructed based on a generative Gaussian mixture model. For the construction of the learning algorithms, it is assumed that the probability density p(x) of the data points x can be described by a mixture model and that every component l of the mixture is homogeneous in the sense that it generates data points which belong only to one class Cl . The probability density of the data is then given by: p(x|T ) =

Ny X X

p(x|l)P (l),

(3)

c=1 {l:cl =c}

where Ny is the number of classes and cl is the class label of the data points generated by component l. P (l) is the probability that data points are generated by a particular component l and p(x|l) is the conditional probability that this component l generates a particular data point x. The restricted probability densities of a data point x and its true class label y are defined as: X p(x, y|T ) = p(x|l)P (l), (4) {l:cl =y}

p(x, y¯|T )

=

X

p(x|l)P (l).

(5)

{l:cl 6=y}

p(x, y|T ) is the probability density that a data point x is generated with the correct class label y, and p(x, y¯|T ) is the probability density that a data point x is generated with another class label 6= y. In order to obtain the mixture model which maximizes the likelihood of eq. (4) and at the same time minimizes the likelihood of eq. (5), the following likelihood ratio was suggested as an objective function to be maximized w.r.t. the parameters of the model densities: N Y p(xk , yk |T ) ! Lr (S|T ) = = max . p(xk , y¯k |T )

(6)

k=1

Maximizing this likelihood ratio leads to a classifier which maximizes the rate of correct classification and at the same time minimizes the rate of misclassification. For computational simplicity we maximize log Lr , log Lr (S|T ) =

N X k=1

log

p(xk , yk |T ) ! = max . p(xk , y¯k |T )

(7)

Using stochastic gradient ascent ([16]) we obtain the learning rule p(x, y|T ) ∂ log , (8) θl (t + 1) = θl (t) + α(t) ∂θl p(x, y¯|T )

where t is the iteration number, and α(t) is the learning rate. In order P∞to ensure convergence, P∞α(t)2 must fulfill the conditions t=0 α(t) = ∞ and t=0 α (t) < ∞ ([16]). Because classification using NP classifiers depends only on the relative distances between data points and prototypes, it is assumed that every Gaussian component has the same 1 width and strength i.e. σl2 = σ 2 , P (l) = M , ∀l = 1, . . . , M. For a mixture ansatz with D-dimensional Gaussian components of the same width and strength,   (x − θl )2 1 exp − , (9) p(x|l) = 2σ 2 (2πσ 2 )D/2 we obtain the following learning rule θl (t + 1) = θl (t) + ∆θl (t), (10) ( Py (l|x)(x − θl ), if cl = y, ∆θl (t) = α∗ (t) −Py¯(l|x)(x − θl ), if cl 6= y, with α∗ (t) =

α(t) σ2

and

  2 l) exp − (x−θ 2σ 2  , Py (l|x) = P (x−θm )2 {m:cm =y} exp − 2σ 2   2 l) exp − (x−θ 2σ 2 .  Py¯(l|x) = P (x−θm )2 exp − 2 {m:cm 6=y} 2σ

(11)

The learning rule, eq. (10), shows that the prototypes with the ’correct’ labels (cl = y) are attracted towards a data point x proportionally to their distance to x and weighted by a factor Py (l|x). ’Incorrect’ prototypes (cl 6= y) are repelled from the data point x, and the strength of the repulsion is again proportional to their distance to x and is weighted by a factor Py¯(l|x). The width of the Gaussian components σ is a hyperparameter of the learning rule (eq. (10)) and its choice is critical for good performance. One can select an optimal value from a set of candidates using a model-selection method, like the hold-out test set method or the k-fold cross-validation method. Alternatively, one can use deterministic annealing of σ 2 , which is a useful optimization procedure for clustering problems (cf. [17], [18]). Initially σ 2 is set to a large value and is decreased during optimization (’annealing’) until an optimal value σf is reached. The CV and the annealing methods are computationally intensive. If the width of the Gaussian components σ goes to zero, both assignment probabilities, eq. (11), become hard assignments, i.e. Py (l|x) = δ(l, qy ),

qy = arg min

||x − θk ||,

Py¯(l|x) = δ(l, qy¯),

qy¯ = arg min

||x − θk ||,

{k:ck =y} {k:ck 6=y}

where δ is the Kronecker-Delta function, and the learning rule, eq. (10), turns into the following hard learning rule  if cl = qy ,  (x − θl ), θl (t + 1) = θl (t) + α(t) −(x − θl ), if cl = qy¯, (12)  0, else.

2

2

(a) σ = 0.05, P (l|x)

(b) σ = 2, P (l|x)

y

y

6

1

6

1

4

0.9

4

0.9

2

0.8

0

0.8

0 0.7

−2 −4 −6 −5

2

2

0

5

−4

0.5

−6 −5

(c) σ = 0.05, P (l|x) ~y

1 5

0.7

−2

0.6

0.6

2

0

(d) σ = 2, P~y(l|x)

5

1

5

0.9

0.9

0.8

0.8

0

0 0.7

0.7

0.6 −5 −5

0.5

0

5

0.5

0.6

−5 −5

0

5

0.5

Fig. 2: The figure shows the update factors Py (l|x) and Py¯(l|x), eq. (11), of the nearest prototypes with correct label (fig. (a,b)) and with incorrect label (fig. (c,d)), for different width parameters, σ 2 = 0.05 (left column) and σ 2 = 2 (right column). Here, a three class problem in 2D space with an NP classifier having two prototype per class is constructed. The x and y axis denote the location of the data points, and the symbols ◦, / and  denote the position of prototypes belonging to different classes; θ◦ ∈ {(−3, −3), (−1, −1.5)}, θ/ ∈ {(0, 2), (0, 4)}, θ ∈ {(1, −1.5), (3, −3)}. It is assumed that every data point is classified correctly by the current prototypes, i.e. all data points have the same label as their nearest prototype. For details see the text. It is identical to the learning rule of LVQ 2.1, eq. (1). This implies that the learning rule of LVQ 2.1 optimizes the likelihood ratio based on a Gaussian mixture model, eq. (6). Fig. 2 illustrates the factor for the update of the prototypes, eqs. (10) and (11) for the nearest correct and incorrect prototypes, respectively, for different width parameters, σ 2 = 0.05 and σ 2 = 2, as function of the location of the data points. The left side of the figure illustrates the learning rule for the hard version of the algorithm for small σ 2 , eq. (12). In this case, each data point adjusts its nearest correct and incorrect prototypes with factor 1. The winners among the prototypes with the correct label and among the prototypes with incorrect labels are adjusted. But for the data points which lie midway between the two correct prototypes or the two nearest incorrect prototypes, the factor is 0.5. The right figure shows the case with a large width parameter and illustrates the typical dynamics of the SLVQ-LR algorithm. Given a data point, the update factor for the correct (incorrect) prototypes, fig. 2b(d), results from a competition among the prototypes with correct (incorrect) labels, as can be seen from eqs. (11). For fixed σ 2 , this factor approximates a winnertakes-all manner, if the distance among the prototypes with correct (incorrect) labels increases. However, the two algorithms, eq. (10),(12), diverge, because the objective function is not upper bounded. The

expectation of the cost-function eq. (7) w.r.t. the true distribution p(x, y) of the data is given by difference of two Kullback-Leibler divergences DKL (p(x, y)||p(x, y¯|T )) − DKL (p(x, y)||p(x, y|T )) (for details see [14]). Maximization of the cost-function in the limit of a large number of data points is therefore equivalent to (i) the maximization of the Kullback-Leibler divergence between the true distribution p(x, y) and the data distribution p(x, y¯|T ) which is generated by the ’incorrect’ classes and (ii) the minimization of the Kullback-Leibler divergence between the true distribution p(x, y) and the data distribution p(x, y|T ) which is generated by the ’correct’ classes. Minimizing the second term forces the model to optimally represent the true data distribution while maximizing the first term ensures an optimal class boundary. Maximizing the first term, however, may lead to the divergence of the prototype vectors θl because (i) the Kullback-Leibler divergence has no upper bound and (ii) the Kullback-Leibler divergence between two Gaussian distributions increases with increasing distance between their centers. This analysis also provides the answer to the question, why the learning rule LVQ 2.1 leads to the divergence of the prototypes. In order to prevent the divergence of the prototypes a ’window rule’ similar to the ’window rule’ eq. (2) used in LVQ 2.1 is introduced [14].   1−ω kx − θm k2 kx − θl k2 min > , . (13) kx − θl k2 kx − θm k2 1+ω θl and θm are the two prototypes which are closest to the data point x and which have different labels, cl 6= cm . 0 < w ≤ 1 is the width parameter which determines the window size. According to this window rule, for small ω the prototypes are only modified if a data point lies near the actual class boundary. Here, in contrast to the window rule of LVQ 2.1, the constraint that exactly one of the two nearest prototypes must match the label of the data point is dropped. Instead, only the inequality of the class label of the two nearest prototypes is required. The learning rules eqs. (10) and (11) together with the window rule, eq. (13), implement the Learning Vector Quantization based on Likelihood Ratios (LVQ-LR) and the Soft LVQ-LR (SLVQLR) algorithm [14]. LVQ-LR is identical to LVQ 2.1 except for the different heuristically chosen window rule. However, the experimental results in section IV show that LVQ-LR performs much better than LVQ 2.1. This implies that the heuristically-chosen window rule of LVQ 2.1 is not the best one. To find an optimal rule for the selection of the active region is still an open problem. C. Robust Soft LVQ (RSLVQ) The objective function of SLVQ-LR is not upper bounded, and the maximization of the likelihood ratio leads to the divergence of the model parameters, the prototypes. To assure the convergence of the algorithm, a window rule was introduced. Robust Soft LVQ is an algorithm which is derived by maximizing an upper bounded objective function based

(a) σ2 = 0.5, Py(l|x) − P(l|x)

on a Gaussian mixture model: Lr =

N Y

k=1

p(xk , yk |T ) . p(xk |T )

5

(b) σ2 = 4, P (l|x) − P(l|x) y

0.6

(14)

5

0.5

0.5

0.4

0.4

0

Because the ratio is continuous and bounded by 0 ≤ p(x, y|T )/p(x|T ) ≤ 1, this objective function can be maximized with respect to θl by stochastic gradient ascent [16], [19]. By maximizing the logarithm of the ratio Lr based on a Gaussian mixture model, eq. (9), the following learning rule is obtained

0.3

0.3

0

0.2

0.2 0.1

−5 −5

0

5

0

0.1

−5 −5

(c) σ2 = 0.5, P(l|x)

0

5

(d) σ2 = 4, P(l|x)

5

0.35

5

0.3

0.4

0.25

θl (t + 1) = θl (t) + ∆θl (t), (15) ( (Py (l|x) − P (l|x))(x − θl ), cl = y, ∆θl (t) = α∗ (t) −P (l|x)(x − θl ), cl = 6 y,

0.3

0

0.2

0

0.15

0.2

0.1 0.1

−5 −5

0.05

−5 0

5

0

−5

0

5

where α∗ (t) = α(t)/σ 2 and Py (l|x) =

P (l|x) =

  2 l) exp − (x−θ 2 2σ  , P (x−θm )2 exp − 2 {m:cm =y} 2σ   (x−θl )2 exp − 2σ2 .  PM (x−θm )2 m=1 exp − 2σ 2

(16)

The prototypes whose labels match the label of data point x will be attracted to it proportional to the (always positive) factor Py (l|x) − P (l|x), while the prototypes whose labels differ from the label of data point x will be repelled from it proportional to the factor P (l|x). Eqs. (15),(16) implement the Robust Soft Learning Vector Quantization (RSLVQ) algorithm. Fig. 3 shows the factors for the update of the prototypes, eqs. (15), (16), for two different width parameters σ 2 = 0.5 (left), σ 2 = 4 (right) on a 2D toy example. Figures (a,b) show the factor for the update of the nearest correct prototype as a function of the position of the data points and figures (c,d) show the factor for the update of the nearest prototype with incorrect class label. The figure demonstrates: (i) By increasing the width parameter σ 2 , the update factors get smoother and the scale of the active region becomes larger. (ii) The amount of adjustment of the correct prototype is larger than of the incorrect prototype. (iii) The RSLVQ algorithm changes the prototypes using the data points near the class boundary, and in contrast to SLVQ-LR, outliers are not used for the adjustment of the prototypes (compare to fig. 2). This property shows that the RSLVQ algorithm implicitly selects informative data points which lie near the class boundary. The properties (ii) and (iii) lead to the convergence of the prototypes without using a heuristic window rule. III. DYNAMIC H YPERPARAMETER S CALING M ETHOD FOR LVQ A LGORITHMS In the previous section we considered four different LVQ learning algorithms and the properties of their hyperparameter w.r.t. the active data selection and the update factor for the prototypes. The value for the window width ω determines

Fig. 3: This figure shows the update factors of the nearest prototype with a correct and incorrect labels (eqs. (15), (16)) as a function of the position of the data points, for different widths of the components σ 2 = 0.5, 4. The active area for the update of the prototypes is around the class boundary and is expanded by increasing σ 2 . The RSLVQ algorithm adjusts the prototypes by only using the data points near the class boundary. For a description of the problem see the caption of fig. 2. the active region of training data for the algorithms, LVQ 2.1, LVQ-LR and SLVQ-LR. For the extreme value ω = 1, all of the training data points are used for the learning of the prototypes. By decreasing ω of LVQ 2.1 the window width narrows along the class boundary. The width of the Gaussian component, σ 2 , has similar influence on the selection of active data for the RSLVQ algorithm. For small values of σ 2 the prototypes are adjusted only by the data points near the current class boundary, while with an increasing value of σ 2 the active region is enlarged. For the SLVQ-LR algorithm, σ 2 regulates the competition among the prototypes with correct label and the prototypes with incorrect labels. For small σ 2 , the winner-take-all principle dominates, while with increasing values of σ 2 the learning factors become smoother. Using this monotonic relation between the hyperparameter and the active region, and the influence of the hyperparameter on the competition amongst prototypes, we introduce a simple heuristic learning method for LVQ algorithms. Initially, the hyperparameters are set to a large value and then decreased very slowly at each learning step. At the beginning of learning, the current prototypes are poor representatives and not very informative about the data distribution. Hence, we allow more data points to contribute to the learning of the prototypes. During learning, the prototypes become more representative of the data set, which results in a better class boundary. Thus we can place increasing trust in the current set of prototypes, and use only the data points which contain information relevant for the current prototypes. We call this method ’dynamic hyperparameter scaling’. Algorithms based on this method use the hyperparameter to set

a dynamic active region for the learning during the learning process. Hence, there is no need for selecting an optimal hyperparameter from a set of candidates. This leads to a remarkable speed-up, since the computational complexity for the hyperparameter selection using k-fold CV is of the QN η order O(k × i=1 Ni ), where Nη denotes the number of hyperparameters and Ni the number of candidates for the i-th hyperparameter. For dynamic hyperparameter scaling we propose the following annealing schedule, which depends on the number of training data points: η(t + 1) = ηini ×

Tη × N , Tη × N + t

where η ∈ {σ 2 , ω} and ηini is the initial value, Tη is a positive integer number, N is the number of training data points and t is the number of learning steps. The performance does not critically depend on the values of ηini and Tη . As a rule of thumb, we can set Tη to 10, the initial σ 2 to the mean of the variance of each class and the initial ω to 0.04 if there are 1 − 3 prototypes per class or else to 0.1. The difference between this method and the deterministic annealing method is that this method changes the hyperparameters at each learning step, while deterministic annealing changes the parameters after the convergence of the prototypes (at a fixed value of the hyperparameters). With our heuristic method, the hyperparameter does not need to be optimized and the parameters are learned only once. If the cost function has many local minima with respect to the hyperparameters, dynamic hyperparameter scaling might be superior to an algorithm which chooses a fixed sigma. The main advantage, however, lies in the lower computational complexity. In this paper, the dynamic hyperparameter scaling method is only applied to four algorithms, LVQ 2.1, SLVQ-LR, LVQLR and RSLVQ. However, this method can also be applied to other algorithms which contain hyperparameters selecting the active region of the data space during the training. IV. N UMERICAL E XPERIMENTS The generalization performance using dynamic hyperparameter scaling is compared to the adjustment of hyperparameters via the cross-validation method on the data sets letter and pendigits from the UCI Machine Learning Repository (ftp://ftp.ics.uci.edu/pub/machine-learningdatabases). A. Design for the Experiment 1) Measure of Generalization Error: letter is generated from a large number of black-and-white rectangular pixel displays of the 26 capital letters in the English alphabet. By finding optimal hyperparameter using cross-validation, the data set was split into two subsets, each of which contained the same number of data points for each label. The first subset, the training set, was used to find the optimal values of the hyperparameters. The generalization error for the optimal hyperparameters was then estimated

by 10-fold cross-validation on the second (test-) set. For the dynamic hyperparameter scaling method, the process of finding an optimal hyperparameter is not required. Hence the generalization error is measured via 10-fold CV using the whole data set. The data set pendigits is a database of hand-written digits created by 44 writers. The samples written by 30 writers are used for training (7494 samples), and the remaining digits written by the other 14 writers are used for writer independent testing (3498 samples). For the selection of an optimal hyperparameter using 10-fold cross-validation, the training data set is used to find optimal hyperparameters for the algorithms. After that, the classifier is optimized on the training set using the selected hyperparameters and the generalization error of the optimal classifier is measured via the hold-out test set method. For the dynamic hyperparameter scaling method, the prototypes are optimized using the training set, and the generalization error is measured via the hold-out test set method. In order to obtain more representative results, in both cases the training and test procedures (with a fixed selected hyperparameter and with the annealing method) are repeated 10 times, and the average of the 10 runs is calculated. 2) Schedule of Hyperparameter Scaling: For the dynamic hyperparameter scaling method, the hyperparameter σ 2 (for SLVQ-LR and RSLVQ) and the hyperparameter ω (for LVQ 2.1, SLVQ-LR and LVQ-LR) are scaled down at each 2 learning step. For the experiment we set σini to 0.5 − 1.5 times the mean of the variance of each class. Nσ is set to 4 (SLVQ-LR) and 6 (RSLVQ) for the data set letter and to 10 for the data set pendigits. For dynamic ω scaling, we start with a large value, 0.04 ≤ ω ≤ 0.19, and then scale it down very slowly at each learning step. Nω is set to 8 for the data set letter and to 15 for the data set pendigits1 . B. Experimental Results In this section, the classification results of the algorithms with the proposed method and with selection of hyperparameters via 10-fold CV are compared on the data sets letter and pendigits. 1) Data Set Letter: Dynamic σ 2 Scaling: Figure 4 shows the generalization error of the soft LVQ algorithms on the data set letter for two different treatments of the hyperparameters. The error bar denotes the standard deviation of 10-fold CV errors. The red line denotes the average of test errors of an NP classifier which is trained with decreasing σ 2 , while the window parameter ω of SLVQ-LR is set to the value selected via the 10-fold CV method. The blue line denotes the average test errors of an NP classifier whose optimization is performed with a fixed optimal σ 2 selected from a set of 10 candidates via the 10-fold CV method. For the selection of an optimal σ 2 10 × 10 CVs were performed. This process is not needed with the dynamic scaling method, which therefore leads to a reduction of 1 For details on the used parameters for the dynamic scaling schedule and the optimal hyper parameters chosen via CV see appendix A.5–A.8 of [20].

Rate of Misclassification

Rate of Misclassification

(b) RSLVQ 0.25

Selection Annealing

0.2

0.15

0.1

Selection ω−Annealing 2 σ −Annealing ω,σ2−Annealing

0.22

Rate of Misclassification

(a) SLVQ−LR 0.25

Selection Annealing

0.2

0.15

0.1

0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 2

4

6

8

10

Number of Prototypes per Class 0.05

2

4

6

8

10

0.05

Number of Prototypes per Class

2

4

6

8

10

Number of Prototypes per Class

Fig. 4: The figure shows the average 10-fold CV errors of the SLVQ-LR and RSLVQ algorithms with the σ 2 dynamic scaling method (red) and with the selection of an optimal σ 2 from a set of 10 candidates via 10-fold CV (blue) on the data set letter. For the SLVQ-LR algorithm the hyperparameter ω used is the optimal one selected via the 10-fold CV method. (a) LVQ 2.1 Selection Annealing

0.24

0.2 0.18 0.16 0.14 0.12 0.1 0.08

Selection Annealing

0.24

0.22

Rate of Misclassification

Rate of Misclassification

0.22

(c) LVQ−LR

0.2 0.18 0.16 0.14 0.12 0.1 0.08

0.2 0.18 0.16 0.14 0.12 0.1 0.08

0.06

0.06

0.06

0.04

0.04

0.04

2 4 6 8 10 Number of Prototypes per Class

Annealing Selection

0.22

Rate of Misclassification

0.24

(b) SLVQ−LR

2 4 6 8 10 Number of Prototypes per Class

2 4 6 8 10 Number of Prototypes per Class

Fig. 5: The figure shows the average 10-fold CV errors of LVQ 2.1, SLVQ-LR and LVQ-LR on the data set letter. The blue line shows the performance of an NP classifier where the optimal ω was selected from a set of 10 candidates via 10-fold CV. The red line shows the performance of the NP classifier trained with the ω scaling method. For the SLVQ-LR algorithm, the optimal hyperparameter σ 2 was used which was selected via the 10-fold cross-validation method. computational complexity. For the RSLVQ algorithm, the resulting NP classifier using the dynamic scaling method performs almost as well as the one using 10-fold CV. The error bar of NPC using the dynamic scaling method is smaller than the one of NPC using CV. The SLVQ-LR algorithm with the dynamic scaling method performs consistently better than that with the selection method and slightly better than the RSLVQ algorithm with a selected optimal σ 2 . 2) Data Set Letter: Dynamic ω Scaling: Figure 5 shows the generalization error of LVQ 2.1, SLVQ-LR and LVQ-LR which is measured via 10-fold CV. LVQ 2.1 and SLVQLR with dynamic ω scaling outperform the algorithms with the selected optimal ω. LVQ-LR with ω dynamic scaling performs worse than with the selected optimal ω, but it results in a much more robust classification performance. Note, that the performance of SLVQ-LR and LVQ-LR with the dynamic scaling method is almost the same, and that they

Fig. 6: The figure shows the average 10-fold CV errors of SLVQ-LR with four different treatments of the hyperparameters ω and σ 2 . SLVQ-LR with dynamic σ 2 and ω scaling performs better than with fixed selected hyperparameters via the 10-fold cross-validation method. Furthermore, the performance of SLVQ-LR with simultaneous dynamic scaling of σ 2 and ω is superior to the other variants. perform slightly better than RSLVQ with either selection or annealing method. 3) Data Set Letter: Simultaneous Dynamic σ 2 and ω Scaling: Now it is interesting to consider how SLVQ-LR will perform with simultaneous dynamic scaling of σ 2 and ω. Figure 6 shows the average 10-fold CV error of SLVQLR with four different treatments of the hyperparameters. It shows that SLVQ-LR with simultaneous dynamic scaling of two hyperparameters, σ 2 and ω, performs superiorly to the others. An optimal σ 2 and an optimal ω were selected from a set of 10 candidates for each hyperparameter. For the selection 10 × 10 × 10 CVs were performed. Using simultaneous dynamic scaling of two hyperparameters, the computational costs for 1000 training and test processes were avoided 4) Data Set Pendigits: The same experiment is conducted on the data set pendigits. The results of the experiment are shown in figure 7. For the selection method we used on average 10 candidates for σ 2 and 7 candidates for ω. The figure shows that the SLVQ-LR with the dynamic scaling method and with the selected optimal hyperparameters perform almost equally. RSLVQ with the selection method is superior to that with the dynamic scaling method. However, SLVQ-LR with the dynamic scaling method performs almost equally to RSLVQ with a selected optimal σ 2 . LVQ 2.1 with the dynamic scaling method performs well for a large number of prototypes per class (≥ 7), in which case it performs almost as well as LVQ-LR and its performance is much more robust than LVQ 2.1 with the selection method. LVQ-LR with dynamic scaling and with the selection method perform almost equally. V. S UMMARY

AND

D ISCUSSION

In this paper we proposed a new annealing method for the selection of hyperparameters for learning vector quantization algorithms, in order to reduce the computation time spent on the search for the optimal values. An analysis of the learning process for the methods LVQ 2.1, SLVQ-LR, and LVQ-LR showed that the active region in input space, i.e. the region

(a) SLVQ−LR

(b) RSLVQ 0.1

Selection Annealing 0.08 0.06 0.04 0.02

Rate of Misclassification

Rate of Misclassification

0.1

Annealing Selection 0.08

0.04

(c) LVQ 2.1 Selection Annealing

0.06 0.04

2 4 6 8 10 Number of Prototypes per Class

Rate of Misclassification

Rate of Misclassification

0.1

0.08

0.02

2 4 6 8 10 Number of Prototypes per Class

(d) LVQ−LR

0.1

Selection Annealing 0.08 0.06 0.04 0.02

ACKNOWLEDGMENT This work was supported in part by the Anna-Geisler Stiftung under grant no. VH P 293/33 and the MonikaKutzner Stiftung under grant no. 10025517.

0.06

0.02

2 4 6 8 10 Number of Prototypes per Class

its use.

2 4 6 8 10 Number of Prototypes per Class

Fig. 7: The figure shows the average hold-out set error standard deviation of 10 runs for four different LVQ algorithms on the data set pendigits. The blue line denotes the performance of the LVQ algorithms with the optimal hyperparameter selected via 10-fold CV. The red line denotes the performance of the LVQ algorithms with the dynamic hyperparameters scaling method: SLVQ-LR(simultaneous σ 2 , ω scaling), RSLVQ(σ 2 ), LVQ 2.1 (ω), LVQ-LR (ω). from which training data is selected, shrinks monotonically around the current class boundary with decreasing value of the window width (LVQ 2.1, SLVQ-LR and LVQ-LR) and the width of the Gaussian components (RSLVQ-LR). Based on these findings we suggest a new ’dynamic hyperparameter scaling’ method, based on an annealing process for these particular hyperparameters. The performance of the algorithms using the proposed ’dynamic scaling’ method is compared to the performance of methods that select the optimal hyperparameters via 10fold CV on two UCI data sets. In the experiments, LVQ 2.1 and SLVQ-LR using the dynamic scaling method performed superior to the algorithms using the selection method. For these two algorithms the proposed method gives a double advantage. Not only is the computational complexity reduced remarkably, but the generalization performance of the resulting classifier is improved. RSLVQ and LVQ-LR with the selection method perform slightly better than the versions using the proposed method on one data set and perform almost equally on the other data set. However, SLVQ-LR with the dynamic scaling method gives almost the same results as RSLVQ using the selection method. Due to its computational simplicity, the variant with dynamic scaling should be preferred. While in this paper we applied the method only to the four LVQ algorithms, it could also prove useful for other algorithms which contain hyperparameters that control the active region for the learning. It could be interesting to see whether the LVQ algorithms [5], [11], [12] also benefit from

R EFERENCES [1] T. Kohonen, Self Organization Maps. New York: Springer-Verlag, 2001. [2] T. Kohonen, J. Hynninen, J. Kangas, J. Laaksonen, and K. Torkkola, Lvq Pak: The Learning Vector Quantization Program Package, 1995. [3] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. New York: Sohn Wiley & Sons, 2001. [4] T. Komori and S. Katagiri, “Application of a generalized probabilistic descent method to dynamic time warping-based speech recognition,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, 1992, pp. 497 –500. [5] E. McDermott and S. Katagiri, “Prototype-based minimum classification error/generalized probabilistic descent training for various speech units,” Computer Speech and Language, vol. 8, no. 4, pp. 351–368, 1994. [6] B. Weigelt, A. Glas, L. Wessels, A. Witteveen, J. Peterse, and L. Veer, “Gene expression profiles of primary breast tumors maintained in distant metastases,” in Proceedings of the National Academy of Sciences of the United States of America, vol. 100, 2003, pp. 15 901–15 905. [7] R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu, “Diagnosis of multiple cancer types by shrunken centroids of gene expression,” in Proceedings of the National Academy of Sciences of the United States of America, vol. 99, 2002, pp. 6567–6572. [8] T. Kohonen, “Learning vector quantization,” Helsinki Univ. of Tech., Otaniemi, Tech. Rep., 1986. [9] N. N. R. Centre, Bibliography on the Self-Organizing Map (SOM) and Learning Vector Quantization (LVQ), Helsinki Univ. of Tech., 2003, http://liinwww.ira.uka.de/bibliography/Neural/SOM.LVQ.html. [10] T. Kohonen, “Improved versions of learning vector quantization,” in International Joint Conference on Neural Networks, vol. 1, 1990, pp. 545 –550. [11] A. Sato and K. Yamada, “Generalized learning vector quantization,” in Advances in Neural Information Processing Systems 8, D. Touretzky, M. Mozer, and M. Hasselmo, Eds. MIT Press, 1996, pp. 423–429. [12] K. Crammer, R. Gilad-Bachrach, A. Navot, and N. Tishby, “Margin analysis of the lvq algorithm,” in Advances in Neural Information Processing Systems 15, S. Becker, S. Thrun, and K. Obermayer, Eds. MIT press., 2002. [13] S. Seo, M. Bode, and K. Obermayer, “Soft nearest prototype classification,” IEEE Transactions on Neural Network, vol. 14, pp. 390–398, 2003. [14] S. Seo and K. Obermayer, “Soft learning vector quantization,” Neural Computation, vol. 15, pp. 1589–1604, 2003. [15] Y. Bengio, “Gradient-based optimization of hyperparameters,” Neural Computation, vol. 12, no. 8, pp. 1889–1900, 2000. [16] H. Robbins and S. Monro, “A stochastic approximation method,” Ann. Math. Stat., vol. 22, pp. 400–407, 1951. [17] T. Graepel, M. Burger, and K. Obermayer, “Phase transduction in stochastic self-organizing maps,” Physical Review E., vol. 56, pp. 3876–3890, 1997. [18] D. Miller, A. V. Rao, K. Rose, and A. Gersho, “A global optimization technique for statistical classifier design,” IEEE Transactions on Signal Processing, vol. 44, no. 12, pp. 3108–22, 1996. [19] L. Bottou, “Online learning and stochastic approximations,” in Online Learning and Neural Networks, D. Saad, Ed. Cambridge, UK: Cambridge University Press, 1998, pp. 9–42. [20] S. Seo, “Clustering and prototype based classification,” Ph.D. dissertation, Berlin University of Technology, 2005.