Pattern Recognition Letters. - CiteSeerX

Report 5 Downloads 115 Views
A Class-Dependent Weighted Dissimilarity Measure for Nearest Neighbor Classification Problems Roberto Paredes and Enrique Vidal Instituto Tecnológico de Informática, Universidad Politécnica de Valencia, Spain. [email protected], [email protected] Abstract A class-dependent weighted (CDW) dissimilarity measure in vector spaces is proposed to improve the performance of the nearest neighbor classifier. In order to optimize the required weights, an approach based on Fractional Programming is presented. Experiments with several standard benchmark data sets show the effectiveness of the proposed technique.

Pattern Recognition Letters.

Keywords: Nearest Neighbour Classification, Weighted Dissimilarity Measures, Iterative Optimization, Fractional Programming.

1 Introduction Let let

P

be a finite set of prototypes, which are class-labelled points in a vector space



d( ; )

be a dissimilarity measure defined in

E.

For any given point

Neighbor (NN) classification rule assigns the label of a prototype p

x

2

E,

E

and

the Nearest

2 P to x such that d(p; x)

is minimum. The NN rule can be extended to the k -NN rule by classifying x in the class which is more heavily represented by the labels of its k nearest neighbours. The great effectiveness of these rules when the number of prototypes is growing to infinity is well known [Cover (1967)]. However, in most real situations, the number of available prototypes is usually very small, which often leads to dramatic degradations of (k -)NN classification accuracy. Consider the following general statistical statement of a two-class Pattern Recognition classification problem: Let

Dn =

f(X ; Y ); : : : ; (Xn; Yn)g be a training data set of independent, 1

1

identically distributed random variable pairs, where Yi

2 f0; 1g; 1  i  n, are classification

labels, and let X be an observation from the same distribution. Let Y be the true label of X and



gn ( )

a classification rule based on Dn . The probability of error is

f 6

g. De-

Rn = P Y = gn (X )

vroye et al. show that, for any integer n and classification rule gn , there exists a distribution of (X; Y )

with Bayes risk R

=0

such that the expectation of Rn is E (Rn ) 

1 2

", where " > 0

is an arbitrary small number [Devroye (1996)]. This theorem states that even though we have rules, such as the k -NN rule, that are universally consistent (that is, they asymptotically provide optimal performance for any distribution), their finite sample performance can be extremely bad for some distributions. This reason explains the increasing interest in finding variants of the NN rule and adequate distance measures that help improve the NN classification performance in small data set situations [Tomek (1976), Fukunaga (1985), Luk (1986), Urahama (1995), Short (1980), Short (1981), Fukunaga (1982), Fukunaga (1984), Myles (1990]. Here we propose a weighted measure which can be seen as a generalization of the simple weighted L2 dissimilarity in a d-dimensional space:

v u d uX d(y; x) = t j (xj 2

j =1

y j )2

(1)

where j is the weight of the j-th dimension. Assuming a m-class classification problem, our proposed generalization is just a natural extension of (1):

v u d uX d(y; x) = t  j (xj 2

j =1

where

= lass(x).

measure. If ij

y j )2

(2)

We will refer to this extension as “Class-Dependent Weighted (CDW)”

= 1, 1 < i < m, 1 < j < d,

the weighted measure is just the

L2

metric.

On the other hand, if the weights are the inverse of the variances in each dimension, the Mahalanobis distance (MD) is obtained. Weights can also be computed as class-dependent inverse variances, leading to a measure that will be referred to as class-dependent Mahalanobis (CDM) dissimilarity.

In the general case, (2) is not a metric, since d(x; y) can be different from d(y; x) if lass(x) 6=

lass(y),

which would not satisfy the symmetry property.

In this most general setting, we are interested in finding an m  d weight matrix, M , which

optimizes the CDW-based NN classification performance:

2

0 BB . M = B .. 

:::

11

1d

.. .

m1

:::

md

1 CC CA

(3)

2 Approach In order to find a matrix

M

that results in a low error rate of the NN classifier with the CDW

dissimilarity measure, we propose the minimization of a specific criterion index. Under the proposed framework, we expect NN accuracy to improve by using a dissimilarity measure such that distances between points belonging to the same class are small while interclass distances are large. This simple idea suggests the following criterion index:

P d(x; x ) nn 2S J (M ) = P 6 d(x; xnn ) =

x

(4)

=

x2S

= 6= where x= nn is the nearest neighbor of x in the same class ( lass(x) = lass(xnn )) and xnn is the

nearest neighbor of x in a different class ( lass(x) 6= lass(x6= nn )). In the sequel,

will be denoted as f (M ), and

P

6=

x2S d(x; xnn )

P

=

x2S d(x; xnn )

as g (M ). That is:

J (M ) =

f (M ) g (M )

Minimizing this index amounts to minimizing a ratio between sums of distances, a problem which is difficult to solve by conventional gradient descent. In fact, the gradient with respect to a ij takes the form:

J (M ) ij

=

(f (M )=ij )g (M )

Taking into account that f (M )

=

P

f (M )(g (M )=ij )

g (M )2 =

x2S d(x; xnn )

and g (M ) =

P

6=

x2S d(x; xnn )

this leads to an

exceedingly complex expression. Clearly, an alternative technique for minimizing (4) is needed.

2.1 Fractional Programming In order to find a matrix M that minimizes (4), a Fractional Programming procedure [Sniedovich (1992)] is proposed. Fractional programming aims at solving problems of the following type:1

ProblemQ : 1

q = minz2Z

v (z ) w (z )

As in [Vidal (1995)], where another application of Fractional Programming in Pattern Recognition is de-

scribed, her we consider minimization problems rather than maximization problems as in [Sniedovich (1992)]. It can be easily verified that the same results of [Sniedovich (1992)] also hold in our formulation.

3

where

v

and

w

are real-valued functions on some set

Z,

and

w (z ) > 0;

8z 2

Z.

Let

Z

denote the set of optimal solutions to this problem. An optimal solution can be obtained via the solution of a parametric problem of the following type:

ProblemQ () :

q () = minz 2Z (v (z )

w (z )); 

2 " ) f 0 = ; 00 = 1; while(00  > "g ) f 00 = ; M 0 = M for all x 2 S f i = lass(x); k = lass(x6= nn ); for j = 1 : : : d f 0

g g

g

g

g

0

ij = ij

ij ij (x= nnj

x j )2

d(x; x= nn )

0

0

kj = kj +

;

0 kj kj (x6= nnj

d(x; x6= nn )

x j )2

;

iterations = iterations + 1;

M = M 0;  =

f (M ) ; g (M )

Figure 1: Fractional Programming Gradient Descent algorithm. It is interesting to note that the computations involved in (6) and (7) implicitly entail com-

puting the NN of each x 2 S , according to the CDW dissimilarity corresponding to the current

5

Error estimation

FPGD evolution in the Monkey Problem

Index 1

Error estimation Index

14 12

0.8

10 0.6

8 6

0.4

4 0.2

2 0

0 0

50

100 Iterations

150

200

Figure 2: Behaviour of the FPGD algorithm as applied to the “Monkey Problem” data set. Classification error is estimated through Leave One Out.

values of the weights ij and prototypes S

fxg. Therefore, as a byproduct, a Leave One Out

estimation (LOO) of the error rate of the NN classifier with the weighted measure can readily be obtained. This issue will be further explored on the next section. Figure 2 shows a typical evolution of this algorithm, as applied to the so-called “Monkey Problem” data set which is described in Section 3.

6

2.2 Finding adequate solutions in adverse situations. A negative side effect of the fact that only locally optimal solutions can be obtained in each step of the Fractional Programming procedure is that, if the the additive factor in (7) is not sufficiently large, the algorithm may tend to set  -values to zero. As an example of this kind of divergent behaviour, consider the following two-class problem, with each class having 500 two-dimensional points (Figure 3). Class A is a mixture of two Gaussian distributions. The first distribution has a standard deviation of

p

10

in the x1 dimen-

sion and a unit standard deviation in the x2 dimension, while the second distribution has a unit standard deviation in the

x1

dimension and a standard deviation of

p

10

in the

x2

dimension,

with both distributions centered at (0,0). Class B is a Gaussian distribution with unit standard deviation in the x1 dimension and a standard deviation of

p

10

in the x2 dimension, centered at

(6,0). Note the relatively large interclass overlapping on the x1 dimension. 10

Class A Class B

5

0

-5

-10 -10

-5

0

5

10

Figure 3: Two-class problem with the Gaussian mixture distributions and interclass overlapping.

As shown in Figure 4, with this data set (and using just unit initialization weights and a constant value for the step factor ), the estimated error rate tends to worsen, while the proposed criterion index (4) effectively decreases through successive iterations. This undesirable effect is actually due to the fact that all ij tend to zero until the algorithm

7

stops. It is interesting to note that, despite this “divergent” behaviour, a minimum error estimate is achieved at a certain step of the procedure, as can be seen in Figure 4. In other words, a low value of

J (M )

does not always necessarily mean a low value of the NN classifier error rate,

but was only an assumption as mentioned in Section 2. Nevertheless it is possible to find a minimum of the estimated error somewhere in the path that goes towards the minimum index value. This suggests to us that, rather than supplying the weight values obtained at the end of the FPGD procedure, a better choice for M in general would be supplying the weights that led to the minimum estimated error rate. In typical cases, such as that shown in Figure 2, this minimum is achieved at the convergence point of the FPGD procedure, while in adverse situations, such as that in Figure 4, the minimum-error weights will hopefully be a better choice than the standard (L2 or Mahalanobis) distance. Evolution on a Synthetic Data Set

Error estimation 60

Index 0.12

Error estimation Index

0.1

50 0.08 40

0.06 0.04

30

0.02 20

0

Minimum estimated error

-0.02

10

-0.04 0 0

20

40

60

80 100 Iterations

120

140

160

180

Figure 4: “Divergent” evolution of the FPGD algorithm with the “adverse” synthetic data shown in Figure 3. The CDW index converges as expected but the error rate tends to increase. Nevertheless there is a step in which the error is minimum.

It worth noting that this simple heuristic guarantees a LOO error estimation for the resulting weights which is never larger than the one obtained with the initial weights. Consequently, if weights are initialized with values corresponding to a certain conventional (adequate) metric,

8

the final weights are expected to behave at least as well as this metric would.

2.3 Asymptotic behaviour. The previous section introduce a essential feature of our approach, namely, the estimation of the error rate of the classifier by LOO using the weights at each step of the process. At the end of the process, the weights with the best estimation are selected. Let n be the size of the training set. If M is initialized to the unit matrix, in the first step of the process a LOO error estimation, ^nnn , of the standard Nearest Neighbor classifier is obtained.

At the end of the process the weight matrix with the best error estimation, ^nw , is selected. Therefore ^nw  ^nnn . It’s well known that, under suitable conditions [Devroye (1996)], when

n

tends to infinity

the LOO error estimation of a NN classifier tends to the error rate of this classifier. Therefore:

9 >  > = n ! w  nn limn!1  ^nn = nn > > limn!1  ^n = w ;

n ^ w

^n nn

(8)

w

In conclusion, the classifier using the optimal weight matrix is guaranteed to produce less than or equal error rate than the standard Nearest Neighbor, in this asymptotic case.

3 Experiments Several standard benchmark corpora from the UCI Repository of Machine Learning Databases and Domain Theories [UCI] and the Statlog Project [Statlog] have been used. A short description of these corpora is given below:  Statlog Australian Credit Approval (Australian): 690 prototypes, 14 features, 2 classes. Divided into 10 sets for cross-validation.  UCI Balance (Balance): 625 prototypes, 4 features, 3 classes. Divided into 10 sets for crossvalidation. A different design of the experiment was made in [Shultz (1994)].  Statlog Pima Indians Diabetes (Diabetes): 768 prototypes, 8 features, 2 classes. Divided into 11 sets for cross-validation.  Statlog DNA (DNA): Training set of 2000 prototypes. Test set of 1186 vectors, 180 features, 3 classes.

9

 Statlog German Credit Data (German): 1000 prototypes, 20 features, 2 classes. Divided into 10 sets for cross-validation.  Statlog Heart (Heart): 270 prototypes, 13 features, 2 classes. Divided into 9 sets for crossvalidation.  UCI Ionosphere (Ionosphere): Training set of 200 prototypes (the first 200 as in [Sigilito (1989)]),Test set 151 vectors, 34 features, 2 classes.  Statlog Letter Image Recognition Letter (Letter): Training set of 15000 prototypes, Test set of 5000 vectors, 16 features, 26 classes.  UCI Monkey-Problem-1 (Monkey): Training set of 124 prototypes, Test set of 432 vectors, 6 features, 2 classes.  Statlog Satellite Image (Satimage): Training set of 4435 prototypes, Test set of 2000 prototypes, 36 features, 6 classes.  Statlog Image Segmentation (Segmen): 2310 prototypes, 19 features, 7 classes. Divided into 10 sets for cross-validation.  Statlog Shuttle (Shuttle): Training set of 43,500 prototypes, Test set of 14,500 vectors, 9 features, 7 classes.  Statlog Vehicle (Vehicle): 846 prototypes, 18 features, 4 classes. Divided into 9 sets for crossvalidation.

Most of these data sets involve both numeric and categorical features. In our experiments, each categorical feature has been replaced by

n

binary features, where

n

is the number of

different values allowed for the categorical feature. For example, in a hypothetical set of data with two features: Age (Continuous) and Sex (Categorical: M,F), the categorical feature would be replaced by two binary features; i.e., Sex=M will be represented as (1,0) and Sex=F as (0,1). The continuous feature will not undergo any change, leading to an overall three-dimensional representation. Many UCI and Statlog data sets are small. In these cases, N-Fold Cross-Validation [Raudys (1991)] has been applied to obtain the classification results. Each corpus is divided into N blocks using N

1

blocks as a training set and the remaining block as a test set. Therefore, each block

is used exactly once as a test set. The number of cross validation blocks,

N,

is specified for

each corpus in the UCI and Statlog documentation. For the DNA, Letter, Monkey, Satimage and Shuttle, which are relatively larger corpora, a single specific partition into training and test sets was provided by Statlog and, in these cases, no cross validation was carried out. It should be finally mentioned that, although classification-cost penalties are available in a few cases, for the sake of presentation homogeneity, we have decided not to make use of them; neither for training nor for classification.

10

4 Results Experiments with both the NN and the k-NN rules were carried out using the

L2

metric, the

Mahalanobis distance (MD), the “class-dependent” Mahalanobis (CDM), and our CDW dissimilarity measures. As mentioned in Section 1, CDM consists in weighting each dimension by the inverse of the variance of this dimension in each class. In the case of the CDM dissimilarity, computation singularities can appear when dealing with categorical features, which often exhibit null class-dependent variances. This problem was solved by using the overall variance as a “back-off” for smoothing the null values. Initialization values for training the CDW weights were selected according to the following simple rule, which is based on LOO NN performance of conventional methods on the training data: If raw L2 outperforms CDM, then set all initial ij

= 1;

otherwise, set them to the inverse

of the corresponding training data standard deviations. Similarly, the step factors, ij , are set to a small constant (0.001) in the former case and to the inverse of the standard deviation in the latter. Tables 1 and 2 summarize the results for NN and K-NN classification, respectively. In the case of k-NN, only the results for the optimal value of

k; 1 < k < 21

observed in each

method are reported. For the NN classification rule (Table 1) CDW outperforms conventional methods in most of the corpora. The greatest improvement (+13%) was obtained in the Monkey-Problem, a categorical corpus with a small number of features and only two classes. Similarly, good improvement (+9.2%) was obtained for the DNA corpus, which is also a corpus with categorical data, but with far more features (180) and 3 classes. CDW has only been slightly outperformed (by less than 1.6%) by other methods in a few cases: Australian, Ionosphere and Shuttle. For the K-NN classification rule (Table 2), CDW outperforms conventional methods in many corpora: DNA, Ionosphere, Letter, Monkey, Segmen and Vehicle; again Monkey and DNA yielded the most significant improvements (+12.7% and +7.7%, respectively). Also, in this KNN case, in the corpora where CDW is outperformed by some other method, the difference in accuracy was generally small. Error estimation 95% Confidence Intervals 3 [Duda (1973)] for the best method are also shown in Tables 1 and 2. It is interesting to note that in the few cases where CDW is outper-

P

P

formed by other methods, the difference is generally well within the corresponding confidence 3 Computed by numerically solving the equations: kK P (k; n; p1 ) = 1 2A and kK P (k; n; p0 ) = 1 2A , where P(k,n,p) is the binomial distribution, A=0.95 is the confidence value and [p0 ; p1 ℄ the confidence interval.

11

Table 1: Classification accuracy (in %) of different methods, using the NN rule on several data sets. Results in boldface correspond to the best accuracy. The last column is the 95% confidence interval of the best method.

Australian Balance Diabetes DNA German Heart Ionosphere Letter Monkey Satimage Segmen Shuttle Vehicle

L2

MD

CDM

CDW

65.73 78.83 69.94 76.55 66.3 59.72 92.05 95.8 78.7 89.45 96.32 99.88 65.3

81.03 80.16 70.62 74.28 66.9 76.21 85.22 95.26 86.34 89.35 96.27 99.91 68.51

82.94 68.0 68.3 84.99 67.6 76.14 82.95 92.98 87.04 85.3 95.97 99.93 66.79

81.37 82.63 71.72 94.18 70.7 77.31 91.39 96.6 100 90.15 96.92 99.86 69.5

CI +2:7;

3:0

+2:9;

3:2

+3:2;

3:3

+1:3;

1:5

+2:8;

2:9

+4:8;

5:5

+3:8;

5:5

+0:5;

0:5

+0:0;

0:8

+1:3;

1:4

+0:7;

0:8

+0:04; +3:1;

0:05 3:2

Table 2: Classification accuracy (in %) of different methods using the K-NN rule on several data sets. Results in boldface correspond to the best accuracy. The last column is the 95% confidence interval of the best method.

Australian Balance Diabetes DNA German Heart Ionosphere Letter Monkey Satimage Segmen Shuttle Vehicle

L2

MD

CDM

CDW

69.26 91.16 76.5 86.76 71.2 67.89 94.7 96.1 83.33 90.75 96.32 99.88 66.54

85.44 91.66 77.32 83.64 73.2 85.13 85.22 95.56 86.34 90.65 96.27 99.92 71.72

85.29 91.16 73.77 85.16 74.5 82.14 90.34 92.98 87.33 87.25 95.97 99.93 70.25

84.8 90.83 75.13 94.43 71.8 80.6 97.35 96.6 100 90.75 96.92 99.86 71.85

12

CI +2:5;

2:8

+2:0;

2:4

+2:9;

3:1

+1:2;

1:4

+2:7;

2:8

+4:0;

4:8

+1:9;

4:0

+0:5;

0:5

+0:0;

0:8

+1:2;

1:3

+0:7;

0:8

+0:04; +3:0;

0:05 3:2

intervals. On the other hand, in many cases where CDW was the best method, confidence intervals were small (notably DNA, Monkey and Letter), thus indicating a statistically significant advantage of CDW. Comparisons with the best method known for each corpus [UCI, Statlog, Sigilito (1989)] are summarized in Table 3, while Table 4 shows the results achieved by several methods in a few corpora4 . From these comparisons and the previously discussed results (Tables 1,2) it can be seen that CDW exhibits a uniformly good behaviour for all the corpora, while other procedures may work very well for some corpora (usually only one corpus) but typically tend to worsen (dramatically in many cases) for the rest.

5 Concluding remarks A weighted dissimilarity measure for NN classification has been presented. The required matrix of weights is obtained through Fractional-Programming-based minimization of an appropriate criterion index. Results obtained for several standard benchmark data sets are promising. Current results using the CDW index and the FPGD algorithm are uniformly better than those achieved by other more traditional methods. This also applies to comparing FPGD with the direct Gradient Descent technique previously proposed in [Paredes (1998)] to minimize a simpler criterion index. Other more sofisticated optimization methods can be devised to minimize the proposed index (4) and new indexes can be proposed which would probably lead to improved performance. In this sense, an index which computes the relation between the K-NN distances to the prototypes of the same class and the K-NN to the prototypes in the nearest class (rather than the plain NN as in (4)), would be expected to improve current CDW K-NN results. Another new weighting scheme that deserves to be studied is one in which weights are assigned to each prototype –rather than (or in addition to) each class. This “Prototype-Dependent Weighted (PDW)” measure would involve a more “local” configuration of the dissimilarity function and is expected to lead to an overall behaviour of the corresponding k-NN classifiers which is even more data-independent. 4

Corpora that make use of classification-cost penalties (Section 3), (such as Heart and German), other corpora

which are not comparable because of other differences in experiment design, are excluded. Only those methods which have results in many corpora, and corpora for which results with many methods are available have been chosen for the comparisons in Table 4

13

Table 3: Comparing CDW classification accuracy (in %) with the best accuracy achieved by other methods.

CDW Australian Diabetes DNA Ionosphere Letter Monkey Satimage Segmen Shuttle Vehicle

Other (Method)

84.80 75.13 94.43 97.35 96.60 100.00 90.75 96.92 99.86 71.85

86.9 (Cal5) 77.7 (LogDisc) 95.9 (Radial) 96.7 (IB3) 93.6 (Alloc80) 100.0 (AQ17-DCI)5 90.75 (KNN) 97.0 (Alloc80) 99.0 (NewId) 85.0 (QuaDisc)

Table 4: Comparing classification error rate (in %) achieved by several methods. Results in boldface correspond to the the best method for each corpus.

Australian DNA Letter Satimage Segmen Vehicle

Alloc80

CART

C4.5

Discrim

NBayes

QDisc

Cal5

Radial

CDW

20.1 5.7 6.4 13.2 3 17.3

14.5 8.5 —— 13.8 4 23.5

15.5 7.6 13.2 15 4 26.6

14.1 5.9 30.2 17.1 11.6 21.6

15.1 6.8 52.9 —— 26.5 55.8

20.7 5.9 11.3 15.5 15.7 15

13.1 13.1 25.3 15.1 6.2 27.9

14.5 4.1 23.3 12.1 6.9 30.7

15.2 5.5 3.4 9.2 3.1 28.1

Local prototype weighting can also be made feature-independent; i.e., a single scalar weight is assigned to each prototype. The weight of each prototype is intended to measure the value of this prototype for improving classification accuracy. Such a prototype weighting scheme can be seen from the viewpoint of prototype editing. This kind of weights can be learned using techniques similar to those introduced in this paper, leading to a recently studied very successful editing-oriented weighting method which we call WP-Edit [Paredes (2000)]. 5

Many other algorithms also achieve 100% accuracy.

14

References [Cover (1967)] T.M. Cover and P.E. Hart. 1967 Nearest neighbor pattern classification. IEEE Transactions on Infromation Theory, 13(1), 21–27. [Devroye (1996)] L. Györfi, L. Devroye and G. Lugosi. 1996. A probabilistic theory of pattern recognition. Springer-Verlag New York, Inc. [Tomek (1976)] I. Tomek. 1976. A generalization of the k-nn rule. IEEE Transactions on Systems, Man, and Cybernetics 6(2), 121–126. [Fukunaga (1985)] K. Fukunaga and T.E. Flick. 1985. The 2-nn rule for more accurate nn risk estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 7(1), 107–112. [Luk (1986)] A. Luk and J.E. Macleod. 1986. An alternative nearest neighbour classification scheme. Pattern Recognition Letters, 4, 375–381. [Urahama (1995)] K. Urahama and Y. Furukawa. 1995. Gradient descent learning of nearest neighbor classifiers with outlier rejection. Pattern Recognition, 28(5), 761–768. [Short (1980)] R.D. Short and K. Fukunaga. 1980. A new nearest neighbor distance measure. In Proc. 5th IEEE Int. Conf. Pattern Recognition. Miami Beach, FL. [Short (1981)] R.D. Short and K. Fukunaga. 1981. An optimal distance measure for nearest neighbour classification. IEEE Trans. Info. Theory, 27, 622–627. [Fukunaga (1982)] K. Fukunaga and T.E.Flick. 1892. A parametrically defined nearest neighbour measure. Patter Recognition Letters, 1, 3–5. [Fukunaga (1984)] K. Fukunaga and T.E.Flick. 1984. An optimal global nearest neighbour metric. IEEE Trans. Pattern Recognition Mach. Intell. PAMI, 6, 314–318. [Myles (1990] J.P. Myles and D.J. Hand. 1990. The multi-class metric problem in nearest neighbour discrimination rules. Pattern Recognition, 23(11), 1291–1297. [Paredes (1998)] R. Paredes and E. Vidal. 1998. A nearest neighbor weighted measure in classification problems. VIII Simposium Nacional de Reconocimiento de Formas y Análisis de Imágenes, Proc., Bilbao, Spain. July 1998. [Paredes (2000)] R. Paredes and E. Vidal. 2000. Weighting prototypes. A new editing approach. 15th International Conference on Pattern Recognition, ICPR2000. Barcelona, Spain. September 2000. [Sniedovich (1992)] M. Sniedovich. 1992. Dynamic Programming. Marcel Dekker Inc. [Vidal (1995)] E. Vidal, A. Marzal and P. Aibar. 1995. Fast Computation of Normalized Edit Distances”. IEEE Trans. on Pattern Analysis and Machine Intelligence, 17(9), 899-902. [UCI] C. Blake, E. Keogh and C.J. Merz. UCI Repository of machine learning databases. http://www.ics.uci.edu/mlearn/MLRepository.html. University of California, Irvine, Dept. of Information and Computer Sciences. [Statlog] Statlog Corpora. Dept. Statistics and Modellong Science (Stams). Stratchclyde University. ftp.strath.ac.uk [Sigilito (1989)] V.G. Sigilito, S. P. Wing, L. V. Hutton and K. B. Baker. 1989. Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Technical Digest, 10, 262-266. [Shultz (1994)] T.R. Shultz, D. Mareschal and W.C. Schmidt. 1994. Modeling Cognitive Development on Balance Scale Phenomena. Machine Learning, 16, 57–86. [Raudys (1991)] S.J. Raudys and A.K Jain. 1991. Small Sample Effects in Statistical Pattern Recognition: Recomendations for Practitioners". IEEE Trans on PAMI, 13(3), 252-264. [Duda (1973)] R.Duda and P.Hart. 1973. Pattern Recognition and Scene Analisys. John Wiley. New York.

15