A comparative study of some non-parametric spectral classifiers ...

Report 2 Downloads 14 Views
int. j. remote sensing, 1997 , vol. 18 , no. 6 , 1259± 1275

A comparative study of some non-parametric spectral classi® ers. A pplications to problems with high-overlapping training sets F. J. CORTIJO and N. PEREZ DE LA BLANCA Depto. Ciencias de la Computacio n e I. A. ( DECSAI), E.T.S. Ingenierõ a Informa tica, Universidad de Granada, 18071 Granada, Spain ( Received 21 February 1996; in ® nal form 15 July 1996 ) Abstract. In this paper we show some alternative classi® ers to the widely used maximum likelihood (ML) classi® er in order to obtain high accuracy classi® cations. The ML classi® er does not provide high accuracy classi® cations when the training sets are high-overlapping in the representation space due to the shape of the decision boundaries it imposes. In these cases, it is preferred to adopt another classi® er that may adjust the decision boundaries in a better fashion. This objective may be achieved with several non-parametric classi® ers and by using the regularized discriminant classi® er, as shown in this paper.

1.

Introduction

Supervised classi® ers assume the existence of a training set T composed by n labelled training samples, where the labels represent informational classes ( labels). Let V be the set of informational classes, V = { v 1 , v 2 , ¼ , v J }. Then T = { (X 1 , l1 ), (X 2 , l2 ), ¼

, (X n , ln ) }

where X i are d -dimensional random variables (the training samples) and l i × V for i = 1, 2, ¼ , n . Let n i be the number of training samples in class i , then J S i 1 n i = n = |T |. When using supervised classi® ers learning is based on the available = training set. Spectral classi® ers use only the spectral information of the pixel to be classi® ed. Spectral classi® ers are split into two main categories: (a ) parametric classi® ers, if they assume the existence of an underlying probability distribution of the data, and (b ) non-parametric classi® ers, if they do not assume anything about the probability distribution. When classifying a multi-spectral image by using a spectral classi® er such as the classical maximum likelihood (ML) classi® er, the resulting thematic map will usually give the overall impression of a `noisy’ classi® cation. The thematic map that they obtain is composed by many isolated labelled pixels inside patches of other classes giving the overall impression of a noisy classi® cation. This negative e€ ect is more evident when there is overlapping between the training sets in the spectral space (Cortijo et al . 1995). When this occurs the traditional quadratic ML classi® er and the linear classi® ers are not able to obtain a high accuracy classi® cation due to the ® xed-form decision boundaries they impose. We shall show that alternative spectral classi® ers may improve the accuracy of the classi® cations as they are able to de® ne a wider set of decision boundaries between the clusters in the representation space, thus reducing the classi® cation error rate. In this work, we perform a comparative study of di€ erent spectral classi® ers, when the available training samples are highly overlapping in the representation space. 0143 ± 1161/97 $12.0 0

Ñ

1997 Taylo r & Francis Ltd

1260 2.

F. J. Cortijo and N. Perez de la Blanca

Parametric classi® ers

The structure of the Bayes classi® er is determined, basically, by the probability density functions ( pd f s) p(X | v i ). The objective in the construction of a supervised parametric classi® cation rule is to characterize the pattern of each class in terms of its pd f . It is assumed that the form of that function is known, that is, only some parameters need to be known (they are estimated from the training set) in order to characterize each class. The pd f s are usually multivariate Gaussian functions so it is only necessary to estimate two sets of parameters for each class: the mean vector, m i and the covariance matrix, S i . Maximum likelihood (ML) classi® cation rule is a particular case of the Bayes rule when the a priori probabilities verify that p i = 1/J for i = 1, 2, ¼ , J . When the pdf s are multi-variate Gaussian functions the likelihood of an observation X to be i in class v i , noted by L X , is de® ned by i

LX =Õ

1 2

ln ( |S i |) Õ

1 2 (X Õ

m i ) SiÕ T

1

(X Õ

mi )

( 1)

The maximum likelihood classi® cation rule can be written as follows: d(X ) = v c

if

c

i

L X = max {L X } i

=

1 ,¼

,J

( 2)

which assigns class v c to observation X . The ML classi® er imposes quadratic decision boundaries between the clusters of samples in the representation space. If a common covariance matrix is assumed, S i = S for i = 1, 2, ¼ , J , then the quadratic classi® er yields a linear classi® er in which the decision boundaries have a linear form. The quadratic classsi® er is more sensitive to the violation of the Gaussian assumption of the pdf s than the linear classi® er ( Lachenbruch 1979) and the training set size required by a quadratic classi® er is higher than the training set size required by a linear classi® er. It is well known that the Hughes’ e€ ect ( Hughes 1968) arises in high dimensionality data when the training set size is not large enough to estimate the covariance matrices properly. When adopting a quadratic or a linear classi® er we are imposing an extreme degree of adjustment of the decision boundaries on the training samples. When the training samples are highly overlapped in the representation space they are not good choices and it is plausible to allow a wider degree of adjustment, that is, a wider set of possible decision boundaries. This may be achieved using the regularized discriminan t analysis classi® er ( RDA classi® er) proposed by Friedman ( 1989). RDA allows a wide family of parametric classi® ers including the quadratic ML classi® er and linear classi® ers as particular cases. The estimation of the covariance matrices is performed by a regularization process determined by two parameters, the values of which are customized to individual situations by jointly minimizing a sample-based estimate of future misclassi® cation risk. The joint optimization of these parameters minimizes the cross-validation error so this technique is very accurate for problems in which the training set size is low. As Friedman points out ( 1989) a quadratic classi® cation problem is ill-posed if n i is not considerably larger than d , the dimensionality of data. A possible solution is to adopt the common variance matrix ( linear classi® cation) in order to reduce the number of parameters to be estimated. Friedman proposes a less limited set of alternativesÐ including both quadratic and linear classi® cationÐ by means of a two-step regularization process described above.

A comparative study of some non-parame tric spectral classi® ers

1261

First it is possible to parametrize the sample-based estimate of the covariance matrix for class i, SÃ i ( l) , where SÃ i ( l) =

S i ( l) n i ( l)

=

(1Õ

l )S i + l S

(1 Õ

l )n i + l N

( 3)

where S i ( l) is the sample-based estimate of S i . The regularization parameter l takes on values 0 l 1 and might be interpreted as the degree of shrinkage of the individual class covariance matrix estimates toward the pooled estimate ( Friedman 1989 ). It can easily be seen that if l = 0 we will obtain a quadratic classi® er and if l = 1 we will obtain a linear classi® er. Values 0 < l < 1 represent less severe degrees of regularization than linear classi® cation. The regularization provided by equation ( 3) is still limited for two reasons: (a ) it might not provide for enough regularization, and (b ) biasing the sample class covariance matrices toward the common covariance matrix may not be the most e€ ective way to shrink them ( Friedman 1989 ). To solve this problem Friedman adds a second step in the regularization process by means of a new parameter, c , which also takes on values 0 l 1 in such a way that Sà i ( l, c ) = ( 1 Õ

c ) SÃ i ( l) + c c i I

( 4)

where c i = tr[ SÃ i ( l ) ]/d

( 5) Ã and S i ( l) is given in equation ( 3 ). For a given l , the additional parameter c controls shrinkage toward a multiple of the identity matrix. In this way it is possible to counteract the biasing inherent in sample-based estimation of eigenvalues ( Friedman 1989). Assuming equal a priori probabilities let D i (X ) be the regularized discriminant value for sample X to be in class i . Its expression is de® ned in equation ( 6 ). D i (X) = ln ( | SÃ i ( l , c ) | ) + (X Õ

T m i ) SÃ i ( l, c ) Õ

1

(X Õ

mi )

( 6)

Now we can write the regularized discriminant classi® cation rule as follows: d(X ) = v c

if

D c (X ) = min {D i (X) } i

=1 ,¼

,J

( 7)

which assigns class v c to observation X . Friedman refers to this approach as regularized discriminan t analysis ( RDA). This approach provides a rich class of regularization alternatives. It is easily seen that the four corners of the plane de® ned by l and c represent well known classi® ers: (l = 0, c = 0 ) represents the quadratic classi® er, ( l = 1, c = 0 ) represents the linear classi® er, (l = 1, c = 1 ) corresponds to the Euclidean classi® er and (l = 0, c = 1 ) represents a weighted nearest-means classi® er with the class weights being inversely proportional to the average variance of the measurement variables within the class. The optimum values for l and c are not likely to be known in advance and they must be estimated from the training set. Friedman proposes selecting the pair which jointly minimize the cross-validation estimate of future misclassi® cation risk. This gives rise to a two-parameter numerical minimization problem. The straightforward strategy is to de® ne a grid of 25 to 50 points on the plane de® ned by l and c ( 0 l , c 1 ) ( Friedman 1989 ). RDA has been shown useful for classifying both low-dimensional and highdimensional images (Cortijo 1995), but its computational e€ ort may be sometimes

1262

F. J. Cortijo and N. Perez de la Blanca

prohibitive for high-dimensional data (Cortijo 1995 ). To circumvent this problem Friedman proposes a computationally fast method to estimate the optimum pair. 3.

Non-parametric classi® ers

In most classi® cation applications the assumption that the forms of the underlying density functions are known is unrealistic. The common parametric forms rarely ® t the densities actually encountered in practice. For instance, the parametric models manage unimodal densities, whereas many practical problems present multi-modal densities ( Duda and Hart 1973 ). The only available information is the training set and the classi® cation rules must be built only from it with no additional assumptions. We can ® nd a wide variety of spectral non-parametric classi® ers. They may be summarized in three main categories. The ® rst approximation consists of nonparametric statistical techniques to estimate p(X | v i ) (nearest neighbour estimation techniques ( Duda and Hart 1973 ) and kernel estimation techniques (e.g., ( Parzen 1962 ) and ( Devijver and Kittler 1982) ). The second consists of directly estimating the a posterior probability p ( v i | X ) (nearest neighbour classi® cation rules ( Devijver and Kittler 1982 ) ) and the third consists of splitting recursively the representation space by means of binary questions related to the values of the variables involved (classi® cation trees ( Breiman et al . 1984) ). Among classi® cation trees techniques we have tested CART ( Breiman et al . 1984) due to its wide use and knowledge. L earning consists, basically, of splitting hierarchically the representation space by means of the training samples in such a way that the terminal nodes are labelled and the remaining nodes have associated a bolean question in which only one variable is involved in the simplest case. This splitting strategy yields a binary tree. Classi® cation of a new and independent sample is performed by following the path determined by the answers to the questions related to the non-labelled nodes from the root node until a labelled node is achieved. The question associated to each non-terminal node is answered: if it is true, this process is repeated in the left-descendent-node and if it is false, this is repeated in the right-descendent-node. When the new node to test is a labelled node, the sample is labelled with the label associated to that node. This process is repeated for each sample to be classi® ed. 3.1. Nearest neighbour classi® ers One of the most popular and widely used non-parametric classi® cation rules is the k nearest neighbour rule (k -NN). In the simplest formulation, given a training set T , a metric d and a sample to classify X , the k -NN searches the k nearest neighbours to X in T and assigns to X the most populated class in the selected neighbours. If k = 1 , the k -NN rule is known as the nearest-neighbour-rule or 1-NN rule. In this work we will note by 1-NN, the 1-NN classi® er that uses the complete (as given by experts) training set to search the nearest neighbour. The requirement of a large training set to ensure the convergence of the k -NN rule ( Devijver and Kittler 1982 ) is the main drawback of the nearest neighbour rules in practical problems. Moreover there are two additional drawbacks in the application of these rules: ® rstly, they are negatively in¯ uenced by incorrectly labelled training samples (noisy samples or outliers) and secondly, the computational 2 complexity associated to the search of the nearest neighbour(s) in T can be O (n ) or higher. To circumvent these problems many di€ erent solutions have been proposed in

A comparative study of some non-parame tric spectral classi® ers

1263

the literature such as fast searching algorithms ( Vidal 1986), modi® ed metrics (Short and Fukunaga 1981 ) and techniques to obtain a reduced and representative reference set, R , from T . The objective of the last approach is to search the nearest neighbour(s) in R with an acceptable trade-o€ between the accuracy of the classi® cation and the required computational e€ ort ( ( Devijver and Kittler 1982 ), ( Kohonen 1990), (Geva and Sitte 1991 ) among others). This can be done in ways: (a ) by editing-cond ensing techniques (§ 3.1.1 ); in this case R k T or (b ) by adaptativ e learning techniques (§ 3.1.2 ); in this case there is not an explicit relation between both sets. 3.1.1. Editing and condensing technique s The aim of editing-cond ensing techniques is two-fold: improving the accuracy of the classi® cation by removing samples located in overlapping acceptation surfaces (editing technique s) and decreasing the computational e€ ort required to ® nd the nearest neighbour(s) (condensing technique s). Editing techniques take as input the original training set and give as output a subset of the original training set. Condensing techniques usually take as input the edited training set and give as output a subset of the previously edited set. R is a reduced (sometimes drastically) and representative version of T . The joint application of these techniques improves the trade-o€ between the accuracy of the 1-NN classi® cation and the computational e€ ort required for that classi® cation. The multiedit algorithm is the most widely used editing algorithm due to its convergence properties ( Devijver 1986 ). If the training set size is not large enough it is preferable to use the cross-validation editing algorithm proposed by Ferri ( Ferri and Vidal 1992 b). This algorithm is based on the twin paradigms of di€ usion and conf usion which are known to e€ ectively remove unwanted statistical dependence. We will note by TM the multiedited training set. The multiedit algorithm may be written as follows ( Devijver and Kittler 1982 ): M ultiedit algorithm 1. Initialization. Let the initial multiedited set TM be the training set T , that is,

TM € T . 2. Di€ usion. Make a random partition of the available data, TM , into N subsets 1 2 N (N 3 ) T M , T M , ¼ T M . i (i mod N ) 1 + as 3. Classi® cation. Classify the samples in T M using 1-NN with TM reference set, i = 1 ¼ N . 4. Editing. Discard all the samples that were misclassi® ed at step 3. 5. Confusion. Pool all the remaining data to constitute a new set TM . 6. Termination . If the last I iterations produced no editing, then exit with the ® nal set TM , else go to step 2.

The rotation of indices in step 3 (classi® cation) is intended to avoid two-way interaction between any two subsets. It is important to explain the meaning of step 6 (termination). Due to the randomization in step 2 (di€ usion) the fact that one iteration does not edit any sample is not evidence that the current sample set TM is forever immune to further editing. That is the reason that I must be greater or equal to 2. This algorithm discards samples located inside or close to clusters of di€ erent classes and as a consequence gives homogeneous clusters of samples as output. Each cluster is spanning the Bayes acceptance region corresponding to its class. It is not possible to ® x |TM | prior to editing. The number of samples retained by the multiedit

1264

F. J. Cortijo and N. Perez de la Blanca

algorithm depends on the statistical structure of the problem: it discards numerous samples when the training sets are highly overlapping. Once edited the training set TM the next step is to classify using 1-NN with R = TM as a reference set. We will note by 1-NN ( M ) that classi® er. The 1-NN ( M) classi® er implements a piecewise-linear decision boundary which is the sample based approximation to the Bayes-optimal decision boundary. Hart’s condensing algorithm is the most popular and simple condensing algorithm ( Hart 1968). It requires that the input set be previously edited. If we assume that the input set has been previously multiedited we will note the multiedited and condensed training set by TM C . Let us ® x some notation used in Hart’s condensing algorithm expression. Let G be the set that contains the samples not classi® ed yet and let | G | be the cardinality of G whenever step 2 of the algorithm is entered. Hart’s condensing algorithm may be written as follows ( Devijver and Kittler 1982): Hart’s condensing algorithm

1. Select randomly a sample from TM , place it into TM C and place all the others into G . 2. For i = 1, ¼ , | G | (a ) Use 1-NN with the current contents of TM C to classify the i th sample from G . (b ) If classi® ed correctly, the sample is returned to G , otherwise it is placed in TM C . 3. If one complete pass is made through step 2 with no transfer from G to TM C , or G is exhausted, then terminate; else go to step 2. This algorithm selects for each class a set of samples that approximate the Bayesoptimal decision boundary discarding those samples which do not contribute to de® ne the decision boundary. The sample set TM C is not optimal in the sense that it is not always the minimal subset and it is highly dependent on the sample that it selects randomly in step 1, that is, di€ erent trials give di€ erent condensed sets. The selection of the optimal condensed set is a computationally complex problem ( Devijver and Kittler 1982 ) but Hart’s condensing algorithm is considered as a good approximation to the minimal set of samples. It is not possible to ® x | TM C | prior to condensing. The number of samples retained by Hart’s condensing algorithm depends on many factors such as the initial sample retained, the spectral dispersion of the clusters, etc. Once the reference set TM has been condensed, the next step is to classify using 1-NN with R = TM C as a reference set. We will note that classi® er by 1-NN ( MC ). 3.1.2. Adaptative learning technique s A di€ erent approach consists of adaptativ e learning techniques. Adaptative learning is a powerful alternative to classical editing-condensing techniques as it allows us to ® x the reference set size. Now the reference set is not usually a subset of the training set. Adaptative learning algorithms can be tuned by means of a set of parameters in such a way that it is possible to directly supervise the learning process. The training samples are used to tune a ® xed number of codebooks or prototypes and the reference set is called the codebook s set or the prototypes set. Adaptative learning is performed in two sequential phases: initializatio n and learning . The

A comparative study of some non-parame tric spectral classi® ers

1265

prototypes set is initially a subset of the training set and the values of the prototypes are updated in an iterative learning process. The initial values of the prototypes are determined in the initialization phase . When this phase concludes it is veri® ed that R k T and | R |= p , where p is the speci® ed number of prototypes. p is the only parameter involved in this phase. Let J p i be the number of prototypes associated to class i , then S i 1 p i = p = | R |. The = number of prototypes associated to each class, p i , is calculated from p : it can be proportional to n i (considering di€ erent a priori probabilities) or it can be the same for all classes (considering equal a priori probabilities). In the learning phase the values of the prototypes are updated according to the training samples used for learning. When a sample is used to update R a learning step is performed. The parameters involved in this phase depend on the updating method used. There are, however, two common parameters: the number of learning steps, r , and the initial value of the gain function, a ( 0 ). r imposes the number of learning iterations and a ( 0 ) imposes the amount of adjustment of the codebooks to the training samples ( Kohonen 1990 ). Function a ( ) is usually a linear function monotonically decreasing with the learning steps. Two di€ erent adaptative learning approaches can be adopted: the ® rst is based on the use of learning vector quantizatio n ( LVQ) methods proposed by Kohonen ( 1990 ) and the second is the decision surface mapping ( DSM) algorithm proposed by Geva and Sitte ( 1991). In LVQ learning the location of the prototypes in the representation space approximate the underlying probability densities using a ® xed number of prototypes. Kohonen (1990 ) proposes some updating strategies based on learning vector quantization called LVQ-1, LVQ-2 and LVQ-3. More recently he has proposed ( Kohonen 1992) the optimized LVQ-1 (OLVQ-1 ) to initialize the prototypes set. LVQ-1 modi® es only the closest prototype and LVQ-2 and LVQ-3 modify two prototypes. LVQ-2 modi® es the nearest and the next-to-nearest neighbours whenever the nearest neighbour is of a di€ erent class, and the next nearest neighbour is of the same class, as the training sample. This updating strategy requires that the training sample falls within a window. This window is determined by the relative distances of the training sample from the prototypes. Finally, LVQ-3 is a mixture of the former updating strategies. Kohonen (1989 ) shows that the 1-NN error using the LVQlearned training set as a reference set tends to the Bayes error. As has been pointed by several authors (e.g., Kohonen et al . 1992 and Ferri and Vidal 1992 a) the results obtained by the di€ erent LVQ learning strategies are similar and it is di cult to predict which will be the best for a given problem. Considering that the LVQ-1 learning method is the simplest in terms of the number of parameters involved, we have adopted LVQ-1 in this work. In the original formulation | R | = p . It is possible, however, to modify the original formulation in such a way that the number of prototypes is not ® xed a priori and it can be modi® ed throughout learning. This approach was developed by Pe rez and Vidal ( 1993 ). In that work, they propose a constructive design of LVQ classi® ers. LVQ-1 acts only on the closest prototype to the training sample in each learning step by using a punishment-reward criterion. The amount of the correction depends on the gain value for the learning step in course, being greater in the ® rst steps than in the last ones. Given that | R | = p , the set of prototypes R in the iteration t can be written as R = {m i ( t) |i = 1, 2, ¼ , p} . Let X ( t) (X ( t) × T ) the training sample considered in the iteration t and let m c ( t) (m c ( t) × R ) be the nearest prototype to X ( t) in R . The LVQ-1 updating can be written as follows ( Kohonen 1990):

1266

F. J. Cortijo and N. Perez de la Blanca

if the label of m c is the same as the label of X ( t) then m c ( t + 1 ) = m c ( t) + a ( t) (X ( t) Õ

m c ( t) )

{reward }

( 8)

if the label of m c is not the same as the label of X ( t ) then m c ( t + 1 ) = m c ( t) Õ m i ( t + 1 ) = m i ( t)

a ( t) (X ( t) Õ

if

i

m c ( t) )

for

c

1

{pu nishm ent }

i

( 9) ( 10)

p

This updating strategy tends to reduce the density of the prototypes around the decision boundaries, in other words, it tends to approximate the underlying probability distributions of the classes. This is due to the negative sign in the punishment expression. In each training step, t , a training sample is presented for learning until r training steps are performed. If r is greater than the available training samples the training set is visited again selecting additional training samples randomly ( Kohonen et al . 1992 ). When r is too large there is an overlearning e€ ect (Cortijo and Pe rez de la Blanca 1996 b). It is an anomalous e€ ect because the prototypes ® t the actual training set extremely well and it is not possible to infer properly the results obtained to another training set selected from the same population. It is not possible to ® x an optimum value for r prior to learning. As a general rule, Kohonen ( 1990) points out that it is enough to learn with a value between 50Ö p and 200Ö p , where p is the number of prototypes selected in the initialization phase. Function a ( ) veri® es that 0< a ( t) < 1 . It can be a constant function or monotonically decreasing with t , with the last alternative being the best choice ( Kohonen 1990 ). This function imposes the ® tting degree between the prototypes on the training samples: the lower the value of a ( t) , the ® ner the ® tting. That is the reason why the rough adjustment is done in the ® rst steps and the ® nest one is done in the last ones. It is recommended to ® x a low value for a ( 0 ) , much lower than 0´1. As a general criterion Kohonen ( 1990) proposes a value near to 0´02. We must note, however, that the most accurate values depend on the problem at hand. The aim of DSM is placing the prototypes to approximate the decision boundaries in the representation space. In this case it is required that the training set be edited previously. The initial prototype set is built by picking up randomly p samples from TE where TE must be a previously edited set from T . Let m c ( t) (m c ( t) × R ) be the nearest prototype to X ( t) in R and let m w ( t) (m w ( t) × R ) be the nearest prototype to X ( t ) such as it is from the same class as X ( t ). The DSM updating may be written as follows (Geva and Sitte 1991): if the label of m c is not the same as the label of X ( t ) then m w ( t + 1 ) = m w ( t) Õ

a ( t) (X ( t) Õ

m w ( t) )

{ pu nishm ent }

m c ( t + 1 ) = m c ( t) + a ( t) (X ( t) Õ

m c ( t) )

{reward }

m i ( t + 1 ) = m i ( t)

if

i

c, i

w

for

1

i

( 11)

p

if the label of m c is the same as the label of X ( t) then m i ( t + 1 ) = m i ( t)

for

1

i

p

( 12)

A comparative study of some non-parame tric spectral classi® ers

1267

DSM updating acts on pairs of prototypes located on both sides of the boundaries only if there is a misclassi® cation. The probability of misclassi® cation is high in locations close to the decision boundaries. Geva and Sitte propose a linear decreasing function with t for a ( ) being a ( 0 ) = 0´3 or lower. This value, however, depends on the problem at hand (Geva and Sitte 1991 ). It is possible to modify the original formulationÐ as it was pointed out with the LVQ updatingÐ in such a way that the prototypes set is built incrementally. This approach was developed by Pe rez and Vidal ( 1993). In both LVQ-1 and DSM, the 1-NN classi® er is used with the codebooks set as the reference set. We will note by TD S M the DSM-learned training set (R = TD S M ) and by 1-NN (DSM ) the 1-NN classi® er that uses TD S M as the reference set and by TL VQ -1 the LVQ-1-learned training set ( R = TL VQ -1 ) and by 1-NN ( L VQ- 1) the 1-NN classi® er that uses TL VQ -1 as the reference set. 4.

Data

The data used to test the performance of the classi® ers are two Landsat images, landscapes from Greenland and their corresponding training sets. The ® rst image is shown in ® gure 1. It is a Landsat-2 MSS image of the Igaliko region, South Greenland. We show, from top to bottom and from left to right, the digital images corresponding to the MSS channels 4, 5, 6 and 7. The second image is shown in ® gure 2. It is a Landsat-5 TM image of the Ymer é region, East Greenland. We show, from top to bottom and from left to right, the digital images corresponding to TM channels 1, 2, 3, 4, 5 and 7. Each image consists of 512 pixels by 512 pixels with 256 grey levels. We have used for classifying all the available

Figure 1.

Landsat-2 MSS image. Igaliko, Greenland. MSS-2 Bands 4 to 7.

1268

F. J. Cortijo and N. Perez de la Blanca

Figure 2.

Landsat-5 TM image. Ymer é , Greenland. TM-5 bands 1 to 5 and 7.

information: the four MSS bands in Igaliko and the six TM bands in Ymer é . The training sets have been selected by experts (Conradsen et al . 1987) and their spectral distribution represents di€ erent problems (Cortijo 1995). Training information relating to the Igaliko classi® cation problem can be found in table 1. In this problem, experts have selected 42 796 training samples distributed in ® ve informational classes. In order to show that the training samples are highly overlapped in the representation space, we show in ® gure 3 (b ) the spectral Table 1.

Training set information. Igaliko.

Class

Informational class

Training areas

1 2 3 4 5

Intrusives Water Barren granite Plain granite Dolerite

1 1 1 8 1

5725 11 372 8231 4191 13 277

12

|T |= 42 796

Sum

Training samples

A comparative study of some non-parame tric spectral classi® ers

1269

Figure 3. The training set for Igaliko. (a ) Image showing the spatial distribution of the training areas. This image may be seen as a mask to be located over any band in ® gure 1. ( b) Spectral distribution of the training samples.

distribution of the training samples considering a two-dimensional representation space spanned by bands 2 and 4 in the image (MSS-5 and MSS-7 channels). This is the two-band combination that presents the maximum Je€ reys-Matusita average distance (Swain 1978 ). Considering that we are plotting 42 796 training samples in the scatterogram given in ® gure 3 (b ) we can see that the ® ve ideal clusters are highly overlapped. Training information related to the Ymer é classi® cation problem can be found in table 2. In this case we have 20 classes to discriminate and the training set size is 12 574 samples. In ® gure 4 (b ) we show the spectral distribution of the training Table 2.

Training set information. Ymer é .

Class

Informational class

Training areas

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Moraine Delta + young alluvial fans Old alluvial fans Quartzites/shales Quartzites Shales ( black) Quartzites (white) Shale (red/brown), free of vegetetation Quartzites (red only) Shales (red ) Dolomite (white/yellow) Limestone ( black) Shale (red/yellow), sunlit Shale (red/yellow), partly covered by vegetation Dolomite (white) Quartz. (yellow/brown), sunlit Quartz. (yellow/brown), shaded areas Limestone ( black) Limestone ( black/grey) Glacial deposits

3 2 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 2 2 2

3698 1235 6071 279 714 709 255 493 73 266 127 969 208 536 323 273 393 673 389 354

29

|T |= 12 574

Sum

Training samples

1270

F. J. Cortijo and N. Perez de la Blanca

Figure 4. The training set for Ymer é . (a ) Image showing the spatial distribution of the training areas. This image may be seen as a mask to be located over any band in ® gure 2. ( b) Spectral distribution of the training samples.

samples considering bands 4 and 6 ( TM-4 and TM-7 bands). As earlier, this is the two-band combination that presents the maximum Je€ reys-Matusita average distance (Swain 1978 ). We can see that again there is high overlapping between the 20 ideal clusters. 5.

M ethodology

In this work we have adopted the test sample estimation to measure the accuracy l of the classi® cations. The training set, T is split into two disjoint sets: T ( learning t l set) and T (test set). T has been built by selecting at random 2/3 of the available t l training samples; the remainder are placed into T . Only samples in T are used to t construct the classi® er. Then samples in T are used to estimate the classi® cation error. Though this approach reduces the e€ ective learning set size if the training set size is large, as in the two classi® cation problems we have, this is a minor di culty, and the test sample estimator is honest and e cient ( Breiman et al . 1984). In the Igaliko dataset we have 42 796 training samples available, so we use 28 441 samples for learning and 14 355 samples for testing. In the Ymer é dataset we have 12 574 training samples available distributed in the following way: 8560 samples for learning and 4014 samples for testing. The accuracies of the classi® cations we have performed are shown in table 5. The estimates of the parameters involved in the parametric classi® ers have been l estimated from T . We have assumed a Gaussian distribution for the density probability functions. The steps performed for each problem can be summarized as follows: Firstly, the mean vectors and the covariance matrices estimates are computed l from T (learning phase), secondly, the image is classi® ed using these estimates (classi® cation phase) and ® nally, the accuracy of the classi® cation is computed using t T ( test phase ). For RDA learning we have sampled two grids of ( l , c ) pairs consisting of 121 and 36 pairs for both datasets. This means that when sampling 121 pairs we have considered the following grid: ( 0, 0), ( 0, 0´1 ), ( 0, 0´2 ), ¼ , ( 0, 1), ( 0´1, 0), ¼ , ( 1, 1 ) and when sampling 36 pairs we have considered the following grid, which is included in the later: ( 0, 0 ), ( 0, 0´2), (0, 0´4), ¼ , ( 0, 1 ), ( 0´2, 0), ¼ , ( 1, 1 ). In Igaliko, the estimated pairs were: ( 0´5, 0´2) for 121 pairs and ( 0´4, 0´2) for 36 pairs; this means a medium regularization degree. In Ymer é , the estimated pairs were: ( 0´7, 0´0 ) for

A comparative study of some non-parame tric spectral classi® ers

1271

121 pairs and ( 0´8, 0´0) for 36 pairs; this means a high regularization degree. The computational cost of testing a pair is high, the accuracies of the RDA classi® cations performed with 36 and 121 pairs are similar, thus it is preferable to sample 36 pairs, which is a value in the range ( 25 to 50) proposed by Friedman ( 1989). The accuracies of the RDA classi® cations performed in this work are based on a 36 point grid. We will show the methodology for each non-parametric classi® er separately as they are very di€ erent to be exposed jointly. In any case, we will di€ erentiate between the learning, classi® cation and test phases. l Learning in CAR T consists of constructing the classi® cation tree from T . We have used the standard parameters in CART: the splitting function is the Gini diversity index and we have also adopted the cost-complexity prune using crossvalidation with 10 sets ( Breiman et al . 1984). The splitting function should depend on the problem at hand, but the authors point out that experience demonstrates that the classi® er is not very sensitive to this selection ( Breiman et al . 1984). Once the classi® cation tree is available, it is possible to classify the image and test the t classi® cation using T . The nearest-neighbour classi® ers used in this work di€ er in the reference set used to search the nearest neighbour, in other words, we have used di€ erent learning strategies for each case. The 1-NN classi® er is the simplest one as the learning phase l is trivial: R = T . The reference sets for the remaining nearest-neighbour classi® ers are constructed with di€ erent learning algorithms. The editing-cond ensing algorithm s we have used are: the multiedit algorithm ( Devijver and Kittler 1982 ) for editing and Hart’s condensing algorithm ( Hart 1968) l for condensing. The multiedit algorithm takes as input T and gives as output an l edited set, T M . The values of the parameters involved were N = 3 and I = 2 . From our experience, we can ensure that higher values do not improve signi® catively the results obtained in this paper whereas the computational e€ ort increases drastically (Cortijo 1995 ) so these values are generally used. The multiedit algorithm needed 47 iterations in Igaliko and 54 in Ymer é . Hart’s condensing takes as input the l previously edited learning set, T M , and gives as output an edited-condensed set, l T M C . This algorithm does not need additional parameters. Condensing in Igaliko l needed 5 iterations and 4 in Ymer é . As was pointed out in § 3.1.1, neither | T M | l nor | T M C | can be ® xed prior to learning. It is expected that multiedit will discard numerous samples in both problems due to the highly overlapping training sets. This is con® rmed in tables 3 and 4 where we show the editing and condensing information for the two datasets. We must note that the condensed set size is very low in both cases so many classes are represented only with a few training samples. Table 3. Class 1 2 3 4 5 Sum

Learning, edited and edited-condensed set sizes. Igaliko. T

l

l

TM

l

T MC

3806 7442 5463 2796 8834

458 267 492 34 490

6 4 11 2 16

28 441

1741

39

1272

F. J. Cortijo and N. Perez de la Blanca Table 4. Class 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Learning, edited and edited-condensed set sizes. Ymer é . T

l

2464 843 413 196 480 476 178 344 52 187 94 656 144 369 227 192 274 453 271 247

l

l

TM

T MC

1453 704 158 0 19 17 0 227 0 0 34 183 78 61 72 110 148 0 112 22

15 4 5 0 1 2 0 5 0 0 1 10 4 2 3 6 4 0 4 2

The classi® cation of the images is performed using these reference sets, with 1-NN l ( M ) being the classi® er which uses R = T M as the reference set and 1-NN ( MC ) the l classi® er which uses R = T M C as the reference set. The adaptativ e learning algorithm s adopted in this paper are LVQ-1 learning ( Kohonen 1990) and DSM learning (Geva and Sitte 1991). Both algorithms allow us to ® x the learning set size prior to learning. The number of prototypes, p , is a crucial parameter because further learning will be done over the selected set of prototypes in the initialization phase, so it determines the quality of learning and consequently the accuracy of the ® nal 1-NN classi® cation. This is the reason why it must be determined very precisely. As a general rule, it is not possible to determine a priori the optimum value for p but it should be done so that overlearning in initialization is avoided. We say that there is overlearning in initialization (Cortijo and Pe rez de la Blanca, 1996 b) when p i > n i , in other words, when the number of prototypes associated to a class is greater than the number of available training samples for that class. The parameters that control the learning phase depend on the actual learning algorithm. From our experience (Cortijo and Pe rez de la Blanca 1996 b) we have shown that they are not as relevant for the accuracy of the classi® cations as the prototype set size is. In any case, if quality is mandatory their ® ne-tuning is required to obtain a good learning accuracy. In order to automatize the selection of good values for the parameters involved in the adaptative learning algorithms, it is possible to estimate accurate values for themÐ including the learning set sizeÐ with two algorithms proposed by the authors (Cortijo and Pe rez de la Blanca 1996 b) in such a way that we obtain high accuracy classi® cations. When applying those algorithms the estimated parameters were the following. In Igaliko, pà = 735 ( pà i = 147 for i = 1, 2, ¼ , 5 ), aà ( 0 ) = 0´04 and rà = 91 000 . In Ymer é ,

A comparative study of some non-parame tric spectral classi® ers

1273

, 20 ), aà ( 0 ) = 0´1 and rà = 79 000 (Cortijo and Pe rez de la Blanca 1996 b). For DSM learning we have used the default values (Geva and Sitte 1991). The parameters for DSM learning in both cases were p i = 50 and a ( 0 ) = 0´3 . l l LVQ-1 learning is performed from T and gives as output T L VQ -1 . DSM learning l l is performed from the edited set T M and gives as output T D S M . The classi® cation phase is performed using these reference sets, with 1-NN ( L VQ- 1) being the classi® er l which uses R = T L VQ -1 as the reference set and 1-NN (DSM) the classi® er which l uses R = T D S M as the reference set. pà = 840 ( pà i = 42 for i = 1, 2, ¼

6.

Results and conclusions

In table 5 we show the accuracy of the classi® cations performed on the two datasets. We show, in columns, the name of the spectral classi® er and the accuracy of the spectral classi® cations for Igaliko and Ymer é . We should point out that the ML classi® er is the worst in terms of accuracy in both cases. The reason for the poor accuracy it gives is the high overlapping between the clusters of training samples in the representation space in both problems. In this case the quadratic decision boundaries do not de® ne accurately the Bayes acceptance regions corresponding to each class. The accuracy given by the RDA classi® er is higher than the accuracy given by the ML classi® er. It must always be true, as the ML classi® er is a particular case of the RDA classi® er. We can see that adopting the RDA classi® er is always a better choiceÐ in terms of accuracyÐ than the ML classi® er. CART gives the highest accuracy in the classi® cation of Igaliko whereas it is not so powerful in the classi® cation of Ymer é as it is only better than the ML classi® er. This is due to the high overlapping and the numerous classes to be discriminated in Ymer é . This makes it di cult to build an accurate classi® cation tree. The computational e€ ort associated to CART is mainly in¯ uenced by the learning step and it is a function of the training set size but we must note that it is a relatively low-cost step. If we centre on nearest-neighbours classi® cation rules, we must note the increase in the accuracy we obtain using 1-NN ( M) with respect to the use of 1-NN in Igaliko. This is the usual behaviour as the multiedit algorithm discards samples located inside clusters of a di€ erent class. In this way, those anomalous samples do not in¯ uence the misclassi® cation of new samples. This general tendency is not followed in the classi® cation of Ymer é because the multiedit algorithm discards all the Table 5. Classi® er ML RDA CART 1-NN 1-NN ( M ) 1-NN ( MC ) 1-NN ( DS M ) 1-NN ( L V Q- 1 )

Accuracy of the classi® cations. Igaliko (%) 73´51 78´97 80´66 74´61 77´76 77´08 77´50 79´07

Ymer é (%) 61´92 64´29 62´35 78´50 65´67 63´23 64´55 68´18

1274

F. J. Cortijo and N. Perez de la Blanca

samples belonging to ® ve classes. The consequence is a decrease in the mean accuracy of the classi® cation. In both cases the accuracy of the 1-NN ( MC ) classi® cations are slightly lower than the accuracy of the 1-NN ( M) classi® cations, whereas the classi® cation computational cost decreases dramatically. This fact can be extended to the 1-NN (DSM) classi® cations as they are based on the multiedited sets. In both cases 1-NN ( L VQ -1) classi® cations present a good trade-o€ between the accuracy of the classi® cation and the computational cost required. LVQ-1 learning is a quick process and, as a additional advantage, we can select the training set size and we can tune the parameters involved in the learning step ( Kohonen 1990 ) or they can be automatically estimated by using the two estimation algorithms proposed by the authors (Cortijo and Pe rez de la Blanca, 1996 b). We can conclude that in classi® cation problems with high-overlapping training sets, it is preferable to adopt a non-parametric classi® cation rule since it allows the decision boundaries to ® t more accurately the Bayes acceptance regions than the parametric classi® ers do. In any case, RDA classi® cation is a powerful alternative to ML classi® cation if the computational cost required for RDA learning is assumed. Acknowledgments

This work has been supported by the Spanish Direccio n General de Ciencia y Tecnologõ a ( DGCYT ) under grant PB-92-0925-C02-01. We should also like to thank the IMM, Denmark University of Technology ( Lyngby, Denmark) for providing the Landsat images used in this work. Referen ces B reiman, L ., F riedman, J ., O lshen, R ., and S tone, C ., 1984, Classi® cation and Regression T rees (Belmont, California: Wadsworth). C onradsen, K ., N ielsen, A ., N ielsen, B ., P edersen, J ., and T hyrsted, T ., 1987, The use of

structural and spectral enhancement of Remote Sensing data in ore prospectingÐ East Greenland case study. IMM Technical Report (Lyngby: Institute of Mathematical Modelling, The Technical University of Denmark) . C ortijo, F . J ., 1995, A Comparative Study of Classi® cation Methods for Multispectral Images. PhD thesis, DECSAI, Universidad de Granada, Spain. Available at http://decsai.ugr.es/ ~cb/entrada_tesis.html, electronic edition. C ortijo, F . J ., P e’rez de la B lanca, N ., M olina R ., and A bad, J ., 1995, On the combination of non-parametric nearest neighbor classi® cation and contextual correction. Pattern Recognition and Image Analysis, Proceedings of the VI Spanish Symposium on Pattern Recognition and Image Analysis, Co rdoba, Spain, 3± 7 April 1995 , A. Calvo and R. Medina (eds), pp. 503± 510. C ortijo, F . J ., and P e’rez de la B lanca, N ., 1996 a, Image classi® cation using non-parametric classi® ers and contextual information. Proceedings of the X VIII ISPRS (Internationa l Society for Photogrammetry and Remote Sensing) Congress, V ienna, 9± 19 July 1996 , K. Kraus and P. WaldhaÈusl (eds), Vol. XXXI, Part B3, pp. 120± 124. C ortijo, F . J ., and P e’rez de la B lanca, N ., 1996 b, Automatic estimation of the LVQ-1 parameters. Applications to multispectral image classi® cation. Proceedings of the 13th International Conference on Pattern Recognition, V ienna, 25± 30 August 1996 ( Washington, D.C.: I.E.E.E. Computer Society Press), pp. 346± 350. D evijver, P ., 1986, On the editing rate of the multiedit algorithm. Pattern Recognition L etters, 4, 9± 12. D evijver, P ., and K ittler, J ., 1982, Pattern Recognition. A Statistical Approach ( London: Prentice Hall International ). D uda, R .O ., and H art, P .E ., 1973, Pattern Classi® cation and Scene Analysis ( New York: John Wiley & Sons). F erri, F ., and V idal, E ., 1992 a, Small sample size e€ ects in the use of editing techniques. Proceedings of the 11th. International Conference on Pattern Recognition, T he Hague,

A comparative study of some non-parame tric spectral classi® ers

1275

T he Netherlands, 30 August± 3 September 1992 ( Washington, D.C.: I.E.E.E. Computer Society Press), pp. 607± 610. F erri, F ., and V idal, E ., 1992 b, Comparative study of consistency-based and adaptative methods for selecting reduced sets of prototypes for nearest neighbors classi® ers. In Advances in Pattern Recognition and Applications, edited by F. Casacuberta and A. Sanfeliu (Singapore: World Scienti® c), pp. 12± 23. F riedman, J ., 1989, Regularized discriminant analysis. Journal of the American Statistical Association , 84, 165± 175. G eva, S ., and S itte, J ., 1991, Adaptative nearest neighbor pattern classi® cation. I.E.E.E. T ransactions on Neural Networks, 2, 318± 322. H art, P ., 1968, The condensed nearest neighbour rule. I.E.E.E. T ransactions on Information T heory , 14, 515± 516. H ughes, G .F ., 1968, On the mean accuracy of statistical pattern recognizers. I.E.E.E. T ransactions on Information T heory , 14, 55± 63. K ohonen, T ., 1990, The self-organizing map. Proceedings of the I.E.E.E., 78, 1464± 1480. K ohonen, T ., K angas, J . L aakonsen J ., and T orkkola, K ., 1992, LVQ-Pack: The learning vector quantization program package, version 2.1. LVQ Programming Team Technical Report ( Helsinki: Laboratory of Computer and Information Science, Helsinki University of Technology). K ohonen, T ., 1995, Self-Organizing Maps (Berlin: Springer-Verlag) . L achenbruch, P .A ., 1979, Note on initial misclassi® cation e€ ects on the quadratic discriminant function. T echnometrics, 21, 129± 132. P arzen, E ., 1962, On estimation of probability density functions and mode. Annals on Mathematics and Statistics, 33, 1065± 1076. P e’rez, J .C ., and V idal, E ., 1993, Constructive design of LVQ and DSM classi® ers. In L ecture Notes in Computer Science , edited by J. Mira, J. Cabestany and A. Prieto ( Berlin: Springer-Verlag), pp. 334± 339. S hort, R .D ., and F ukunaga, K ., 1981, The Optimal Distance Measure for Nearest Neighbor Classi® cation. I.E.E.E. T ransactions on Information T heory , 27, 622± 627. S wain, P .H ., 1978, Fundamentals of Pattern Recognition in Remote Sensing. In Remote Sensing. T he Quantitative Approach , edited by P.H. Swain and S.M. Davis (New York: McGraw Hill ), pp. 136± 187. V idal, E ., 1986 , An Algorithm for Finding Nearest Neighbors in (approximately) Constant Average Time. Pattern Recognition L etters, 4, 145± 157.