Finding Optimal Neural Networks for Land Use ... - Semantic Scholar

Report 3 Downloads 48 Views
Finding Optimal Neural Networks for Land Use Classi cation Horst Bischof and Ales Leonardisy Pattern Recognition and Image Processing Group Vienna University of Technology Treitlstrae 3/1832, A-1040 Vienna, Austria fbis,[email protected]

Abstract In this letter we present a fully automatic and computationally ecient algorithm based on the Minimum Description Length Principle (MDL) for optimizing multilayer perceptron classi ers. We demonstrate our method on the problem of multispectral Landsat image classi cation. We compare our results with a hand designed multi-layer perceptron and a Gaussian maximum likelihood classi er where our method produces better classi cation accuracy with a smaller number of hidden units.

1 Introduction The number of applications of neural networks to remote sensing problems (especially classi cation) has been constantly increasing in the last few years (e.g. see [1, 2, 3, 4]). It has been demonstrated that in many cases neural networks perform considerably better than classical methods e.g. [1]. However, to achieve this superior performance, the neural networks need to be carefully designed. This includes both the design of the network topology as well as the input/output representation. The remote sensing specialist (i.e. the end user) is usually not a neural network specialist, therefore the design of the classi er should be automated as much as possible. In this paper a step towards this goal is presented. Let us consider the standard three-layer multilayer perceptrons (MLP). Given the training set TS = f(xp; tp) = ([xp1 ; : : : ; xpL]T ; [tp1 ; : : : ; tpN ]T )j1  p  qg consisting of q examples we have to 1. Determine the number of hidden units M , 2. and estimate the parameters (weights) wij . The activation of the j -th output unit is given by:

opj = (

M X i=1

wij ri(xp )) = (

M X i=1

wij (

L X k=1

wki xpk ));

(1)

 This work was supported by a grant from the Austrian National Fonds zur Forderung der wissenschaftlichen Forschung (No. S7002MAT). y Also with the Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia.

1

where  is the standard sigmoid function ((x) = 1+1e? ), and wij is the weight between units i and j . In order to determine the weights we train the network to minimize the usual sum of squared errors: q X N X E= (opi ? tpi )2 : (2) x

p=1 i=1

This is a nonlinear optimization problem of high dimensionality, and the likelihood of being trapped in a local minimum can be quite high. It is now widely accepted, both from the theoretical as well as from the experimental point of view, that the degrees of freedom (i.e., the number of independent weights) in uence the performance of a neural network considerably, e.g. [5]. Cross-validation is a procedure which is very often used to determine the number of hidden units. More speci cally, many di erent networks are trained with varying number of hidden units, then they are tested on an independent test set and the network having the best performance on the test set is chosen. Two main disadvantages of this procedure are that it is very time consuming and that it requires a large amount of data. There exist also a variety of procedures which determine the network size by either pruning or growing hidden units (e.g. [6, 7]). The algorithm we present here falls into the class of pruning methods. However, we base our algorithm on a rm theoretical basis by considering information theoretic measures to evaluate the complexity of the network. The resulting algorithm is fully automatic and computationally ecient. The approach involves two procedures: adaptation (training) and selection. The rst procedure adaptively changes the weights of the network. The selection procedure performs the elimination of some of the hidden units. By iteratively combining these two procedures we achieve a controlled way of training and modifying neural networks, which balances accuracy, learning time, and complexity of the resulting network. In addition we do not require a separate test set. The structure of the paper is as follows: In the next section we introduce the optimization principle and describe the objective function (selection procedure), as well as how we optimize the objective function. Section 2.2 presents the complete algorithm. In section 3 we present the experimental results and compare our method to a hand designed network and a Gaussian classi er. Finally we give a conclusion and outlook of further research.

2 Optimizing the Complexity of MLPs 2.1 Selection Let us consider the task of the network as \encoding" the training set, i.e. encoding the input-output relation. The major building blocks for the \encoder" are the hidden units of the network. Considering a neural network from this point of view we can ask about the shortest possible encoding of the training set by the neural network. It is clear that the 2

length of the encoding depends both on the size of the network and the network error. For example, using a network with an excessive number of hidden units results in no error on the training set, however the encoding is large because of the large number of weights one has to specify for the hidden units. On the other hand, a too small network causes many errors resulting also in a long encoding, this time because the encoding should encompass the deviations between the desired and the actual output. A formalization of this reasoning leads to the principle of Minimum Description Length (MDL) [8] which is the basis of our selection procedure. The task of the selection procedure is to obtain a subset of hidden units from a larger set of hidden units such that the output performance is preserved. Selection is performed by optimizing an objective function which can be tied to the principle of Minimum Description Length. In the next two subsections we describe the objective function which encompasses the information about the competing hidden units and the optimization procedure which selects a set of hidden units, respectively.

2.1.1 Optimization Function The objective function which encompasses the information about the competing hidden units has the following form 1 Eq.(3): 2 6 F (m) = mT Cm = mT 664

3

c11

:::

c1M 7 . 7

cM 1

:::

cMM

.. .

.. 7 m : 5

(3)

Vector mT = [m1 ; m2 ; : : : ; mM ] denotes a set of hidden units, where mi is a presence-variable having the value 1 for the presence and 0 for the absence of the hidden unit i in the resulting network. The diagonal terms of the matrix C express the cost-bene t value for a particular hidden unit i cii = K1 ni ? K2i ? K3 Ni ; (4) where Ni is the cost of specifying a hidden unit, ni is the summed activation of the hidden unit P P @E j). The coecients (ni = qp=1 ri (xp )), and i is the error of that hidden unit (i = qp=1 j @r K1 , K2 , and K3 adjust the contribution of the three terms. The coecients can be determined automatically [9]; K1 is related to the average cost of describing the activation (in bits), K2 is related to the average cost of specifying the error, and K3 is related to the average cost of specifying a hidden unit. More pragmatically, we can determine suitable values for K1 , K2 , p i

This function actually denotes the savings in the length of the description and it has been directly derived from the Minimum Description Length Principle (MDL). By maximizing the savings, in fact, we minimize the length of the description. Due to lack of space we cannot present a complete derivation of this objective function (see [9]). 1

3

K3 by considering a few limiting cases. Since we are interested only in the relative comparison of possible descriptions, we can set K1 which weights the number of summed activations to 1 and normalize K2 and K3 relative to it. If we assume for a moment that the error equals 0, K3 determines the minimum summed activation that a hidden unit has to have. Once we have obtained the value for K3 , we can, by xing the maximum allowable error and the minimum activation in the network, determine the value of K2 . One should note that the third term Ni is here constant. However, one can imagine that this formulation enables networks which use di erent kinds of hidden units, e.g., sigmoids, di erent kinds of radial basis functions. The o -diagonal terms handle the interaction between hidden units j j + K2 ij cij = ?K1jRi \ R ; (5) X 2 X ij = max( rii ; rj j ) : (6) Ri \Rj

Ri \Rj

jRi \ Rj j = rij = Pqp=1 riprjp is the joint activation of hidden units i and j , and ij is the mutual error of the hidden units de ned in (6). In the current implementation we approximated ij by max(rij i ; rij j ). The objective function takes into account the interaction between di erent hidden units. However, we consider only the pairwise interactions in the nal solution. From the computational point of view, it is important to notice that the matrix C is symmetric, and depending on the interaction of the hidden units, it can be sparse or banded. All these properties of the matrix C can be used to reduce the computations needed to optimize F (m).

2.1.2 Solving the Optimization Problem We have formulated the problem in such a way that its solution corresponds to the global extremum of the objective function. Maximizing the objective function F (m) belongs to the class of combinatorial optimization problems (quadratic Boolean problem). Since the number of possible solutions increases exponentially with the size of the problem, it is usually not tractable to explore them exhaustively. Thus the exact solution has to be sacri ced to obtain a practical one. Various methods have been proposed for nding a \global extreme" of a class of nonlinear objective functions. Among these methods are Winner-takes-all strategy, Simulated annealing, Microcanonical annealing, Mean eld annealing, Hop eld networks, Continuation methods, and Genetic algorithms [10]. We currently use the Winner-takes-all method (Greedy) and a Tabu search algorithm [11].

2.2 Complete Algorithm We can now describe the complete algorithm: 1. Initialization: We initialize the network with random weights and use a large number of hidden units (e.g. > 20% of the samples). 4

2. Adaptation: We adapt the network by a standard training algorithm (in particular, we use a conjugate gradient algorithm). We do not train the network to convergence (see discussion below), usually only a few (e.g., < 10) epochs are sucient. 3. Selection: Remove redundant hidden units using the selection procedure (section 2.1). 4. If the selection procedure has not removed any of the hidden units and the changes in the weights are small, we stop the algorithm, otherwise goto step 2. This iterative approach is a very controlled way of removing redundant units. Selection is performed based on the relative competition among the hidden units. Only those units are removed that cause a high error and overlap with others which better approximate the data. The units which remain in the network are adapted by the training procedure. To achieve a proper selection it is not necessary to train the network to convergence at each step. This is because the selection procedure removes only those hidden units where others can compensate for their omission. Since this is independent on the stage of the training, it is not critical when we evoke the selection procedure2 . For these reasons we achieve a computationally ecient procedure. This is in contrast to other well-known pruning algorithms like Optimal Brain Damage [6] and Optimal Brain Surgeon [7]. They always have to be trained to convergence after one weight/unit is eliminated because only in this case reliable measures of importance for a weight/unit can be obtained. Moreover, these algorithms do not have a stopping criterion since they only provide a set of networks with decreasing complexity from which the best one has to be selected according to some cross validation procedure. One should note that all entries of C which we need for the selection can be calculated during the adaptation procedure, so that this causes no additional costs. Since the matrix C is usually very sparse, the selection procedure can be implemented very eciently, requiring only computational costs which are comparable to a single epoch of training.

3 Experimental Results We applied our procedure to the problem of land use classi cation of Landsat TM data. The data we used for training and testing of the classi cation accuracy of the neural network were selected from a section of a Landsat TM scene (512  512 pixels) of the surroundings of Vienna. In Figure 1(a), channel 4 of the test site is shown. This data set has already been used with various other algorithms [1, 12]. In order to guarantee a fair comparison we use the same setup for our procedure as was used in the hand designed network. Therefore we brie y explain the training data and the hand designed network (for more details see [1]). The exception is when we have a pathological initialization, e.g. when all hidden units have the same weights or are con ned to a small subspace. In these cases a few steps of training are sucient to overcome the problem. 2

5

Hand designed network The aim of the classi cation was to distinguish between four

categories: built-up land, agricultural land, forest, and water. The resulting thematic map is compared with the output of a Gaussian classi cation [13]. The Gaussian classi er assumes a normal distribution of the data. A preliminary analysis indicated that this assumption was not met in the case of the four categories. Therefore, these categories were split into 12 sub-categories with spectral data of approximately normal distributions [14]. Two thematic maps of this scene were prepared by visual classi cation of the Landsat image, using auxiliary data from maps, aerial photographs, and eld work. These two thematic maps, showing 4 and 12 land-cover classes were considered to represent the \true" classi cation of the scene (see Figure 1(b)) and were used for obtaining both training information and test data for assessment of the accuracy of the automatic classi cation results. The training set for both the neural network and the Gaussian classi er consisted of 3,000 pixels. The hand designed network used for the classi cation is a three-layer feed-forward network with 5 hidden units trained with a conjugate gradient algorithm. The ve hidden units have been found by a trial and error method. The inputs to the network are the seven images from Landsat TM channels (bands). One-dimensional coarse coding [15] is used as input representation. The output of the classi cation task is coded locally. The network was trained for approximately 80 epochs (complete presentations of the training set) and achieved 98.1% classi cation accuracy on the training set. Applying the network to the whole image leads to an average classi cation accuracy of 85.9%, which is slightly better than the Gaussian classi er which achieved 84.7% correctly classi ed pixels (training and classi cation with 12 classes, merging into 4 classes after classi cation). The rst two columns of table 1 compare the neural network classi cation with the Gaussian classi er. Figure 2(a) shows the result of the neural network classi cation.

Optimized Network We applied our procedure, as described in the previous section, to

the same data. The starting number of hidden units was 30 (however similar results are obtained with other initial number of hidden units). The selection procedure was invoked 12 times. The total number of epochs the network was trained was 120. The network ended up with 4 hidden units, and the results are shown in Table 1(third column) and Fig. 2(b). These results show that the fully automatic network setup performs on the average slightly better than the hand designed one with a smaller number of hidden units. It should be emphasized that our primary goal was to develop a fully automatic method that can compete with a carefully hand-designed network, and that at this point we did not strive to improve the classi cation accuracy. The number of epochs is higher than for the hand designed network because due to pruning some retraining has to be performed. However, considering that in the hand design one has to try more alternatives (> 10) , we also gain in terms of computational time. In hand design we tested also networks with 4 hidden units but those were found to perform 6

worse than the 5 hidden unit network. This may be explained by the fact that those networks were trapped in a local minimum, whereas our procedure found considerably better minima. The reason for this is that in our case at the beginning many hidden units are used (all initialized at di erent positions), and therefore a larger portion of \weight space" is examined. Those units that perform well remain in the network. On the other hand, starting with a minimal network, it is practically impossible to avoid bad initialization.

4 Conclusions and Work in Progress We presented a method which starting from an initially high number of hidden units in an MLP network, through an iterative procedure of learning, adaptation of hidden units and selection, achieves a compact network. Our approach e ectively and systematically overcomes the problem of selecting the number of hidden units. We have demonstrated on a well-known example that our method performs better than a hand designed network. The proposed algorithm can be extended in several ways. We have used basically the same algorithm to design radial basis function networks [16] and achieved similarly better results than with traditional methods. We are currently working on adapting the algorithm for several other networks. This includes Gaussian mixture models trained with the EM algorithm and some unsupervised networks. We are also exploiting the possibility to use our algorithm for modular neural network design and networks which use di erent types of hidden units. Besides, we plan to develop an incremental variant of the algorithm where we start with a small number of samples and hidden units and then incrementally add examples and hidden units when needed; then use the selection procedure to remove the redundant ones. Since the method proposed in this paper is quite general, it should be easy to incorporate all these extensions without changing the general paradigm.

7

References [1] H. Bischof, W. Schneider, and A. Pinz, \Multispectral classi cation of Landsat-images using neural networks," IEEE Transactions on Geoscience and Remote Sensing, vol. 30, no. 3, pp. 482{490, 1992. [2] H. K. Greenspan and R. Goodman, \Remote sensing image analysis via a texture classi cation neural network," in Advances in Neural Information Processing Systems 5 (S. J. Hanson, J. D. Cowan, and C. L. Giles, eds.), pp. 425{432, Morgan Kaufmann, California, 1993. [3] Y. Hara, R. Atkins, S. Yueh, R. Shin, and J. Kong, \Application of neural networks to radar image classi cation," IEEE Transactions on Geoscience and Remote Sensing, vol. 32, no. 1, pp. 100{109, 1994. [4] A. S. Solberg, A. Jain, and T. Taxt, \Multisource classi cation of remotely sensed data: Fusion of Landsat TM and SAR images," IEEE Transactions on Geoscience and Remote Sensing, vol. 32, no. 4, pp. 768{778, 1994. [5] V. Vapnik, E. Levin, and Y. LeCun, \Measuring the VC-Dimension of a learning machine," Neural Computation, vol. 6, no. 5, pp. 851{876, 1994. [6] Y. L. Le Cun, J. S. Denker, and S. A. Solla, \Optimal brain damage," in Advances in Neural Information Processing Systems 2 (D. S. Touretzky, ed.), pp. 598{605, San Mateo, CA: Morgan Kaufmann, 1988. [7] B. Hassibi and D. Stork, \Second order derivatives for network pruning: Optimal brain surgeon," in NIPS 5 (S. H. et al., ed.), pp. 164{172, Morgan Kaufmann, 1993. [8] J. Rissanen, \Universal coding, information, prediction, and estimation," IEEE Transactions on Information Theory, vol. 30, pp. 629{636, July 1984. [9] A. Leonardis, A. Gupta, and R. Bajcsy, \Segmentation of range images as the search for geometric parametric models," International Journal of Computer Vision, vol. 14, no. 3, pp. 253{277, 1995. [10] A. Cichocki and R. Unbehauen, Neural Networks for Optimization and Signal Processing. New York: Wiley, 1993. [11] F. Glover and M. Laguna, \Tabu search," in Modern heuristic techniques for combinatorial problems (C. R. Reeves, ed.), pp. 70{150, Blackwell Scienti c Publications, 1993. [12] H. Bischof, Pyramidal Neural Networks. Lawrence Erlbaum Associates, 1995. [13] R. O. Duda and P. E. Hart, Pattern Classi cation and Scene Analysis. New York: Wiley, 1973. 8

[14] R. Wagner, Verfolgung der Siedlungsentwicklung mit Landsat TM Bilddaten (Settlement development with Landsat TM data). PhD thesis, Inst. f. Surveying and Remote Sensing, University of Natural Resources, Vienna, 1991. [15] G. E. Hinton, J. L. McClelland, and D. E. Rumelhart, \Distributed representations," in Parallel Distributed Processing (D. Rumelhart and J. McClelland, eds.), vol. 1, pp. 77{ 109, Cambridge, MA: MIT Press, 1986. [16] A. Leonardis and H. Bischof, \Complexity optimization of adaptive RBF networks," in Proceedings 13th International Conference on Pattern Recognition, vol. IV, pp. 654{658, IEEE Comp.Soc., 1996.

9

Figure Captions Figure 1: (a) Channel 4 of test image and (b) Visual classi cation. Figure 2: (a) Result of neural network and (b) Optimized neural network classi cation (black: built-up land; dark gray: forest; light gray: water; white: agricultural area).

Table Captions Table 1: Classi cation results of Gaussian classi er (ML), neural network (NN), and optimized neural network (Opt. NN).

10

Figure 1: Bischof, Leonardis

11

Figure 2: Bischof, Leonardis

12

Table 1: Bischof, Leonardis

category built-up land forest water agricultural area average accuracy

ML NN (5 Hidden Units) Opt. NN (4 Hidden Units) 78.2% 87.5% 85.9% 89.7% 89.9% 91.9% 84.7% 95.7% 90.4% 74.1% 70.6% 74.2% 84.7% 85.9% 86.2%

13