A practical overview of neural networks - Semantic Scholar

Report 7 Downloads 96 Views
Journal of Intelligent Manufacturing (1997) 8, 157 ± 165

A practical overview of neural networks L A U R A B U R K E 1 * and J A M E S P . IG N I Z I O 2 * 1

Department of Industrial and Manufacturing Systems Engineering, Lehigh University, Bethlehem, PA #18015, USA. 2 Department of Systems Engineering, University of Virginia, Charlottesville, VA #22901, USA

Invited Paper

This paper overviews the myths and misconceptions that have surrounded neural networks in recent years. Focusing on backpropagation and the Hop®eld network, we discuss the problems that have plagued practical application of these techniques, and review some of the recent progress made. Both real and perceived inadequacies of backpropagation are discussed, as well as the need for an understanding of statistics and of the problem domain in order to apply and assess the neural network properly. We consider alternatives or variants to backpropagation, which overcome some of its real limitations. The Hop®eld network's poor performance on the traveling salesman problem in combinatorial optimization has colored its reception by engineers; we describe both new research in this area and promising results in other practical optimization applications. Overall, it is hoped, this paper will aid in a more balanced understanding of neural networks. They seem worthy of consideration in many applications, but they do not deserve the status of a panacea ± nor are they as fraught with problems as would now seem to be implied. Keywords: Neural networks, engineering applications

1. Introduction Despite the close of the decade of the `new golden age' of neural networks, their credibility remains questioned by many engineers. The reason for this skepticism is actually not dicult to unearth. For the past forty-odd years the neural network approach to arti®cial intelligence has repeatedly earned the status of an `irresponsible little brother' in the arti®cial intelligence (AI) community ± this in spite of the fact that neural networks preceded other AI approaches, such as expert systems. The ®eld of neural networks owes this unfortunate status in part to the actions of enthusiastic researchers in the ®eld and in part to the nature of the neural network itself. Researchers, starting with the likes of Frank Rosenblatt in the 1950s, tended to make excited generalizations about their work, which ± when the ®eld has to do with the modeling of human intelligence ± may be picked up by the media or other less scrupulous individuals and exaggerated for e€ect. Such hype tends to deter serious researchers in the AI and engineering ®elds. Additionally, there is the problem of the neural net's inherent inability to explain itself.

*Author to whom all correspondence should be addressed.

0956-5515

Ó 1997 Chapman & Hall

By the 1980s, when neural networks experienced their reawakening, conventional AI researchers had made progress in the transparent modeling of intelligence ± namely, expert systems. Expert systems were able to explain their reasoning. The rule-following, knowledge-based approach of these models stood in stark contrast to the emerging neural network scene, where rules were abandoned and explanations seemed impossible. So the hype and the `irresponsibility' of neural networks led to some annoyance on the part of serious researchers in, and practitioners of, AI. Statements that were made by some worsened matters: `You don't need to understand statistics to use neural networks' or `You don't need to know much about the problem to use neural networks'. These kinds of sweeping generalizations only detracted from the credibility of neural nets, and in particular from the viewpoints of statisticians ± for obvious reasons ± and engineers. The latter asked how a technique that seemed to require no knowledge of the problem domain could be trusted? Two additional items of confusion further slowed the reentry of neural networks. One body of literature, read by interested newcomers to the ®eld, seemed to suggest that the backpropagation network was `the' neural network. Even now, vestiges of this confusion remain. It is an understandable misconception to hold; the excitement sur-

158 rounding neural networks that led to new funding and interest in the 1980s was in part due to backpropagation's emergence as a training algorithm, which overcame the crippling limitations of the Perceptron of the 1950s. But, at the same time, an enormous amount of research was conducted and progress was being made on neural networks quite distinct from backpropagation. Backpropagation itself underwent numerous studies that o€ered modi®cations and improvements. While backpropagation is quite ¯exible and powerful, its drawbacks are numerous. Moreover, there remains a gap between perceived and real limitations. Nevertheless, it remains the most ubiquitous neural network. It is simply not the only neural network. Of course, any time `the' neural network proved inadequate, sceptics perceived all neural networks as being useless. The other body of literature, also read by interested newcomers, attributed the new notoriety of neural networks to the Hop®eld network, named for John Hop®eld. Hop®eld attracted much note, as he was a Nobel prize winner, and since he published his seminal work, `Neural networks and physical systems with emergent collective computational abilities', in the Proceedings of the National Academy of Scientists (Hop®eld, 1982). This article described how the Hop®eld network, composed of arti®cial neurons and capable of computation, could be implemented in simple electrical circuitry. Hop®eld's work attracted further note from engineers when, teamed with David Tank, he illustrated the application of the Hop®eld network to the traveling salesman problem (Hop®eld and Tank, 1985). But again, when the Hop®eld network proved to have severe shortcomings, neural networks were seen as failures. Thus misconceptions such as `backpropagation is the only neural network' or `Hop®eld is the only neural network' and even `backpropagation and the Hop®eld network are very similar' all surfaced, and still remain. To relieve the confusion, we must ®rst disentangle backpropagation from the Hop®eld network. These two neural networks represent fundamentally di€erent techniques applied in fundamentally di€erent instances. Rarely, if ever, is it sensible to ask, `Should we use backpropagation or the Hop®eld network?' Neural networks, unfortunately, are not identical copies of the human brain, and their limitations are many. Still, properly used, their true potential may have yet to be realized. They have not come and gone, fading into obscurity after a brief, ignominious heyday. Instead, neural network users ®nd more and better software tools available, thanks to a continued ¯ood of research in techniques and applications. A recent survey conducted by Golden et al. (1995) cited 17 di€erent neural network software packages in use by respondents (although two of these packages accounted for 70% of users). Furthermore, they found that over 70% of respondents felt that the `future looks bright for using neural networks to model company problems' (most respondents were from industry).

Burke and Ignizio The goal of this overview is to clarify many of the misunderstandings surrounding the popular view of neural networks. It will also endeavor to distinguish between the real and the perceived inadequacies of the two most familiar networks, backpropagation and Hop®eld. We debunk statements such as `You don't need statistics to use neural networks' and `You don't really need to know much about the problem'. But our ®rst order of business is to describe the present understanding and use of empirical networks, primarily backpropagation. We elaborate on its limitations, both perceived and real, and brie¯y describe the alternatives (or modi®cations) to this widely used algorithm (according to Golden et al., 85% of neural network users report that backpropagation is the algorithm they use). After clarifying some widely held misconceptions about backpropagation, and o€ering insights into what may truly hinder its application, we move on to describe the present issues in application of and research into the Hop®eld network. While this paper does not attempt to survey the literature exhaustively or rigorously explain ®ndings, it does aim to explain why neural networks sometimes fail, what may be done to prevent this, and what more recent research has to o€er. Hopefully, the discussion will show why both perfunctory dismissal of, or blind faith in, neural networks may be misguided. 2. Backpropagation description The backpropagation neural network exempli®es the adaptive weighted, feedforward-type neural network that accomplishes empirical modeling. Rather than provide a detailed description of the backpropagation algorithm, we refer the reader to a number of excellent sources (Rumelhart et al., 1986; Fausett, 1994). Here we simply give an overall intuitive explanation of the concept of backpropagation. The algorithm operates on a neural network, which may look like the one pictured in Fig. 1. Input nodes, at the bottom of the ®gure, receive problem attributes for a particular record (input±output pair), and transmit that information to adjacent nodes by way of the connections between them. Output nodes, at the top of the ®gure, display the neural network's response to the particular input pattern. Connections between nodes have associated weights. Hidden nodes perform non-linear transformations that are central to the modeling capability of the network. Figure 2 illustrates the typical logistic sigmoid activation function used in backpropagation. As an example, consider the problem of predicting the ¯ank wear on a machine tool in a turning operation (Burke and Rangwala, 1991). Let the inputs to the model be measures of cutting force and acoustic emission. Assume that these signals have been transformed so that only six attributes of a tool at a particular time are needed for input. The output associated with these readings is the measured wear on the ¯ank of the

A practical overview of neural networks

159 non-linear function; however, it does so stochastically. Instead of using the entire sum over all training patterns to calculate the weight change, the algorithm typically uses partial sums leading to weight changes based only on partial information. This allows it to search the error surface in a stochastic manner. 2.1. Real and perceived inadequacies of backpropagation The next subsections describe various weaknesses of the backpropagation algorithm. First, we address the assertion that backpropagation is nothing more than linear regression. Next, we discuss the problem of overly long training times. In both cases, we will see that both substance and subterfuge exist.

Fig. 1. Typical backpropagation neural network structure.

tool at the given time. The approach would then be to collect many such measurements, and repeatedly present the input vectors to the network. The network, whose weights are (usually) initially random values, will give meaningless outputs at ®rst. But the backpropagation algorithm adjusts the weights of the network as the `training' progresses. It does this by calculating an error between each input's desired output and that which the network actually gives. Then the sum squared error over the entire training set is the error term which drives the weight change as follows: Dwi ˆ

dE dwi

…1†

2 Pÿ where E ˆ j tj ÿ aj ; j ˆ 1; . . . ; n; wi ˆ weight on the ith connection; n ˆ number of patterns; tj ˆ target output for pattern j; and aj ˆ actual network output for pattern j. Many readers will recognize such an approach as a gradient descent search method. Backpropagation indeed approximates the gradient descent method of minimizing a

2.1.1. Backpropagation comparisons with linear regression One of the most often cited reasons for dismissing backpropagation is its perceived correspondence to linear regression. While there do exist similarities between the techniques, the true statistical counterpart of backpropagation is non-linear regression. In linear regression, the function being approximated is assumed to be linear in its weights. The backpropagation function, owing to the nonlinear transformations performed at the hidden nodes, is non-linear in its weights. Non-linear regression is a much less accessible technique than linear regression, so very few attempts at comparison are made. On the other hand, some researchers do compare neural networks to linear regressions of higher orders: that is, wherein the function is still linear in the unknown weights but now input data may take on higher-order forms. Even in such a case, the need to determine which terms contribute to error reduction means that the process becomes considerably more time consuming than for the simple ®rst-order linear regression approach. To suggest that backpropagation represents the same technique as linear regression is unfair. Furthermore, even when compared to its correct statistical counterpart, the stochastic nature of the backpropagation algorithm makes it unique. 2.1.2. Excessive training time

Fig. 2. Sigmoid function, Y = f (x).

The original backpropagation algorithm indeed possesses many ¯aws. But much of the problem is due to inappropriate application and unrealistic expectations. This causes perceived limitations to surface, and masks the real inadequacies of backpropagation. One of the ®rst and most often cited ¯aws attributed to backpropagation is the `excessively long time' it takes to train the network (that is, ®nd the weights). While this can indeed hamper some applications, two very common reasons for long training time exist, and are usually avoidable. The ®rst is the use of exceptionally small learning rates, which often appeal to

160 those who understand that the smaller the learning rate, the closer will the neural network approximate gradient descent. While this may seem reassuring, because a stable algorithm is more likely to result, it is important to remember that the stochastic nature of the algorithm contributes to its unique advantages over conventional methods. Moreover, approaches that initialize the learning rate at high values, then decrease it gradually, generally prove stable and fast. The second reason is the fundamental approach taken by some users: drive the error to some arbitrarily low value. (What many readers may ®nd surprising is that, in many applications where backpropagation succeeds, it requires only a few minutes to train the network. The training time obviously depends greatly on the size of the network and the quantity of training data). The error given by a backpropagation network is one of the most misused measures of performance, as we shall next discuss. 2.2. Real inadequacies and possible solutions Long training times can indeed hamper backpropagation applications. So can problems of `overtraining' and `undertraining'. We next try to clarify what these problems are and how they may be addressed. To do so requires an understanding of how to assess a neural network, bringing us back to the issue of error. 2.2.1. Error versus R2 By driving performance error to very low values, users can certainly cause long training times to result, and can also generate other problems. It turns out that an understanding of some very basic statistics is critical after all, as we shall see. The error in a backpropagation net is a function of the di€erence between the desired and the actual output for each training pair in the training data set. Requiring this sum squared error to reach a prespeci®ed `low' value makes little sense. For some applications, such a value may be easily reached quickly. In fact, causing the net to train long enough to reach such low error can lead to very poor performance on the test set: that is, poor generalization on future data. We shall have much more to say about this shortly. For now, we leave the reader with a seemingly paradoxical statement: even when they can be reached quickly, low error values can prove detrimental to the network application overall. For other applications, a low error value may require excessive time to reach ± or in fact may be virtually impossible to reach. It is dicult indeed to know a priori what an acceptably low error value should be. Moreover, that level should incorporate the inherent diculties of the problem. In statistical terms, a correlation coecient should be computed that corrects the error value with a measure of the inherent variability of the data. This correlation coecient, the R2 statistic, is:

Burke and Ignizio   SSE R ˆ1ÿ SST 2

…2†

where SSE ˆ sum of squared errors for the network output, and SST ˆ squared variance of the data output ( the tj 's from Equation 1). A value closer to 1 is better than a value closer to 0. Most introductory statistics textbooks discuss this and other forms of the correlation coecient. The use of the R2 statistic is not without its own problems. For the training set ± the data set that is used to ®nd the weights, or parameters of the network ± the correct statistic to use is the adjusted R2 statistic:    SSE nÿ1 2 …3† Radj ˆ 1 ÿ SST n ÿ p ÿ 1 where n ˆ number of patterns and p ˆ number of parameters in the network. This statistic takes into account the number of parameters in the model. After all, a model that has twice as many parameters has an advantage over one with fewer parameters ± and it also has the capacity to over®t the data. At least this is true in regression. Unfortunately, neural network models tend to have a large number of weights if they are to be useful. In fact, they sometimes have so many weights that the adjusted R2 becomes meaningless, leading to values larger than 1, when n ÿ p ÿ 1 < 0. The problem is that not all the parameters of the neural network play a signi®cant role in the model at various points in training. (Again, the neural network is not an exact duplicate of any regression model). That is, not all weights are e€ective parameters (Moody, 1992). Since the role of a weight changes over the course of training, how e€ective it becomes depends on the length of time allotted to training. While Moody gives some formulae with which to calculate the e€ective number of parameters of a neural network, the usefulness of these formulae depends on certain assumptions being made about the neural network (many standard implementations of backpropagation do not satisfy the necessary assumptions) and on a fair amount of computation. Still, a statistical measure incorporating the data's inherent variability is extremely useful. The network's ability to generalize determines its true usefulness. A test set, separate from but similar to training data, can aid in performance assessment. The existence of a satisfactory test set dictates the ease with which the assessment may be made. The next section discusses test set issues. 2.2.2. Test set issues Proper validation of a neural network requires a test set: a set of data similar to but distinct from the training data. For such data the unadjusted R2 measure will suce and avoid the question of number of e€ective parameters. The problem that then arises is that, in order to secure the best results, a number of di€erent architectures (quantity of

A practical overview of neural networks hidden units, primarily) together with various training times must be used. If the R2 measure is used in each case on the test set, then the measure is again biased because the approach is assessing the e€ect of additional parameters ± such as number of hidden units or training time. Thus, for ideal validation, a third and separate data set is required. Unfortunately, not all applications have available such a large quantity of appropriate data. Quite often the second data set is used only, and the best R2 is reported. While this estimate may be somewhat biased and optimistic, it is nonetheless a reasonably useful assessment of the network, and certainly gives more information than just the error on the training set. This of course assumes that the second set is suciently large and appropriately chosen. If not, consider what can happen. In an application characterized by scarcity of data, a group of engineers used a rule of thumb provided by Hecht-Nielsen in his textbook (Hecht-Nielsen, 1990): use 90% of the data for training, and 10% for testing. Unfortunately, the engineers did not read much more carefully about test set selection. What happened with their set of 60 data points was that only six were reserved for testing. After training and testing the network they discovered, to their excitement, that they could achieve a test set R2 of 0.9 ± quite respectable. To their horror and confusion, however, they found on alternate test sets (where test and training data as whole never changed) that they achieved negative (unadjusted) R2 values! It goes without saying that such results are simply not acceptable. But why did it happen? With only six data points, no reasonably reliable conclusions can be drawn about the neural network. It may in fact be perfectly well trained, or it may be miserably overtrained ± it is impossible to know with such sample sizes. Worse, these samples may consist of data points clustered together. If such a small sample size is necessary, then at the very least great care must be taken to handpick data points for testing ± preferably re¯ecting the same wide range of values present in the training data. Consider, in the above example, what would have happened if the engineers had not explored the network's performance on variations of the test set. They would have supplied the client with a neural network system that might well have been useless. This kind of validation e€ort is critical in applying neural networks. While multiple training and testing e€orts may not always be the most ecient approach, it is certainly better than no attempt. The often cited problem of long training times may be masking more serious issues regarding the neural net's application: for a signi®cant percentage of the time, very short training periods are more than sucient. Overtraining may be an even more serious problem than `long training time' ± and the two problems may go hand in hand, as we shall see.

161 2.2.3. The dangers of overtraining As the use of the adjusted R2 statistic suggests, models with a di€erent number of parameters can be expected to perform quite di€erently, and the assessment of the model corrects for p, the number of parameters. The problem, of course, is the potential for over®tting. In statistics this term applies to the situation wherein a model possesses too many parameters, and ®ts the noise in the data rather than the underlying function. Because neural networks have so many weights, it is natural to believe that they succumb easily to this situation. However, the e€ective number of parameters (mentioned in Section 2.1) does not necessarily equal the number of weights in a neural network. It turns out that e€ectiveness of a parameter depends on a number of features, especially length of training and, of course, the data used to train. Figure 3 illustrates the situation best. Initially, the function described by the input±output relationship of the neural net is smooth and close to linear; as training proceeds, it begins to acquire `curves' as it ®ts higher-order features in the data. Eventually, if allowed to train too long and if sucient parameters exist in the model, the function will resemble Fig. 4 ± highly wrinkled and likely to ®t the noise in the training set. This results in an ability to perform exceedingly well on the training data ± almost as if it were `memorized' ± at the expense of not generalizing to new and di€erent, but similar, test data. Finally, a critical clari®cation deserves note. The need for suciently long training to activate all or most parameters, and the perhaps even more important e€ect of the actual training data used, preclude prediction of whether or not a neural network of a particular size will over®t the training data. Recall ®rst that only a second test set can reveal if the neural net was overtrained on the training data ± a low error value (or high R2 ) alone cannot tell. The discussion in the previous section about the test set data is helpful in understanding how critical selection of the test set data is to the task of neural network assessment. Also, the training data itself will a€ect the likelihood of over®tting. Even when a large amount of data is available,

Fig. 3. Neural network approximation to y = f(x), early in training. Circles represent data points, x, and line represents approximation to output, y.

162

Fig. 4. Neural network approximation to y = f(x) , late in training (overtraining). Circles represent data points, x, and line represents approximation to output, y.

if it does not span the input and output space well, then the neural network is not exposed to the wide range of information that it needs. It can `memorize' the relatively similar patterns it sees over an over again. Thus even quantity of data alone is not sucient to prevent overtraining; neither is a small (but well chosen) set of training data ± with respect to the number of weights ± a guarantee of over®tting. 2.2.4. `Undertraining' The discussion thus far has focused on problems that arise when the practitioner causes training to proceed for too long. It is the view of these authors that understanding the situation that results under such circumstances will prevent a large percentage of the failures of neural networks, or at least minimize the frustration. But unfortunately it is also entirely possible that a network will not, in a reasonable period of time, reach an acceptable error or R2 level. Overtraining is unlikely to concern the practitioner when the network is unable to capture the underlying function of the training data at all. For example, consider another real engineering application wherein a backpropagation network's R2 value was negligible (near zero) between 1000 and 10 000 iterations (cycles through the training set). It is just such a situation that might prompt the practitioner to train for excessively long periods, still unable to achieve an acceptable error level on the training set. We propose here that in such a situation training should be halted at the point where progress has leveled o€. Either the neural network architecture, the training data or some other as yet undiagnosed problem prevents successful training from occurring, and continued application of the algorithm really serves no purpose. Instead, the diculty suggests that a careful review is in order. In the case of this example, alternative neural networks were applied for preprocessing the data, and then the preprocessed training set was used in backpropagation training. The new approach yielded R2 values above 0.8 (compared to near-zero values before) in a few hundred iterations. It is clear, then, that much of what plagues backpropagation ± its perceived limitations ± is related to incompat-

Burke and Ignizio ibility between the data available and the expectations of the practitioner. Further, the myth that statistics need not intrude upon the application of neural networks can be exposed for what it is. For the most part, backpropagation still represents the most ubiquitous neural network model, but a number of real contenders have emerged. Thus backpropagation is not the only neural network. But a corollary follows immediately: there is no `one' backpropagation. Numerous variations on the backpropagation theme exist, and some `contenders' can actually be described as modi®ed backpropagation. We next review some of these alternatives. 2.3. Alternative empirical neural networks Much of the diculty concerning the use of backpropagation lies in the virtual impossibility of determining the best number of hidden units to use for an application. In addition, as described above, backpropagation can sometimes simply fail ± or take excessively long to train. Again, this may relate back to the problem of adequate hidden unit determination. A further disadvantage of backpropagation is its inability to `know what it doesn't know'. By this we refer to the fact that backpropagation will always give a response, without indicating that perhaps the new input is really outside its area of expertise. Thus much of the e€ort in improving backpropagation or pursuing alternatives has focused on these limitations. Here we sample a selected set of modi®cations/alternatives, which demonstrate the results of these e€orts. The cascade correlation algorithm (Fahlman and Lebiere, 1990) is generally treated as an extension of backpropagation. In this method, the number of hidden units is initialized at zero, and grows throughout the training of the network. When the error for the zero hidden unit network levels out, the algorithm is induced to add a hidden node. When a hidden node is added, previous hidden node connection weights are ®xed, and only the new connection weights will be modi®ed (various implementations of the algorithm attack the weight modi®cation di€erently). The new node connects to input, output, and previously existing hidden nodes. This approach lies in the class of `constructive' algorithms. Other constructive algorithms include the Upstart method (Frean, 1990), the `Tiling' algorithm (Mezard and Nadal, 1989; Yang et al., 1996) and, for classi®cation, the RCE network (Reilly et al., 1982) and the probabilistic neural network. Weight elimination, another extension of backpropagation, is a `destructive' approach to the problem of deciding the number of hidden units (Weigend et al., 1990). In this method an `oversized' network is initialized. Some decision does have to be made on what is `oversized'; more than a sucient number of hidden units is meant, but it is not always clear what this number should be. Once decided, the algorithm automates the process of `pruning' weights from the network by adding a decay term to the error term

A practical overview of neural networks guiding the algorithm. Thus weights that are not proving useful are forced to decay to near-zero values. Weight elimination helps to prevent overtraining by reducing the number of parameters in the network; thus it falls into the class of destructive algorithms. The statistical counterpart to weight elimination is regularization. Other destructive algorithms include `pruning' approaches (Reed, 1993). Radial basis functions are sometimes treated as alternatives to backpropagation, as they are implementable as modi®cations to the commonly used logistic backpropagation activation function. A radially symmetric function such as the Gaussian replaces the sigmoid at the hidden units, so the units become `localized receptive ®elds' (Moody and Darken, 1989). However, to exploit the advantages of radial basis functions requires a fundamentally di€erent approach to training, in which clustering occurs before weight adaptation. These networks can train much more quickly than backpropagation, but still require the user to predetermine the `best' number of hidden units, or clusters. Thus radial basis function modi®cations have arisen that mirror the cascade correlation algorithm's changes to backpropagation: initialize with zero hidden units, or some minimal number, then add hidden units as needed according to an error criterion (see, for example, dynamic radial basis functions, Blevins and St Clair, 1993). There is a signi®cant advantage to using constructive or destructive methods that automate the process of hidden unit determination. While these kinds of algorithm still require some parameter tuning, they do not require the user to experiment with a variety of architectures in terms of hidden units. If parameter settings can be made in advance, then only one network will require training. A second, separate test set will then adequately validate the network. The `overhead' time involved in training is then considerably reduced, as is the data requirement. In addition, these methods can sometimes report an indication of con®dence in their answers. They may refrain from responding to input data that is signi®cantly di€erent from that on which they were trained (Burke, 1993), or they may actually provide and estimate of the con®dence invested in the answer provided. Either way, such information can be highly practical in real applications. 2.4. Summary: empirical nets Clearly, we hope, the above discussion demonstrates that a simple knowledge of statistics is essential to interpret neural network performance adequately. The belief that one need not understand statistics is misguided; however, it is certainly true that neural networks provide a reasonably straightforward means to tackle dicult, non-linear problems that may be appropriately solved only by complicated statistical techniques. Understanding details about the application area is also essential; the amount and quality of the data available may be estimated or understood only if a

163 clear understanding of the process from which it is generated is available.

3. Hop®eld network The Hop®eld network and its limitations are perhaps even more misunderstood than backpropagation and its rivals. As mentioned in the introduction, the Hop®eld network solves fundamentally di€erent problems than does backpropagation. Typically it is used in pattern association and combinatorial optimization applications. Many engineers scorn the Hop®eld network because of its astonishingly poor performance on the traveling salesman problem ± a performance that was, however, accompanied by a great deal of fanfare. The shortcomings of the original Hop®eld network have continued to inspire research, and although it still may be a poor choice for solving the travelling salesman problem ± we shall address this question below ± it seems premature to write it o€ altogether. The traveling salesman problem (TSP) is a well-known combinatorial problem, which serves as a benchmark for combinatorial optimization algorithms. It is easy to state, but technically dicult to solve (it falls in the class of NP problems ± see Lawler et al., 1985 for a through discussion of the TSP and the NP class of problems). Simply stated, it is the problem of ®nding a minimum-cost route that begins and ends at the same point and visits each customer from a list of customer locations exactly once. Despite the apparent simplicity of the problem, the diculty of ®nding the optimal solution is well understood. Further, it has application in numerous manufacturing logistics problems. The initial report of the Hop®eld network application to the TSP (Hop®eld and Tank, 1985) described a method that solved only ten city TSPs, and even then could not guarantee feasibility (80% of trials converged to a feasible solution). On the other hand, it did ®nd the optimal solution 50% of the time. While the approach held great interest for those already concerned with neural networks, to the practised engineer such results completely discredited the technology. Worse, Wilson and Pawley (1988) showed soon thereafter that even these results could not be reproduced. The key parameters of the Hop®eld network, it turned out, require careful tuning to ensure reasonable results, but no helpful guidelines existed. Not surprisingly, these and similar ®ndings led the majority of engineers with expertise in combinatorial optimization away from the Hop®eld network in droves. In recent years, however, there have been advances in the Hop®eld network and new insights into its previous shortcomings, which deserve note. Not only have improvements to the original network shown promise, but insights from other branches of AI and from OR itself have led to major changes in the approach, and suggest that the Hop®eld network has potential both in ultimate hardware

164 implementation and possibly even in serial simulation. The discussion here will focus on the network's application to combinatorial optimization problems, and speci®cally the TSP. The net's application to the TSP continues to attract a great deal of attention, and its performance on the TSP ± whether good or bad ± can shed light on its overall limitations and potential. 3.1. The TSP issue The realization that the Hop®eld network performs quite poorly on the TSP has recently inspired some new reasoning as to why. First, early implementations struggled with feasibility. Hop®eld networks attempt to minimize an energy function that combines the objective(s) and constraints of a combinatorial optimization problem. The energy function, like a penalty function in conventional optimization, requires the determination of weighting parameters on the terms corresponding to objective and constraints. In the early implementations the determination of these parameters was left to trial and error, often resulting in infeasibility. Now, parameters may be determined a priori to ensure feasible solutions (Beyer and Ogier, 1990). Second, the neural network implementation used a formulation typically ignored by the OR community. This formulation is actually a quadratic non-linear program. Conventional OR approaches ± successful ones, at least ± generally use a di€erent formulation of the TSP: a linear integer program. The problem that this formulation poses is that it includes explicit constraints to prevent subtours from arising in the solution (a subtour is a cycle of visits through a subset of the required cities). These subtour-breaking constraints do not appear to lend themselves practically to the energy function approach. Thus dropping them from the representation of the problem for the Hop®eld network introduces the possibility of allowing subtours that must be detected and merged. Some researchers have begun to explore this representation in the Hop®eld network implementation, and have found practical ways to detect and merge subtours (Joppe et al., 1990; Magent, 1996). Still other researchers propose that perhaps the problem is that the much-researched TSP is a poor benchmark for the Hop®eld network. Indeed, numerous other combinatorial optimization problems have been solved by the Hop®eld network with successes much more competitive with existing algorithms. Typically, the successful networks combine the Hop®eld network with other search methods: that is, tabu search or self-organizing neural networks. For example, a variant of the Hop®eld incorporating self-organizing neural networks showed great promise when applied to a practical car-sequencing problem that arises in manufacturing (Smith et al., 1996). Excellent results in applications to problems of plant location (Magent, 1996) further illustrate the vast strides made in the past decade

Burke and Ignizio towards making the Hop®eld network a viable method for real problems. Even the TSP has begun to yield to Hop®eld variants. For recent experiments using the linear integer representation, a tabu search±Hop®eld combination has yielded consistently near-optimal solutions for problems in the 100±200 city range (Magent, 1996). Whereas in 1985 only 10±30 city TSP problems were analysed, today problems that are orders of magnitude larger are being solved, with much more impressive outcomes. Finally, numerous researchers have appealed to the claim to hardware implementability without explicit justi®cation, and have unfortunately further alienated many engineers. While the promise of ultimate hardware implementability remains tantalizing, and certainly would change the face of such neural network approaches (simulated on serial hardware, most Hop®eld networks take hours or days to accomplish tasks that would probably take seconds in parallel hardware), it is not easy to understand just which Hop®eld network variants would have this property, and when. In the meantime, newer, smarter serial software implementations seem to be paving the way for ecient implementations that utilize presently available technology (Magent, 1996). 4. Conclusions It is as unfair to paint a rosy picture of the state of neural network research as it is to unilaterally dismiss it. In the view of these authors, several misunderstandings require clari®cation and conventions adopted before the technique has a chance, and even a right, to take hold. This overview has attempted to discuss many practical misunderstandings surrounding the application of backpropagation. An eagerness to ignore statistics is often at the root of problematic applications. On the other hand, many observers of the ®eld see analogies between backpropagation and statistics, and dismiss the new technology as repackaged regression. While there are similarities between backpropagation and linear regression, the former technique has the ability to generate non-linear functional mappings. The real analogy is between backpropagation and nonlinear regression, and the latter approach is signi®cantly more dicult to apply than its linear counterpart. Still, backpropagation is essentially a stochastic approximation to non-linear regression, so it o€ers an accessible approach to what would otherwise be time- consuming and dicult for the non-expert statistician. While neural network software packages have certainly made these techniques accessible, it is still necessary to understand something about the application area and the data available, statistical assessment practices, and, ideally, the neural network in use. The Hop®eld network has also su€ered from excessive hype and relatively poor performance in applications to the

A practical overview of neural networks TSP. The intriguing new insights from new research that have surfaced, however, suggest that there may be problems with the representation used by neural network researchers or even the use of the TSP as a benchmark for such neural networks. At the same time, new approaches are yielding results orders of magnitude more quickly on problems that are orders of magnitude larger than those that ®rst appeared a decade ago. Applications to several other practical combinatorial optimization problems also show great promise. And ®nally, a few researchers are beginning to address more directly the issue of how to implement the Hop®eld network in practice on currently available hardware. References Beyer, D. and Ogier, R. (1990) The tabu learning search neural network method applied to the traveling salesman problem, unpublished technical report, SRI International, Stanford, CA. Blevins, W. and St Clair, D. (1993) Determining the number and placement of functions for radial basis approximation networks, in Intelligent Engineering Systems through Arti®cial Neural Networks, Vol. 3. Dagli, C., Burke, L., Fernandez, B. and Ghosh, J. (eds), ASME Press, New York, pp. 45±50. Burke, L.(1993) Comparing neural networks for classi®cation: advantages of RCE-based networks, in Intelligent Systems through Arti®cial Neural Networks, Vol. 3, ASME Press, New York. Burke, L. and S. Rangwala (1991) Tool condition monitoring in metal cutting: a neural network approach. Journal of Intelligent Manufacturing Special Issue on Neural Networks, Vol. 2, pp. 269±280. Fahlman, S. and Lebiere, C. (1990) The cascade-correlation learning architecture, in Advances in Neural Information Processing Systems 2. Touretsky, D. (ed.), Morgan Kaufmann, San Mateo, CA, pp. 524±532. Fausett, L. (1994) Fundamentals of Neural Networks. PrenticeHall, Englewood Cli€s, NJ. Frean, M. (1990) The upstart algorithm: a method for constructing and training feedforward neural networks, Neural Computation, 2, 198±209. Golden, B., Wasil, E., Coy, S. and Dagli, C. (1997) Neural networks in practice: survey results, in Computer Science and Operations Research: Advances in the Interface, (in press). Hecht-Nielsen, R. (1990) Neurocomputing, Addison-Wesley, Reading, MA. Hop®eld, J. J. (1982) Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Science, 79, 2554±2558.

165 Hop®eld, J. J. and Tank, D. (1985) Neural computation of decisions in optimization problems. Biological Cybernetics, 5, 141±152. Joppe, A., Cardon, H. R. A. and Bioch, J. C. (1990) A neural network for solving the traveling salesman problem on the basis of city adjacency in the tour, in Proceedings of the International Neural Network Conference, Paris, IEEE, pp. 254±257. Lawler, E. L., Lenstra, J. K., Rinnooy Kan, A. H. G. and Shmoys, D. B. (1985) The Traveling Salesman Problem. John Wiley and Sons, Chichester. Magent, M. (1996) Combining neural networks and tabu search in a fast neural network simulation for combinatorial optimization, PhD Dissertation, Lehigh University. Mezard, M. and Nadal, J. (1989) Learning feedforward neural networks: the tiling algorithm. Journal of Physics A: Mathematical and General, 22, 2192±2203. Moody, J. (1992) The e€ective number of parameters: an analysis of generalization and regularization in nonlinear learning systems, in Advances in Neural Information Processing Systems 4 Moody, J. et al, (eds.), Morgan Kaufman Publishers, San Mateo, CA, pp. 847±854. Moody, J. and Darken C. (1989) Fast learning in networks of locally tuned processing units. Neural Computation, 1, 281± 294. Reed, R. (1993) Pruning algorithms ± a survey. IEEE Transactions on Neural Networks, 4 (5), 740±747. Reilly, D., Cooper, L. and Elbaum, C. (1982) A neural model for category learning. Biological Cybernetics, 45, 35±41. Rumelhart, D., Hinton, G. and Williams, R. (1986) Learning representations by backpropagating errors. Nature, 323, 533± 536. Smith, K., Palaniswami, M. and Krishnamoorthy, M. (1996) A hybrid neural approach to combinatorial optimization. Computers and Operations Research, 23 (6), 597±610. Specht, D. (1990) Probabilistic neural networks. Neural Networks, 3 (1), 109±118. Weigend, A., Huberman, B. and Rumelhart, D. (1990) Predicting the future: a connectionist approach. International Journal of Neural Systems, 1 (3), 193±210. Wilson, G. V. and Pawley, G. S. (1988) On the stability of the travelling salesman problem of Hop®eld and Tank. Biological Cybernetics, 58, 63±70. Yang, J., Parekh, R. G. and Honavar, V. G. (1996) Mtiling ± a constructive neural network learning algorithm for multicategory pattern classi®cation, in Proceedings of World Congress on Neural Networks, Lawrence Erlbaum, Hillsdale, NJ, pp. 182±187.