Estimating Prediction Intervals for Artificial Neural Networks

Report 2 Downloads 183 Views
Estimating Prediction Intervals for Arti cial Neural Networks Lyle H. Ungar Richard D. De Veaux Evelyn Rosengarten University of Pennsylvania Williams College University of Pennsylvania

Abstract Neural networks can be viewed as nonlinear models, where the weights are parameters to be estimated. In general two parameter estimation methods are used: nonlinear regression, corresponding to the standard backpropagation algorithm, and Bayesian estimation, in which the model parameters are considered as being random variables drawn from a prior distribution, which is updated based on the observed data. These two estimation methods suggest di erent methods of calculating prediction intervals for neural networks. We present some preliminary observations comparing the ability of the two methods to provide accurate prediction intervals.

1 Introduction

Arti cial neural networks (ANN) are being used with increasing frequency as an alternative to traditional models for a range of applications including modelbased control. Unfortunately, ANN models rarely provide any indication of the accuracy or reliability of their predictions. (One exception is for radial basis functions; See Leonard et al., 1992.) In this paper, we compare two approaches to obtaining prediction limits for ANN's: a frequentist approach, based on standard non-linear regression theory, and a Bayesian approach, following recent work by MacKay (1992) and Neal (1994). We present preliminary comparisons of the methods via Monte Carlo methods. We examine the coverage probabilities of the prediction intervals, their computational costs and practical implementation issues of the two approaches. Being able to estimate the uncertainty of predictions is important for many reasons. In process control, for example, when designing controllers and verifying their performance, it is useful to know the accuracy of model used by the controller. A particularly interesting application of this occurs in model switching, where the controller chooses among multiple models at each discrete control step. Model accuracy, along with model complexity, serves as a criterion for model selection. We concentrate on the most widely used neural networks: feedforward neural networks with sigmoidal

activation functions (\backpropagation networks"). These neural nets can be viewed as nonlinear models. The parameters (network weights), are typically estimated by minimizing a residual sum of squares prediction error, using gradient descent or conjugate gradient methods. Thus, the networks can be thought of as doing nonlinear regression. Standard methods exist for estimating the prediction uncertainties of nonlinear regression (see e.g Seber and Wild, 1989), based on local linearizations of the model. These methods have been applied to neural nets (see e.g Ding and Hwang, 1995), but although they work well on small problems, they are often less reliable on the larger problems typically addressed using neural networks. Recently (MacKay 1992, Neal 1994), a Bayesian approach has been proposed for estimating the parameters in neural nets. The parameters in the neural network are considered as being drawn from a distribution (e.g. Gaussian) which is characterized by a set of hyperparameters (e.g. the mean and variance of the Gaussian). The parameters in the Bayesian neural network are estimated using Monte Carlo methods. The network predictions are then simply a summary (typically the mean or mode) of the posterior distribution of the parameters. The Bayesian approach provides a natural framework for estimating prediction intervals, as will be described below. We present some preliminary observations comparing the conventional frequentist and Bayesian approaches to estimating the uncertainty of neural networks described above. The methods di er by orders of magnitude in speed, with the backpropagation method, although not fast in absolute terms, being much faster than the Bayesian approach. Estimating the parameters in the Bayesian neural network is also substantially more complex than for backpropagation networks. While it is clear from this preliminary investigation that the Bayesian methods require much more computation time and e ort, it is less clear that the accuracy of the prediction intervals is improved. Clearly, much more work needs to be done on this problem.

2 Neural Nets as Nonlinear Regressors Neural networks are simply nonlinear equations containing (many) parameters to be estimated. For example, a \backpropagation" network is of the form: y^k = (2)

0 nh " @X wjk 

n X

(1)

i=1

j =1

wij xi + ij

!#

1 + k A

(1)

where, typically, (1) (z) =

1 1 + e?z

and often (2) (z) = z. The standard way of estimating these parameters is to select them to minimize the prediction error on a training data set. This is, in e ect, nonlinear regression, and can be done using a variety of methods including gradient descent methods such as backpropagation. Viewing neural networks as nonlinear regression suggests calculating prediction intervals using the standard method for nonlinear regression. Let yi = f (xi;  ) + i (i = 1; : : : ; n)

represent the network in equation 1, where the i are i:i:d N (0; 2 ). If one linearizes the neural network around x = x0, and n is large enough so that ^  , we have the Taylor expansion: where:

f (x0; ^)  f (x0;  ) + f00 (^ ? )





f00 = @f (@x0 ; ) ; @f (@x0;  ) ; : : : ; @f (@x0 ; ) : 1 2 p

A prediction interval for y0 can then (see e.g. Seber and Wild, 1989) be given by: y0  t =2;n?ps[1 + f00 (F0:F: )?1f0 ]1=2

where

F: =

 @f ()  i

@j

, s is the estimate of the standard deviation, and t =2;n?p is the t statistic for n data points and p degrees of freedom in the model. Typically, P(y ? y^ )2 i i s2 = n?p

where, p, the number of degrees of freedom is problematic, since neural networks are often used with more parameters than needed, but regularized toward

0 by using only partial convergence during the training. A number of variants of the equation have been proposed, using di erent estimates of the covariance of the model parameters, but all give similar results (see e.g. Donaldson and Schnabel, 1987).

3 Neural Nets as Bayesian Estimators Bayesians take a di erent view of the modeling and estimation problem: the parameters  are drawn from a distribution. During the estimation phase, one uses observations z = fx; yg to update the (prior) distribution of . Then, given a distribution of , new z's can be estimated by taking the appropriate expectation: the weighted average over the di erent values of . More precisely, Bayesians attempt to compute the posterior density of  given the data z: p(zj)p() p(jz) = R p(zj)p()d where, following MacKay (1992), p(jz) is the posterior, pR(zj) is the likelihood, p() is the prior, and p(z) = p(zj)p()d is the evidence. Actually,  must be de ned in the context of a model, Mi. So, more precisely, p(zj; Mi )p(jMi) p(jz; Mi ) = p(zjMi) We can look at the posterior density of the model after viewing the data: p(Mijz) / p(zjMi)p(Mi)

Assuming the prior p(Mi) is uniform, we can rank the likelihood of the di erent models as indicated by the evidence. In practice, one does not need to select a particular model, since predictions are made using a weighted average over the di erent models (e.g. different numbers of hidden nodes). One can then form a Bayesian model of a neural net: exp(?E ) p(zj; M ) = p D 2 and exp(? EW ) p(j ; M ) = p 2= Where ED is the objective function being minimized: ED () = 1=2

XX m i

(yi(m) ? y^i(m) )2

P

and EW = 1=2 i wi2 is a regularization penalty, which encourages small weights and hence models with smooth predictions. The amount of regularization is controlled by , which is estimated as part of the procedure. We then minimize: So,

E () = ED + EW p(zj; ; M )p(j ; M ) p(zj ; M ) = exp(Z?E ()) M

p(jz; ; M ) =

One then needs to estimate the parameters in the model. One can do this in three ways:

 Analytical, involving closed form expressions for

the posterior, typically using conjugate families of densities  Monte Carlo methods { Gibbs sampling  Gaussian approximation

The rst implementations of Bayesian learning (MacKay, 1992) used a Gaussian approximation for the posterior distribution of the network parameters (weights) and single-valued estimates for the hyperparameters (mean and variances). However, recently, a di erent approach to estimating the parameters in neural nets has been implemented (Neal, 1994). Neal's method is complex and computationally demanding method, but seems to get excellent results. The key to his method lies in sampling from and averaging over a posterior distribution of models. This approach uses the hybrid Monte Carlo algorithm for updating the network parameters; the hyperparameters are updated separately using Gibbs sampling. The hybrid Monte Carlo algorithm combines the Metropolis algorithm with sampling based on dynamical simulation. This method is applied by rst formulating the desired distribution in terms of a potential energy function. Candidate states are then found using a leapfrog method of Hamiltonian dynamics. A complex heuristic approach, based on the values of the hyperparameters, is used to select the step sizes for the leapfrog method. The main bene t of the hybrid Monte Carlo approach over other Markov chain Monte Carlo methods is that it avoids random walk behavior. In comparison to the previous Gaussian implementations of Bayesian learning, the hybrid Monte Carlo approach had signi cantly better results for large problems. Once one has estimated the distributions of the parameters, predictions and prediction intervals can be

estimated by integrating over (marginalizing) over the parameters,  the models M , and the amount of regularization :

Zp(y x) Z= Z j

dM

d

P (yj; x; ; M )P ( j ; x; M )d

Note that the integrand is simply the product of the probability of y given the model and the parameters times the probability of the parameters. In practice, the Monte Carlo method provides a set of estimates of the y distributed according to the posterior distributions of the parameters and models. The prediction, y can be estimated by the median, mode or expected value. Prediction intervals are trivially calculated by nding the (e.g.,) 2.5th and 97.5th percentiles of the predictors.

4 Simulation Results How well do the Bayesian and nonlinear regression methods work? Figures 1 and 2 show a portion of a sine wave, trained on 50 data points, and the 95% prediction limits, calculated using nonlinear regression described in Section 2 and the Bayesian method described in Section 3, respectively. All calculations are done using a single hidden layer of eight nodes, sigmoidal transfer functions for the hidden layer, and linear functions for the output layers. Tables 1 and 2, respectively, give the prediction error and the coverage probabilities for nominal 95% prediction intervals based on a test sample of 1000 points using the nonlinear regression method and the Bayesian prediction intervals. Results are presented for two data sets: a sine wave y = sin(2x) +  with  being Gaussian noise with standard deviation of one, and a function that Friedman (1991) used to present his MARS algorithm, which has ten predictors (inputs), of which ve (x6 to x10 ) do not e ect the response (output): y = (10sin(x1x2)+20(x3 ?0:5)2 +10x4 +5x5 +)=25:

sine Mars

regression 0.12 0.31

Bayesian 0.11 0.29

Table 1: prediction accuracy on test set

The ts for the sine function is excellent: it is to data which has noise of 0.1. The t for the mars function is less good, as the noise is only 0.04, but the function is a complex one to be t with only 50 data points, since it has ten predictors.

1.5

1

0.5

sine Mars

Bayesian 98.4 100.0

y

0

regression 98.8 100.0

Table 2: Actual coverage probabilities for nominal 95% prediction intervals

−0.5

−1

−1.5

−2 0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

1

Figure 1: sin(2x) with prediction and 95% prediction intervals: Regression

1.5

1

y

0.5

0

−0.5

−1

−1.5 0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

Figure 2: sin(2x) with prediction and 95% prediction intervals: Bayesian method

1

Note that the prediction intervals are much too broad for both methods. For the nonlinear regression method, this is easily understood. The regression formula requires dividing by n ? p where n is the number of data points (50 in this case), and p is the number of parameters in the model. For the sine function, there are 25 parameters, and for the mars function, there are 97. Clearly the standard regression formula is not appropriate here. In fact, it is standard practice to stop the training of neural networks prior to convergence, which reduces the e ective number of parameters. Even if trains to convergence, as was done here, the structure of the neural networks virtually guarantees that the e ective number of parameters is much smaller than the actual number. The analysis of the Bayesian case is more subtle. The model that we are using assumes a nonlinear function with additive Gaussian noise, and estimates the noise as one of the parameters. For the sine function, the noise parameter is estimated to be 0.12, which is quite close to the correct 0.1. This could be used to analytically calculate the prediction intervals. We have chose instead to use the Monte Carlo sampling, which extends to more complex distributions. We are still investigating why these bounds are inaccurate. There is a further limitation of both methods, as implemented: the error is assumed to have constant variance (homoscedastic). If the error standard deviation is made variable, as in gure 3, the bounds do not re ect the variable noise for either method. This is dicult to x in the regression case, but can be more easily incorporated in the Bayesian case, by including heteroscedasticity in the model. The data density may also vary. The nonlinear regression method handles this somewhat sensibly. When the data become very sparse, the bounds, of course, become very wide. Note that there are two data points near x = 0:9 which pull in the bounds towards the right hand side of gure 4.

1.5

1.5 1

1 0.5

y

y

0.5 0

0 −0.5

−0.5 −1

−1 −1.5 0 −1.5 0

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

1

Figure 3: sin(2x) with variable noise: regression method

6 4 2

y

0 −2 −4 −6 −8

0.1

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

Figure 4: sin(2x) with variable data density: regression method

0.2

0.3

0.4

0.5 x

0.6

0.7

0.8

0.9

Figure 5: sin(2x) with variable data density: Bayesian method

5 Conclusions

8

−10 0

0.1

1

We can characterize the performance of the two di erent approaches to training neural networks on many dimensions: prediction accuracy, prediction interval accuracy, speed, and the number of adjustable parameters. The nonlinear regression approach to neural network parameter estimation has a long and successful history. It is easily extended (at no extra cost) to provide prediction intervals, but requires some estimation of the e ective number of parameters. The Bayesian method of model estimation is less well known, and must still be tried on more data sets before one can fully be con dent of it. In our experience, if one samples over a suciently large number of networks (and this can be a very computationally demanding if), then the prediction accuracy is excellent. There are a huge number of hyperparameters to specify (the priors on the distributions of the weights, and the noise), but the results are not unreasonably sensitive to these, since virtually everything can be estimated as part of the modeling procedure. The Bayesian method has the further virtue that the regularization is explicit (which has been done for backpropagation networks as well) and that the degree of regularization can be estimated as part of the modeling procedure. Bootstrapping is another potential method of estimating prediction intervals (see e.g. Baxt and White, 1995), which can be done with either model. It can be very e ective when sucient data and computa-

1

tional power are available. This is not a trivial statement; some bootstrapping calculations have been run for six months!(White, 1993) The methods that we have presented in this paper o er the advantage that no resampling is required, so all points can be used in model estimation, and since multiple models are not required, computation is reduced. Bootstrapping is also model-free, which can be an advantage (e.g., in capturing variable noise) or a disadvantage (e.g., explicit noise models reduce the amount of data required). Clearly, this research is only preliminary. We are testing these methods on other data sets, examining ways to estimate the e ective number of model parameters, and exploring more complex Bayesian models which include heteroscedasticity. Furthermore, the methods described above do not address extrapolation error - the prediction intervals are only meaningful where data was present to build the model. Separate methods are required for estimating the likelihood that new points are outside the region of validity of the model

6 Acknowledgments We thank NSF for support of this project through grants DMS-95044451 and CTS95-04407.

7 References Baxt, W.G. and H. White, (1995) \Bootstrapping con dence intervals for clinical input variable e ects in a network trained to identify the presence of acute myocardial infarction", Neural Computation 7 624638. De Veaux, R.D., D.C. Psichogios and L.H. Ungar, (1993) \A Tale of Two Non-parametric Estimation Schemes: MARS and Neural Networks," 4th Intl. Conf. on Arti cial Intelligence and Statistics, Jan. 1993. Ding, A.A., and Hwang, J.T.G. (1995) \Construct prediction intervals using arti cial neural networks," Proceedings of the Section on Physical and Engineering Sciences, 73-78. Donaldson, J.R., and Schnabel, R.B., (1987) \Computational experience with con dence regions and con dence intervals for nonlinear least squares", Technometrics, 29 (1) 67-82. Friedman, J.H. (1991) \Multivariate adaptive regression splines," Annals of Statistics, 19 (1) 1-141. Neal, R.M. (1994) \Bayesian Learning for Neural Networks," Ph.D. Thesis, Dept. of Computer Science, University of Toronto,

Leonard J., Kramer, M., and Ungar, L.H. (1992) \Using radial basis functions to approximate a function and its error bounds." IEEE Transactions on Neural Nets, 3(4), 624-627. MacKay, D.J.C., (1992) \A Practical Bayesian Framework for Backpropagation Networks." Neural Computation, 4, 448-472. Seber, G.A.F. and C.J. Wild. Nonlinear regression. New York. Wiley, 1989. Weigend, A.S., M. Mangeas, and A.N. Srivastava, (1995) \Nonlinear gated experts for time series: discovering regimes and avoiding over tting" International Journal of Neural Systems 6 373-399. White, H., (1993) personal communication. Address correspondence to Lyle H. Ungar, 311A Towne Bldg., Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104. ([email protected])