IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 9, NO. 6, NOVEMBER 1998
1515
Prediction Limit Estimation for Neural Network Models R. B. Chinnam and J. Ding
Abstract— A novel method for estimation of prediction limits for global and local approximating neural networks is presented. The method partitions the input space using self-organizing feature maps to introduce the concept of local neighborhoods, and calculates limits that indicate the extent to which one can rely on predictions for making future decisions. Index Terms— Estimation, feedforward neural networks, prediction intervals, prediction limits, self-organizing feature maps.
I. INTRODUCTION
A
RTIFICIAL neural networks have gained a lot of interest as empirical models for their powerful representational capacity and multiinput and output mapping characteristics. In fact, most feed-forward networks with nonlinear nodal functions have been proved to be universal approximators. However, conventional neural networks focus primarily on accurate prediction of output values, and they usually do not provide any information regarding the confidence with which they make these predictions. Since prediction limits indicate the extent to which one can rely on predictions for making future decisions, it is of paramount importance to estimate these limits. Locally generalizing networks such as radial basis function (RBF) networks and cerebellar model arithmetic computer (CMAC) networks have a naturally well defined concept of local neighborhood. Training data and testing data are considered “local” to a test point if they are within a limited region around the test point. Such networks have been extended in the literature to include prediction limits on network predictions. For example, the validity index (VI) network derived from RBF networks, fits functions and calculates error bounds for its predictions [9]. In contrast, globally generalizing networks [such as the multilayer perceptron (MLP) network with sigmoidal nonlinearities], proven to be very effective in function approximation and time-series forecasting, are ill-defined with regard to this concept of a local neighborhood, and hence, cannot be easily extended to incorporate prediction limits. Techniques proposed in literature, including the confidence interval prediction technique proposed by Chryssolouris et al. [1], typically make the strong assumption of constant variance of data in the output space. The paper begins by giving the background on properties of self-organizing feature maps (SOFM’s) and their inherent ability to partition input spaces and introduce the definition of local neighborhoods. The paper then goes on to introduce Manuscript received April 8, 1997; revised January 7, 1998 and July 16, 1998. The authors are with the Industrial and Manufacturing Engineering Department, North Dakota State University, Fargo, ND 58105 USA. Publisher Item Identifier S 1045-9227(98)08807-9.
a novel method, that utilizes SOFM’s, to estimate prediction limits for global and local approximating neural networks. II. BACKGROUND ON SELF-ORGANIZING FEATURE MAPS The principal goal of a SOFM developed by Kohonen [7] is to transform an incoming signal pattern of arbitrary dimension into a one- or two- dimensional discrete map, and to perform this transformation adaptively in a topological ordered fashion. Typically, each input pattern presented to the network one at a time consists of a localized region, or “spot,” of activity against a quiet background. Each such presentation causes a corresponding localized group of neurons in the output layer of the network to become active [4], introducing the concept of a neighborhood. A brief description of the SOFM from Haykin [4] follows to illustrate this concept of local neighborhood. denote a spatially continuous input space, the topolLet, ogy of which is defined by the metric relationship of the Let denote a spatially discrete output vectors space, the topology of which is endowed by arranging a set as the computation nodes of a lattice. Let of neurons denote a nonlinear transformation called a feature map, which onto the output space as shown maps the input space This may be viewed as an abstraction that by developed in defines the location of a winning neuron Given an input vector the response to an input vector SOFM algorithm proceeds by first identifying a best-matching in the output space in accordance or winning neuron The synaptic weight vector of with the feature map may then be viewed as a pointer for that neuron neuron into the input space The ability of a self-organizing feature map to: 1) provide a good approximation of the input space 2) exhibit topological ordering (spatial location of a neuron in the lattice corresponds to a particular domain or feature of input patterns) and 3) provide density matching1 (regions in the input space that have high probability density, pdf , are mapped onto makes it an excellent larger domains of output space candidate to introduce the definition of a local neighborhood into feedforward neural networks. This does not preclude the use of other clustering algorithms. However, traditional clustering algorithms such as those driven by the standard nearest-neighbor rule [10] do not exhibit density matching property and may lead to neighborhoods with significantly different populations (i.e., number of data points). This will influence the accuracy of estimation of covariance matrices for residuals in these neighborhoods, critical for accurate prediction limit estimation, as discussed later. 1 To enhance the density matching property of the SOFM algorithm, we advocate the utilization of conscience into the SOFM algorithm, as is proposed by DeSieno [1988].
1045–9227/98$10.00 1998 IEEE
1516
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 9, NO. 6, NOVEMBER 1998
CASES USED
FOR
EVALUATION
OF THE
TABLE I PROPOSED PL ESTIMATION METHOD AND THEIR RESULTS
(a)
(b) Fig. 1. Results from Case-I study; truncated exponential pdf in input space; 90% PL’s. (a) Case I-A data set with 1500 data points. (b) Case I-B data set with 150 data points.
IEEE TRANSACTIONS ON NEURAL NETWORKS. VOL. 9, NO 6, NOVEMBER 1998
1517
(a)
(b)
(c)
(d)
Fig. 2. Results from Case-II study. (a) Original function and data with Gaussian noise—First output, y1 : (b) FFN prediction—First output, y1 : (c) Estimated 90% upper and lower PL surfaces—First output, y1 : (d) Percentage error in estimating 90% PLs—First output, y1 :
III. INTRODUCTION OF A DEFINITION OF LOCAL NEIGHBORHOOD FOR FFNS USING SOFMS represent the total number of training patterns Let spanning the entire input space and denote the number of , the “membership” of neuron neurons in the SOFM. Let in the discrete output space , represent the subset of training that activate it. This is shown by patterns from input space for all
(1)
It is also true that the sum of the memberships of the SOFM neurons in the lattice output space must equal the total number of training patterns for the SOFM, as given by (2) The three properties exhibited by SOFM’s (discussed earlier) provide the motivation to utilize the SOFM to break the into distinct regions (denoted by input space that are mutually exclusive, and hence satisfy the following relationship: (3)
All the patterns from any given distinct region , when provided as input to the feature map , will activate the same output neuron This is shown by for all
(4)
Thus, using SOFM’s, one can introduce the concept of a “local neighborhood,” the resolution depending on the number of in the discrete output space. neurons
IV. CONSTRUCTION OF PREDICTION LIMITS FOR FFNS From the definition of a local neighborhood introduced in Section III, input signal patterns can be associated unambiguAssuming that an ously with one of the distinct regions FFN is being used for function approximation or time-series forecasting, an estimate of the covariance matrix for the FFN is given by model residuals within the domain of region
.. .
.. .
..
.
.. .
(5)
1518
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 9, NO. 6, NOVEMBER 1998
(e)
(f)
(g)
(h)
Fig. 2. (Continued.) Results from Case-II study. (e) Original function and data with Gaussian noise—Second output, y2 : (f) FFN prediction—Second output, (g) Estimated 90% upper and lower PL surfaces—Second output, y2 : (h) Percentage error in estimating 90% PLs—Second output, y2 :
y2 :
where
and denotes the covariance between output variables and denotes the FFN model residual for output variable for pattern denotes the number of output variables predicted by the FFN Assuming that the residuals are independent and Gaussian distributed with a constant covariance matrix over the domain of any region but varying from domain to domain, the quantile is given by the point satisfying the following condition: (6) where quantile of the Chi-Square distribution with degrees of freedom. mean residual vector for domain . inverse of the matrix . Simulation experiments (see Section VI) have revealed that the residuals in distinct neighborhoods do tend to exhibit a Gaussian distribution as long as the true noise in the overall
data set is Gaussian. In fact, if the output variables are assumed to be independent, one can even relax the Gaussian residual assumption for distinct neighborhoods and arrive at an upper limit on the true prediction limits (PL’s) for each of the output variables based on the Chebyshev’s theorem [6] as follows: (7) and denote the mean and standard deviation, where respectively, of variable If the FFN has adequate representational capacity, the fit should not be significantly biased, and the mean residual vector can be a null vector. The above PL estimation method can also be easily extended to determine the limits of the dispersion of the mean, i.e., the range of possible values for the mean predicted value (rather than the value for a single sample). However, there exists a strong limitation with the above approach, in that, the ability of the method to accurately estimate the prediction limits in distinct regions of the input space would depend on the resolution of the SOFM. In addition, the method exhibits the following detrimental characteristics: 1) within the input space neighborhood of any given SOFM neuron, the estimated prediction limits are at constant width from the neural network prediction (the nature of the model residuals within the complete neighborhood are represented by
IEEE TRANSACTIONS ON NEURAL NETWORKS. VOL. 9, NO 6, NOVEMBER 1998
1519
a single covariance matrix) and 2) prediction limits are disjoint at boundaries of neighborhoods due to the abrupt transition from one neighborhood to the adjacent. V. A TRUTH FUNCTION CONCEPT TO IMPROVE PREDICTION LIMIT ESTIMATES One approach to address the limitations (associated with the prediction interval estimation procedure) discussed in Section IV is to introduce the concept of a “truth function” to evaluate the strength of the membership of any given data point within any distinct “local neighborhood.” Such a truth function will facilitate a weighted approach to estimation of PL’s, where the weights are the data point’s strengths in different neighborhoods. One such truth function that has showed promise is given below
(a)
(8)
Truth
is the output of neuron in the SOFM2. where Essentially, this truth function determines the strength for a data point in a particular neighborhood as a function that is inversely proportional to the Euclidean distance of the data point from the location (or center) of the SOFM neuron representing the neighborhood. Since the PL’s are being estimated using a weighted approach, it is essential that the total membership for any data point among all neighborhoods be equal to unity. In other words, the following constraint has to be satisfied by any credible truth function: Truth
(b)
(9)
The proposed truth function [i.e., (8)] certainly meets this constraint. Simulation studies have demonstrated that this approach is effective in estimating the PL’s for neural network models. These studies are discussed in Section VI. VI. PERFORMANCE EVALUATION OF THE PROPOSED PL ESTIMATION METHOD To evaluate the performance of the proposed PL estimation method, data sets were generated with one- and twodimensional input spaces as shown in Table I. In the case of the one-dimensional input space, the data sets were generated from the following function: (from [9]), where is Gaussian noise. The mean of the Gaussian noise was chosen to be zero and the standard deviation a decreasing function of In the case of the two-dimensional input space, the data set was generated from the following functions: 2 Here, the output is the Euclidean distance between the neuron and the data point in the input space. In the SOFM algorithm proposed by Kohonen [8], these outputs are further transformed onto a binary scale using the concept of similarity matching (for finding the best-matching or winning neuron, using the minimum-distance Euclidean criterion).
(c) Fig. 3. Influence of SOFM size on PL estimation accuracy. (a) Results from Case I-A. (b) Results from Case I-B. (c) Results from Case II.
and for the two output variables. Again, the mean of the Gaussian noise was chosen to be zero and standard deviations and , as shown in Table I. Note that for are functions of the standard deviation in the noise, the first output, i.e., , is very high when and and i.e., the standard deviation in the noise, i.e., is for and , the opposite corner very high when of input space. For Case I, the data sets were generated with 1500 data points (labeled Case I-A) and 150 data points (labeled Case
1520
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 9, NO. 6, NOVEMBER 1998
CHI-SQUARE TEST RESULTS
FOR
TABLE II CASE I-A; CHECKING GAUSSIAN RESIDUAL ASSUMPTION
I-B) to study the degradation in the accuracy of calculation of prediction limits as a function of the size of the data set. For Case II, with two-dimensional input space and two dimensional output space, the data set was generated with 400 data points. In all cases, multilayer perceptron networks, proved to be universal approximators in the literature [5], were used for function approximation. The configurations of networks that proved to be effective in offering good generalization after training for 5000 epochs are also shown in Table I. In all cases, the neurons had a sigmoid nonlinearity except for the neurons in the output layers that were linear. With respect to selecting the configuration for the SOFM’s, two factors should be taken into consideration. First, the partition scheme should ensure that all the neighborhoods will to accurately have enough data points as members (i.e., estimate the residual covariance matrices (in general, for statistical analysis, 10–20 data points are considered adequate for accurate estimation of any given covariance matrix; we for all Second, the recommend the same: partition resolution sought by the SOFM should parallel the complexity of the associated function approximating neural network. Experimental investigation has revealed that this
guideline approximately translates to saying that the number of nodes in the SOFM should be no more than twice the number of nodes in the hidden layers in the function approximation neural network. This guideline partially ensures that the estimated residual covariance matrices reflect the true dispersion characteristics of the output variables and not the biases associated with the function approximation neural network predictions. Simulation studies conducted for the test cases (discussed later) strongly revealed the appropriateness of these two guidelines. The configurations of the SOFM’s used to partition the input spaces for the test cases are shown in Table I, and certainly meet the above guidelines. The SOFM training scheme suggested by Haykin [4] was used, involving 20 000 epochs, a time-varying learning rate parameter, and a time-varying neighborhood function. Cases I-A and I-B involve a single output variable, and hence, the covariance matrix is reduced to a scalar estimate of the variance of the residuals in each of the neighborhoods In the single-output case, with the weighted truth function prediction limits are approach applied to (6), the Truth
(10)
IEEE TRANSACTIONS ON NEURAL NETWORKS. VOL. 9, NO 6, NOVEMBER 1998
1521
(a)
(b)
(a)
(b)
Fig. 4. Normal scores plots for checking neighborhood Gaussian residual assumption. (a) Case I-A: Neighborhood of SOFM neuron one. (b) Case I-B: Neighborhood of SOFM neuron one. (c) Case II: Neighborhood of SOFM neuron one—First output, y1 : (d) Case II: Neighborhood of SOFM neuron one—Second output, y2 :
where denotes the quantile of Student’s statistic with degrees of freedom. Fig. 1 presents the estimated 90% prediction limits for Case-I simulation study along with the true 90% prediction limits directly calculated using the actual residual From standard deviation (i.e., the location and the spacing of the ten SOFM neurons shown in the figure, it is evident that the SOFM was effective in density matching the truncated exponential distribution of the input space. In spite of a 90% reduction in the size of the data set (from Case I-A to Case I-B), the results obtained with data sets containing only 150 data points were reasonably good [compare Fig. 1(a) with Fig. 1(b)]. Careful examination of Fig. 1(b) by the reader will reveal that even though there are discrepancies in some areas between the true and estimated PL’s, the estimated PL’s are more accurately representing the data in some areas. For example, in Fig. 1(b), the 90% estimated PL’s shown in and around the neighborhood of are more accurately representing the nature of the residuals. In that neighborhood, some 30% of the data are outside the true PL’s even though on the average only 10% of the data are expected to fall outside the PL’s Fig. 2 presents the results for the Case-II simulation study. It is
evident from Fig. 2(d) and (h) that the proposed PL estimation method was effective even with two-dimensional input–output space. The PL estimation accuracy for all these cases was also quantified and shown in Table I. Note that the data points shown on all prediction limit graphs are testing data sets, and hence, the fraction of the number of data points falling outside the limits (should be close to 10% given that was chosen to be 0.1 in estimating the PL’s) is representative of the accuracy of the proposed estimation procedure. Fig. 3 summarizes the influence of SOFM size on PL estimation accuracy for the test cases. The results strongly support the two guidelines discussed earlier with regard to determining the configuration of the SOFM. Note that the PL estimation error in Fig. 3 associated with an SOFM containing a single neuron (that forces the whole data set to be treated as a single neighborhood), for each of the test cases, is representative of the assumption of constant variance all over the output space, a typical assumption in statistical regression. Following the above discussed SOFM configuration determination guidelines have led to as much as 65% reduction of PL estimation error for the test cases in comparison with constant variance assumption (see Fig. 3). Table II presents results from Chi-Square goodness-of-fit tests [3] conducted to check the assumption of Gaussian
1522
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 9, NO. 6, NOVEMBER 1998
residuals in each of the SOFM neighborhoods for Case I-A. At a Type-I error of 0.01, all the tests indicate that the assumption is a completely valid assumption. Fig. 4 also presents normal scores plots [6] of residuals for SOFM neighborhood one for the test cases.3 These plots are a graphical alternative to the Chi-Square tests in evaluating the Gaussian residual assumption for the neighborhoods. If the data represent a true Gaussian distribution, they are expected to fall on a straight line on the normal scores plot. It is evident from Fig. 4 that the residuals in fact approximate a Gaussian distribution. All in all, the results show that the proposed prediction limit estimation procedure is accurate and effective. VII. CONCLUSION In contrast to locally generalizing networks such as RBF networks and CMAC networks that have a naturally welldefined concept of local neighborhood, globally generalizing networks FFN’s, such as MLP networks and Temporal Processing networks, are not inherently capable of providing 3 Cases I-A and I-B each involve ten neighborhoods and case II involves 25 neighborhoods. Due to space constraints, it was decided not to show all the 45 normal scores plots. All neighborhoods exhibited normal scores plots similar to those shown in Fig. 4.
prediction intervals. A unique approach to compute prediction intervals (error bounds) for any FFN by combining the network with an SOFM was introduced. REFERENCES [1] G. Chryssolouris, M. Lee, and A. Ramsey, “Confidence interval prediction for neural-network models,” IEEE Trans. Neural Networks, vol. 7, pp. 229–232, 1996. [2] D. DeSieno, “Adding a conscience to competitive learning,” in IEEE Int. Conf. Neural Networks, San Diego, CA, vol. 1, pp. 117–124, 1988. [3] E. R. Dougherty, Probability and Statistics for the Engineering, Computing, and Physical Sciences. Englewood Cliffs, NJ: Prentice-Hall, 1990. [4] S. Haykin, Neural Networks: A Comprehensive Foundation. New York, NY: Macmillan, 1994. [5] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feed-forward networks are universal approximators,” Neural Networks, vol. 2, pp. 359–366, 1989. [6] R. A. Johnson, Miller & Freund’s Probability & Statistics for Engineers, 5th ed. Englewood Cliffs, NJ: Prentice-Hall, 1994. [7] T. Kohonen, “Self-organized formation of topologically correct feature maps,” Biol. Cybern., vol. 43, pp. 59–69, 1982. , “The self-organizing feature map,” Proc. IEEE, vol. 78, pp. [8] 1464–1480, 1992. [9] J. A. Leonard, M. A. Kramer, and L. H. Ungar, “Using radial basis functions to approximate a function and its error bounds,” IEEE Trans. Neural Networks, vol. 3, pp. 624–627, 1992. [10] J. E. Moody and C. J. Darken, “Fast learning in networks of locally tuned processing units,” Neural Comput., vol. 1, pp. 281–294, 1989.