Issues in Bayesian Analysis of Neural Network Models Peter Muller
Box 90251, Duke University, Durham NC 27708-0251 and Department of Arti cial Intelligence, Madrid Technical University, 28660 Madrid and
David Rios Insua
Department of Arti cial Intelligence, Madrid Technical University, 28660 Madrid and CNR-IAMI, 20131 Milano
Stemming from work by Buntine and Weigend (1991) and MacKay (1992), there is a growing interest in Bayesian analysis of Neural Network Models. We study computational approaches to the problem, suggesting an ecient Markov chain Monte Carlo scheme for inference and prediction with xed architecture feed forward neural networks. The scheme is then extended to the variable architecture case, providing a procedure for automatic data driven choice of architecture.
1 Introduction Neural networks (NN) constitute the central theme of a huge amount of recent research. Introductions from the physical (Muller and Reinhardt, 1988), computational (Beale and Jackson, 1990), mathematical (Amari, 1993) and statistical (Cheng and Titterington, 1994) points of view are available. Recently, Wang (1995) has suggested the importance of incorporating human knowledge in neural network models to improve their performance. This naturally leads to model this knowledge through prior distributions over the parameters of the model; through more exible architectures, possibly with a variable number of nodes; and through the combination of NN models with other, more conventional, ones. This paper discusses these issues exploring the potentiality of Bayesian ideas in the analysis of NN models. Buntine and Weigend (1991) and MacKay (1992) have provided frameworks for their Bayesian analysis based on Gaussian approximations and Neal (1993) has applied hybrid Monte Carlo ideas. Ripley (1993) and Cheng and Titterington (1994) have dwelt on the power of these ideas, specially as far as interpretation and architecture selection are concerned. See MacKay (1995) for a recent review. From a statistical modeling point of view NN's are a special instance of mixture models. Many issues about posterior multimodality and computational strategies in NN modeling are of relevance in the wider class of mixture models. Related recent references in the Bayesian literature on mixture models include Diebolt and Robert (1994), Escobar and West (1994), Robert and Mengersen (1995), Roeder and Wasserman (1995), West (1994), West and Cao (1993), West, Muller and Escobar (1994), and West and Turner (1994). We concentrate on approximation problems, though many of our suggestions can be translated to other areas. For those problems, NN's are viewed as highly nonlinear (semiparametric) approximators, where parameters are typically estimated by least squares. Applications of interest for practicioners include nonlinear regression, stochastic optimisation and regression metamodels for simulation output. The main issue we address here is how to undertake a Bayesian analysis of a NN model, and the uses of it we may make. Our contributions include: an evaluation of computational approaches to Bayesian analysis of NN models, including a novel Markov chain Monte Carlo scheme; a suggestion of a scheme for handling a variable architecture model and a scheme for combining NN models with more traditional models, in our case linear regression; and, more importantly, a suggestion of a scheme for automatic mixing over dierent network architectures and automatic data driven choice of architecture. In Section 2, after providing a review of neural network models and a conceptual framework for their Bayesian analysis, we suggest various uses in neural network model selection. Section 3 analyses computational strategies to inference and prediction with neural networks. Per se, this is important for the neural network community, since Bayesian ideas still need to permeate the analysis of NN models. In Section 4 we study variable architecture models and their combination with linear regression, concluding with some examples and discussion. We concentrate on feed forward NN's with one hidden layer, and one dimensional output.
1
2 Posterior analysis of feed forward neural networks To x our notation, we provide a brief review of feed forward neural networks (FFNN) and their Bayesian analysis. Most work in FFNN models in approximation is based on results suggesting those models as universal approximators. For example, Cybenko (1989) shows that nite sums of the form M X (1) y^(x) = j (x0 j + j )
Rp ; M
j =1
with j 2 R; j 2 2 N, are dense in C (Ip), the set of real continuous functions in the p-dimensional unit cube, where is a sigmoidal function, the typical choice being the logistic x) ; (2) (x) = 1 +exp( exp(x)
as we shall x in the rest of the paper. Model (1)-(2) de nes a feed forward neural network (FFNN) with logistic activation functions, p input units, one hidden layer with M hidden nodes and 1 output node. The terms j are designated biases and may be assimilated to the rest of the j vector if we consider an additional input with constant value one, say x1 = 1, relabeling the other variables to x2; :::; xp+1. Then, the typical approach in FFNN research in approximation is: Given data D = f(x1; y1); :::; (xN ; yN )g and xed MP, choose = ( 1; :::::::; M); = ( 1; :::; M ) according to a least squares criterion min ; Ni=1(yi ? y^(xi))2, or some variant, either via backpropagation (Rumelhart and McLelland, 1986) or other optimisation method such as quasiNewton or simulated annealing. The data set is frequently broken in two subsets, say f(x1; y1); ::::; (xk; yk )g and f(xk+1; yk+1); ::::; (xN ; yN )g. The rst one is used for estimation and the second for validation. A very important issue that has received comparatively little attention in the literature is the choice of architecture, which for the model we are considering consists mainly of choosing the number M of hidden nodes. The approximation results mentioned above are of little help, since their proof assumes that M !1 as the approximation gets better (this establishes the connection with nonparametric or semiparametric regression). Typically, a choice is made by trial and error though there are heuristics helping in that task. We provide a coherent framework for choosing the architecture. Besides considering the conventional FFNN model (1)-(2), we shall also consider a minor modi cation which adds a linear term explaining a linear component or trend in the model. Its expression is M X (3) y = j (x0 j ) + x0a: j =1
2.1 A Bayesian framework for FFNN models
Conceptually, a Bayesian analysis of a NN model is simple, see e.g. Cheng and Titterington (1994). We distinguish between the cases of xed and variable architecture, which in our case amounts to having a xed or variable number of hidden nodes. The latter subsumes the xed architecture model by considering a degenerate distribution on the number of hidden 2
nodes. Vice versa, the xed architecture model subsumes the variable architecture model in the following sense. Denote with Mm the xed architecture model with m hidden nodes. Then, Mm contains Mk , k m by, for example, setting j = k , j = k + 1; : : : ; m. This equivalence has important implications about multimodality of the posterior distribution, as we will discuss in detail in Section 3.7.
2.1.1 Fixed architecture
In this case, the model may be viewed as y = g( ; ; x) + P with g( ; ; x) = Mj=1 j (x0 j ) and M xed. Assume that p(:) and we have data D = f(x1; y1); : : :; (xN ; yN )g. The likelihood is p(Dj ; ) = Ni=1p [yi ? g( ; ; xi)] To complete the analysis, we assess a prior distribution p( ; ). For inference purposes, we will be interested in the posterior distribution j ; ) ; p( ; jD) = R p(p ;( ; ) p)(pD(D j ; )d d and various marginals and moments. For predictive purposes, we will be interested in the predictive distribution Z p(yN +1jD; xN +1) = p [yN +1 ? g( ; ; xN +1)]p( ; jD)d d ; possibly summarised through moments or probability regions.
2.1.2 Variable architecture
In this case, the model may be viewed as y = g( ; ; M; x) + : The likelihood is p(Dj ; ; M ) = Ni=1p[yi ? g( ; ; M; xi)]: The prior is p( ; ; M ). We are interested in the posterior: p( ; ; M jD) / p( ; ; M )p(Dj ; ; M ) and various marginals, and their moment summaries. We are also interested in the predictive density, marginalized over M , Z p(yN +1jD; xN +1) = p[yN +1 ? g( ; ; xN +1; M )]p( ; ; M jD)d d dM; and conditional on M : Z p(yN +1jD; xN +1; M ) = p[yN +1 ? g( ; ; xN +1; M )]p( ; jM; D)d d ; possibly summarised through moments and regions. 3
2.2 Uses of Bayesian analysis in NN modeling
The approach we adopt here is the usual one of assuming a xed number of hidden nodes and then undertake some analyses to address model selection questions. In this section, we still remain at a conceptual level. We discuss implementation later. Most of the problems we mention lead to testing point null hypotheses, a point in which there has been debate in the Bayesian literature, see e.g. Berger and Delampady (1987). In our context, we could view naturally a sharp null hypothesis as an approximation to an interval null hypothesis. Some authors, see e.g. MacKay (1995), argue against model selection issues in NN research, the main interest being in prediction. On the contrary, Buntine and Weigend (1991) and others, suggest that more parsimonious models improve generalisation and, hence, forecasting, since an overcomplex model may over t the data. Moreover, since model selection in NN is done in such an ad hoc basis, we could view all these as potentially useful tools. The point of view we favor, though, is using variable architecture models, as described in Section 4.
Issues concerning model selection Exploratory analyses may suggest these issues: Are we dealing with a linear model? Consider model (3). Then, we have to test whether j = 0; 8j . Are we dealing with a logistic model? We have to test whether i = 1; 8i > 1. A
related question is:
Have we included duplicate copies of one node? As discussed in Crawford (1994), a potential problem when dealing with mixture models is identi ability. One case is that in which two terms with the same parameter are included. Therefore, we might test for
i = j and, eventually, collapse both nodes and add the corresponding 's. We discuss the problem of node duplication in more detail in Section 3.7. It is important to consider possible node duplication before continuing with the following inference, even if it is done on an informal level of exploratory data analysis. Have we included an excessive number of hidden nodes ? A strategy would
be to eliminate sequentially irrelevant hidden nodes. This strategy is close to Thodberg's (1994) based on Gaussian approximations. To do that, we could test whether j = 0. Note that j j ( j x)j j , so if j is small the node may be disregarded, whatever the values of j are. We prefer the alternative strategy of assuming a variable architecture and compare architectures according to p(M jD). We could suggest a most likely M 0 satisfying
p(M 0jD) = max p(M jD): M However, it may be interesting to preserve several architectures, so as to improve predictive performance, see e.g. Raftery et al (1995). 4
Relevance of a bias We mentioned that biases could be modeled considering an additional input node with constant input one. Clearly, we may check whether a given bias is zero, by testing whether j;1 = 0 (assuming the constant input node has index 1). Relevance of a variable Assume that the k-th variable xj is bounded. Then, if j;k is small 8j , j;k xk is small, therefore we may test whether j;k = 0; 8j , concluding that the variable is irrelevant.
Issues concerning hidden nodes >From an interpretational point of view, it may be of interest to test whether j 0 or j 0, which would correspond to determining whether
a node is excitatory or inhibitory.
3 Computations in Bayesian analysis of NN's All of the above has been conceptual, we need to provide a proper computational framework. In this section, we discuss computational methods for Bayesian analysis of neural networks with xed architecture. Buntine and Weigend (1991) and MacKay (1992, 1995) suggest using normal approximations. As we shall see, these may have some shortcomings. We favor the use of an ecient Monte Carlo Markov chain scheme we introduce. Neal (1993) uses a hybrid Monte Carlo appproach and MacKay (1995) an implementation of the Gibbs sampler for NN, based on BUGS (Spiegelhalter et al, 1994). As indicated by Neal, McKay and Spiegelhalter et al., straightforward implementations of Gibbs sampler may be rather inecient. We illustrate the ideas with a speci c model and three examples.
3.1 The Model
We focus on the following FFNN. The model implements nonparametric regression of a response y on covariates x1; : : : ; xp by using a hidden layer of M nodes with logistic activation functions. This actually corresponds to the combination of two standard statistical models: linear and logistic regression.
yi =
M X
j (x0i j ) + i; i = 1; : : :; N;
j =1 i N (0; 2);
() = exp()=(1 + exp()); j N ( ; 2 ); j = 1; : : :; M;
j N ( ; S 2): (4) The model is completed with priors over hyperparameters: N (a ; A ), N (a ; A ), ?2 Gamma(cb=2; cb Cb=2), S ?1 Wish(c ; (c C )?1) and ?2 Gamma(s=2; sS=2). We use an informative prior probability model because of the meaning and interpretation of the parameters. For example, the j 's should re ect the order of magnitude of the data yi. Typically positive and negative values for j would be equally likely, calling for a symmetric 5
prior around a = 0 with a standard deviation re ecting the range of plausible values for yi. Similarily, a range of reasonable values for the logistic coecients j will be determined by the meaning of the data yi being modeled. See the appendix for the particular hyperparameter choices used in the implemented examples. Alternative mixture parameterizations and prior models are discussed in literature on posterior inference in mixture models. Robert and Mengersen (1995) and Mengersen and Robert (1995) propose a parameterizaiton for the mixture parameters like the ( j ; j ), using an overall baseline value and osets for each term relative to it's preceeding term. Their proposal is motivated by the desire to use partially non-informative priors, and by an eciency consideration for the posterior simulation scheme used to implement posterior inference. Roeder and Wasserman (1995) propose a partially improper prior model by de ning a proper prior distribution only for conditional priors of the type p( j ; j j h; h ; h 6= j ). We chose not to use either of these parameterizations because we work with a proper prior distribution and because the partial marginalization and blocking in the posterior simulation scheme described in Section 3.6 seems sucient to alleviate concerns about eciency of the posterior simulation scheme. The particular choice of a normal and, gamma and inverse Wishart distributions is for technical convenience. Similar hyperpriors are fairly common in Bayesian modeling, see e.g. Lavine and West (1992). In general, posterior inference is reasonably robust with respect to the choice of hyperpriors, see, for example, Berger (1990). However, if available prior information suggests dierent hyperpriors and hyperprior parameters, the model should of course be adjusted appropriately.
3.2 Marginal Likelihood
Before implementing formal posterior inference in the model, we consider some exploratory data analysis. We illustrate it with Example 1 described below, with one input variable and one output variable. To visually inspect features of the likelihood function we marginalise model (4) with respect to j ; j = 1; : : : ; M; to obtain p(Dj ; ), where designates the logistic parameters ( 11; 12; : : :; M;1; M;2), and combines all the hyperparameters ( ; ; ; S ; ). Marginalisation with respect to is achieved by observing the following identity, which we shall use also later. For the sake of an easier notation, we only include and in the notation and suppress dependence on the hyperparameters. =1;:::;M , 1 = (1) Lemma 1 Let zij = zij ( ) = (x0i j ), Z = (zij )ij=1 i=1;:::;M , A = Z 0Z= 2 , ;:::;N 2 ?1 0 2 2 = Z y= , C = 1= I , = = 1. Let mb( ) = (A + C ) ( + ) and Sb ( ) = (A + C )?1. Then, N = mb( )] Y p(Dj ) = p[ p[= mb( )jy; ] i=1 p[yij = mb( ); ] = p[ = mb( )]jSb( )j1=2
N Y
i=1
p[yij = mb( ); ]:
Proof. Note that conditional on , model (4) becomes a normal linear regression model. The posterior p( jD; ) takes the form of a multivariate normal distribution N [mb( ); Sb( )]. The posterior moments mb( ) and Sb( ), are given, for example, in Bernardo and Smith (1994).
6
By Bayes' Theorem, p( jD; ) = p( ) QNi=1 p(yi j ; )=p(Dj ): Substituting = mb( ) in the last equation, we obtain the expression for p(Dj ) given earlier. 2 Example 1 . Galaxy Data. We try to relate velocity (yi) and radial position (xi2) of galaxy NGC7531 at 323 dierent locations (Buta, 1987). The data is shown in Figure 3. Radial positions are centered and scaled to have zero mean and unit variance. A constant covariate xi1 adds an intercept to the logistic regression terms (x0 j ) of the NN model. Figure 1a shows the marginal likelihood p(Dj 12 ; 22; 11 = 21 = 0) under model (4) with M = 2 hidden nodes. Most notorious features are multimodality and nonnormality. The global mode occurs at (0; ?2). Multimodality issues have pervaded classical analysis of neural network models, since typically used algorithms lead to local optima. Multimodality and nonnormality issues are relevant in Bayesian analysis since some of the computational strategies rely heavily on assumptions of unimodality and approximate normality of the posterior. Depending on the prior adopted these features will tend to be strengthened or softened in the posterior. The important point, though, is to be aware of them, as discussed below. Fig:1
4
3.3 The posterior distribution
Given the prior we have adopted for Example 1, see the Appendix, the multimodality of the marginal likelihood gets partly smoothed out in the posterior distribution, albeit some nonnormality remains. Figure 1b shows the marginal posterior corresponding to the likelihood in Figure 1a. Up to multimodality due to simple relabeling, i.e. symmetric re ection along the 45 degree line, the multimodality of the likelihood has disappeared. Note however, that unimodality of the marginal posterior distribution p( 12; 22jD; 11 = 21 = 0) does not allow any inference about unimodality of the joint posterior distribution p( ; jD). In fact, we will nd joint posterior modes quite dierent from the marginal modes seen in Figure 1b. This corresponds to observations in the literature about multimodality of posterior distributions for NN parameters, for example, MacKay (1995), and, more generally, of mixture models, Crawford (1994). See the discussion in Section 3.7 on posterior multimodality.
3.4 Posterior Mode
Several computational strategies require a posterior mode, which is also a convenient summary of the posterior. Hence, we look for one applying an iterative conditional modes (ICM) algorithm. This is a generalisation of the coordinate directions method where we consider optimising groups of coordinates rather than one at a time. Besag (1986) proposed it in a computational statistics context. Starting with an initial point we iteratively replace by the conditional mode in when keeping xed, and then we replace by the conditional mode, when keeping xed. We keep on iterating until the scheme converges (to a local mode). 7
Example 1 (ctd.) . We initialize the ICM scheme with the maximum over the grid used in Figure 1b. The corresponding local mode is reported in the rst row labeled \ICM" in Table 1. Other local modes are obtained by repeating the ICM scheme with randomly generated starting points. Four of them are reported in Table 1. Tab:1
4
3.5 Approximate posterior inference
Several commonly used methods for approximate posterior inference are based on approximate posterior normality. These kinds of strategies are used by MacKay (1992, 1995) and Buntine and Weigend (1991). In what follows, ; p(); p(Dj) and p(jD) designate, respectively, the parameters, the prior, the likelihood and the posterior. For large sample sizes, the posterior distribution of is under appropriate conditions approximately multivariate normal N (^; ^ ), where ^ is the maximum likelihood estimate and ^ is the inverse of the Hessian matrix of the likelihood evaluated at ^, see Bernardo and Smith (1994). In Example 1, this leads to the approximate posterior standard deviations reported in Table 2. Clearly strategies for numerical posterior inference based on approximate posterior normality are only valid to the extent to which the sample size is large enough to appeal to such asymptotic arguments, and the conditions for asymptotic posterior normality are not violated. It could be argued that sample sizes would commonly not be a problem in NN applications with typically abundant data. However, note also that NN models are usually high dimensional, therefore requiring even more data. In any case, it is dicult to check in practice conditions ensuring the validity of the asymptotic approximation, see Smith (1991). Our plots in Figures 1 should warn against uncritical use of normal approximations in NN research. By a similar argument, posterior moments can be approximated by Laplace integration, see Tierney, Kass and Kadane (1989). To approximate posterior expectations of the type R R J = f ()()p(Dj)d= ()p(Dj)d Laplace integration approximates the numerator and denominator integrands by quadratic Taylor series expansions of the logarithms around the respective modes, i.e. the numerator and denominator integrals are replaced by integrals with respect to a Gaussian kernel. Clearly, this would be exact if the posterior p(jD) / ()p(Dj) was, in fact, a normal distribution. Otherwise the accuracy of the approximation depends on the quality of the normal approximation to the posterior distribution. See, for example, Robert (1995) for a detailed discussion.
Example 1 (ctd.) . Table 2 lists the approximate posterior moments obtained by Laplace integration. All hyperparameters were xed at their prior means, i.e. the integration is only over = ( ; ). Figure 2 (top panels) graphs marginal posterior distributions estimated by Laplace integration, implemented with XLISP-STAT routines (Tierney, 1989). The maximization to 8
nd the posterior mode was initialized at the parameter values obtained by ICM. Note the unimodality of the marginals, to be compared in Section 3.6. Fig:2
4
Another commonly used numerical posterior integration scheme is Monte Carlo integration with importance sampling.The rationale R of importance sampling is to rewrite posterior integrals of the form J above as J = f ()[p(jD)=h()] h()d; where h() p(jD) is chosen as a probability density function (p.d.f.) which mimics p(jD) but allows ecient r.v. generation, for example a multivariate normal approximation of the joint posterior p.d.f. Using a simulated Monte Carlo sample i h(); i = 1; : : : ; m, and writing w() = p(P jD)=h() for the weights, the posterior integral J can then be approximated by J J^ = mi=1 f (i)w(i)=Pmi=1 w(i): For the choice of the importance sampling density h(), it is desirable to obtain weights w() close to constant, or at least bounded. Since the weights are given by the ratio p(jD)=h() it is thus important to choose h() to dominate p(jD) in the tails. Common choices for h() are therefore multivariate t distributions with location and scale matrix matched to the moments of the asymptotic normal approximation. See Geweke (1989) for a detailed discussion of importance sampling for posterior inference. Example 1 (ctd.) . Table 2 reports the approximate posterior moments computed by Monte Carlo integration with importance sampling. As importance sampling distribution h() we used a multivariate t distribution with 10 d.f. and location and scale matrix matched with the posterior mode and the negative inverse Hessian at the mode. We used an implementation in XLISP-STAT.
4
3.6 Markov chain Monte Carlo
Assessment of these and other techniques for posterior inference problems may be seen in Robert (1995). In the speci c context of NN models, it is specially cumbersome that inference from ICM, Laplace integration and importance sampling may be misled by local modes. General problems stemming from this fact in inference are well known. One way of mitigating against this is by appealing to a strategy by Buntine and Weigend (1991), which aims at nding several modes, and base the analysis on weighted mixtures of the corresponding normal approximations. Of course, we return to the same problem since we are probably leaving out undiscovered local modes. An alternative view is argued by MacKay (1995): inference from such schemes is best considered as approximate posterior inference in a submodel de ned by constraining the parameters to a neighborhood of the particular local mode. Depending on the emphasis of the analysis, this might be reasonable, specially if in a nal implementation our aim is to set the parameters at speci c values. However, we prefer to propagate the uncertainty in the parameters, since this allows better predictions, see Raftery et al (1995). 9
The dispute could be somewhat sterile since we can now appeal to Markov chain Monte Carlo (MCMC) methods to implement posterior inference. The essential idea is to obtain a sample from the posterior and base inference on that sample by, for example, replacing posterior expectations with sample means over the simulated posterior sample. The diculty may reside in obtaining a sample from the posterior p(jD). The rationale of MCMC is to consider a Markov chain fng with state and having p(jD) as stationary distribution. Various ways of obtaining such chains are described in Tierney (1994), including Metropolis, Gibbs and independence chains. The strategy is to start with arbitrary values of , let the Markov chain run until it has practically reached convergence, say after T iterations, and use the next k observed values of the chain as an approximate posterior sample A = f1; : : : ; k g. MacKay (1995) implements an MCMC method for neural networks based on BUGS (Spiegelhalter et al, 1994) a program for Bayesian Inference using the Gibbs sampler. Neal (1993) introduces a hybrid Monte Carlo method combining ideas from simulated annealing and Metropolis' algorithm. Both authors warn against potential ineciency of straightforward implementation of MCMC methods in Bayesian analysis of NN models. Also Besag and Green (1993), albeit in a dierent application context, dwell on the special care required when using MCMC in multimodal problems. We introduce here an ecient MCMC algorithm based on partial marginalization (over the j 's) and blocking (all j are jointly resampled). As illustrated in Figures 2, 5 and 6 the scheme is eective in mixing over the various local modes in the posterior distribution. Denote with = ( 1; : : :; M ) and = ( 1; : : :; M ) the NN parameters, and with = ( ; ; ; S ; 2) the hyperparameters in the model. Let = ( ; ; ) be the full parameter vector. (i) Start with equal to some initial guess (for example the prior means). Until convergence is achieved, iterate through (ii) Given current values of ( ; ), replace current values of by a draw from the complete conditional p( j ; ; D). This is a multivariate normal distribution with moments described in Lemma 1. (iii) Given current values of only, (i.e., marginalizing over ) replace by Metropolis steps: For each j , j = 1; :::; M , generate a \candidate" ~j gj ( j ), with gj ( j ) described below. Compute a( j ; ~j ) = min[1; p(Dj ~ ; )=p(Dj ; )] ; where ~ = ( 1; : : :; j?1 ; ~j ; j+1 ; : : :; M ). With probability a( j ; ~j ) replace j by the new candidate ~j . Otherwise leave j unchanged. For the probing distribution gj (), we use a multivariate normal N ( j ; c2C ) with c = 0:1. Conceptually, any probing distribution which is symmetric in its arguments, i.e. g(~ j j j ) = g( j j ~j ) would imply the desired posterior distribution as stationary distribution of the corresponding Markov chain. For a practical implementation a probing distribution with acceptance rates not too close to zero or one is desirable. For a specialized setup, Gelman, Roberts and Gilks (1994) showed that acceptance rates of around 25% are optimal. We found the factor c = 0:1 by trying a few alternative choices until 10
we achieved acceptance rates in this range. (iv) Given current values of ( ; ), replace the hyperparameters by a draw from the respective complete conditional posterior distributions: p( j ; ) is a normal distribution, p( j ; S ) is multivariate normal, p( ?2j ; ) is a Gamma distribution, p(S ?1 j ; ) is Wishart, and p(?2j ; ; y) is Gamma, as corresponds to a normal linear model, see Bernardo and Smith (1994). The proof of the convergence of this chain follows from arguments in Tierney (1995). To judge when the chain has practically converged we relied on a convergence diagnostic proposed by Geweke (1992). Once with an approximate posterior sample f1; : : : ; k g, we may undertake various posterior and predictive tasks as we illustrate in the example below.
Example 1 (ctd.) . Figures 2 and 3 show some aspects of the posterior inference in Example 1. The bottom two panels in Figure 2 show the estimated marginal posterior distributions for 1 and 12. Compare with the marginal posterior distributions approximated with Laplace's method (top two panels in Figure 2). Evidently, Laplace integration got misled by a local mode. Compare also the moments reported in Table 2. Expected values reported for normal approximation, Laplace integration and Importance sampling are essentially the same and driven by the local mode found. Note the dierences in standard deviations. Expected values reported for MCMC are dierent and away from the local mode found. Note also that more uncertainty is reported via bigger standard deviations, acknowledging again multimodality. Tab:2 Similarly, we may undertake predictive inference. For example, predictive means f (x) = E (yn+1jxn+1 = x; D) can be evaluated via k X f^(x) = E^ (yn+1jxn+1 ; D) = k1 E (yN +1jxn+1; = t): t=1 Figure 3a plots the tted curve f^(x). In addition to estimating the nonlinear regression curve f^, the MCMC allows a complete probabilistic description of the involved uncertainties. Figure 3b, for example, visualizes the posterior distribution on the nonlinear regression curve induced by the posterior distribution p(jD). Fig:3
4
Example 2 . Old Faithful Geyser data. Azzalini and Bowman (1990) analysed a data set concerning eruptions of the Old Faithful geyser in Yellowstone National Park in Wyoming. The data 11
set records eruption durations and intervals between subsequent eruptions, collected continuously from August 1st until August 15th, 1985. Of the original 299 observations, we removed 78 observations which were taken at night and only recorded durations as \short", \medium" or \long". Figure 4 plots the data and a tted curve for predicting the waiting time yn+1 until the next eruption from the duration of the previous eruption xn+1. The nonlinear regression curve was obtained by tting model (4) with M = 3. In addition to the single mean curve, we also obtain a distribution on the curve, illustrated in the gure by plotting E (yn+1 jxn+1; t) for several values t p(jD). Fig:4
4
3.7 Posterior Multimodality
As in mixture models, posterior inference in NN models is typically complicated by multimodality. Crawford (1994) considers two main reasons for the occurrence of multiple modes. Multiple modes occur because prior and likelihood, and hence the posterior, are invariant with respect to arbitrary relabeling of the nodes. This problem is easily avoided by introducing an arbitrary ordering of the nodes. For example, we imposed the constraint that i2 be ordered, i.e. 12 22 : : : M 2. Another, more serious, source of multimodality occurs through duplication of terms in the mixture, i.e. duplication of nodes in the hidden layer. We see duplication as a manifestation of model mixing in the following sense. Denote with M0 the xed architecture model (4) with M hidden nodes. Denote with MM the xed architecture model (4) with M distinct hidden nodes. Model M0 contains MM , M = 1; : : : ; M as special cases by setting, for example
i = M for i = M + 1; : : : ; M . While exact equality of 's has zero posterior probability because of the continuous priors we have adopted over the parameters, approximate equality can have considerable posterior probability. Denote with pM (jD) and ^M the posterior distribution and the posterior mode under model MM . The posterior distribution p(jD) P in model M0 can be rewritten as a mixture Pr(M = mjD)pm (jD). If the terms of this mixture are spiked and well enough separated, p(jD) exhibits local modes corresponding to the submodels, with additional multimodality entering by the dierent ways of nesting MM in M0. Of course, remaining multimodality could also be due to the inherent nonlinearity of the model. For demonstration, we generated data from a neural network model with two distinct hidden modes:
Example 3 . Simulated Data. We simulated y1; : : : ; yN from (4) with M = 2, 1 = (1; 1:5) 2 = (2; ?1), = (10; 20), N = 100 and = 0:1 and estimated models M3 and M2. The marginal posterior p( 22jy) shown in Figure 5 has at least three local modes. The rst local mode (around -1) and the third mode (around 1.5) are due to model M2 contained in M3 by duplicating node 1 in node 2 ( rst mode) or by duplicating node 3 (third mode). Also, 12
under M3, p( 22jy) shows a local mode around 0. This could be due to nesting model M2 in M3 by setting 2 = 0. Conditional on 2 = 0 the conditional posterior for 22 would coincide with the prior, i.e. centered around the prior mean zero (see appendix). Fig:5
4
Example 1 (ctd.) . Figure 6 shows some more aspects of the posterior inference in Example 1 relating to multimodality due to node duplication. The patterns are similar to the simulated data. However, even under model M2 we still see some multimodality, some of which could be due to model M1. Note, though, that as suggested in Figure 1, there is some inherent multimodality. Fig:6
4 We propose three alternative approaches to deal with multimodality due to node duplication: (i) From a predictive point of view, node duplication is no issue, i.e. if the focus of the analysis is prediction, for example tting a nonlinear regression surface, one could ignore the possibility of node duplication. However, it is important to be aware of the implications for the particular estimation scheme: routine application of any numerical posterior integration scheme based on approximate posterior normality and unimodality would be hindered. This includes widely used algorithms like direct normal approximation, Laplace integration, importance sampling and iterative Gaussian quadrature. If used, as in the illustrations in Section 3.5, inference will only be applicable to the local mode (i.e., the particular submodel) on which the normal approximation was based. (ii) Often the multimodality can be removed by picking a lower order model, i.e. when symptoms of node duplication are noticed in the posterior distribution one could consider models with M < M nodes. There remains the problem of identifying multimodality due to node duplication. We propose to rely mainly on graphical summaries of the type presented in Figures 5 and 6. In particular, the marginal posterior distribution of the parameter which is chosen to arbitrarily order the nodes can be a helpful exploratory data analysis tool. First, we use the marginal posteriors to detect multimodality. We can also use them to detect whether one (or more) of the marginals seems a mixture. Then, we can use pairwise scatterplots of the posteriors and check whether the scatterplot gets close to the 45 degree line, suggesting equal 's. If this happens, as in our case, we could try to estimate a model with a smaller number of nodes, 2 in our example, and check whether multimodality due to node duplication has been removed. 13
More formal diagnostics could be based on testing for j i ? i+1j < , for a suitably chosen , as an approximation to the point null hypothesis j i ? i+1 j = 0. could be chosen to ensure that j (x0 i ) ? (x0 i+1 )j < , so that little would be lost if both nodes were collapsed. (iii) We feel the conceptually clearest and straightforward approach is to explicitly include the number of hidden nodes M as a parameter in the model, i.e. use variable architecture NN models. This is fairly natural, since, apart from simulated examples, we do not expect a 'true' M for the NN model, hence we need to model uncertainty about it. There remains the problem of providing procedures for modeling it and a scheme for estimating the model. This is the topic of Section 4.
4 Variable Architecture NN models Our considerations in Section 3 lead us to contemplate the number M of nodes as another parameter. We provide here a scheme for modeling and estimating uncertainty about M , therefore dropping the assumption of a xed known architecture. We actually allow the model to "select" the size of the hidden layer by including indicators dj , with dj = 1 if a node is included, and dj = 0 if a node is dropped. This selection is data driven and automatically based on p(M jD). We generalise the xed architecture model in yet another direction by including a linear regression term x0a to eciently model level and linear eects. This would tend to reduce typically the size of the network. We will always use d1 = 1, which would be a reasonable assumption if a linear model is not satisfactory. This corresponds to a model building strategy based on blocks, where the linear part models linear aspects of the process of interest and the neural network part models nonlinear aspects. Speci cally, the model we use is:
yi = i () Pr(dj = 0jdj?1 = 1) Pr(dj = 1jdj?1 = 1) dj = 0 j a
j
M X 0 xia + dj j j =1 N (0; 2);
(x0i j ) + i; i = 1; :::; N;
= exp()=(1 + exp()); = 1 ? ; j = 1; :::; M ; = ; if dj?1 = 0; N (b ; 2 ); N (a; a2); N ( ; ):
(5)
The model is completed with the same hyperpriors as model (4). For the sake of simplicity we use an improper constant prior on the linear regression parameters a, which play a role similar to that of weights . Observe that the indicators dj are ordered decreasingly, so that dj = 0 implies dk = 0 for k > j . We also maintain the ordering of the logistic parameters j , according to j2, 14
say, to mitigate the relabeling problem described in Section 3.7. The prior distribution for the indicators dj allows a priori any architecture with M M hidden nodes. It actually implies a negative binomial prior with parameter , truncated at M , on the size M of the hidden layer. This prior enables ecient implementation and favors parsimony, in the sense of supporting architectures with smaller number of hidden nodes. This is mitigated choosing a bigger . Also, it is conceivable that we might be interested in ensuring that the number of hidden nodes is bigger than, say, M1, in which case model (5) may be extended by assuming d1 = ::: = dM1 = 1. Note that, alternatively, we could have used a Bin(M ; ) prior on the size M of the hidden layer, by de ning Pr(dj = 0) = 1 ? , Pr(dj = 1) = . This enables more exible modeling of uncertainty about the size of the hidden layer, but the implementation is less ecient. This scheme is a natural extension of that in Section 3.6. This is yet another advantage of favoring an MCMC approach to NN modeling: other computational schemes described in Section 3 break down in the variable architecture case, whereas the simulation scheme outlined for inference in the xed architecture model requires only minor modi cations to be used for model (5). Conditional on currently imputed values for the indicators dj , the model reduces to the xed architecture one. Given other model parameters, the conditional posterior probabilitiesPfor dj = 0 and dj = 1 are easily computed. Denote with M = jM=1 dj the number of hidden nodes currently included in the model. By de nition of the indices we always have dj = 1, j = 1; : : : ; M and dj = 0; j = M + 1; : : : ; M . Before discussing details of the algorithm, we give an outline of the updating scheme. The scheme goes through the following sampling steps, until convergence is judged:
j jM; ; D; j = 1; : : :; M + 1; (ii) M j 1; : : : ; M ; M +1; ; D (iii) 1; : : : ; M ; aj 1; : : :; M ; M; ; D (iv) j 1; 1; : : : ; M ; M ; D: (i)
In step (i), we marginalize over and we include M +1. Conditional on M and the hyperparameters, the conditional posterior on M +1 is just the N ( ; ) prior. The rest of j 's (j = 1; : : :; M ), are updated through Metropolis steps, as in Section 3.6. Step (ii) refers to updating the number of hidden nodes. We also marginalise over . Denote with M 0 the current value of M and use for = ( 1; : : :; M 0+1). Given currently imputed values for and the hyperparameters , the conditional posterior probabilities for M = M 0 ? 1; M 0; M 0 + 1 take the following form:
8 >< (1 ? ) QQNi=1 p(yi jM = M 0 ? 1; ) for M = M 0 ? 1; for M = M 0; Pr(M j ; ; D) / > (1 ? ) QNi=1 p(yijM = M 0; ) : 2(1 ? ) Ni=1 p(yi jM = M 0 + 1; ) for M = M 0 + 1; For the special cases M 0 = 1 and M 0 = M , replace (6) by Pr(M = M 0 ? 1j : : :) = 0 and Pr(M = M 0 +1j : : :) = 0, respectively. Note that p(yijM; ) is marginalized over ( ; a) using
the result given in Lemma 1.
15
Updating M is implemented in a Metropolis step by rst generating a \candidate" value M~ for M by P (M~ = M 0 ? 1) = Pr(M~ = M 0 + 1) = 0:5. With probability a(M 0; M~ ) = min(1; Pr(M~ j ; ; D)=Pr(M 0 j ; ; D)) we accept the candidate as new value for M , i.e. we set M = M~ , else we keep the current value, i.e. M = M 0. In step (iii) we sample all j and a jointly. Step (iii) draws from the complete conditional posterior. The complete conditional posterior 1; : : :; M ; aj 1; : : : ; M required in step (iii) is a multivariate normal posterior distribution in a straightforward linear normal regression model, as described in Lemma 1. Step (iii) is unchanged from the xed architecture model. See step (iv) in the algorithm described in Section 3.6. We illustrate the variable architecture model with two examples. The rst one shows how model (5) may adapt to multimodalities. The second one suggests how model (5) may adapt to sharp edges. The exibility of this model for coping with these features make it very competitive with respect to other smoothing methods, including model (4).
Example 4 . Ozone Data. Chambers et al. (1983, pp. 346-347) report a data set with n = 111 daily observations on ozone concentration (yi), radiation (xi1), and temperature (xi2). Observations with missing values were excluded, and ozone was transformed to a cube root scale. Figure 7a shows the estimated regression surface f^(x1; x2) E (yn+1jxn+1;1 = x1; xn+1;2 = x2; y) under the variable architecture model, i.e. the predictive means are averaged over models of dierent size hidden layer. For the particular prior ( = 0:80), Figure 7b shows the induced posterior distribution on the number of nodes. Note that we have moved from a truncated negative binomial prior to a posterior in the number of hidden nodes with a mode in 6. The posterior distribution suggests a size of the hidden layer between 4 and 8, with 6 the most likely size. Note also the drop in posterior probability from 4 to 5 nodes. Observe also in Figure 13 how the tted surface adapts to the multimodality of the surface. Figure7
4
Example 5 . Reservoir management. We apply the described methods to a case study coming from Rios Insua and Salewicz (1995). The example concerns a reservoir operation problem, complicated by the existence of multiple objectives, uncertainty to the in ows and the eect of time. The decisions to be made at each month were the volumes of water to be released through turbines and spilled, and based on maximising a predictive expected utility. There was no analytic expression for this last one, so we could appeal to an optimisation method based on functional evaluations only, such as Nelder Mead method. Alternatively, we could evaluate the expected utility at a grid of controls, and t a surface and optimise it. We illustrate this last approach, tting a neural network model. Figure 16
8 shows the data and the tted surface. Note how the NN model tted the sharp edge in front, a case in which typically other smoothing methods would fail. Figure8
4
5 Discussion Though neural network models are typically presented as black box models, allowing to model nonlinear features in problems like approximation, regression, smoothing, forecasting and classi cation, we believe that incorporating prior knowledge in those models should enhance their performance. This begs naturally for a Bayesian approach to neural network models. Among other advantages, this allows for the coherent incorporation of all uncertainties, including those relating to the hidden layer size. This approach, however, leads to dicult computational problems and we have analysed pros and cons of several computational schemes. Speci cally, we have noted potential problems of normal approximation based approaches due to nonnormality and multimodality. As an alternative, we have provided a powerful Markov chain Monte Carlo scheme which avoids those problems and permits routine Bayesian analysis of FFNN models. The scheme allows the consideration of variable architecture networks and, consequently, automates the choice of the network architecture. We have also shown that the scheme allows the combination with more conventional models such as linear regression. In summary, we have provided a general enough framework for the Bayesian analysis of FFNN models. Future work will deal with a somewhat inverse problem: how FFNN models enhance the Bayesian toolkit.
Appendix: Implementation, Initialization and convergence diagnostic
In the examples, we have used the following initialization and hyperparameters. The covariates were standardized to have xi = 0 and var(xi) = 1:0 (except, of course, for the dummy intercept x1i = 1). In the current implementation we keep the hyperparameters ; ; and S xed at = 0, 2 = A , = (0; : : : ; 0)0, S = 25 I , where I is the identity matrix of appropriate dimension and A = 10000 in Examples 1, 2 and 3, A = 100 in Example 4 and A = 1 in Example 5, with the respective choices re ecting the scale of the response variable y. We simulated 10000, 10000, 3000, 1000 and 1000 iterations in Examples 1, 2, 3, 4 and 5, respectively. The decision to terminate simulations was based on the convergence diagnostic proposed by Geweke (1992). The long simulation lengths in Examples 1 and 2 were required to obtain suciently big Monte Carlo posterior samples for the posterior scatter plots. In all examples, we discarded the initial 100 iterations as burn-in and saved every 10-th iteration thereafter to collect an approximate posterior Monte Carlo sample, used for 17
the ergodic averages. The error variance was xed to = 100; 1; 100; 0:5 and 0:5 in Examples 1,2,3,4 and 5, respectively. In Examples 4 and 5, we chose = 0:8 M = 20, and M = 3 initially for the prior on the number of nodes in the hidden layer. Simulations in all examples were implemented as C programs on a personal computer with a 486/100MHz processor running under BSD unix.
Acknowledgements Research supported by grants from the National Science Founda-
tion, CICYT and the Iberdrola Foundation. It was completed while the rst author visited the Department of Arti cial Intelligence, Madrid Technical University and the second author visited CNR-IAMI. We are grateful to discussions with Fabrizio Ruggeri.
References Amari, S. 1993. Mathematical methods of neurocomputing, In Networks and Chaos, BarndorfNielsen, Jensen, Kendall eds. Chapman and Hall, London. Azzalini,, A. and Bowman, A. W. 1990. A look at some data on the Old Faithful geyser. Applied Statistics 39, 357-365. Beale, R. and Jackson, T. 1991. Neural Computing, Hilger. Berger, J. 1990. Robust Bayesian Analysis: Sensitivity to the prior. J. Stat. Plan. Inf., 25, 303-328. Berger, J. and Delampady, M. 1987. Testing precise hypothesis. Stat. Sci. Bernardo, J. and Smith A. 1994. Bayesian Statistics, Wiley. Besag, J. 1986. On the statistical analysis of dirty pictures, Jour. Roy. Stat. Soc. B, 48, 259-302. Besag, J. and Green, P. 1993. Spatial statistics and Bayesian computation, Jour. Roy. Stat. Soc., B, 55, 25-37. Buntine, W. and Weigend, A. 1991. Bayesian back-propagation. Complex Systems, 5, 603643. Buta, R. 1987. The structure and dynamics of ringed galaxies, III. The Astroph. Jour. Suppl. Ser., 64, 1-37. Chambers, J., Cleveland, W., Kleiner, B., Tukey, P. 1983. Graphical Methods for Data Analysis, Wadsworth, Paci c Grove. Cheng, B. and Titterington, D. 1994. Neural networks: a review from a statistical perspective (with discussion). Stat. Science, 9, 2-54. Crawford, S. 1994. An application of the Laplace method to nite mixture distributions. J Am. Statist. Assoc., 89, 254-267. Cybenko, G. 1989. Approximation by superposition of a sigmoidal function. Math. Cont. Sys. Sig., 2, 303-314. Diebolt, J. and Robert, C. 1994. Estimation of nite mixture distributions through Bayesian Sampling. J.R.Statist. Soc. B, 56, 363-375. 18
Escobar, M. and West, M. 1994. Bayesian density estimation and inference using mixtures, Journal of the American Statistical Association , to appear. Gelman, A., Roberts, G.O. and Gilks, W.R. 1994. Ecient Metropolis Jumping Rules. Technical report, University of California. Geweke, J. 1989. Bayesian inference in econometric models using Monte Carlo integration. Econometrica, 24, 1317-1399. Geweke, J. 1992. Evaluating the accuracy of sampling based approaches to the calculation of posterior moments. In Bayesian Statistics 4 , Bernardo, Berger, Dawid and Smith, eds., pp. 169-194. Oxford University Press, Oxford. Lavine, M. and West, M. 1992. A Bayesian method for classi cation and discrimination. Can. Journ. Stats., 20, 451-461. MacKay, D. 1992. A practical Bayesian framework for backprop networks. Neural Computation, 4, 448-472. MacKay, D. 1995. Bayesian methods for neural networks: theory and applications. Tech. Rep. Cavendish Lab, Cambridge Univ. Mengersen, K. and Robert, C. 1995. Testing for mixtures: a Bayesian entropic approach. In Bayesian Statistics 5, J.O. Berger, J.M. Bernardo, A.P. Dawid, D.V. Lindley and A.F.M. Smith (Eds.). Oxford University Press, London (to appear). Muller, B. and Reinhardt, J. 1991. Neural Networks. Springer-Verlag, New-York. Neal, R. 1993. Bayesian training of backpropagation networks by the hybrid Monte Carlo method. Tech. Rep. Dpt. Comp. Sci., U. Toronto. Raftery, A., Madigan, D., Volinsky, C. 1995. Accounting for model uncertainty in survival analysis improves predictive performance. In Bayesian Statistics 5, Bernardo, Berger, Dawid, Smith eds., to appear. Rios Insua, D. and Salewicz, K. 1995. The operation of Kariba Lake: A multiobjective decision analysis. Jour. Multic. Decision Anal. (to appear). Ripley, B. 1993. Statistical aspects of neural networks. In Networks and Chaos, , BarndorfNielsen, Jensen, Kendall, eds. Chapman and Hall, London. Robert, C. 1995. The Bayesian Choice, Springer-Verlag, New-York. Robert, C., and Mengersen, K.L. 1995. Reparameterisation issues in mixture modelling and their bearing on the Gibbs sampler, Technical Report, CREST, Paris. Roeder, K. and Wasserman, L. 1995. Practical Bayesian density estimation using mixtures of normals, Technical Report, Carnegie Mellon University. Rumelhart, D., McClelland, J. 1986. Parallel Distributed Processing, Bradford. Smith, A. 1991. Bayesian computational methods. Phil. Trans. Roy. Soc. London, 337, 369-386. Thodberg, H. 1994. Ace of Bayes: Application of neural networks with pruning. Tech. Rep. Danish Meat Res. Inst. Spiegelhalter, D., Thomas, A., Gilks, W. 1994. BUGS Manual. MRC Biostatistics Unit, IPH, Cambridge. Tierney, L. 1989. X-LISP STAT, Wiley, New York. Tierney, L., Kass, R. and Kadane, J. 1989. Fully exponential Laplace approximations to expectations and variances of nonpositive functions. Jour. American Statist. Association, 81, 82-86. 19
Tierney, L. 1994. Markov chains for exploring posterior distributions. Annals of Statistics, Wang, Y. 1995. Unpredictability of standard back propagation neural networks. Mgt. Sci., 41, 555-559. West,M. 1994. (to appear) Models and computations for neural transmission analysis. Journal of the American Statistical Association. West, M. and Cao, G. 1993. Assessing mechanisms of neural synaptic activity, in Bayesian Statistics in Science and Technology: Case Studies, C. Gatsonis, J. Hodges, R. Kass and N. Singpurwalla (Eds.), Springer Verlag, New York. M. West, P. Mueller and M.D. Escobar (1994) Hierarchical priors and mixture models, with application in regression and density estimation. In Aspects of Uncertainty: A Tribute to D V Lindley. (eds: AFM Smith and PR Freeman), Wiley, London. West, M. and Turner, D.A. 1994. Deconovolution of mixtures in analysis of neural synaptic transmission, The Statistician , 43, 31{43.
20
Table 1: Example 1. Estimated posterior modes ^ computed by the iterative conditional modes algorithm, started at randomly generated intitial values. 1
67:9 58:64 41:36 135:37
11
12
3:70 2:09 7:09 3:99 8:64 ?5:51 0:26 0:72
2
21
163:2 0:40 186:81 0:38 161:74 ?0:09 270:18 ?0:55
22
3
4:26 124:2 4:10 109:71 2:46 173:18 3:19 ?63:82
31
?2:1 ?2:43 ?0:69 ?8:29
32
4:63 4:95 3:18 5:06
Table 2: Example 1. Posterior means ( rst row) and standard deviations (second row) estimated by normal approximation, Laplace integration, importance sampling (IS) and Markov chain Monte Carlo (MCMC).
1 Norm approx 67:0 26:44 Laplace 68:9 79:44 IS 60:03 10:26 MCMC 66:28 111:75
11 3:94 4:63 6:87 13:90 5:17 1:69 4:77 5:76
12 2:22 2:37 3:95 7:13 2:95 1:06 ?3:01 3:81
2 162:55 69:84 151:31 102:13 170:74 31:21 76:42 157:29
21 0:43 0:63 0:65 1:16 0:15 0:42 ?0:77 4:69
22 4:31 0:97 3:95 1:58 4:20 0:70 0:58 3:09
3 125:75 74:93 149:02 104:41 127:50 27:64 186:61 153:81
31 ?2:12 1:43 ?1:95 2:81 ?1:46 0:85 ?0:67 4:74
32 4:66 1:34 4:73 2:61 3:68 1:40 3:75 1:44
4 -4
-2
C22
0
2
4 2 C22
0 -2 -4 -1
0
1
2 C12
3
4
-1
0
1
2 C12
3
(a) Likelihood (b) Posterior Figure 1: Example 1. Panel (a) shows the marginal likelihood p(Dj ) for the model with M = 2 hidden neurons. The likelihood is plotted as a function of 12; 22, xing 11 = 21 = 0. Panel (b) plots p( 12; 22j 11 = 21 = 0; D) in the model with M = 2. The multimodality of the likelihood is smoothed out (except for multimodality due to relabeling etc. { see the discussion in Section 3.7). Non-normality remains.
4
Figure 2: Example 1. p( 1jD) and p( 12jD) estimated by Laplace integration (top row) and estimated by MCMC (bottom row). Notice how the Laplace integration is constrained to one local mode.
(a) f^(x) = E (yN +1jxN +1 = x; D) (b) E (yN +1jxN +1 = x; = t) Figure 3: Example 1. Estimated regression curve f^(x) (left panel), and a few draws from the posterior distribution on the regression curve (right panel) induced by the posterior distribution p(jD) on the parameters. In both panels, the dots show the data points.
(a) f^(x) = E (yN +1jxN +1 = x; D) (b) E (yN +1jxN +1 = x; = t); t p(jD) Figure 4: Example 2. Estimated regression curve (thick line) and simulations from the posterior distribution on the regression line (thin lines) for the Old Faithful geyser data. The superimposed points show the data.
0.5 0.4 0.3 0.0
0.1
0.2
P(Cj2|D)
-6
-4
-2
0
2
4
6
2
4
6
1.5 1.0 0.0
0.5
P(Cj2|D)
2.0
2.5
Cj2
-6
-4
-2
0 Cj2
Figure 5: Example 3 (simulated). Posterior distributions under model M2 (top row) and under model M2 (bottom row of panels). The panels on the left show the marginal posterior distributions p( 12jD) (solid line), p( 22jD) (dotted line), and p( 32 jD) (dashed line in the top left panel). The posterior multimodality can be clearly seen in the marginal posterior distributions under M3. Note how p( 22jD) takes the form of a mixture of p( 12jD) and p( 32jD). This multimodality vanishes under M2. The panels on the right show the joint posterior distribution p( 12; 22jD) under M3 (top) and M2 (bottom). The line indicates the 45 degree line 12 = 22. All multimodality is removed by constraining to M = 2 nodes. The data was generated from a model with two hidden nodes. In general, some multimodality might be left due to M1.
0.4 0.3 0.2 0.0
0.1
P(Cj2|D)
-10
-5
0
5
10
0.0
0.5
P(Cj2|D)
1.0
1.5
Cj2
-10
-5
0
5
10
Cj2
Figure 6: Example 1. Posterior distributions under model M3 (top row) and M2 (bottom). The features are very similar to the posterior distribution for the simulated data in the previous gure. In particular, note that p( 22jD) is a mixture of p( 12jD), p( 32 jD) and the prior p( 22). Also, the second mode of p( 12jD) duplicates p( 32jD), indicating positive posterior probability for model M1 using node 3 only. Like in the simulated example, most of the multimodality from the posterior under M3 is lost under M2. Only some minor mode due to model M1 remains. It can be seen as secondary mode in p( 12jD) as well as a secondary mode close to the 45 degree line in p( 12; 22jD).
(a) E (yn+1 jD) (b) p(M jD) Figure 7: Example 4. Fitted surface E (yn+1 jD) as a function of x1;n+1; x2;n+1 (Panel a) and Posterior p(M jD) on the size of the hidden layer (Panel b). Compare this with the NBin(0:8) prior distribution.
(a) Data (b) Fitted NN surface. Figure 8: Example 5. Data and tted surface using the NN model of Section 4.