Deterministic Learning for Maximum-Likelihood ... - Semantic Scholar

Report 2 Downloads 209 Views
1456

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 8, AUGUST 2008

Deterministic Learning for Maximum-Likelihood Estimation Through Neural Networks Cristiano Cervellera, Danilo Macciò, and Marco Muselli, Member, IEEE

Abstract—In this paper, a general method for the numerical solution of maximum-likelihood estimation (MLE) problems is presented; it adopts the deterministic learning (DL) approach to find close approximations to ML estimator functions for the unknown parameters of any given density. The method relies on the choice of a proper neural network and on the deterministic generation of samples of observations of the likelihood function, thus avoiding the problem of generating samples with the unknown density. Under mild assumptions, consistency and convergence with favorable rates to the true ML estimator function can be proved. Simulation results are provided to show the good behavior of the algorithm compared to the corresponding exact solutions. Index Terms—Deterministic learning (DL), discrepancy, maximum-likelihood estimation (MLE), variation.

I. INTRODUCTION CENTRAL problem in statistics is the estimation of the probability density function (pdf) of a random variable starting from a sample of independent identically distributed (i.i.d.) realizations. Due to its importance, a great variety of approaches have been proposed in the literature to deal with this problem [1]–[9]. They differ from many points of view, but usually a subdivision into two great classes is performed. Nonparametric techniques [2]–[4], [9] aim to retrieve the behavior of the pdf without imposing any a priori assumption on it [10, p. 150]. As a consequence, they require a sample with a size significantly much higher than the dimension of the domain of the random variable. To improve the quality of the approximation obtained, while keeping the number of realizations needed low, several promising approaches have been proposed [11]–[15]. On the other hand, parametric techniques [3], [4], [6], [9], [16] assume to know in advance the form of the pdf except for a vector of parameters , which has to be estimated from the sample of realizations. In this case, smaller sample size can achieve good performances if the form of the pdf is properly selected. An interesting analysis about pros and cons of parametric and nonparametric methods can be found in [17]. To take the advantages of both approaches while reducing the negative aspects, a new class of methods, named semiparametric techniques, have been introduced which consider models

having a number of parameters not growing with the sample size, though greater than that involved in parametric techniques. In this way, a sufficient flexibility in the form of the pdf is guaranteed, while keeping the computational cost low. Density estimation methods using neural networks [18]–[21] or support vector machines [22] fall into this last class of techniques. Nevertheless, when some insights about the form of the pdf are available, parametric techniques offer the most valid and efficient approach to density estimation. One of the most used methods inside this class is maximum-likelihood estimation (MLE) [6], [16], where the unknown vector of parameters is estimated by maximizing the likelihood of obtaining the random sample at hand through a pdf of the considered form. is the expression of the pdf, where More formally, if and , the MLE approach consists in solving the following problem:

A

Manuscript received October 17, 2006; revised July 23, 2007; January 10, 2008; accepted January 27, 2008. First published July 15, 2008; last published August 6, 2008 (projected). C. Cervellera and D. Macciò are with the Istituto di Studi sui Sistemi Intelligenti per l’Automazione, Consiglio Nazionale delle Ricerche, Genova 16149, Italy (e-mail: [email protected]; [email protected]). M. Muselli is with the Istituto di Elettronica e di Ingegneria dell’Informazione e delle Telecomunicazioni, Consiglio Nazionale delle Ricerche, Genova 16149, Italy (e-mail: [email protected]). Digital Object Identifier 10.1109/TNN.2008.2000577

for all

(1)

, , is a compact where set, and is the likelihood function, defined as the joint pdf computed in . Given the random sample is i.i.d., we can write in the following form:

The function that solves (1) for any is the maximum-likelihood (ML) estimator corresponding to the family of . densities This formally corresponds to a functional optimization problem in which we look for a function of the random sample that maximizes the likelihood function in any point of the domain. If we were able to derive such function, we could obtain an ML estimate for any given random sample drawn from any value of the unknown parameters, i.e., we would have an ML esti. In mator for the family of density functions of the form particular, such function would allow to obtain an ML estimate without recurring to a (in general, nonlinear) minimization for any new sample. As a consequence, such estimator would be very suited, for instance, to real-time contexts where an immediate estimate of the parameters is required, and there is no time for computations. In general, analytical solutions to functional optimization problems of this kind cannot be obtained, except in simple cases (for instance, by setting the derivative of the likelihood function to zero, and solving the corresponding equations). Therefore, techniques capable of obtaining approximate solutions have to be employed. In this paper, we present a method based on deterministic learning (DL) for the derivation of approximate ML estimators

1045-9227/$25.00 © 2008 IEEE Authorized licensed use limited to: IEEE Xplore. Downloaded on December 4, 2008 at 08:06 from IEEE Xplore. Restrictions apply.

CERVELLERA et al.: DETERMINISTIC LEARNING FOR MLE THROUGH NEURAL NETWORKS

that yield an estimate of the parameters of the unknown density in correspondence of any random sample of fixed size drawn with any actual value of the parameters. DL was first introduced in the context of function estimation from data [23], [24]. In such context, we look for the best element within a class of neural network models, in terms of some risk functional that measures the distance from the true function, based on an empirical cost computed over a random sample of input/output observations. In typical function estimation frameworks [25], [26], samples are generated by an external source according to an unknown probability distribution. DL deals instead with the case where the sample space can be actively sampled in a deterministic way. Consequently, the results obtained are not subject to a confidence interval. The accuracy of the estimation is measured by two specific quantities, discrepancy and variation. In particular, it is proved that observations chosen according to a special family of deterministic sequences, called low-discrepancy sequences [27], [28], typically employed in quasi-Monte Carlo integration, lead to an almost linear rate of convergence for the estimation error when commonly employed neural network structures, such as feedforward sigmoidal networks and radial basis function networks, are used. Low-discrepancy sequences have also been successfully employed for the design of state observers [29]. For the purposes of the MLE problem addressed in this work, we first show how problem (1) can be faced in a DL context; then we apply the methodology of DL, based on low-discrepancy sequences, to obtain an approximate solution. Convergence results are provided, in terms of how the approximation obtained by the DL method is close to the solution given by the “true” estimator that maximizes the likelihood function, under some mild regularity assumptions on the involved functions. Such results depend on the discrepancy of the sample of observations. In particular, when low-discrepancy sequences, -sequences [27, p. 47], are employed, conversuch as gence can be achieved with almost linear rate as the number of observations grows. This makes the proposed approach particularly suited to problems with a high-dimensional input space. We notice that, since the approximate ML estimator can be obtained from observations based on a deterministic sampling of the sample space, the proposed approach allows to avoid the problem of generating samples with the original density . This paper is organized as follows. In Section II, the basic algorithm for the solution of the MLE problem is introduced, together with its connections to the theory of DL. Section III contains an analysis of the convergence properties of the method, while Section IV is devoted to experimental results regarding the application of the proposed approach to actual examples of pdfs. Finally, Appendixes I and II contain definitions and proofs of theorems.

II. MAXIMUM-LIKELIHOOD ESTIMATION AS A DETERMINISTIC LEARNING PROBLEM In order to solve the functional optimization problem (1), we need to find a function that maximizes the likelihood in any point of the sample space. For the purpose of applying the results of DL, we first show how this result can be achieved through the maximization of the likelihood in integral form.

1457

Given the method is based on sampling, we need to define a compact set in which the random variable can take values. Obviously, we need to choose the set in such a way that the probability of the random variable taking values outside such set is small. For the sake of simplifying the notation, and without loss of , i.e., , and generality, assume in the following that consider a compact set , where is the number of observations in the sample and is the number of parameters in . The approach described in this section can be in a straightforward manner. extended to problems where Then, define the function as the one that maximizes for each

which, in general, may be not unique. Denote the set of the maximizer functions (i.e., the ML estimators) for with for all and consider such that . Without loss of generality, we prove the main results by asis the -dimensional unit cube suming in the following that . The results can be easily extended to any subinterval of by simple scaling, and to more complex the form by suitable transformations [28, p. 39]. subsets of If , denote with the maximizer over the set of

Then, is the function belonging to that maximizes . Again, is possibly not unique; let the integral of over be the collection of possible

Notice that we have . there exists such Theorem 2.1: For all almost everywhere. that and in case We can obtain the equivalency between of suitably “well-behaved” functions . as follows: Assumption 2.1: is such that we can choose for each with

there exists such that is continuous on

where denotes the Lebesgue measure. Theorem 2.2: If Assumption 2.1 holds, we have . Theorems 2.1 and 2.2 assert that we can focus the attention over on the maximizer of the integral of the likelihood , which allows us to state the MLE problem in the the space context of DL. In such framework, given we are, in general, unable to solve the maximization of the integral analytically, we search for the maximum over a set of neural network models, possibly in-

Authorized licensed use limited to: IEEE Xplore. Downloaded on December 4, 2008 at 08:06 from IEEE Xplore. Restrictions apply.

1458

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 8, AUGUST 2008

cluding or an -close approximation to . The maximum is typically obtained through a learning procedure. In particular, the procedure for obtaining the approximate ML estimator corresponds to the following functional optimization such that . problem: find Following the approach of DL, we want to achieve the maxiover through the maximization of an “emmization of pirical” estimate of based on a finite sample of observations of . as a sample of To this purpose, we define , and discretization points of

for the mean, and

(7) for the covariance matrix. In the previous definitions, , are the sample points, whereas and vectors of parameters. The likelihood function is

, are

(2) The problem is thus reduced to: find such that . In the case of neural networks, the maximization of reduces to an optimization procedure in a finite-dimensional space of parameters, i.e., models in have , where is a set of parameters that the form have to be optimized. Thus, the method relies on two key points: 1) the choice of a of suitable class of networks and 2) a proper discretization the -dimensional state space. We point out that the purpose of the discretization is the maximization of (2); therefore, we do not need to extract samples in a way that is related to the unknown original density .

(8) To simplify the maximization task, it is convenient to employ the log-likelihood function, i.e.,

(9) In fact, it can be easily seen that the points of maximum of and coincide. We have seen that, for the solution of the MLE problem, DL leads to the definition of the integral

A. Example (10) Let us consider the classical problems of estimating the mean and the covariance matrix of a -dimensional Gaussian random variable, whose density function has the form

and eventually to the maximization of the empirical estimate of based on the finite sample of observations of

(3) Concerning the estimate of , the ML solution is an “arithmetic mean estimator” of the form (4) while for

the MLE is (5)

(11) where we recall that . With a little abuse of also the empirical estimate notation we have denoted with for the log-likelihood function. varies in as varies in Notice that , i.e., we have turned the infinite-dimensional maximization problem into a finite-dimensional problem in the space of the coefficients of the linear and the quadratic combinations. The method then leads us to determine , which can be solved analytically by first setting

We want to show that our approach can achieve exactly the solution defined in (4) and (5) by choosing the following set of admissible functions:

(6)

Authorized licensed use limited to: IEEE Xplore. Downloaded on December 4, 2008 at 08:06 from IEEE Xplore. Restrictions apply.

(12)

CERVELLERA et al.: DETERMINISTIC LEARNING FOR MLE THROUGH NEURAL NETWORKS

Equation (12) can be rewritten in the following form:

(13) with respect to , is For the linearity of independent of , so (13) can be simply solved by setting , for , regardless of . This solution yields , which is the correct ML estimator defined by (4). As for the covariance matrix estimate, we set

(14) where and denote the inverse and the squared inverse of the matrix , respectively, which leads to

1459

method, in terms of how close we can get to the true function that solves the MLE problem. More specifically, we prove the consistency of the method, i.e., the property of asymptotically obtaining the best element inside the class as the number of sample points grows. In DL [23], some deterministic algorithm is chosen to gen, where and erate the sample of points is the -fold Cartesian product space of ; since the behavior of the algorithm is fixed a priori, the obtained sequence is uniquely determined and is not the realization of some random variable. For this purpose, we can introduce a func, which acts as a deterministic input setion quence selector. Then, the particular sequence can be written . Accordingly, let be the single point of the seas quence. the variation in the sense of Hardy and Denote with of , and with the star Krause [30] over discrepancy [27, p. 14] of the sample . Definitions of variation in the sense of Hardy and Krause and of star discrepancy can be found in Appendix I. For the purpose of proving consistency of the method, i.e., the property of asymptotically obtaining the best element inside the class as the number of sample points grows, we introduce the following hypotheses. is Lipschitz in for every Assumption 3.1: The pdf , i.e., there exists such that

Assumption 3.2: The class of models satisfies the following two conditions. and , we have 1) For any

2) Every such that

(15) and Equation (15) can be simply solved by setting , , which yields the solution expressed by (5). This example shows that the method can lead to the exact analytical solution provided the choice of the class is particularly favorable, e.g., when the true optimal estimator is an element of (a condition generally referred to as zero-error case in statistical learning literature). When this condition is not true, we will see in the following that a suitably “rich” class is still enough to obtain a sufficiently close ML estimation of the unknown parameters as the number of samples grows. III. CONSISTENCY ANALYSIS In this section, we discuss how the choice of a class and influences the performance of the of a sampling scheme for

is Lipschitz in

, i.e., there exists

for any . Assumption 3.3: The sequence of points

satisfies (16)

Item 1) of Assumption 3.2 states that the class of models is endowed with the universal approximation property. Most of the times, this means that the complexity of is represented by some parameter , and it is possible to approximate any suitably “regular” function by properly increasing . This is the case, for instance, of most classes of neural networks, where is the number of hidden units (see, e.g., [10, p. 517]). . This condition is Sometimes can be chosen so that generally referred to as zero-error hypothesis. In this case, we . obviously have , where is obtained by the Define maximization of (2). Then, we can prove the following. Theorem 3.1: Suppose Assumptions 3.1, 3.2, and 3.3 hold. and any , Then, there exists such that, for any . we have

Authorized licensed use limited to: IEEE Xplore. Downloaded on December 4, 2008 at 08:06 from IEEE Xplore. Restrictions apply.

1460

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 8, AUGUST 2008

Furthermore, the rate of convergence of is of the same in (16). order of the discrepancy An analysis of the proof (see Appendix II) reveals that even when item 1) of Assumption 3.2 is not verified, the method still provides convergence to the “best” element inside the class , provided the involved functions are sufficiently well behaved and the discrepancy of the sample of points used to discretize the state space converges as its size grows. As to the discrepancy of the sample , we note that the rate controls that of . Therefore, it of convergence of is essential to choose sampling schemes with favorable discrepancy rates. To this purpose, a simple choice amounts to employ random sampling based on i.i.d. sequences drawn with uniform distribution. It is possible to prove [28, p. 19] that with such choice . the rate of convergence of the discrepancy is about is given by the use of the so-called A better choice for low-discrepancy sequences [27, p. 23], which are a family of deterministic sequences that attain an almost linear rate of convergence for the discrepancy. Examples of low-discrepancy sequences are the Niederreiter sequence, the Halton sequence, the Hammersley sequence, the Sobol’ sequence, and the Faure sequence [27], [28]. In [27, p. 47], a common general framework is presented for the construction of low-discrepancy sequences. In - sequences that generalize many of the aforeparticular, mentioned techniques are discussed, together with the most relevant theoretical properties. It can be shown that it is possible -sequences that satisfy to construct

With such choice, the rate of convergence of the error is given . by IV. EXPERIMENTAL RESULTS In this section, we present some numerical examples to evaluate the performance of the method and discuss some implementation issues. In particular, we show the application of the algorithm to the problem of estimating the unknown parameters of a log-normal random variable and of a Rayleigh random variable. In both examples, has been chosen as the class of onehidden-layer feedforward neural networks with sigmoidal activation function, i.e., mappings of the form

1) Log-Normal Random Variable: We recall that the lognormal density is the pdf of a random variable whose logarithm is normally distributed. It has the following general form: (18) for , where and are the median and standard deviation . of The correct ML estimator for the median is expressed by (19) by using We want to retrieve an approximate value for the approach proposed in this paper, where the neural networks hidden units form the class of described by (17) with 8 elements as admissible functions. Consider vectors of inputs for the neural network. The optimal weights have been determined by minimizing the empirical risk (2) on a sufficiently rich set of 8-dimensional points. For our simulations, we considered a discretization based on a low-discrepancy sequence (specifically, a Sobol’ sequence; see [31]), scaled to fit the 8-dimensional hypercube . In particular, four different sizes of with bounds , , , and the sequences, namely, have been considered, each time adding 1000 points to the previous sequence, in order to show the improvement of the results as grows. The minimization is computed using the standard fminunc routine of the Matlab Optimization Toolbox. The results have been evaluated by generating 100 8-dimensional log-normal distributed samples with standard deviation , for different values of , and by comparing the ML solution given by (19) with the output of the neural network in correspondence of each sample. Fig. 1 shows boxplots of the absolute error (AE) obtained by repeating the test for different values of . It is worth noting that the performance of the neural estimator improves as the number of discretization points increases, in accordance to the theory presented in this paper, whereas the absolute error becomes almost independent from the value of . 2) Rayleigh Random Variable: This problem concerns a Rayleigh pdf with unknown parameter (20) The ML solution to this problem is analytically found to be

(17) (21)

where

is the chosen activation function, and

,

, , . In this case, . The outputs of the obtained estimators are compared with the exact ML estimates, which can be determined analytically in these illustrative cases.

In this case, the approximate solution has been searched for by using a neural network as in (17) with neural units. Vectors of size have been used as inputs. In order to minimize cost (2), we have employed again discretizations based on points coming from a Sobol’ sequence of size , suitably scaled to fit the input space, corresponding to the 10-dimensional hypercube with bounds .

Authorized licensed use limited to: IEEE Xplore. Downloaded on December 4, 2008 at 08:06 from IEEE Xplore. Restrictions apply.

CERVELLERA et al.: DETERMINISTIC LEARNING FOR MLE THROUGH NEURAL NETWORKS

Fig. 2 describes simulation results obtained by applying the proposed algorithm to 100 10-sized Rayleigh distributed samples with different parameters , and by considering the absolute deviation from the standard ML stated by (21). The procedure for the evaluation of the results is the same as previously described for the log-normal case. Again, we can see that the solution gets better as increases. However, it would seem that no dramatic improvement is to be expected for values of higher than ; already seems a good choice in terms of accuracy. A. Comments The tests show how both in the case of log-normal and Rayleigh density, the method based on DL leads to a very close approximation of the true ML estimator for different values of the unknown parameters, with the advantage of performing all the relevant computations offline. Regarding the actual implementation, we note that the method shares many of the issues typical of neural networks. For instance, overtraining may happen, when the structure of the network is too “rich” with respect to the available number of samples used to optimize its parameters. In order to detect overtraining for a given size of discretization points and a given dimension of the input vector, we can increase the number of neural units until the performance of the estimation starts to deteriorate. As an example, we have considered and the case of the Rayleigh distribution, with values , and performed simulations progressively increasing the number of neural units . The performance of the estimator has been evaluated through a mean absolute error (MAE), computed as the average deviation of the network output from the true ML estimator (21) over 100 random samples extracted from the Rayleigh distribution, each with a random in the range . The results are depicted in Fig. 3. In this case, we noappears to be the “optimal” size, and that the tice that network starts to overfit for values of greater than 50. Another parameter that plays an important role in the accuracy of the estimation is the dimension of the input vector (i.e., the number of observations we use for the ML estimation), since an ML estimator is expected to improve as the number of observations grows. Furthermore, we cannot employ the same network optimized with -dimensional samples for estimating the density function when a longer sample is available, as the dimension of the input to the network needs to be fixed. Yet, we cannot use an arbitrarily large , because the complexity of the network has to grow with in order to keep the same level of performance, in terms of condition 1) in Assumption 3.2, and this eventually leads to an unfeasible computational burden for the minimization of the empirical estimate (2) in very high-dimensional settings. In fact, as it is widely observed, the presence of many local minima and plateaus in the behavior of (2), viewed as a function of the network parameters, makes it difficult to attain values proximal to the global optimum. In most cases, several searches, starting from different initial points, have to be performed for retrieving a satisfying collection of parameters. Thus, a tradeoff between the desired accuracy of the MLE and the computational requirements of the method (also considering the fact that obtaining observations in real-world applications may not be straightforward or costless) must be considered.

1461

Nevertheless, it has been proved that the complexity of many commonly employed nonlinear approximators (e.g., the number of basis functions in parameterized architectures such as feedforward neural networks and radial basis functions networks) needs not to grow exponentially with if the function we need to approximate satisfies suitable regularity conditions (see, e.g., [32] and the references therein), thus avoiding the so-called curse of dimensionality and eventually making the tradeoff less dramatic. In practice, we can evaluate how fast the complexity of the network must grow with by considering the “optimal” number of neural units for different values of while keeping fixed, i.e., the number of neural units after which the performance deteriorates as previously described. As an example, we have con, and sidered again the Rayleigh distribution, with performed a series of simulations to compare both the MAE (this time computed as the average deviation of the network output from the true parameter over 100 random samples extracted from the Rayleigh distribution, each with a random in the range ) and the “optimal” number of neural units as increases. Then, the same test has been performed for and in the range the log-normal distribution with . The results are shown in Figs. 4 and 5. In the plots on the right, the error given by the true ML estimator (21) is reported as well for reference. We can see that the growth of with is almost linear in this example, which suggests that the algorithm can be used also for higher dimensional settings. The dimension of the input vector also affects sampling. In fact, it is clear that the number of points necessary to cover well the sample space needs to grow as grows, leading also in this case to a tradeoff between accuracy and computational burden. Yet, it was proved in Section III that, for a given dimension , the number of samples needed to estimate the best element inside our class of approximators grows almost linearly with and independently on when low-discrepancy sequences are employed. This theoretically makes low-discrepancy sequences a very good choice in high-dimensional settings, especially with respect to full-factorial grids (where the number of points grows exponentially with ) and also with respect to i.i.d. uniform random sequences, where typically needs to grow quadratically (see, e.g., [28, p. 19]). It is also important to remark that the same points of a “good” low-discrepancy sequence can be employed for different densities (either with the same domain or with a transformation). Furthermore, we can increase the size of a sequence just by adding more points, without having to recompute all the previous ones. In any case, it is worth noting that the simulations show how very close approximations of the true MLE can be obtained with small and computationally manageable values of , , and . V. CONCLUSION AND FURTHER DEVELOPMENTS The DL framework has been successfully applied to perform parametric estimation of pdfs through the ML approach. In particular, a neural network having a number of inputs equal to the size of the sample adopted for the estimation is properly trained to maximize the likelihood function . The output of the network is the desired value of the parameter for the unknown probability density, i.e., the neural network provides an ML estimator for the considered family of densities.

Authorized licensed use limited to: IEEE Xplore. Downloaded on December 4, 2008 at 08:06 from IEEE Xplore. Restrictions apply.

1462

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 8, AUGUST 2008

Fig. 1. Estimated MAE for the simulations.

The employment of low-discrepancy sequences for the construction of the training set to be used in the learning phase allows to prove some theoretical results ensuring the consistency of the considered approach. We have also provided extensive simulations concerning the parameter estimation of two significant pdfs, showing that the proposed method allows to approximate the general form of the ML estimator within a good level of accuracy. Current work is devoted to tailoring the technique to treat more complex probabilistic frameworks. As an example, we present some guidelines and preliminary results for the estimation of parameters for mixtures of densities, i.e., density functions corresponding to convex combination of “basis” densities, having the form

where and every is a density function. The parameters to be estimated are the real coefficients and possibly the unknown parameters of the functions . For the sake of our example, consider the estimation of the parameters of the following mixture of Gaussians:

where the functions are normal densities with means and , respectively, and variances and , respectively. Assume that the variances are known, and their values are and . Then, we want to estimate .

In order to apply the method presented in this work to such problem, we have to make sure that the constraint is satisfied. To this purpose, we can design the class of neural networks in such a way that the outputs corresponding to the coefficients are automatically scaled in order for their sum to be 1. Specifically, if is the component of corresponding to , we can define the normalized output and use this modified model in the minimization of the empirical likelihood. Operatively, it is convenient to estimate only coefficients and determine the th one imposing in the maximization of the likelihood. In particular, when , it is convenient to estimate using a network with sigmoidal output function, so that the output is automatically between 0 and 1, and then take in the computation of the likelihood. Concerning the example defined above, we have used a model composed of three neural networks, one for each parameter to be estimated (i.e., , , and , respectively), instead of using a single network with three outputs that would have been more complex. Each neural network has ten neural units. The sample size has been taken equal to ten, and the discretization of the 10-dimensional sample space was made using points in the range . In order to evaluate the performance of the model obtained, we have generated 100 10-dimensional samples from a mixture of Gaussians with , , and and compared the output of the networks with an MLE obtained

Authorized licensed use limited to: IEEE Xplore. Downloaded on December 4, 2008 at 08:06 from IEEE Xplore. Restrictions apply.

CERVELLERA et al.: DETERMINISTIC LEARNING FOR MLE THROUGH NEURAL NETWORKS

1463

Fig. 2. Estimated MAE for the simulations.

Fig. 3. Overtraining detection.

by pointwise maximizing the likelihood function with respect to for each sample , through nonlinear programming.

Fig. 6 shows boxplots of the absolute error obtained for the 100 samples. It can be noticed that the approximation of the ML

Authorized licensed use limited to: IEEE Xplore. Downloaded on December 4, 2008 at 08:06 from IEEE Xplore. Restrictions apply.

1464

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 8, AUGUST 2008

Fig. 4. Tradeoff between the input sample size n and the optimal number  of neural units for the Rayleigh distribution.

estimator of all the parameters is accurate, corresponding to an average error of about 2%. The good performances obtained in this example suggest that the proposed method is still effective for complex structures such as mixture of densities. Future work in this case will be aimed at coping efficiently with the increased offline computational burden due to the “multiplication” of parameters to be estimated, with respect to the case of single distributions. APPENDIX I DISCREPANCY, VARIATION, AND LOW-DISCREPANCY SEQUENCES Consider and a sample characteristic function of a subset , otherwise), define

. If (i.e.,

indicates the number of points of .

is the if

belonging to

Fig. 5. Tradeoff between the input sample size n and the optimal number  of neural units for the log-normal distribution.

Definition I.1: If is the family of all the closed subinter, the discrepancy is vals of of the form defined as (22) where is the Lebesgue measure of . Definition I.2: The star discrepancy of the sequence is defined by (22), where is the family of all the closed subintervals of of the form . A classic result [33] states that the following properties are equivalent. 1) is uniformly distributed in , i.e., for all the subintervals of . 2) . 3) . The smaller the discrepancy or the star discrepancy is, the more uniformly distributed is the sequence of points in the space. For each vertex of a given subinterval of , we can define a binary label by assigning “0” to every and “1”

Authorized licensed use limited to: IEEE Xplore. Downloaded on December 4, 2008 at 08:06 from IEEE Xplore. Restrictions apply.

CERVELLERA et al.: DETERMINISTIC LEARNING FOR MLE THROUGH NEURAL NETWORKS

1465

Fig. 6. Estimated MAE for the simulations.

to every . For every function , we define as the alternating sum of computed at the vertices of

, i.e.,

where is the set of vertices with an even number of “1”s in their label, and is the set of vertices with an odd number of “1”s. Definition I.3: The variation of on in the sense of Vitali is defined by [27, p. 19] (23) where is any partition of into subintervals. If the partial derivatives of are continuous on , it is possible to write in an easier way as [27, p. 19] (24) where For

is the th component of . and , let be the variation in the sense of Vitali of the restriction of to the -dimensional face . Definition I.4: The variation of on in the sense of Hardy and Krause is defined by [27, p. 19] (25)

A. Low-Discrepancy Sequences Low-discrepancy sequences are special families of deterministic sequences that provide deterministic generation of uniformly scattered points, aiming at keeping the discrepancy of the resulting sets (taken as finite portions of the sequences) small. Intuitively, we can obtain sampling points that are well uniformly spread over by subdividing the set into “basic” subsets, and then making sure that every basic subset contains a number of points of the sequence that is somehow proportional to the volume of the subset itself. Then, define an elementary interval in base over as an -dimensional subinterval of the form

for integers and .A -net in base on is a set of points such that every elementary interval in base of volume contains exactly points. Accordingly, a -sequence in base is an infinite sequence of points such that for all the integers and , the finite sequence is a -net in base . The key feature of such sequences is that it is possible to construct -sequences leading to samples that satisfy deterministically [27, p. 95]

Therefore, the rate at which the discrepancy of samples is obtained through such -sequences converges with an almost linear rate in the number of points. For a comparison, it can be

Authorized licensed use limited to: IEEE Xplore. Downloaded on December 4, 2008 at 08:06 from IEEE Xplore. Restrictions apply.

1466

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 8, AUGUST 2008

proved [34] that the discrepancy of a sample of points i.i.d. with a uniform distribution on , typical of Monte Carlo methods, is

with probability one, which corresponds to a quadratic rate. -sequences, all There are many ways of constructing leading to different sampling schemes. In general, the procedure is not straightforward and varies from method to method. As an example, we briefly describe the construction of a Halton sequence [28, p. 27]. For any natural number , there is a unique -digits representation of the form

for any random sample Next, for any

. , we can write

where and . Denote with the term of Assumption 3.2, and define as the function in that attains the minimum of (possibly not unique). We have

If Assumption 3.2 holds, we can choose obtaining where is a natural number and Then, we can define the radical inverse of

. with base

(26) as The value of the error term depends both on discretization of the sample space Recall that

Consider distinct prime numbers , the th point of the Halton sequence is defined as

, thus

and on the .

. Then,

Other methods to obtain -sequences, together with their properties, can be found, for instance, in [27] and [28]. Furthermore, many software tools and libraries are available today on the web for the generation of low-discrepancy sequences. APPENDIX II PROOFS Proof of Theorem 2.1: Suppose there exists such that for all a neighborhood with can be found, where for all . Then, we have for all and the definition of is not verified. Proof of Theorem 2.2: 1) is obviously maximum for all , which implies . 2) Suppose is maximum, but . Then, for all , there exists such that . Assumption 2.1 implies that there is with such that for all . Given that , this implies , which contradicts the hypothesis that is maximum. Thus, . Proof of Theorem 3.1: First, we notice that Assumption 3.1 implies that also is Lipschitz, i.e., there exists such that

Now, item 2) of Assumption 3.2 implies the family is such that each has bounded variation in the sense of Hardy and Krause , with . This allows us to prove, along the lines of [23], that

In particular, following the proof of [23, Th. 4], based on the Koksma–Hlawka inequality [30], we can prove that

with a rate of convergence that is equal to that of (16). Therefore, we can choose such that, for all

in

(27) and (28) Thus, combining (27) and (28) with the fact that, by definition of

Authorized licensed use limited to: IEEE Xplore. Downloaded on December 4, 2008 at 08:06 from IEEE Xplore. Restrictions apply.

CERVELLERA et al.: DETERMINISTIC LEARNING FOR MLE THROUGH NEURAL NETWORKS

we can write

From this last result and (26), the assertion of the theorem follows. ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers for their valuable comments that improved this paper. REFERENCES [1] B. W. Silverman, Density Estimation for Statistics and Data Analysis. London, U.K.: Chapman & Hall, 1986. [2] J. S. Simonoff, Smoothing Methods in Statistics. Berlin, Germany: Springer-Verlag, 1996. [3] L. Devroye and G. Lugosi, Combinatorial Methods in Density Estimation. New York: Springer-Verlag, 2001. [4] P. P. B. Eggermont and V. N. LaRiccia, Maximum Penalized Likelihood Estimation: Volume I: Density Estimation. Berlin, Germany: Springer-Verlag, 2001. [5] D. W. Scott and S. R. Sain, “Multivariate density estimation,” in Data Mining and Data Visualization: Handbook of Statistics, C. R. Rao, E. J. Wegman, and J. L. Solka, Eds. Amsterdam, The Netherlands: Elsevier, 2005, vol. 24. [6] J. Fan and Q. Yao, Nonlinear Time Series: Nonparametric and Parametric Methods. Berlin, Germany: Springer-Verlag, 2005. [7] C. Wang and W. Wang, “Links between PPCA and subspace methods for complete Gaussian density estimation,” IEEE Trans. Neural Netw., vol. 17, no. 3, pp. 789–792, May 2006. [8] C. Costantinopoulos and A. Likas, “Unsupervised learning of Gaussian mixtures based on variational component splitting,” IEEE Trans. Neural Netw., vol. 18, no. 3, pp. 745–755, May 2007. [9] X. Hong, S. Chen, and C. J. Harris, “A forward-constrained regression algorithm for Sparse Kernel density estimation,” IEEE Trans. Neural Netw., vol. 19, no. 1, pp. 193–198, Jan. 2008. [10] L. Devroye, L. Györfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition. New York: Springer-Verlag, 1997. [11] C. Kooperberg and C. J. Stone, “A study of logspline density estimation,” Comput. Statist. Data Anal., vol. 12, pp. 327–347, 1991. [12] D. L. Donoho, I. M. Johnstone, G. Kerkyacharian, and D. Picard, “Density estimation by wavelet thresholding,” Ann. Statist., vol. 24, pp. 508–539, 1996. [13] N. L. Hjort and M. C. Jones, “Locally parametric nonparametric density estimation,” Ann. Statist., vol. 24, pp. 1619–1647, 1996. [14] B. U. Park, W. C. Kim, and M. C. Jones, “On local likelihood density estimation,” Ann. Statist., vol. 30, pp. 1480–1495, 2001. [15] C. Fraley and A. Raftery, “Model-based clustering, discriminant analysis, and density estimation,” J. Amer. Statist. Assoc., vol. 97, pp. 611–631, 2002. [16] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, Second ed. New York: Wiley, 2001. [17] D. W. Scott, “Multivariate density estimation and visualization,” in Handbook of Computational Statistics: Concepts and Methods, J. E. Gentle, W. Händle, and Y. Mori, Eds. Berlin, Germany: SpringerVerlag, 2004, pp. 517–538. [18] J.-X. Wu and C. Chan, “Maximum likelihood density estimation by means of an artificial neural network,” J. Inf. Sci. Eng., vol. 7, pp. 153–172, 1991. [19] D. S. Modha and Y. Fainman, “A learning law for density estimation,” IEEE Trans. Neural Netw., vol. 5, no. 3, pp. 519–523, May 1994. [20] C. M. Bishop, Neural Networks for Pattern Recognition. New York: Oxford Univ. Press, 1995.

1467

[21] A. Likas, “Probability density estimation using artificial neural networks,” Comput. Phys. Commun., vol. 135, pp. 167–175, 2001. [22] J. Weston, A. Gammerman, M. O. Stitson, V. Vapnik, V. Vovk, and C. Watkins, “Support vector density estimation,” in Advances in Kernel Methods: Support Vector Learning. Cambridge, MA: MIT Press, 1999, pp. 293–305. [23] C. Cervellera and M. Muselli, “Deterministic design for neural network learning: An approach based on discrepancy,” IEEE Trans. Neural Netw., vol. 15, no. 3, pp. 533–544, May 2004. [24] C. Cervellera and M. Muselli, “Deterministic learning and an application in optimal control,” in Advances in Imaging and Electron Physics, P. W. Hawkes, Ed. New York: Elsevier, 2006, vol. 140, pp. 61–118. [25] V. N. Vapnik, Statistical Learning Theory. New York: Wiley, 1995. [26] M. Vidyasagar, A Theory of Learning and Generalization. London, U.K.: Springer-Verlag, 1997. [27] H. Niederreiter, Random Number Generation and Quasi-Monte Carlo Methods. Philadelphia, PA: SIAM, 1992. [28] K.-T. Fang and Y. Wang, Number-Theoretic Methods in Statistics. London, U.K.: Chapman & Hall, 1994. [29] A. Alessandri, C. Cervellera, and M. Sanguineti, “Design of asymptotic estimators: An approach based on neural networks and nonlinear programming,” IEEE Trans. Neural Netw., vol. 18, no. 1, pp. 86–96, Jan. 2007. [30] E. Hlawka, “Funktionen von Beschränkter variation in der theorie der Gleichverteilung,” Ann. Mat. Pura Appl., vol. 54, pp. 325–333, 1961. [31] I. M. Sobol’, “The distribution of points in a cube and the approximate evaluation of integrals,” Zh. Vychisl. Mat. i Mat. Fiz., vol. 7, pp. 784–802, 1967. [32] R. Zoppoli, M. Sanguineti, and T. Parisini, “Approximating networks and extended Ritz method for the solution of functional optimization problems,” J. Optim. Theory Appl., vol. 112, pp. 403–439, 2002. [33] L. Kuipers and H. Niederreiter, Uniform Distribution of Sequences. New York: Wiley, 1974. [34] K. L. Chung, “An estimate concerning the Kolmogoroff limit distribution,” Trans. Amer. Math. Soc., vol. 67, pp. 36–50, 1949. Cristiano Cervellera received the M.Sc. degree in electronic engineering and the Ph.D. degree in electronic engineering and computer science from the University of Genoa, Italy, in 1998 and 2002, respectively. Since 2002, he has been a Researcher at the Genoa branch of the Institute of Intelligent Systems for Automation of the Italian National Research Council. His research interests include neural networks, machine learning, optimal control, and number-theoretic methods for optimization and approximation.

Danilo Macciò received the M.Sc. degree in telecommunication engineering from the University of Genoa, Italy, in 2005. Currently, he is working towards the Ph.D. degree at the Genoa branch of the Institute of Intelligent Systems for Automation of the Italian National Research Council. His research interests include neural networks, nonlinear observers, and numeric solutions of functional optimization problems.

Marco Muselli (M’97) received the M.Sc. degree in electronic engineering from the University of Genoa, Italy, in 1985. In 1988, he joined the Institute of Electronic Circuits, now Genoa branch of the Istituto di Elettronica e di Ingegneria dell’Informazione e delle Telecomunicazioni, of the Italian National Research Council, where he is currently Principal Scientist. His research interests include neural networks, machine learning, global optimization, mathematical statistics, and probability theory. His activity is mainly centered on the development of new efficient training methods for connectionist systems and on the theoretical analysis of their convergence properties.

Authorized licensed use limited to: IEEE Xplore. Downloaded on December 4, 2008 at 08:06 from IEEE Xplore. Restrictions apply.