Exact Sample Conditioned MSE Performance of the ... - IEEE Xplore

Report 21 Downloads 29 Views
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 5, MAY 2012

2575

Exact Sample Conditioned MSE Performance of the Bayesian MMSE Estimator for Classification Error—Part I: Representation Lori A. Dalton, Student Member, IEEE, and Edward R. Dougherty, Fellow, IEEE

Abstract—In recent years, biomedicine has been faced with difficult high-throughput small-sample classification problems. In such settings, classifier error estimation becomes a critical issue because training and testing must be done on the same data. A recently proposed error estimator places the problem in a signal estimation framework in the presence of uncertainty, permitting a rigorous solution optimal in a minimum-mean-square error sense. The uncertainty in this model is relative to the parameters of the feature-label distributions, resulting in a Bayesian approach to error estimation. Closed form solutions are available for two important problems: discrete classification with Dirichlet priors and linear classification of Gaussian distributions with normal-inverse-Wishart priors. In this work, Part I of a two-part study, we introduce the theoretical mean-square-error (MSE) conditioned on the observed sample of any estimate of the classifier error, including the Bayesian error estimator, for both Bayesian models. Thus, Bayesian error estimation has a unique advantage in that its mathematical framework naturally gives rise to a practical expected measure of performance given an observed sample. In Part II of the study we examine consistency of the error estimator, demonstrate various MSE properties, and apply the conditional MSE to censored sampling. Index Terms—Bayesian estimation, classification, error estimation, genomics, minimum mean-square estimation, small samples.

I. INTRODUCTION

T

WO RECENT developments in error estimation for classification come together in this two-part study: 1) analysis of error-estimation performance, in particular, with regard to the root-mean-square (RMS) error between the true and estimated errors; and 2) the use of prior distributional knowledge in error estimation, in particular, Bayesian minimum-mean-square-error (MMSE) estimation (being that RMS is the square root of MSE, we will use the two interchangeably, with MSE being used mainly in the equations to avoid square roots). These statistical developments have been driven by the Manuscript received July 22, 2011; accepted November 09, 2011. Date of publication January 12, 2012; date of current version April 13, 2012. The associated editor coordinating the review of this manuscript and approving it for publication was Dr. Yufei Huang. L. A. Dalton is with the Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843 USA (e-mail: [email protected]). E. R. Dougherty is with the Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843 USA. He is also with the Computational Biology Division of the Translational Genomics Research Institute, Phoenix, AZ 85004,USA, and the Department of Bioinformatics and Computational Biology, University of Texas M. D. Anderson Cancer Center, Houston, TX 77030 USA (e-mail: [email protected]). Digital Object Identifier 10.1109/TSP.2012.2184101

advent of high-throughput genomic and proteomic technologies that present the biological and medical communities with difficult small-sample classification problems, specifically, error estimation, a salient issue in genomic signal processing [1]. When an error estimate is reported, it implicitly carries with it the properties of the error estimator and these properties characterize the goodness of the estimate. Epistemologically, when a classifier is designed and an error estimate computed, model validity, and, hence, the degree to which the model has meaning, rests with the properties of the error estimator, in particular, the RMS or some other specified measure of validity [2]. If no distribution assumptions are made, then very little, or perhaps nothing, can be said about the precision of the estimate. In the rare instances in which performance bounds are known in the absence of any assumptions on the feature-label distribution, those bounds are so loose as to be virtually worthless for small samples [3], [4]. Full knowledge of the joint behavior of an error estimator and the true error is characterized by their joint distribution. Partial information is contained in their mixed moments, in particular, the second mixed moment. Since we are interested in measuring estimator accuracy, specifically, the RMS, we desire knowledge of the second-order moments, marginal and mixed. Historically, analytic study has mainly focused on the first and second marginal moments of the estimated error for linear discriminant analysis (LDA) in the Gaussian model or for multinomial discrimination [5]–[11]; however, marginal knowledge regarding the error estimator does not provide the kind of joint probabilistic knowledge required for the assessment of estimation accuracy. Recent work has aimed at characterizing joint behavior, via representation of the joint distribution and the second-order moments. For multinomial discrimination, exact representations of the second-order moments, both marginal and mixed, for the true error and the resubstitution and leave-one-out estimators have been obtained [12]. For LDA, in the univariate Gaussian model the exact marginal distributions for both the resubstitution and leave-one-out estimators have been found, and in the multivariate model with a common known covariance matrix quasibinomial approximations to the distributions of the resubstitution and leave-one-out estimators have been discovered [13]. Regarding the joint distribution between the true and estimated errors for LDA, the exact joint distributions for both resubstitution and leave-one-out have been found in the univariate Gaussian model and approximations have been found in the multivariate model with a common known covariance matrix [14]. In regard to the RMS, whereas one could utilize the approximate representations to find approximate moments via

1053-587X/$31.00 © 2012 IEEE

2576

integration in the multivariate model with a common known covariance matrix, more accurate approximations, including the second order mixed moment and the RMS, can be achieved via asymptotically exact analytic expressions using a double asymptotic approach, where both sample size and dimensionality approach infinity at a fixed rate between the two [15]. Such finite-sample approximations from the double asymptotic method have long been known to show good accuracy [16], [17]. In practice, the actual feature-label distribution is unknown. If we assume that it comes from an uncertainty class of distributions and we have an expression for the RMS for each distribution in , then to be assured that the RMS is bounded by some desired level of accuracy, say , we require that . If we do not assume an uncertainty class as prior knowledge, then we cannot practically bound the RMS. One alternative might be to estimate the feature-label distribution from the data; however, this is not feasible with small samples. Another alternative would be to employ some distribution-free bounds on the RMS; however, these are only known in rare instances and, when known, are too loose for small-sample problems. Thus, a model-free approach would leave us without a measure of error-estimation accuracy, thereby rendering the resulting classifier model, classifier and error estimate epistemologically unsound. Having recognized that modeling assumptions (an uncertainty class) must be postulated when the sample is small to achieve an acceptable RMS, say, by determining a required sample size to insure that , we can go a step further and assume a prior distribution on the uncertainty class, take a Bayesian minimum-mean-square-error (MMSE) approach, and thereby guarantee that the average RMS across the model family for the resulting error estimator is minimal among all possible error estimators. This is done in [18], [19], where a parameterized family of class-conditional feature distributions is assumed, a prior distribution is applied to the parameters of the model, and this prior along with observed data are used to compute an unbiased, MMSE estimate of classification error. An advantage of this approach, besides achieving minimum RMS across the model family, is that it depends only on the designed classifier, not the classification rule used to design the classifier. In particular, it is independent of the feature-selection method. The transition from an unstructured uncertainty class to a prior distribution governing the parameters defining the uncertainty class is not uncommon in signal processing. For instance, assuming uncertainty in the second-order statistics of a random process originally led to a minimax theory of robust optimal linear filtering [20]–[22], whereas subsequently a prior distribution was assumed to govern the uncertainty class, thereby leading to a Bayesian theory of robust linear filtering [23]. In genomic signal processing, the first analysis of robust control for gene regulatory networks assumed an uncertainty class without a prior distribution, thereby resulting in a minimax theory of robust control [24]; subsequently it was assumed that a prior distribution governed the uncertainty class and a Bayesian theory of robust control was developed [25] (see also [26]). In the present study, we derive analytical expressions for the RMS (MSE) performance of the Bayesian error estimator conditioned on the sample. Uncertainty is relative to the parameters in

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 5, MAY 2012

the feature-label distribution, which is fundamentally different from the RMS relative to the sampling distribution for a fixed feature-label distribution. The latter approach does not address performance for a fixed sample because, absent an underlying framework, nothing is known given a sample. Furthermore, we show that the conditional MSE of an arbitrary error estimator can be decomposed into the MSE of the Bayesian error estimator plus an easily calculable residual term. Thus, the closed form analytical results presented here may be easily modified for any error estimator under the Bayesian model. Consider a typical application, where we are given a specific sample to train a classifier. We are interested in estimating the error rate of our designed classifier, as well as the validity and properties of this estimate. The Bayesian framework presented in [18] not only enables us to find a MMSE estimate of the classifier’s true error, but also makes it possible to study properties of the error estimate and true error conditioned on the precise sample and classifier used. Under the Bayesian model, the sample conditions the uncertainty, and different samples condition it to different extents. Average results over the sampling distribution are difficult to apply in this way because they only address average performance under a classification scheme. This work is distinct because it takes into consideration a family of distributions and estimates performance using the best knowledge available on the parameters of the distribution: the posterior probabilities. Although the RMS is computed relative to a fixed sample, we may still observe trends in performance as we increase sample size. For example, suppose we have a sequence of sample points indexed by , drawn from an unknown fixed distribution. Starting with the first, say, points in this sequence, we may calculate the RMS of the Bayesian error estimator and find it to be relatively high. Although the prior is fixed, as we observe more sample points, the posterior distribution of the parameters will become tighter around the true distribution parameters. In this way, the Bayesian error estimate will be closer to the true error (both are changing since the sample is changing), and this will be reflected in the RMS. Thus, although is calculated for a fixed sample of size , as we increase by acquiring more sample points, will tend to zero if the true distribution is in the family of distributions in our model. This will be discussed in some detail in a section on the consistency of the Bayesian error estimator in Part II of the study and we will take advantage of it in a censored sampling application.

II. REVIEW

OF BAYESIAN MMSE ERROR ESTIMATION

CLASSIFIER

Consider a binary classification problem with class labels 0 be a sample of size drawn from a sample space, and 1. Let . Also let and denote the number of sample points in class 0 and class 1, respectively. From the sample we design, via a classification rule, a classifier, , which is to predict the label of future sample points. The true error of is the probability of mislabeling a sample point, which can be decomposed as

DALTON AND DOUGHERTY: EXACT SAMPLE CONDITIONED MSE PERFORMANCE OF THE BAYESIAN MMSE ESTIMATOR

where is the a priori probability that a sample point is from class 0, and where and are the probabilities that the classifier mislabels a class 0 point and a class 1 point, respectively. In practice, the feature-label distribution is unknown, so that the true error must be estimated via error estimation rules. Classical classifier error estimation methods, such as cross-validation and bootstrap, are typically model-free heuristic methods. In contrast, Bayesian error estimation [18], [19] defines a mathematical framework that rigorously quantifies uncertainty in our knowledge. We assume a parametric model on the feature-label distribution, assign prior probabilities to the unknown distribution parameters, and update these to posterior probabilities using the observed sample. Under this model, the Bayesian error estimate is the MMSE estimate of the true error, which is equivalent to the conditional expectation of the true error given the observed measurements. We denote the parameters of class by and the . Hence, the feature-label class-conditional distribution by distribution is completely characterized by three parameters: , and . We denote the complete collection of parameters by . To facilitate analytic representations, we assume these three parameters are independent prior to observing the data and denote their marginal priors by , and . After observing the sample, independence is preserved and the sample is used to update the priors to posterior densities, , , and . For instance, given a uniform prior on from 0 to 1 it can be shown that

(1) where is shorthand notation for the conditional expectation given the sample. A more general class of priors on is considered in [18]. In addition, we may write the posterior distributions for by

where is the sample point in class and the product on the right is called the “likelihood function.” The constant of proportionality is found by normalizing the integral of to 1. When the prior density is proper this follows from Bayes’ rule, and if is improper this is taken as a definition. , and a designed classifier, , Given a random sample, the Bayesian MMSE estimator (a function of ) for the true classifier error (a function of and ) is given by

(2) where we have used the posterior independence between , and , and we also define , which may be viewed as the posterior expectation of the error contributed by class . From this point forward, we will suppress the dependence on the sample in several quantities to avoid cumbersome notation. That is, we will write instead of ,

2577

instead of , instead of , and instead of . However, the reader should keep in mind that these quantities are always functions of the sample. If we further assume a uniform prior on the a priori probability, , from 0 to 1, then the Bayesian MMSE error estimator can be simplified to

Note that the Bayesian error estimator is a training data error estimator, meaning that no sample points are held out for error estimation and the entire sample set is used to estimate the true error. For a fixed classifier and a fixed distribution parameter, , , is deterministic. Hence the true error, (3) is the parameter space for . When the prior probwhere abilities are improper, this is called the generalized Bayesian error estimator. The Bayesian error estimator has been solved for two important classification problems: discrete classification with Dirichlet priors [18] and linear classification of Gaussian distributions with normal-inverse-Wishart priors [19]. These will be reviewed in Sections IV and V, respectively. III. CONDITIONAL MSE OF ERROR ESTIMATORS There are two sources of randomness in the Bayesian model. The first is the sample, which also randomizes the designed classifier and its true error. Almost all current results on error estimator performance are averaged over random samples, which demonstrates performance relative to a fixed classification rule. The second source of randomness, which is the focus of this paper, is uncertainty in the underlying feature-label distribution. The Bayesian error estimator addresses the second source of randomness, naturally giving rise to a practical expected measure of performance given a fixed sample and classifier. We fix the sample and consider the conditional MSE, which is exactly the objective function optimized by the Bayesian MMSE error estimator. According to MMSE estimation theory, we may apply the orthogonality principle MSE

where we have used the definition of the Bayesian error estimator given in (2). That is, the conditional MSE of the Bayesian error estimator is equivalent to the variance of the true error. Thanks to the posterior independence between , and , we may expand this, via the basic variance identity, to MSE

2578

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 5, MAY 2012

Further decomposing the inner expectation and variance, we have

error estimate. Let be a constant number representing an error estimate evaluated from the given sample. Then MSE

MSE

MSE (4) where and are defined in (3), and in the last line we have employed our shorthand notation for expectations conditioned on the sample. Therefore, finding the MSE of the Bayesian error estimator boils down to finding the posterior variance of and . Furthermore

(6)

the last equality following from (2). Thus, if we solve the conditional MSE of the Bayesian error estimator, MSE , it is trivial to evaluate the conditional MSE of any error estimator, MSE , under the Bayesian model. Further, (6) clearly shows that the conditional MSE of the Bayesian error estimator lower bounds the conditional MSE of any other error estimator. IV. THE DISCRETE CASE

Combining this with (4) yields MSE

(5) The variance and expectations related to the variable depend on our prior model for , but are straightforward to find analytically. For example, if the prior distribution of is beta and , which is the case if has a with hyperparameters uniform prior as seen in (1) where , then the posterior of is also beta with hyperparameters and . Then and

We first solve the conditional MSE for the discrete classification problem defined in [18] with bins. Let and , , be the class-conditional probabilities for each bin for class 0 and 1, respectively. Similarly, let and , , be the number of sample points observed in each bin for class 0 and 1, respectively. The ’s and ’s are outcomes of a multinomial sampling distribution with parameters and , respectively. Further, suppose we are given an arbitrary discrete classifier assigning a class to each bin, . This classifier may be trained using the discrete histogram classification rule but this is not necessary. The bin probabilities, and , are both members of the “standard -simplex,” which is the set of all sequences of length whose terms are nonnegative and add to one. We formally define the model parameters and for class 0 and class 1, respectively. Note that the last bin probability is not needed in the model parameters of either class because and . The conjugate prior for the multinomial distribution used to model the bin probabilities in either class is given by a generalized beta distribution known as the Dirichlet distribution

Hence,

and

MSE

where we require the hyperparameters , , to satisfy . Note that if for all bins, , and both classes, and , then this reduces to a flat prior. Furthermore, the Dirichlet prior for class is mathematically equivalent to a likelihood resulting from class observations, with observations in bin . It has been shown that the posteriors, and , are also Dirichlet distributions with updated hyperparameters and [27]. Furthermore (7)

Therefore, the conditional MSE for fixed samples is solved if we can find the first moment of the true error used in the definition of the Bayesian error estimator, , and the second moment, , for both classes, . Having evaluated the conditional MSE of the Bayesian error estimator, it is easy to find analogous results for an arbitrary

(8) where is an indicator function equal to one if zero otherwise.

is true and

DALTON AND DOUGHERTY: EXACT SAMPLE CONDITIONED MSE PERFORMANCE OF THE BAYESIAN MMSE ESTIMATOR

Following a similar method as that used in the Appendix of [18] to derive (7) and (8), we may also evaluate the second moments of the true errors contributed by each class. In particular, for class 0

2579

Similar results can be found for class 1

(10) Combining (7), (8), (9), and (10) with (5) specifies the conditional MSE of the Bayesian error estimator in the discrete model. V. THE GAUSSIAN CASE WITH LINEAR CLASSIFICATION We next study the Gaussian model defined in [19]. Each sample point consists of multivariate Gaussian features. For each class, , we define the distribution parameter , where is the mean of the class-conditional distribution and is a parameter that specifies the covariance, . The introduction of enables us to force a structure on the without explicitly showing covariance. We will often write its dependence on , that is, we simply write instead of . We assume is invertible with probability 1, and for our priors are of the form invertible where is the gamma function. The integral in the last line has been solved in Lemma B.1 of [18], thus where

where the second equality follows from properties of the gamma function. Finally, we simplify this expression to obtain

(9)

That is, the mean conditioned on the covariance is Gaussian with mean and covariance . If with a parameter space consisting of all positive-definite matrices, then the marginal distribution of the covariance is an inverse-Wishart distribution. In general, the hyperparameters and can be viewed as targets for the mean and the shape of the covariance, respectively. Furthermore, the larger is the more localized the prior is about , and the larger is the less the shape of is allowed to wiggle. The hyperparameters in may have different restrictions depending on the assumed model; these will be discussed in more detail in the following subsections. However, in general we minimally require that , is a length real vector, is a real number (for linear classification we may further require to be an integer to utilize a closed form solution for the Bayesian error estimator), and is a nonnegamatrix. tive definite Given observed sample points in class , the posterior probability of has the same form, with updated hyperparameters given by

where and are the sample mean and sample covariance of points from class , respectively. Although it is not mandatory for the prior to be a proper density (in the general covariance

2580

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 5, MAY 2012

model where , the prior is proper if , positive definite, and ), the posterior must be proper to evaluate the Bayesian error estimator (i.e., we must have , positive definite, and ). If the prior is proper, then the posterior is also guaranteed to be proper. For nonlinear classifiers, the integral in (3) must be approximated using Monte Carlo methods. However, for linear classifiers, i.e., classifiers of the form

To find the conditional MSE, the outer integral in the definition of the second moment (11) is not necessary because the current model has a fixed covariance. Hence, we need only solve the inner integral, which is given by

if if where with some constant vector and constant scalar , we have closed-form Bayesian error estimators for three models: fixed covariance, scaled identity covariance, and arbitrary covariance. If the designed classifier is constant, that is, if , then the true error, , is deterministically zero or one, depending on the sign of . In this special case, the conditional MSE is found trivially

where is a Gaussian density with mean and covariance . This integral is simplified to a well-behaved single integral in Lemma 1 of the Appendix. The final result is

(13) Combining (12) and (13) with (5) defines the conditional MSE of the Bayesian error estimator under the fixed covariance model.

so that from (4) we have

B. The Bayesian Error Estimator for Identity Covariances

MSE which is the posterior variance of the a priori class probability. . Thus, in the remainder of this section we assume We will present closed-form expressions for the conditional MSE of Bayesian error estimators under Gaussian distributions with linear classification for all three models. Note that the second moments we require in (5) may be written as

In this model, we assume is a scaled identity covariance and , where is the matrix, that is, identity matrix. Under this model, it has been shown that (14) where

(11) is the parameter space for where space for .

and

is the parameter

A. The Bayesian Error Estimator for Fixed Covariances to For a fixed (invertible) covariance, , we require ensure that the posterior, , is proper. It was shown in [19] that (12) where and

is the standard normal cumulative distribution function

and is the regularized incomplete beta function [19]. Furthermore, has an inverse-gamma distribution with and . Hence, we require to ensure parameters that is proper and, additionally, we require and to ensure that is proper or, equivalently, , and must be positive definite. To evaluate the second moment of the true error for scaled identity covariances, we use the previous result from Lemma 1 for the inner integral, so that (11) reduces to

DALTON AND DOUGHERTY: EXACT SAMPLE CONDITIONED MSE PERFORMANCE OF THE BAYESIAN MMSE ESTIMATOR

where is the inverse-gamma distribution with shape parameter and scale parameter . This integral is solved in Lemma 2 in the Appendix. The final result is

2581

where is the inverse-Wishart distribution with parameters and . This integral is solved in Lemma 3 of the Appendix. The result is

(18)

(15) where

, and for

(16)

where is as defined at the end of the previous section. Combining (17) and (18) with (5) defines the conditional MSE of the Bayesian error estimator under the general covariance model. Again note that closed form expressions for both and the regularized incomplete beta function, , for integer or half-integer values of are discussed in the next section. and

D. Closed Form Expressions for Functions where is an Appell hypergeometric function. Combining (14) and (15) with (5) defines the conditional MSE of the Bayesian error estimator under the scaled identity covariance model. Closed-form expressions for both and the regularized incomplete beta function for integer or half-integer values of are discussed in Section V-D.

The solutions proposed in the previous sections utilize two Euler integrals. The first is the incomplete beta function, which is in general defined by

C. The Bayesian Error Estimator for General Covariances

for , and , where is the beta func. In our application, tion and normalizes so that note that we only need to evaluate for and . , defined for The second integral is the function , and and given by for and (16) for . The definition of uses the with an Euler-type integral Appell hypergeometric function representation,

Finally, in the general covariance model we assume . That is, is an arbitrary covariance matrix and the parameter space contains all positive definite matrices. It has been shown in [19] that (17) where

Further,

is an inverse-Wishart distribution

where is the multivariate gamma function. For a proper pos, and positive definite. terior, we require Using the same method from the previous section, note that

defined for , , and . Although these integrals do not have closed-form solutions for arbitrary parameters, in this section we provide exact expressions for and for positive integers . Restricting to be an integer guarantees that these equations may be applied, so that Bayesian error estimators and their conditional MSE for the Gaussian model with linear classification may be evaluated exactly using finite sums of common single variable functions. In [19], a closed form solution for the regularized incomplete beta function was found for and positive integers . In particular

if if

is odd

if

is even. (19)

2582

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 5, MAY 2012

Furthermore, in Lemma 4 of the Appendix we demonstrate that the function has closed-form solutions for , and positive integers , with the final result being

be an invertible covariance matrix, and is a nonzero length vector and

, where is a scalar. Then

if

if

is odd

if

is even

where is a Gaussian density with mean and covariance , is an indicator function equal to one if and zero otherwise, and

where Proof: Call this integral

. We have that

and we may apply (19) to evaluate the regularized incomplete beta function in the sum. VI. CONCLUSION A major advantage of Bayesian error estimation over classical classifier error estimation schemes is that it articulates a mathematical model to generate an estimator that is theoretically optimal in the mean-square sense. Moreover, other benefits emerge: the Bayesian MMSE error estimator is theoretically unbiased and the priors can be tailored to target certain properties, for example, to obtain best performance in moderately difficult classification problems with Bayes errors in the mid range [18]. This paper explores another unique advantage of Bayesian error estimation, which is that its mathematical model naturally gives rise to the RMS of an error estimate conditioned on the actual observed sample. Prior to this work, RMS for nonhold-out error estimators has always been considered by averaging over the sampling distribution, and nothing could be said about performance for a particular sample. In contrast, the conditional RMS proposed in this paper formally defines a very practical measure of the expected performance of an error estimate given a fixed sample. In the second part of this two-part study we shall characterize the consistency of the Bayesian error estimator, conditioned upon the sample, and demonstrate consistency under very mild assumptions that hold for both the discrete and Gaussian models we have studied. We will show how the sample-conditioned RMS can used for censored sampling, thereby conditioning the sample size on the desired accuracy of the error estimator, and we will apply censored sampling to genomic classification. We will also present simulations to examine the performance characteristics of Bayesian error estimation in relation to the prior distribution and sample size. APPENDIX THEORETICAL RESULTS ON THE SECOND MOMENT OF THE BAYESIAN ERROR ESTIMATOR UNDER THE GAUSSIAN MODEL Lemma 1: Let be a class label and let . be a mean vector with features, Also let

Since is an invertible covariance matrix, we can use singular value decomposition to write with . Next consider the linear change of variables, . We have that

Define

and

, and note that

. Then

Next consider a change of variables the vector to the vector matrix,

and

, where rotates . Since is a rotation

is an identity matrix. Let the first

DALTON AND DOUGHERTY: EXACT SAMPLE CONDITIONED MSE PERFORMANCE OF THE BAYESIAN MMSE ESTIMATOR

element in the vector to

be called . Then this integral simplifies

Let

2583

. Then finally

On the other hand, if

, then from (20)

This reduces the problem to a three dimensional space. Now consider the following rotation of the coordinate system The first integral is easily solved using the result for , and for the second integral we use the same polar transformation and -substitution as before This rotates the vector

to the vector . To determine the new region of integration, note in the coordinate system the region and of integration is defined by two restrictions: . In the new coordinate system, the first restriction is

Equivalently

And similarly for the other restriction

We have designed our new coordinate system to make the variable independent from these restrictions. Hence, it may be integrated out of our original integral, which may be simplified to

(20) If using

, then we convert to polar coordinates, and to obtain

,

This may be simplified by realizing that a component of this integral is equivalent to an alternate representation for the Gaussian CDF function [28]. We first break the integral into two parts, and then use symmetry in the integrand to simplify the result

2584

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 5, MAY 2012

Lemma 2: Let , , , and . Let be the inverse-gamma distribution with shape be an indicator parameter and scale parameter , and function equal to one if and zero otherwise. Then

We next focus on the inner integral in the second term. Call this integral . We have

where we have solved this integral by noting it is essentially an inverse-gamma distribution. Thus our original integral is where is the regularized incomplete beta function, defined for , and , and is given by an Appell hypergeometric function, , such that and

(22) For the final integral, consider the substitution We have

, and . Proof: Call this integral M. When . For , we obtain that

.

for

, it is easy to show

This is essentially a one-dimensional Euler-type integral representation of Appell’s hypergeometric function, . In other words

Finally, from the identity [29] The integral in the first term has already been solved in Lemma D.1 of [19]. We have that we have

For convenience, we modify this result slightly

(21) is one if and zero otherwise. This interwhere mediate result will be used in Lemma 3.

Combining this result with (22) completes the proof.

DALTON AND DOUGHERTY: EXACT SAMPLE CONDITIONED MSE PERFORMANCE OF THE BAYESIAN MMSE ESTIMATOR

Lemma 3: Let column vector, matrix. Also let tion with parameters tion equal to one if

, , and

, be a nonzero be a positive definite be an inverse-Wishart distribuand and be an indicator funcand zero otherwise. Then

where the outer integration is over all positive definite matrices, is the regularized incomplete beta function, and is defined in the statement of Lemma 2. Proof: Call this integral . If , it is easy to show . Otherwise, if then we have that

2585

Since is nonzero, with a simple reordering of the dimensions we can guarantee . The value of is unchanged by such a redefinition, so without loss of generality assume is invertible. Consider the change of variables, . Since is invertible, is positive definite if and only if is also. Furthermore, the Jacobean determinant of this transformation is [30], [31]. Note , where the subscript 11 indexes the upper left element of a matrix, and we have

Since the integrand now depends on only one parameter in , namely , the other parameters can be integrated out. It can be shown that for any inverse-Wishart random variable, , with density , the marginal distribution of is also an inverse-Wishart distribution with density [32]. In one dimension, this is equiva. lent to the inverse-gamma distribution In this case,

, so

The integral in the first term has been solved in Lemma E.1 of [19]. We have that

where we have defined

Define the following constant matrix:

Note , , and this integral is exactly the same as (21) so we apply Lemma 2 to complete the proof.

2586

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 5, MAY 2012

Lemma 4: Let be a positive integer, and . Then the function defined in the statement of Lemma 2 can be expressed as

The first integral is an incomplete beta function, and the second is again an Appell function, so that

if

if

is odd

if

is even

A property of the regularized incomplete beta function is , hence

where

Proof: If , then we have . The solution for in the statement of this lemma applies for this case. For , to solve for half integer values we first focus on the Appell function, . Define and , and note that . For any real number , we have the definition

With some manipulation we have

By induction, for any positive integer

We apply this to the definition of in the statement of Lemma 2 to decompose into one of two Appell functions with known solutions. In particular

if

if

is odd

if

is even.

After some simplification In the first integral, let

. We have

if

if

is odd

if

is even

DALTON AND DOUGHERTY: EXACT SAMPLE CONDITIONED MSE PERFORMANCE OF THE BAYESIAN MMSE ESTIMATOR

Finally, to evaluate

it can be shown that

and

With further simplification, we obtain the result in the statement of the lemma. REFERENCES [1] E. R. Dougherty, A. Datta, and C. Sima, “Research issues in genomic signal processing,” IEEE Signal Process. Mag., vol. 22, no. 6, pp. 46–68, Jun. 2005. [2] E. R. Dougherty and U. Braga-Neto, “Epistemology of computational biology: Mathematical models and experimental prediction as the basis of their validity,” Biol. Syst., vol. 14, no. 1, pp. 65–90, 2006. [3] L. Devroye, L. Gyrfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition, ser. Stochastic Modelling and Applied Probability. Berlin, Germany: Springer-Verlag, 1996. [4] E. R. Dougherty, A. Zollanvari, and U. M. Braga-Neto, “The illusion of distribution-free small-sample classification in genomics,” Current Genomics, 2011. [5] M. Hills, “Allocation rules and their error rates,” J. Roy. Statist. Soc. Series B (Methodol.), vol. 28, no. 1, pp. 1–31, 1966. [6] D. Foley, “Considerations of sample and feature size,” IEEE Trans. Inf. Theory, vol. IT-18, no. 5, pp. 618–626, Sep. 1972. [7] M. J. Sorum, “Estimating the expected probability of misclassification for a rule based on the linear discriminant function: univariate normal case,” Technometrics, vol. 15, pp. 329–339, 1973. [8] G. J. McLachlan, “An asymptotic expansion of the expectation of the estimated error rate in discriminant analysis,” Austr. J. Statist., vol. 15, pp. 210–214, 1973. [9] M. Moran, “On the expectation of errors of allocation associated with a linear discriminant function,” Biometrika, vol. 62, no. 1, pp. 141–148, 1975. [10] M. Goldstein and E. Wolf, “On the problem of bias in multinomial classification,” Biometrics, vol. 33, pp. 325–31, Jun. 1977. [11] A. Davison and P. Hall, “On the bias and variability of bootstrap and crossvalidation estimates of error rates in discrimination problems,” Biometrica, vol. 79, pp. 274–284, 1992. [12] U. Braga-Neto and E. R. Dougherty, “Exact correlation between actual and estimated errors in discrete classification,” Pattern Recognit. Lett., vol. 31, no. 5, pp. 407–412, Apr. 2010. [13] A. Zollanvari, U. M. Braga-Neto, and E. R. Dougherty, “On the sampling distribution of resubstitution and leave-one-out error estimators for linear classifiers,” Pattern Recognit., vol. 42, no. 11, pp. 2705–2723, Nov. 2009. [14] A. Zollanvari, U. M. Braga-Neto, and E. R. Dougherty, “On the joint sampling distribution between the actual classification error and the resubstitution and leave-one-out error estimators for linear classifiers,” IEEE Trans. Inf. Theory, vol. 56, no. 2, pp. 784–804, Feb. 2010. [15] A. Zollanvari, U. M. Braga-Neto, and E. R. Dougherty, “Analytic study of performance of error estimators for linear discriminant analysis,” IEEE Trans. Signal Process., vol. 59, no. 9, pp. 4238–4255, Sep. 2011. [16] F. Wyman, D. Young, and D. Turner, “A comparison of asymptotic error rate expansions for the sample linear discriminant function,” Pattern Recognit., vol. 23, pp. 775–783, 1990. [17] V. Pikelis, “Comparison of methods of computing the expected classification errors,” Autom. Remote Contr., vol. 5, pp. 59–63, 1976. [18] L. A. Dalton and E. R. Dougherty, “Bayesian minimum mean-square error estimation for classification error—Part I: Definition and the Bayesian MMSE error estimator for discrete classification,” IEEE Trans. Signal Process., vol. 59, no. 1, pp. 115–129, Jan. 2011. [19] L. A. Dalton and E. R. Dougherty, “Bayesian minimum mean-square error estimation for classification error—Part II: The Bayesian MMSE error estimator for linear classification of Gaussian distributions,” IEEE Trans. Signal Process., vol. 59, no. 1, pp. 130–144, Jan. 2011. [20] H. V. Poor, “On robust wiener filtering,” IEEE Trans. Autom. Control, vol. 25, no. 4, pp. 531–536, Apr. 1980.

2587

[21] K. S. Vastola and H. V. Poor, “Robust wiener-kolmogorov theory,” IEEE Trans. Inf. Theory, vol. IT-30, no. 2, pp. 316–327, Mar. 1984. [22] S. Verdu and H. V. Poor, “On minimax robustness: A general approach and applications,” IEEE Trans. Inf. Theory, vol. IT-30, no. 2, pp. 328–340, Mar. 1984. [23] A. M. Grigoryan and E. R. Dougherty, “Bayesian robust optimal linear filters,” Signal Process., vol. 81, no. 12, pp. 2503–2521, Dec. 2001. [24] R. Pal, A. Datta, and E. R. Dougherty, “Robust intervention in probabilistic boolean networks,” IEEE Trans. Signal Process., vol. 56, no. 3, pp. 1280–1294, Mar. 2008. [25] R. Pal, A. Datta, and E. R. Dougherty, “Bayesian robustness in the control of gene regulatory networks,” IEEE Trans. Signal Process., vol. 57, no. 9, pp. 3667–3678, Sep. 2009. [26] A. Nilim and L. E. Ghaoui, “Robust control of markov decision processes with uncertain transition matrices,” Operations Res., vol. 53, no. 5, pp. 780–798, 2005. [27] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin, Bayesian Data Analysis, 2nd ed. Boca Raton, FL: CRC Press, 2004. [28] J. W. Craig, “A new, simple and exact result for calculating the probability of error for two-dimensional signal constellations,” in Proc. IEEE Military Commun. Conf. (MILCOM91), McLean, VA, Nov. 1991, pp. 571–575. [29] L. J. Slater, Generalized Hypergeometric Functions. Cambridge, U.K.: Cambridge Univ. Press, 1966. [30] S. F. Arnold, The Theory of Linear Models and Multivariate Analysis. New York: Wiley, 1981. [31] A. M. Mathai and H. J. Haubold, Special Functions for Applied Scientists. New York: Springer-Verlag, 2008. [32] K. E. Muller and P. W. Stewart, Linear model theory: Univariate, multivariate, and mixed models. Hoboken, NJ: Wiley, 2006. Lori A. Dalton (S’10) received the B.Sc. and M.Sc. degrees in electrical engineering at Texas A&M University, College Station, in 2001 and 2002, respectively. She is currently working towards the Ph.D. degree in electrical engineering at Texas A&M University. Her current research interests include pattern recognition, classification, clustering, and error estimation. Ms. Dalton was awarded an NSF Graduate Research Fellowship in 2001, and she was awarded the Association of Former Students Distinguished Graduate Student Masters Research Award in 2003.

Edward R. Dougherty (M’05–SM’09) received the M.Sc. degree in computer science from the Stevens Institute of Technology, Hoboken, NJ, and the Ph.D. degree in mathematics from Rutgers University, Piscataway, NJ, and has been awarded the Doctor Honoris Causa by the Tampere University of Technology, Finland. He is currently a Professor in the Department of Electrical and Computer Engineering at Texas A&M University, College Station, where he holds the Robert M. Kennedy ’26 Chair in Electrical Engineering and is Director of the Genomic Signal Processing Laboratory. He is also CoDirector of the Computational Biology Division of the Translational Genomics Research Institute, Phoenix, AZ, and is an Adjunct Professor in the Department of Bioinformatics and Computational Biology at the University of Texas M. D. Anderson Cancer Center, Houston, TX. He is author of 16 books, editor of five others, and author of 250 journal papers. He has contributed extensively to the statistical design of nonlinear operators for image processing and the consequent application of pattern recognition theory to nonlinear image processing. His current research in genomic signal processing is aimed at diagnosis and prognosis based on genetic signatures and using gene regulatory networks to develop therapies based on the disruption or mitigation of aberrant gene function contributing to the pathology of a disease. Dr. Dougherty is a Fellow of SPIE, has received the SPIE President’s Award, and served as the Editor of the SPIE/IS&T Journal of Electronic Imaging. At Texas A&M University, he has received the Association of the Former Students Distinguished Achievement Award in Research. He was named Fellow of the Texas Engineering Experiment Station and Halliburton Professor of the Dwight Look College of Engineering.