Empirical Bayes linear regression with unknown model order

Report 1 Downloads 99 Views
Digital Signal Processing 18 (2008) 236–248 www.elsevier.com/locate/dsp

Empirical Bayes linear regression with unknown model order ✩ Yngve Selén a,∗ , Erik G. Larsson b,1 a Department of Information Technology, Uppsala University, P.O. Box 337, SE-751 05 Uppsala, Sweden b School of EE, Communication Theory, Royal Institute of Technology, Osquldas väg 10, SE-100 44 Stockholm, Sweden

Available online 30 March 2007

Abstract We study maximum a posteriori probability model order selection for linear regression models, assuming Gaussian distributed noise and coefficient vectors. For the same data model, we also derive the minimum mean-square error coefficient vector estimate. The approaches are denoted BOSS (Bayesian order selection strategy) and BPM (Bayesian parameter estimation method), respectively. In their simplest form, both BOSS and BPM require a priori knowledge of the distribution of the coefficients. However, under the assumption that the coefficient variance profile is smooth, we derive “empirical Bayesian” versions of our algorithms which estimate the coefficient variance profile from the observations and thus require little or no information from the user. We show in numerical examples that the estimators can outperform several classical methods, including the well-known AICc and BIC for model order selection. © 2007 Elsevier Inc. All rights reserved. Keywords: Linear regression; Empirical Bayes; MMSE estimation; Parameter estimation; Order selection

1. Introduction 1.1. Problem formulation Consider the linear regression model y = Xh + , RN

(1) RN ×n

where y ∈ is the vector of observed data, X = [x 1 . . . x n ] ∈ is a known matrix of n regressors h = [h1 . . . hn ]T ∈ Rn is the unknown vector of linear regression coefficients2 (we will also sometimes call h the parameter vector) and  ∼ N (0, σ 2 I ) is a length N vector of zero-mean Gaussian white noise with variance σ 2 . We call (1) the full model and assume that the data are generated by a model of the form Mk : y = Xk hk + ,

{x j }nj=1 ,

(2)

✩ This work was partly supported by the Swedish Research Council (VR). A conference version of this article has been accepted to ICASSP 2007. Parts of this paper were also presented at EUSIPCO 2006. * Corresponding author. Fax: +46 18 511925. E-mail addresses: [email protected] (Y. Selén), [email protected] (E.G. Larsson). 1 E.G. Larsson is a Royal Swedish Academy of Sciences Research Fellow supported by a grant from the Knut and Alice Wallenberg Foundation. 2 The empirical Bayesian estimators to be presented in Section 3 require n  N , but the theory presented in Sections 1 and 2 does not need this assumption.

1051-2004/$ – see front matter © 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.dsp.2007.03.005

Y. Selén, E.G. Larsson / Digital Signal Processing 18 (2008) 236–248

237

where nmin  k  n, Xk = [x 1 . . . x k ] (i.e., Xk consists of the first k columns of X), and hk = [h1 . . . hk ]T . Furthermore, we make the assumption that the coefficients hj are zero-mean independent Gaussian random variables, hj ∼ N (0, γj2 ). In other words, hk ∼ N (0,  k ), where  k = diag[γ12 . . . γk2 ]. We assume that the model order k and the variances {γj2 }kj =1 , σ 2 are unknown. We consider the following two classical interrelated problems: (1) The model order selection problem: to find the correct order k, given X and y. (2) The parameter estimation problem: to estimate h as accurately as possible when the order k is unknown. 1.2. Related work Bayesian solutions to the above two problems, under the Gaussianity assumption on the coefficients and the noise, are available in the literature. In, e.g., [1], the maximum likelihood (ML) and maximum a posteriori (MAP) model order selection algorithms for the current model were derived, although not numerically evaluated. In [2], the minimum mean square error (MMSE) estimate of a frequency function was derived. Within the same framework, it is easy to derive the MMSE estimate of h. In [3,4] derivations of the maximum a posteriori (MAP) model order selection algorithm and the MMSE estimate of h were presented. These derivations will be the basis of the empirical Bayes method that we propose, and they will be summarized in Sections 2.1 and 2.2. Classical (i.e., non-Bayesian, or “orthodox”) approaches to model selection include the information criteria, which will be used for comparative purposes in our numerical examples in Section 4. However, they are not based on the Bayesian paradigm and are thus less related to the current approach than the methods discussed above. A short review of the information criterion approach is given in Appendix A. The problem of determining the number of signal components in a mixture is fundamentally important also for many other types of data models. For example in array processing for determining the number of sources impinging on an array of sensors [5], or in line spectrum analysis for determining the number of sinusoids in noise [6]. In the present article, however, we are only concerned with the linear regression data model (1). 1.3. Contributions of this work For the Bayesian approaches in the above subsection it is generally assumed that the noise variance σ 2 and the coefficient variances {γj2 }nj=1 are known. This assumption is hardly realistic in applications. The goal of the present article is to present methods which do not require knowledge of σ 2 and {γj2 }nj=1 . To this end we take an empirical Bayes approach: we estimate σ 2 , {γj2 }nj=1 from the data and then use the resulting estimates as if they were the true values; see Section 3. Note that the models in (2) that we consider are nested, in the sense that a lower order model can be obtained as a special case of a higher order model (by setting certain coefficients to zero). This type of models is common in signal processing applications, such as finite-impulse-response (FIR) filter identification or the estimation of polynomial coefficients. By way of contrast, it is common in statistical data analysis to consider sparse models for which the true model can consist of any subset of the regressors {x k }nk=1 . We have treated linear regression for sparse models in [3] where we, among other things, proposed an empirical Bayesian technique tailored to sparse models. However, for the nonsparse unknown model order problem, the technique in [3] is less sophisticated and flexible than the one we provide in this paper. 2. Bayesian model order selection and parameter estimation 2.1. Optimal model order selection Here we review the MAP probability model order selection algorithm for the problem posed in Section 1, assuming known σ 2 , {γj2 }nj=1 . This model selection rule has been derived previously in [1]. In the remainder of the paper this specific model selection algorithm will be denoted by BOSS (Bayesian order selection strategy).

238

Y. Selén, E.G. Larsson / Digital Signal Processing 18 (2008) 236–248

Using Bayes’ Theorem we obtain an expression for the model posterior probabilities: P (Mk |y) = P (Mk )

p(y|Mk ) . p(y)

Since p(y) is independent of the model Mk , the model order which gives the highest posterior probability is kˆMAP = arg

max

k=nmin ,...,n

P (Mk )p(y|Mk ).

(3)

If nothing is known about the model prior probabilities P (Mk ), we will assume that they are equal (this is common practice [7]) and, of course, that they sum up to one: P (Mk ) =

1 n − nmin + 1

,

k = nmin , . . . , n.

(4)

Furthermore, under the assumption that Mk is the data generating model we have y|Mk ∼ N (0, Qk ), where Qk = Xk  k XTk + σ 2 I , So, 1

p(y|Mk ) = √ N 2π

   k = diag γ12 . . . γk2 .

  1 1 T −1 exp − y Qk y . 2 |Qk |1/2

(5)

We obtain BOSS by using (4) and (5) to compute (3). One can show (see, e.g., the discussion around Eq. (37) in [8]) that the order with the highest a posteriori probability is also the most likely to be the correct order. Note that (using the identity |I + AB| = |I + BA|)  T   X Xk k    (6) |Qk | = Xk  k XTk + σ 2 I  = σ 2N  k 2 + I  σ and (using the matrix inversion lemma) Q−1 k

 −1   1 1 1 T −1 T 2 −1 = Xk  k Xk + σ I = 2 I − 4 Xk  k + 2 Xk Xk XTk . σ σ σ

(7)

The expressions in (6) and (7) can be used to boost the computational efficiency: XTk Xk is much faster to evaluate than Xk XTk if N  n (N  n will be a necessary condition for our empirical Bayesian methods to work, see Section 3). One should also exploit the fact that  k is diagonal. The computational complexity of (6) can be further reduced by noting that

T

T Xk−1 Xk−1 Xk−1 (x Tk Xk−1 )T [X XTk Xk = x ] = k−1 k x Tk x Tk Xk−1 x Tk x k and that X Tk−1 Xk−1 is evaluated when computing p(y|Mk−1 ). Also, a well-known result on the inverse of partitioned 1 T matrices (see, e.g., Lemma A.2 in [9]) can be used to compute (7) iteratively. Define Z k = ( −1 k + σ 2 X k X k ). Then (x Tk Xk−1 )T  x Tk X k−1 −1 −Z −1 k−1 2 − Z k−1 1

−1 σ Z k−1 0k−1 σ2 1 Z −1 + = , k x Tk x k 1 1 T 0Tk−1 0 −1 T T + 2 − 4 x k Xk−1 Z k−1 (x k Xk−1 ) σ2 σ γk so the matrix inversion in (7) needs only be computed once, namely for k = nmin .

Y. Selén, E.G. Larsson / Digital Signal Processing 18 (2008) 236–248

239

2.2. The MMSE estimate of h In this section we describe what we will denote the BPM (Bayesian parameter estimation method) by deriving the MMSE estimate of h under the assumption that one of the models {Mk }nk=nmin in (2) generated the data. The MMSE estimate equals the conditional mean [9]: n n  k=nmin P (Mk )p(y|Mk )E[h|y, Mk ] ˆhMMSE = E[h|y] = n , (8) P (Mk |y)E[h|y, Mk ] = k=nmin P (Mk )p(y|Mk ) k=nmin

where we used P (Mk |y) = P (Mk )

P (Mk )p(y|Mk ) p(y|Mk ) = n . p(y) j =nmin P (Mj )p(y|Mj )

We can use (4) and (5) in (8). What remains to evaluate (8) and get the BPM is then to compute E[h|y, Mk ]. Clearly, assuming that the model Mk generated the data, hj = 0 for j > k, so it is sufficient to find E[hk |y, Mk ]. Under Mk , hk and y are jointly Gaussian:



 Xk  Tk Qk y |Mk ∼ N 0, . hk  k XTk k Applying a standard result (Lemma B.17 in [9], for example), the conditional mean evaluates to E[hk |y, Mk ] =  k XTk Q−1 k y.

(9)

We now obtain the BPM by inserting (4), (5) and (9) into (8). 2.3. A note on complexity The computational costs of BOSS and BPM are dominated by the determinant in (6) (an O(k 3 ) operation) and the matrix inverse (7) (an O(kN 2 ) operation). Both (6) and (7) have to be performed for k = nmin , . . . , n, so the total complexity of BOSS and BPM is O(mn(n2 + N 2 )) where m = n − nmin + 1 is the total number of models considered. For comparison, the classical information criterion approach to model order selection (see Appendix A) has, in the scenario considered here, a computational complexity of O(mn2 (N + n)). 3. Empirical Bayesian estimation The weakness of BOSS and BPM is that they require knowledge of {γj2 }nj=1 and σ 2 . In practice, these variances are likely to be unknown. This problem can be dealt with in different ways. One possibility is to assign hierarchical distributions to the unknown variances and set the resulting hyperparameters to some values. For example, one may assume that σ 2 ∼ D(a) where D denotes some distribution and a is a hyperparameter. The resulting expressions, corresponding to (3) and (8), can typically not be evaluated in closed form. Thus, one has to resort to numerical approaches, such as Markov chain Monte-Carlo (MCMC) methods. For a related example, see [10]. This type of numerical approach is generally computationally intensive. Furthermore, it is not clear at what hierarchical level one should stop the hyperparameterization or what values should be assigned to the hyperparameters at the final level: In the above example, should a be set to a specific value (and if so, to what value?), or should we assign yet a distribution for a? Instead, we propose to use an empirical Bayes approach. In this approach the unknown prior (hyper)parameters are estimated from the available data. The so-obtained estimates of {γj2 }nj=1 , σ 2 can then be used for inference, i.e. be inserted into (3) and (8), as if they were the true values. This is somewhat against the true spirit of Bayesian statistics, as the variances {γj2 }nj=1 , σ 2 are a priori parameters, which should not depend on the data. Nevertheless, estimation of {γj2 }nj=1 , σ 2 appears to be an attractive, pragmatic way of handling the situation when these parameters are completely unknown.

240

Y. Selén, E.G. Larsson / Digital Signal Processing 18 (2008) 236–248

The idea behind the empirical Bayes methods to be presented herein is to replace rather specific a priori knowledge about the data (the true values of {γj2 }nj=1 , σ 2 ) by much less specific a priori knowledge. To this end, we will base our estimates of {γj2 }nj=1 , σ 2 on the ML/LS (least squares) estimator,  −1 T Xk y, (10) hˆ ML,k = hˆ LS,k = XTk Xk of the full model (1) (i.e., k := n in (10)). Note that (10) requires N  n when k = n and that X is full rank. These conditions are necessary for our empirical Bayes methods to work. The estimation of σ 2 is straightforward: an unbiased, consistent estimate of σ 2 can be obtained by taking [4]  1  y − Xhˆ LS,n 2 . σˆ 2 = (11) N −n The estimation of {γj2 }nj=1 is more challenging. There are two distinct problems that we need to cope with: First, only a single data realization is available for estimation of the variances {γj2 }nj=1 . In fact, h is not directly available but has to be estimated, as in (10), from y. To put it differently, only one estimated sample hˆ j is available (note that we assumed the coefficients in h to be independent) for the estimation of the variance γj2 . This makes the problem of estimating {γj2 }nj=1 from y ill-posed. Second, for some coefficients, there may be no data at all available for the estimation of their variance. Indeed, if the true order is k, there are no data available for estimation of γj2 , j > k, since then hj = 0. If this problem were disregarded we would likely end up with severely underestimated γˆj2 for j > k, since the estimates would be based on hˆ j ≈ 0. This would give a high risk of overestimating the model order and in turn result in poor estimates of hj , j > k. From the above reasoning, it is obvious that we need to introduce some a priori information on the variances {γj2 }nj=1 . In [4] it was assumed that γj2 = γ 2 , j = 1, . . . , n. Herein, we take a more general approach, as detailed below. We also note that it is likely that any estimates of {γj2 }nj=1 will deviate rather much from the true values. Fortunately, BOSS and BPM appear to be relatively robust against mismatching values of {γj2 }nj=1 (see, e.g., [4] and the numerical examples herein). To cope with the first of the above-mentioned problems we make the (in many applications reasonable) assumption that {γj2 }nj=1 is a smooth sequence, i.e., that γj2 − γj2−1 does not vary a lot with j . In the next subsections we describe two ways of implementing this assumption. Our proposed approaches require little or no a priori information. To cope with the second problem above, we impose a threshold b, such that always γˆj2  b > 0. We select the threshold b as a fraction of the largest squared magnitude element in the full model LS estimate ((10) with k := n), i.e.,  2  2 2   (12) b := bf max hˆ LS,1  , hˆ LS,2  , . . . , hˆ LS,n  . As a rule of thumb, we recommend bf = 0.1. The parameter bf can be viewed as a priori knowledge expressing the difference in magnitude between the lowest and greatest variance of the linear regression coefficients. It can also be viewed as a robustification parameter which guarantees that none of the variances {γj2 }nj=1 will be severely underestimated. This suggested approach is clearly ad hoc, but more systematic ways of dealing with the problem are not obvious. Also, we have found in simulations that the choice of bf is not critical for performance (see Section 4). This is probably related to the relative insensitivity of BOSS/BPM against mismatches in {γj2 }nj=1 , as mentioned above. We will now describe two different ways of incorporating smoothness constraints on {γj2 }nj=1 . 3.1. Estimation of {γj2 }nj=1 by parameterization In this approach we assume that {γj2 }nj=1 can be expressed as a linear combination of a relatively small number of known basis vectors, as follows: γ = α,

(13)

where γ = [γ12 . . . γn2 ]T ,  is a known n × r matrix and α is an unknown vector of length r. The columns of  should be a basis for “likely” variance profiles for {γj2 }nj=1 and generally r should be small compared to n. (In a

Y. Selén, E.G. Larsson / Digital Signal Processing 18 (2008) 236–248

241

communication channel identification application, e.g., it might be known that {γj2 }nj=1 is exponentially decaying [11], and  could then contain one or a few columns with different degrees of exponential decay.) If nothing at all is known about {γj2 }nj=1 , then one possibility is to construct  from the first r basis functions in a discrete Fourier series expansion of γ (in this case, r is a user parameter): []i,1 = 1,  []i,j =

cos( j2 (i − 1) 2π n ),

j even,

sin( j −1 2 (i

j odd and  3,

− 1) 2π n ),

(14)

where i = 1, . . . , n, j = 1, . . . , r and r is odd. Using a truncated Fourier series as basis may seem ad hoc at first glance, but we find it attractive for many reasons: the basis functions are orthogonal and the resulting variance profile is smooth. Additionally, by choosing the number of basis functions r, we can directly influence the amount of variation in the variance profile γ . Now, the problem is reduced to that of estimating α. Thus, the dimensionality of the problem is reduced from n to r. Note that n      P (Mk )E |hj |2 |Mk = qj γj2 , E |hj |2 =

(15)

k=nmin

 where qj = nk=max{j,nmin } P (Mk ), because under Mk , hj = 0 for j > k, and hj has variance γj2 if j  k. By replacing E[|hj |2 ] in the above equation by the squared norm of the corresponding element of the full model LS estimate ((10) with k := n), |hˆ LS,j |2 , and using (13) we can estimate α from the following constrained LS expression: ⎡ ⎤ ⎡ ⎤ 2   |hˆ LS,1 |2 q1  1,1 · · · q1  1,r    ⎢ ⎢ ⎥ ⎥ . . . .. . . − α s.t. α  0. (16) αˆ = arg min ⎣ ⎦ ⎣ . . ⎦  α     |hˆ 2 qn  n,1 · · · qn  n,r LS,n | We then get γˆ by inserting αˆ in (13). The constraint in (16) guarantees that all γˆj2  0. The above expression can be easily and efficiently solved by quadratic programming (use, e.g., quadprog in Matlab). One might ask why the threshold b is not used already in (16) (and (17) below). The reason is that the thresholds γ  0 and γˆj2  b serve different purposes: γ  0 used in (16) and (17) takes care of the physical requirement that the variances must be non-negative, whereas the threshold b is a way to cope with the lack of data for estimation of {γˆj2 }nj=k+1 where k is the true order (see the discussion above). In this manner, we first estimate the variances using the constraint γ  0, and then we impose the threshold b to make sure that no element in γˆ is severely underestimated (see Section 4, Example 3). We found that this way of imposing thresholds on γˆ gave the best numerical performance. 3.2. Estimation of {γj2 }nj=1 by penalization The problem we would like to solve—that of estimating {γj2 }nj=1 using the LS estimate hˆ LS,n from a single data realization—is ill-posed. Tikhonov regularization [12] is a commonly used method for solving such problems. This regularization consists of an additive penalty term which depends on the parameter of interest (see below). This penalty can be designed to, e.g., shrink the estimate toward zero, or limit its variability. Using the a priori assumption that {γj2 }nj=1 is a smooth sequence, we impose a penalty on its second order difference (using higher order differences is also possible) and estimate it from  2  2 T 2  γˆ = arg minγ − hˆ LS,1  . . . hˆ LS,n   + λLγ 2 . (17) γ 0

Here



1 −2 .. L=⎣ .

1 .. .

0

1

0



⎦∈R . −2 1 ..

(n−2)×n

242

Y. Selén, E.G. Larsson / Digital Signal Processing 18 (2008) 236–248

is the second order difference matrix and λ  0 is a user parameter which determines the amount of smoothing.3 Equation (17) can be efficiently solved using quadratic programming. λ should be set to a low value if {γj2 }nj=1 is believed to have large variations, and a high value if {γj2 }nj=1 is believed to be very smooth (the extreme case λ = ∞ forces γˆj2 − γˆj2−1 to be constant with j ). Alternatively, λ can be selected in a fully automatic manner (i.e., without any user parameters) by using, e.g., generalized cross-validation (GCV) [13]. An efficient implementation of the GCV for the problem under consideration is given in [14]. 3.3. Multiple coefficient vector realizations As mentioned in the beginning of this section, the main difficulty in estimating {γj2 }nj=1 comes from the fact that

only one realization of h is available. However, if m, m  n, realizations of h, {h(i) }m i=1 , were available, with the same {γj2 }nj=1 and σ 2 (the regressor matrices, however, need not be the same) and of the (unknown) orders {k (i) }m i=1 , the smoothing approach described in the above two subsections may not be necessary. Assuming m such realizations of h are available, then γj2 can be estimated using (15), where we replace E[|hj |2 ] by the mean of the squared norms of the full model LS estimates’ ((10) with k := n) j th element: 1 m ˆ (i) 2 i=1 |hLS,j | m 2 , j = 1, . . . , n. (18) γˆj = n j =max{j,nmin } P (Mj ) If smoothing or thresholding of the variance profile is desired (e.g., if m is relatively small), it is straightforward to use the above parameterization or penalization approach directly on the estimates from (18). Similarly, to estimate σ 2 one can simply average the estimates from (11) for the different realizations of h: m 1 (i) ˆ (i) 2 i=1 N −n y − X hLS,n  2 (19) σˆ = m where X(i) is the regressor matrix for the ith realization of h. 4. Numerical examples We evaluate the performances of the methods by means of Monte-Carlo simulations. The performance of BOSS  (3) is measured in terms of the percentage of correctly selected orders: M −1 M δ (m) m=1 kˆ −k (m) × 100 [%], where δt (m) (m) ˆ denotes the discrete-time unit impulse and k , k denote the estimated and true orders for realization number m,  (m) respectively. For the evaluation of BPM (8) we use the empirical MSE of the coefficient estimates: M −1 M hˆ − (m) h(m) 2 , where hˆ

m=1

(m)

and h denote the estimated and true coefficient values, respectively, for realization number m (if the vectors have different lengths, they can be zero-padded to the same length). M is the total number of Monte-Carlo runs and we choose M = 105 . For each Monte-Carlo trial we generate data from a model Mk where the order k is chosen uniformly at random between nmin = 1 and n = 30 (these limits are supplied to the estimators). We set N = 50. The true variance profiles, {γj2 }nj=1 are constructed from (13) where the first column of  consists of ones, [1 . . . 1]T , the second is exponentially decaying, [ν1 e−0.1 . . . ν1 e−0.1n ]T , and the third is exponentially increasing, [ν2 e0.1 . . . ν2 e0.1n ]T . The normalization factors ν1 and ν2 are set such that all columns in  have a 2-norm equal to n. The vector α has independent squared N (0, 1) elements (i.e., each element is generated from a χ 2 (1)-distribution).4 Finally, the so generated {γj2 }nj=1 are normalized such that γ 2 = 1. Naturally, these choices are somewhat arbitrary (as any specific numerical example 3 By looking at the matrix (I + λLT L)−1 in the closed-form solution of the unconstrained version of (17), γˆ unconstr = (I + λLT L)−1 [|hˆ LS,1 |2 . . . |hˆ LS,n |2 ]T , one can get an idea of how λ affects the smoothing. 4 This means that α has positive elements. The constraint α  0 is stronger than the constraint γ = α  0 used in (16) and (17). (Note that α  0 implies that γ = α  0 for the  we consider.) However, we choose not to exploit the prior knowledge that α  0 when using our empirical Bayesian methods described in Section 3, even if doing so would likely increase their performance. The reason for not exploiting this information is that we want to avoid tailoring the data in our numerical examples too much to our methods.

Y. Selén, E.G. Larsson / Digital Signal Processing 18 (2008) 236–248

243

Fig. 1. Model order selection performance for uncorrelated regressors. (The small plot is a closeup.)

has to be), but having compared with other examples we believe that it shows a fair comparison of the considered methods. We consider the following methods: (a) The well-known corrected (for short data sequences) information criterion by Akaike (AICc) [15] for order selection and LS (10) with the AICc order for parameter estimation; (b) The wellknown Bayesian information criterion (BIC) [16] for order selection and LS (10) with the BIC order as well as the BIC model weighted estimate (A.2) (see Appendix A) for parameter estimation; (c) BOSS/BPM (3)/(8) with knowledge of the true {γj2 }nj=1 , σ 2 ; (d) Empirical BOSS/BPM using σˆ 2 from (11) and estimating {γj2 }nj=1 as in Section 3.1 with the true ; (e) Empirical BOSS/BPM using σˆ 2 from (11) and estimating {γj2 }nj=1 as in Section 3.1 with  consisting of the first r = 3 discrete Fourier series from (14); (f) Empirical BOSS/BPM using σˆ 2 from (11) and estimating {γj2 }nj=1 as in Section 3.2 with λ chosen using GCV [14]. A short review of the information criteria AIC, AICc and BIC and of BIC model weighting can be found in Appendix A. For all empirical Bayes methods we use bf = 0.1 in (12) to compute the lower threshold b for {γˆj2 }nj=1 , such that all γˆj2 < b from (16) and (17) are set to b instead. In simulations not detailed here we observed that the exact choice of bf is not critical for the performance.5 4.1. Example 1: Uncorrelated regressors First we consider an example in which all elements of X are i.i.d. N (0, 1). In Fig. 1 we study the model order selection performances and in Fig. 2 we study the coefficient estimation performances. We observe that the empirical BOSS/BPM methods give a better performance than AICc and BIC do (generally about 0.5 to 1 dB better, and sometimes significantly more). Also, the differences between the different empirical Bayesian methods are very small. When estimating σ 2 and {γj2 }nj=1 the model selection performance drops about 1 dB below that of the optimal BOSS where these parameters are known. The corresponding loss in coefficient estimation performance for empirical BPM is about 0.3 dB. For this example we also show, in Fig. 3, two representative realizations of the coefficient variances {γj2 }nj=1 together with their estimates, obtained using the approaches in Section 3 when σ 2 = −10 dB. As expected, the estimated variances sometimes deviate rather much from their true values (especially for indices larger than the true order k), 5 b = 0.01, 0.05, and 0.3 gave results similar to those shown here. f

244

Y. Selén, E.G. Larsson / Digital Signal Processing 18 (2008) 236–248

Fig. 2. Coefficient estimation performance for uncorrelated regressors. (The small plot is a closeup.)

Fig. 3. Estimation of γ for two sample data realizations in Example 1 with σ 2 = −10 dB (plotted as lines with squares and lines with x-marks, respectively). The true order for the first data realization (lines with squares) was k = 8, and for the second data realization (lines with x-marks) the true order was k = 20.

which reduces performance as compared to BOSS/BPM using the true variances. However, some features of the variance profiles are captured by the variance estimates, and this is enough for the empirical BOSS/BPM methods to outperform AIC/BIC. 4.2. Example 2: Correlated regressors √ We now let the regressors be correlated, such that x i = (z + v i )/ 2, where z and {v i }ni=1 are i.i.d. N (0, I ) vectors. Thus, E{x i x H j } = [(1 + δi−j )/2]I . In Fig. 4 we study the model order selection performances and in Fig. 5 we study

Y. Selén, E.G. Larsson / Digital Signal Processing 18 (2008) 236–248

245

Fig. 4. Model order selection performance for correlated regressors. (The small plot is a closeup.)

Fig. 5. Coefficient estimation performance for correlated regressors. (The small plot is a closeup.)

the coefficient estimation performances. Also for this example, the empirical versions of BOSS and BPM are able to improve upon the BIC and AICc. The different empirical Bayesian methods show similar performance: for model selection, about 1 dB below the optimal BOSS, which exploits full knowledge of σ 2 and {γj2 }nj=1 , and for coefficient estimation about 0.4 to 0.1 dB below the optimal BPM. The BIC model weighting approach (A.2) for coefficient estimation improves with about 0.2 dB upon LS using the BIC model order. 4.3. Example 3: Variance profile mismatch We also study how BOSS and BPM are affected by mismatch in the variance profile γ . This is interesting because, as we discussed in Section 3 and as can be seen in Fig. 3, γ is difficult to estimate accurately.

246

Y. Selén, E.G. Larsson / Digital Signal Processing 18 (2008) 236–248

Fig. 6. Variance profile mismatch. Note the scale on the y-axis.

The data are generated as in Example 1 (with uncorrelated regressors). However, this time we supply BOSS and BPM with mismatched γ -values such that γ used = cγ true , where c is varied. The empirical Bayesian variants, included for comparison, estimate γ from the data and are thus unaffected by c. The results for order selection are shown in Fig. 6. The coefficient estimation results are similar, and are not shown here. We remark upon the fact that the performances of BOSS/BPM are rather robust to mismatching γ -values. Furthermore, they seem less sensitive to the use of too large values for γ than the use of too small γ -values. This observation motivates the use of the threshold b as a “robustification” parameter, which serves the purpose of avoiding (harmful) severely underestimated values in γ . 4.4. Example 4: Multiple coefficient vector realizations Finally, we consider the same setting as in Example 1, but this time with 10 coefficient vector realizations for each variance profile γ . We use the methodology described in Section 3.3 to compute the empirical Bayes BOSS and BPM. In Fig. 7 we show the model selection results. Compared to Example 1, the empirical BOSS has improved—it now performs about 0.2 dB away from the optimal BOSS (in Example 1 the difference was about 1 dB). The corresponding coefficient estimation plot is omitted for brevity, but is relatively similar to Fig. 2. The empirical BPM now performs less than 0.1 dB below the optimal BPM method. 5. Conclusions We have studied linear regression with an unknown model order assuming zero-mean Gaussian noise and a zeromean Gaussian distributed coefficient vector. Under this model we have derived empirical Bayesian versions of the maximum a posteriori probability model order selector and the MMSE coefficient vector estimate. These empirical Bayesian methods have been shown to outperform the classical approaches AICc and BIC, both in model order selection and coefficient vector estimation examples. Results similar to those shown have also been obtained from many other numerical examples which have to be omitted here due to space constraints. Since our methods have relatively low computational complexity and show good performance, we consider them attractive alternatives in the context of the model (2). At www.it.uu.se/katalog/ys/software/empBOSS_BPM we have posted a free software implementation (in Matlab) of the methods presented here. Appendix A. Information criteria for model order selection A classical and commonly used group of criteria for model selection is the information criterion (IC) group. In our numerical examples in Section 4 we use some of the IC described here for comparison with BOSS and BPM.

Y. Selén, E.G. Larsson / Digital Signal Processing 18 (2008) 236–248

247

Fig. 7. Model order selection performance when there are 10 coefficient vector realizations for each γ . (The small plot is a closeup.)

There exist many different IC [17], but the most frequently used are the IC of Akaike (AIC) [18] and the Bayesian IC (BIC) [16]. The IC can be derived by minimizing the Kullback–Leibler (KL) discrepancy function [19] between the probability density function (pdf) of the data, p(y), and the pdf of a model of the data, p(y): ˆ  p(y) D(p, p) ˆ = p(y) log dy. p(y) ˆ The model for which this discrepancy (or “distance”) is minimized is selected. However, the KL discrepancy cannot be directly calculated since many of the required quantities are unavailable [8] (e.g., p(y) is unknown). To overcome these problems, approximations and asymptotic theory must be used. Depending on the types of assumptions and approximations, various IC can be obtained. The IC have a common form: they select the model which minimizes −2 ln p(y|hˆ k,ML ) + kη(·)

(A.1)

where p(y|hˆ k,ML ) is the pdf for the measured data given the ML parameter vector estimate of order k, hˆ k,ML ((10) for the model (2)). The first term in (A.1) is a measure of how well the model fits the data; by adding more parameters the model will be a better fit and the value of this term will decrease. For the linear regression problem (2), the first term equals [8]: −2 ln p(y|hˆ k,ML ) = N ln σˆ k2 + constant, where σˆ k2 = y − Xk hˆ k,ML 2 /N is the ML estimate of the noise variance. The second term in (A.1), kη(·), is usually called a “penalty” term and it increases with increasing k. (The function η(·) is specific to the particular IC being used; it typically depends on k and N .) The role of this penalty is to penalize models with many parameters (which have a low value of −2 ln p(y|hˆ k,ML )). In other words, it ensures that a simple model is preferred unless a more complex model fits the data significantly better. The function η(·) takes on the values η = 2 for AIC and η(N ) = ln N for BIC. All information criteria we are aware of, and which fit into this framework, are asymptotic in the sense that they are derived assuming N → ∞. This clearly limits the use of the IC for the most difficult model order selection problems—the ones were only a few data samples are available. However, there exists a bias-corrected small-sample version of AIC—AICc [20]—which is applicable for certain types of models, such as the linear regression models under study. It uses η(k, N ) = 2N/(N − k − 2).

248

Y. Selén, E.G. Larsson / Digital Signal Processing 18 (2008) 236–248

The IC are often used in parameter estimation problems when the model order is unknown. Then, the order is first selected with an IC, and then the ML parameter estimate for that model order is used. Note, however, that this ML parameter estimate was obtained without taking the model uncertainty into consideration [17]. This generally degrades the accuracy of the parameter estimate, although the decrease is often quite small [21]. An alternative to the above approach is multi-modeling, or model weighting [17,21] where the model uncertainty is accounted for. In this approach the ML estimates of the different models are averaged in a weighted manner, the weights being estimates of the model posterior probabilities: n  hˆ = (A.2) Pˆ (Mk |y)hˆ ML,k . k=nmin

The model posterior probabilities P (Mk |y) can be estimated using, e.g., AIC or BIC (different priors P (Mk ) give different IC): 1

e− 2 XICk Pˆ (Mk |y) =  n − 12 XICj j =nmin e

(A.3)

where XICj is the value of the chosen IC (such as AIC or BIC) for the model Mj . A uniform prior, P (Mj ) = p, ∀j , leads to BIC in (A.3). More detailed reviews of IC-based model selection criteria can be found in [8,22]. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22]

F. Gustafsson, H. Hjalmarsson, Twenty-one ML estimators for model selection, Automatica 31 (1995) 1377–1392. H. Hjalmarsson, F. Gustafsson, Composite modeling of transfer functions, IEEE Trans. Automat. Control 40 (5) (1995) 820–832. E.G. Larsson, Y. Selén, Linear regression with a sparse parameter vector, IEEE Trans. Signal Process. 55 (2) (2007) 451–460. Y. Selén, E.G. Larsson, Parameter estimation and order selection for linear regression problems, in: 14th EUSIPCO, Florence, Italy, 2006. M. Nezafat, M. Kaveh, W. Xu, Estimation of the number of sources based on the eigenvectors of the covariance matrix, in: IEEE Sensor Array and Multichannel Signal Processing Workshop Proceedings, Sitges, Spain, 2004, pp. 465–469. P. Djuri´c, A model selection rule for sinusoids in Gaussian white noise, IEEE Trans. Signal Process. 44 (7) (1996) 1744–1751. L. Wasserman, Bayesian model selection and model averaging, J. Math. Psychol. 44 (1) (2000) 92–107. P. Stoica, Y. Selén, Model-order selection—A review of information criterion rules, IEEE Signal Process. Mag. 21 (4) (2004) 36–47. T. Söderström, P. Stoica, System Identification, Prentice Hall International, London, UK, 1989. C. Févotte, S.J. Godsill, Sparse linear regression in unions of bases via Bayesian variable selection, IEEE Signal Process. Lett. 13 (7) (2006) 441–444. A.F. Molisch, Wireless Communications, Wiley–IEEE Press, New York, 2005. A.E. Tikhonov, Solution of incorrectly formulated problems and the regularization method, Soviet Math. Dokl. 4 (1963) 1035–1038. G.H. Golub, M. Heath, G. Wahba, Generalized cross validation as a method for choosing a good ridge parameter, Technometrics 21 (2) (1979) 215–223. P.C. Hansen, Analysis of discrete ill-posed problems by means of the L-curve, SIAM Rev. 34 (4) (1992) 561–580. J.E. Cavanaugh, Unifying the derivations for the Akaike and corrected Akaike information criteria, Stat. Probab. Lett. 33 (2) (1977) 201–208. G. Schwarz, Estimating the dimension of a model, Ann. Stat. 6 (1978) 461–464. K.P. Burnham, D.R. Anderson, Model Selection and Multi-Model Inference, Springer, New York, 2002. H. Akaike, A new look at statistical model identification, IEEE Trans. Automat. Control 19 (1974) 716–723. S. Kullback, R.A. Leibler, On information and sufficiency, Ann. Math. Stat. 22 (1) (1951) 79–86. C. Hurvich, C. Tsai, A corrected Akaike information criterion for vector autoregressive model selection, J. Time Ser. Anal. 14 (1993) 271–279. P. Stoica, Y. Selén, J. Li, Multi-model approach to model selection, Digital Signal Process. 14 (5) (2004) 399–412. A.D. Lanterman, Schwarz and Wallace and Rissanen: Intertwining themes in theories of model order estimation, Int. Stat. Rev. 69 (2001) 185–212.

Yngve Selén received the M.Sc. degree in engineering physics from Uppsala University, Sweden (2002) and the Technical Licentiate degree in signal processing from Uppsala University (2004). He is currently a Ph.D. student of signal processing at Uppsala University. His research interests include Bayesian modeling, beamforming and magnetic resonance spectroscopy. Erik G. Larsson is an Associate Professor of communication theory. He joined KTH in 2005, and was previously an Assistant Professor at the George Washington University in Washington, DC, USA. He has published some 40 papers on the topics of signal processing and communications, he is co-author of the textbook Space–Time Block Coding for Wireless Communications (Cambridge Univ. Press, 2003) and he holds several patents on wireless technology. He is an Associate Editor for the IEEE Transactions on Signal Processing and the IEEE Signal Processing Letters and a member of the IEEE Signal Processing Society SAM technical committee.