A New Approach to Variable Selection Using the ... - Semantic Scholar

Report 2 Downloads 177 Views
10

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 1, JANUARY 2007

A New Approach to Variable Selection Using the TLS Approach Jean-Jacques Fuchs, Member, IEEE, and Sébastien Maria

Abstract—The problem of variable selection is one of the most important model selection problems in statistical applications. It is also known as the subset selection problem and arises when one wants to explain the observations or data adequately by a subset of possible explanatory variables. The objective is to identify factors of importance and to include only variables that contribute significantly to the reduction of the prediction error. Numerous selection procedures have been proposed in the classical multiple linear regression model. We extend one of the most popular methods developed in this context, the backward selection procedure, to a more general class of models. In the basic linear regression model, errors are present on the observations only, if errors are present on the regressors as well, one gets the errors-in-variables model which for Gaussian noise becomes the total-least-squares (TLS) model, this is the context considered here. Index Terms—Least squares (LS) problem, matrix perturbation, stepwise regression, Student test, subset selection, total least squares (TLS) problem.

I. INTRODUCTION

V

ARIABLE selection is an important topic in regression analysis [1]–[3]. Its aim is to select among a large number of potential regressors those that are most relevant. In the literature, different models have already been proposed; by far the most widely used is the linear regression model where the attention is restricted to normal linear models, i.e., it is assumed that additive Gaussian perturbations are present on the observations. More formally, let be the -dimensional observation vector and the set of potential regressors. One wants to model the linear relationship between and a subset of the regressors but one has to decide which subset to use. We denote the regression matrix having columns . Seby lecting variables is then equivalent to finding the good subset of columns in . We write (1)

with the -dimensional vector of the exact regression coefficients, among which there are some zeros, and , the measurement error vector. This is the basic least squares (LS) model. We assume that there are more observations than , and that is full column-rank. potential regressors, Ideally, to get the best subset according to a given criterion, possible models. But the comone has to examine all the Manuscript received July 12, 2005; revised March 10, 2006. The associate editor coordinating the review of this paper and approving it for publication was Dr. Daniel Fuhrman. The authors are with the Institut de Recherches en Informatique et Systèmes Aléatoires (IRISA), Université de Rennes I, 35042 Rennes Cedex France (e-mail: [email protected]; [email protected]). Digital Object Identifier 10.1109/TSP.2006.882105

putational complexity of such an exhaustive search makes this approach impractical for even reasonable numbers of regressors. To circumvent this difficulty, many algorithms have been proposed in the literature to reduce the number of models to be compared. They are in general based on the Residual Sum of Squares —which provides a partial ordering (RSS)—RSS of the models and allows to eliminate large numbers of models from consideration [4]. For a given complexity, i.e., number of regressors, one keeps the model which yields the smallest RSS. Once attention is reduced to a reasonable set there is still the need for selection criteria. Besides AIC [5] and BIC [6] which are familiar to this community, one should cite cross-validation approaches [3] and currently the most used selection criterion statistic [7]. These approaches are indeed of a Mallow’s quite different type from the one we investigate below and we will not consider their extensions to the new model we introduce. An alternative that also presents a good compromise between computation time and efficiency, is provided by the Stepwise Procedures [1]–[3]. These include forward selection procedure, backward elimination procedure (BEP) and stepwise regression. All these procedures add or remove variables one-at-a-time until some stopping rule is satisfied. In this paper we will consider the BEP in which the idea is to begin with the complete model and to remove the least relevant regressor at each iteration until a preselected confidence level tells us that no further removal is justified. Let us mention that all these procedures can be far from optimal and no global confidence level or overall power can be guaranteed [1], [2], [8]. In this paper, in order to allow for a more accurate model of the reality, we replace the basic linear regression model (1) in which only the measurements in are assumed to be corrupted by errors, by the so-called errors-in-variables model [9] in which itself is contaminated by errors. The the regression matrix model becomes (2) where the vector and the matrix are known or observed, and the vector and matrix model the errors. When these errors are assumed to be Gaussian, (2) is known as the total least squares (TLS) model [10], [11]. We will call (2) the TLS or basic TLS model in the sequel and distinguish it from the case where some regressors are perfectly known we will call the generalized TLS model (4). We propose to perform the different investigations needed to extend the BEP to the generalized TLS model (4) which includes (2) as a special case. In [12] a similar approach is proposed using a numerical analysis approach. The original contri-

1053-587X/$20.00 © 2006 IEEE

FUCHS AND MARIA: VARIABLE SELECTION USING THE TLS APPROACH

11

butions are, thus: i) to obtain the maximum likelihood estimates (MLE) of all the parameters in the generalized model (4); ii) to evaluate their statistical properties using matrix perturbation results; iii) to prove that one can indeed consider the generalized model (4) as a basic model (2) with a structured noise covariance matrix; and iv) to develop the corresponding test and associated thresholds. This paper is organized as follows, the known results in the LS problem and the backward algorithm are described in Section II. The general TLS model is detailed in Section III together with the maximum likelihood estimates of the regression coefficients and their statistical properties. In Section IV, the extension to the TLS model and the corresponding algorithm are proposed. Then, in Section V, some simulation results are proposed to both validate the statistical analysis and to evaluate the efficiency of the proposed procedure. We conclude in Section VI. II. THE BASIC LINEAR REGRESSION CASE A. The LS Model In the basic multiple regression model (1) with Gaussian noise on the observations, the maximum likelihood (ML) estimate of the vector of weights is obtained by solving the LS problem, i.e., minimizing the RSS (3) where denotes the square of the -norm. The where optimum is attained at is the pseudoinverse of which is assumed to be full column, the residual sum of squares, rank. Denoting the following properties are well known [3], [13]: • • is an unbiased estimate of • • and are independent denotes the Chi square distribution with degrees of where freedom. It is worth noting that all these results are nonasymptotic. They hold for finite and arbitrary signal-to-noise ratio is biased for (SNR). While the ML estimate [14] of is unbiased. The finite , the one proposed above last properties are most easily established by resorting to geometric arguments. All these properties are used below to justify the BEP applied to the LS model (1).

i.e., it tests the increase of the RSS following its deletion, if the increase is too small it is deleted. The other approach, known as the Student test, is the one we will extend to the TLS model in the sequel. Let us detail it. In the Student test approach, the emphasis is put on the value of the components in . In the BEP based on this test, one starts with the full model having weights and decides if one of them should be declared zero. If this is the case, one deletes the associated column and starts anew, as if the true full initial model components. The idea is to use the statistical had only properties of stated above to suitably normalize each estimated weight and to compare it to a threshold corresponding to a given confidence level computed under the assumption that the true weight is zero. The Student’s t-distribution [3], [13] which was derived as the distribution of a random variable arising from statistical inference for the mean of a normal distribution is the natural choice. If and are independent random variables with and then follows a Student’s t-distribution with degrees of freedom. From the properties of and enumerated above, it follows that associated with the th weight , the random variable

follows precisely a Student’s t-distribution with degrees of freedom. Note that this is only true for the full model, for the first step of the BEP, in the following steps it is only “conditionally” true, i.e., provided the preceding decisions were correct. The BEP using the Student test is then as follows: start with the full model and set 1) Compute the current complete model estimate with components associated 2) Test of the lowest components with the • If , remove the th regressor from the current regressor matrix, set and return to 2. • Else stop and accept the current (reduced) model, is the quantile function of the Student’s t-distriwhere degrees of freedom and is the confibution with is dence parameter. More simply, the threshold is such that the probability of the event (for instance 10%) if the true value .

B. BEP Using the Student Test After the full weight vector has been obtained, the BEP decides which regressors should be kept by removing them one at a time. At any given stage it has to decide if an additional column can be removed from the current regressor matrix or equivalently an additional coefficient in can be declared zero. Somehow surprisingly, there are two approaches that lead to strictly equivalent tests for the LS model [2], [3], [13]. This is due to the linear structure of the model. One approach, known , as the Fisher test, proposes to remove or not the regressor the th column in , by evaluating its contribution to the RSS,

III. THE GENERAL TLS MODEL A. The New Model In model (2), we introduce the possibility for the regressors to be corrupted by additive noise just as the observations. For obvious practical reasons,1 we generalize this model to the case where some regressors are perfectly known, those in the matrix, and the others, those in the matrix, are subject to noise. 1In multiple linear regression one systematically introduces a column of ones in the matrix to allow for an intercept parameter. This is but one example of perfectly known regressor.

X

12

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 1, JANUARY 2007

The basic TLS model (2), thus, becomes the general TLS model [11], [15]

The purpose of these transformations is to reduce the number of unknowns to their minimum, making them independent. The last step is now to put this optimization problem (7) into a form where it appears that its optimum is deduced from the singular value decomposition of a given augmented matrix. In order to prepare these last steps, we state what they would be for the basic model (2) in order to mimic their application to (7). For the basic model (2) and with

(4) where one knows the -dimensional matrix and one knows or observes the -dimensional vector and the -dimensional perturbed regression matrix . We assume the to be linearly independent. This model columns in and can be rewritten row-wise to get for the row

the current formulation (7) of the optimization problem, would read

We assume that the observations are independent random and rows are also assumed to be variables. The errors and independent zero mean Gaussian noises with variance covariance matrix , respectively. We further assume that is positive definite and known. This last assumption is necis not identifiable, i.e., the maximum likelihood essary since when function can be made infinite by a proper choice of this matrix is assumed unknown. In practice is taken as a diagonal matrix with positive components. B. The Parameter Estimates and . In (4), the quantities to be estimated are , , unWith the dimension of , there are thus observations. For the estimation problem knowns for to have a solution it must hold that . Under the standing Gaussian assumptions, the ML estimates of the paramand are then part of the optimum of eters (5) or in matrix notations: , where denotes the square of the Frobenius norm of . This optimization problem is separable and the minimum with respect to is attained at . Before we replace by its the projection of on the optimum, let us define the orthogonal comrange space of and plement of in . We define similarly , , , and . Then, after substitution we get

(6) is attained at , Here the minimum in the last term in (6) thus vanishes after optimization with respect . We rewrite the first two terms by introducing a full to column rank orthogonal matrix that is such that and we get (7)

and setting

and

it can be rewritten as (8)

This last formulation can further be transformed into

i.e., one seeks the perturbation matrix of minimal makes Frobenius norm that when subtracted from its rank decrease by one. It is then well known [10], [11], [16] is equal to that the optimal matrix the rank one matrix associated with the smallest singular triplet , is deduced of and that which satisfies . from To apply the same development to (7), we define , and and rewrite (7) as subject to (9) that is to be compared to (8). We, therefore, define the perturbed matrix and deduce the optimal , the ML from its smallest right singular vector say estimate of by scaling its last component assumed nonzero, to

(10) Having obtained the ML estimate of and , the ones of and follow easily. They lead in turn to the estimate of . But in order to develop the extension of the BEP to the genand the statistical eral TLS case we also need an estimate of properties of these estimates. is where The ML estimate of is the square of the minimal singular value of [14]. However, this estimate is too biased to be of any use2 and we will indicate in Section III-C that an unbiased estimate of in the general TLS model case is (11) 2Remember that in the LS case the ML estimate of

 is also (slightly) biased.

FUCHS AND MARIA: VARIABLE SELECTION USING THE TLS APPROACH

13

a result that is compatible with (and deducible from) the comparison of the number of observations and the number of unknowns in our problem.

C. The Statistical Properties of the Estimates For simplicity and to ease the presentation, we will develop these results for the basic TLS model (2). The modifications one has to introduce in order to extend them to the generalized TLS model (4) are rather simple as far as the estimate of is concerned. The deduction of the statistical properties of the of is more cumbersome; we shall estimate show in Section IV how to get rid of the distinction between and in practice. is the ML estimate one can obSince the estimate (10) of tain its statistical properties by evaluating the Fisher information matrix and inverting it [14], but the corresponding values are easier to obtain by using (first order) matrix perturbation analysis [17]–[19]. It is also through this way that we will obis deduced tain (11). Indeed (10) tells us that the estimate of from the smallest right singular vector of a perturbed matrix . It is thus possible to use matrix perturbation results to analyze its properties. For ease of reference, we recall the definition and notations we use for the basic TLS model (2)

and its th line

the matrix has full column rank, both and are, respectively, scalar and vector white Gaussian noises with variand covariance matrix with invertible and ance matrix is then known. The and the perturbation mathe sum of trix . Introducing the normalized perturbation matrix one has (12) The matrix is the sum of the exact but unknown matrix and the perturbation . The components in are of order one since they are independent zero mean normal variables with variance unity. One can thus expect that for sufficiently small, the elements of the singular value decomposition (SVD) of are close to those of . This is what matrix perturbation analysis allows to confirm. Let us introduce the following notations: and denote the SVD of and , respectively, with , , and square orthogonal matrices and , matrices of the dimension of :

with

and

order-p diagonal matrices. and . We are only interested in the way the single zero singular of and its right singular vector , denoted value earlier, are perturbed. The perturbation theory for these elements presents some difficulties. Since a zero singular value can only increase, its expansion is of a different nature than the expansions of nonzero singular values. The key is to work with . One can establish [16], [19] the the cross-product matrix following results: Proposition 1: For sufficiently small the following expansions are valid

(13) (14) is the pseudo-inverse of . where The first relation leads to the estimate (11) of the common noise variance , the second will allow us to get the covariance matrix of the ML estimate of (10). The domain of validity of these expansions is difficult to obtain and to characterize for finite , but this is not a surprise since the corresponding statistical analysis would only be valid asymptotically in the number of observations. Perturbation matrix analysis relies on a so-called well-separateness assumption [16], [18] which in this . Since , the smallest nonzero context becomes singular value of , is a function of that increases generically as , the well-separateness always holds for sufficientlty large . But it may also hold for finite but adequate “SNR” a condition that would be difficult to obtain from a statistical analysis point of view. We evaluate this possibility that validates the proposed approach, in Section V-B-1 where we check that it holds is greater than, say, 3. provided with and We have defined the SVD of as square orthogonal matrices, let us introduce a more compact form by decomposing as and similarly with where and are such that with the square matrix defined above. Equations (13) and (14) in the proposition can then be rewritten

Since the components of are independent random variables, one can show that the random variable and the have the following properties. random vector Proposition 2: The following properties hold up to first order in , for sufficiently small : • is close to a Gaussian random vector with mean and covariance matrix ; has mean and • the ML estimate deduced from ; covariance matrix • is close to a random variable; and are independent. •

14

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 1, JANUARY 2007

Proof: Item 1 and 3 are easy consequences of the preceding proposition. To establish item 2, we first use (10) to get

the covariance To ease the presentation we assume that matrix of , the noise perturbing the rows in , is diagonal . In the approximate model, we consider that and equal to the rows in are similarly perturbed by a noise with diagonal where which will be taken much covariance matrix smaller than , allows to monitor the degree of approximation. To justify this approach we state the following proposition that is established in the Appendix. and Proposition 3: Assume we observe where is a full column rank matrix, is and the rows in are independent with known. Then, as decreases to zero, the ML estiof converges towards , the LS estimate. mate Though expected this is not a trivial observation and is a first step towards the justification of our approximation. To proceed further we need to remember from (12) that prior to the application of the SVD decomposition we need to scale the different parts of the matrix in order to have all the components in the (12) have the same standard deviation perturbation matrix . This is quite logical: the SVD algorithm treats all the components equally and for this decomposition to make sense the perturbations must have the same standard deviation. It is then clear that if is contaminated by noise with variance and is contamined by noise with variance we must perform the SVD of the augmented matrix

where . Since of

denotes the first-order variation and , this implies in matrix form and hence

from (14) one deduces that with . Since this is a full-rank factorization and, thus, that of , it follows that . It follows

which leads to item 2. From the orthogonality between and , one deduces the and which, since decorrelation between the estimates they are close to normal for small enough, implies independence, thus, item 4. Remark: In Section V-B-1, we check by simulations that is greater than, these properties actually hold as soon as say, 3, i.e., the for sufficiently small prerequisite that we impose in Proposition 2, which in the statistical literature [14] is replaced by for sufficiently large , can indeed be replaced by this perfectly checkable condition. In this section, we have considered the basic TLS model (2) rather then (4) to ease the already cumbersome presentation and development, this means that we have ignored the possibility of exactly known regressors offered by the general TLS model (4). In the next section, we show that it is indeed possible to circumvent this additional complexity by introducing an approximate model.

and that this will lead to the estimate of the augmented vector of unknowns

This augmented TLS model has the basic TLS structure (2) and the statistical properties of the associated estimates can be obtained from Proposition 2 in Section III-C. The estimates of are deduced from the “smallest” right singular vector of and verify

with the smallest singular value of . Replacing value one gets after some manipulations

by its

IV. AN APPROXIMATE MODEL In the general TLS model (4) which after different substitutions becomes we make a distinction between the perfectly known regressors in and the imperfectly known ones in . While this distinction is necessary from a selection of variable point of view, it is quite cumbersome in the estimation procedure and in the evaluation of the statistical properties of the estimates. In this section, we introduce an approximate model which allows to handle in the same way both types of regressors present in (4) and investigate the validity of this approach.

The covariance of item 2

(15) can now be deduced from Proposition 2,

FUCHS AND MARIA: VARIABLE SELECTION USING THE TLS APPROACH

which in turns gives, using expressions for the inverse of partitioned matrices [20],

with the matrix which is such that defined above relation (7). We should compare these covariances to those that do result from the full analysis of the true ML procedure applied to (4). We do not detail this analysis that yields

For the covariance of the difference is negligible as soon . For the covariance of , as one takes for instance the difference is quite small also since is quite often around one-half or one-third. It remains to check the value of the estimates themselves. This is indeed more important to justify the approximation. If we neglect the terms involving , which is reasonable for we get from (15)

the estimate of has exactly the same expression as the true ML estimate, for the estimate the unique difference lies in the value of the smallest singular value in the approximate procedure which is not identical to what one gets in the exact procedure. Comparing the traces of the matrices in both cases . We, indicates that the difference is quite small for therefore, do recommend to apply this approximate procedure, see also Section V-B-2. This is the algorithm we describe later.

V. TLS BEP

A. Detailed Algorithm We give in this section the exhaustive description of TLS-BEP using the approximate model of the Section IV. The model used for this algorithm is (4).

where

where of

is the observed matrix and the matrix of known regressors. The estimate of is deduced from the estimate of which is itself deduced from the SVD . We recommend to take, e.g., . This is the value we used in the simulations.

15

Implementation of the algorithm : 1) Initialization • of the augmented matrix: • of the problem dimension of columns in ); . • of 2) Step of the algorithm:

; (number

• Compute the SVD of

.

• Identify in the right singular vector associated to the smallest singular value . then: • Compute the estimate of

where corresponds to the th component of the vector between parenthesis. • Estimation :

where indicates the matrix composed by the first rows and the first columns from X corresponds to the first and the notation component of . • Compute the estimate of noise standard deviation:

• Test all regressors with the Student test: , compute all Student parameters: —

— Test of the lowest : , remove the th • if (it is equivalent to removing of column in the th regressor from the current regressors). Consequently, removal of the th row and . Set and then column in and return to 2. • Else stop and accept the current reduced model. B. Simulation Results It is difficult to argue about the advantage of a new approach on real data since the truth is unknown. We present therefore mostly results on simulated data. At first, we check the conditions that validate the statistical analysis we performed in Section III-C and thus justify the use of TLS-BEP. Later, we validate the use of the approximate model introduced in Section IV. Then, we compare performance of TLS-BEP and LS-BEP on two sets of simulated data before we present the results obtained by both approaches on

16

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 1, JANUARY 2007

at that the application of TLS-BEP is justified whenever the different steps of the procedure. 2) Validity of the Approximate Model: The aim of this section is to investigate the role of the parameter. Indeed we will or smaller as recomshow that as soon as one takes mended in Section V-A, the difference between both TLS-BEPs is negligible. The model used for the complete TLS-BEP algorithm is (4)

Fig. 1. Evolution of performance for TLS-BEP subject to well-separateness condition. (Color version available online at http://ieeexplore.ieee.org.)

the Hald dataset [1], the dataset that is the most frequently used in this context. 1) Conditions for TLS-BEP Validity: The proposed approach relies on the validity of the statistical analysis performed in Section III-C which justifies the use of the Student test. The statistical analysis is valid asymptotically, as usual, and in Section III-C we indicate that this means where is the smallest nonzero singular value of the (partially unknown) regressor matrix and the (unknown) standard deviation of the noise. We illustrate here that the asymptotic analysis is greater than, say, is indeed valid as soon as and are singular values of a perfectly known 3, where matrix. We check that if , the probability for the Student parameter to be below the threshold which declares the true parameter being zero is indeed close to its expected value, i.e., the confidence parameter used when setting the threshold. We use the model

where and , obpotential regressors. Only one regressor servations and does not participate in the regression. The components in are chosen from a uniform distribution in the interval [0,3]. All comare equal to one except for the component of the ponents in nonparticipating regressor which is equal to zero. We apply the test procedure described in Section V-A to this model with a threshold equal to 1.7109 function of the number of degrees of and fixed in order to obtain a confidence freedom parameter equal to 0.9 (90%). and the two smallest singular values of We denote )-dimensional matrix . In Fig. 1, we the ( , present the percentage of good detection for one thousand noise taken between 1 and realizations for each value of , and . Note that the 11 for three cases: different values of are obtained by adequately tuning which acts on the SNR of the scenario. One can conclude that the analysis we performed is valid if the ratio which is perfectly observable, is greater than 3, whatever the way this is attained, i.e., here the value of . This means

where is the (30 3) observed matrix and the (30 3) matrix of known regressors. The components in and matrix are uniformly distributed samples taken between 0 and 3. The , , number of true regressors is four, we choose to take , , and , and components are equal to one for the true regressors. The model used for the approximate TLS-BEP algorithm, see Section IV, is an aggregated form of the previous model

it has the form of the basic TLS model (2) where the covariance matrix of the noise perturbing the regressors is block-diagonal. The idea is to model the perfectly known regressors in by a that can be made small at will by noise having a variance tuning . and , this corresponds to a domain We take in which TLS-BEP is valid, see Section V-B-1. The tests are done with a Student test with 90% confidence level. It appears and that the performance of the complete TLS-BEP where are calculated and tested separately is the same as that of the approximate TLS-BEP described in Section V-A when . On 1000 realizations, both TLS-BEP took systematically identical decisions and retrieved the relevant regressors in 82% of the realizations. The relative deviation between the estimates obtained by both versions of TLS-BEP of has a mean value of 0.0021 over the 1000 realizations. We thus recommend to use the approximate procedure and we denote it as TLS-BEP in the sequel. Remark: The 82% rate of good detection is not unexpected and even somehow predictable. Indeed the TLS-BEP proceeds stepwise as any BEP by removing potential nonpresent regressors one at a time. In this simulation there are two false regressors to be removed, at each step the threshold was fixed to have 90% of good decisions and the 82% rate has, thus, to be com. pared with 3) Performance of TLS-BEP: Here, we compare the performance of TLS-BEP and LS-BEP on a set of simulated data. We use the basic TLS model

FUCHS AND MARIA: VARIABLE SELECTION USING THE TLS APPROACH

where and , obserpotential regressors. Only one regressor does vations and and not participate in the regression. We further take . The components in are chosen from a uniform disare 5, 2 tribution in the interval [0,4]. The components in and 0. We apply the test procedure described in Section V-A to this model with a threshold fixed to obtain a confidence parameter equal to 0.9 (90%). We observe that TLS-BEP retrieves the good regressors with 86% whereas LS-BEP retrieves it only with 40% over the 10 000 noise realizations we performed, a very bad performance for LS-BEP. This is of course a pretty unfair comparison since the model is perfectly matched to the TLS-BEP approach. It is nevertheless interesting to note that LS-BEP is quite sensitive to even modand meaning that erate noise on the regressors, the Gaussian noise variance on the regressors is 1/9. In this case, in almost all the 60% of wrong decisions it made the LS-BEP only retains one regressor. The LS-estimate of the -weight whose true value is 2 is biased and in about half of the simulations it is declared equal to zero. 4) A More Realistic Comparison Between TLS-BEP and LS-BEP: We present a further comparison of the performance of TLS-BEP and LS-BEP on a more realistic set of simulated data. We consider the case where one observes a sum of sinusoids with jitter on the sampling period. The nominal sampling period is taken equal to one but the true sampling instants are

with to be fixed and independent random variables uniformly distributed in [ 0.5,0.5], this can for instance represent the effect of a fluctuating array in underwater acoustics [25]. The model is then

where , is the unobserved regression matrix with jittered sampling period that changes for each noise realization and is the observed regression matrix with regular or nominal sampling period which is constant. The matrix in (4) and is, thus, far from being Gaussian, is now equal to it no longer satisfies the assumption made to develop our selection scheme. and potential regressors, the comWe take , the correponents in the th column of are are . We take the sponding components in s equal to 3, 2, 1, and the corresponding initial phases uniequal to 5, formly distributed in [0, ). As earlier, we take 2 and 0, i.e., the last regressor does not participate. We further , and the coefficient in the jitter equal to 1. take We apply the test procedure described in Section V-A to this and a threshold fixed to obtain a confidence model with parameter equal to 0.9 (90%). We observe that TLS-BEP retrieves the good regressors 64% of the time over the 10 000 noise and jitter realizations we performed, whereas LS-BEP retrieves it only with 48% of the time. . The TLS-BEP outperforms The results are similar for LS-BEP less significantly than in Section V-B-3. These poorer

17

TABLE I LS-BEP AND TLS-BEP APPLIED TO HALD DATA SET

performance of TLS-BEP are indeed due to the fact the model is no longer matched to the theory. It is also in part due to the , see Section V-B-1, is in this fact that the ratio example equal, in the average, to 1.5 which is quite below the threshold 3 over which the theory behind the TLS-BEP is justified. 5) The Hald Data Set: We now present the selection made by TLS-BEP on real data: the most often used Hald data set, [1, ch. 6 and appendix B], it corresponds to an engineering application observations in chemometrics. This is a data set with and potential regressors. Tests are done with a Student test with a 90% confidence level. Since the measurements made to get the components of the regressors are of the same kind as those made to get the observations the TLS model seems quite natural and justified and this suggests to take with , i.e., to assume all regressors and observation are perturbed by noise with the same variance. The model we use is, thus

where and . The estimate of that is deduced from the data once the model is fixed, leads to and similarly the observed value of (see Sec. tion V-B-2) is In Table I, we present the results of LS-BEP and TLS-BEP on this data set. LS-BEP declares two variables as unrelevant whereas TLS-BEP selects three regressors. Remember that the truth is unknown and so we can not argue about which model is the true one. Indeed in the literature and in [1, Ch. 6], most selection schemes agree with LS-BEP. Of course if we diminish , which was taken equal to one above, it is possible to make the TLS-BEP select two regressors just as LS-BEP. This follows from Proposition 3 and for the or smaller. Hald data set, it is the case for VI. CONCLUSION The BEP is a useful and powerful algorithm that allows to select variables in multiple linear regression schemes. It has been developed and is used for LS models where it is assumed that (Gaussian) errors are only present in the observation vector that one wants to explain. In many practical situations, the components of some regresare essentially of the same nature as the components of sors the observation vector . There is, thus, no reason to consider these regressors to be known exactly as is done in the standard LS model. More generally it seems natural, depending upon the type of the regressors to consider that some are subject to noise

18

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 55, NO. 1, JANUARY 2007

while others (e.g., those having integer values) are known exactly and to develop selection procedures that allow to take these features into account. Since the linear regression model where Gaussian noise is assumed to be present not only on the observations but also on part of the regressors is known as the TLS model, we call TLS-BEP, the BEP extended to this type of models. To develop the procedure, we have analyzed the statistical properties of the corresponding ML parameter vector estimate using results from matrix perturbation analysis and developed a Student test that allows to decide if a component of the estimated vector should be declared equal to zero. We also developed a simplified version of the TLS-BEP that allows to handle in the same way perfectly known regressors and perturbed ones. It is this version that is detailed and that should be used for a quick implementation. In a more commercial software one would of course prefer the exact and complete implementation since in some extreme cases the approximate version might lead to badly conditionned matrices. Further investigations are underway to extend this work to the underdetermined case where there are more regressors than observations. This is an active domain of research presently that is concerned with the recovery of sparse representations in the presence of noise [21]–[24]. Evaluating the performances of this type of approaches seems, however, quite difficult. They generally rely on the minimization of criterions that are no longer convex when allowing for perturbations in the regressors.

constant matrix define

APPENDIX Here, we establish that when the variance of the noise perturbing the regressors tends to zero, the TLS estimate of the weights converges towards the basic LS estimate, as one might expect. More formally, we prove the following. Proposition A: Assume we observe and where is a full column rank matrix and is and the rows in are independent with known. Then, as decreases to zero, the ML estimate of converges towards , the LS estimate. Proof: Let us introduce the matrices , and the vector . The ML estimate of is deduced from , the smallest right singular vector of SVD , which is also the smallest eigenvector of . If we denote the square of the associated eigenvalue, it is easy to deduce that [10]

which leads to, introducing the dependencies on

Let us show now that the square of the smallest sin, remains bounded as decreases to gular value of denote the normalized smallest eigenvector of the zero. Let

, we partition it into , one then has

and

where the last term is a constant independent of . Since moreover as , it follows that as .

REFERENCES [1] N. Draper and H. Smith, Applied Regression Analysis. New York: Wiley, 1981. [2] A. Miller, Subset Selection in Regression. London, U.K.: Chapman & Hall, 1990. [3] G. A. F. Seber and A. J. Lee, Linear Regression Analysis, 2nd ed. New York: Wiley , 2003. [4] E. I. George and D. P. Foster, “Calibration and empirical Bayes variable selection,” Biometrika, vol. 87, no. 4, pp. 731–747, 2000. [5] H. Akaike, “A new look at the statistical model identification,” IEEE Trans. Autom. Control, vol. 19, pp. 716–723, 1974. [6] G. Schwarz, “Estimating the dimension of a model,” Ann. Statist., vol. 6, no. 2, pp. 461–464, 1978. [7] C. L. Mallows, “Some comments on C ,” Technometrics, vol. 15, pp. 661–675, 1973. [8] M. L. Thompson, “Selection of variables in multiple regression, part I and II,” Int. Statist. Rev., vol. 46, pp. 1–19, 1978, pp. 129-146. [9] W. A. Fuller, Measurement Errors Models. New York: Wiley, 1987. [10] G. H. Golub and C. F. Van Loan, “An analysis of the total least squares problem,” SIAM J. Numer. Anal., vol. 17, no. 6, pp. 883–893, 1980. [11] S. Van Huffel and J. Vanderwalle, The Total Least Squares Problem. New York: SIAM, 1991. [12] ——, “Subset selection using the TLS approach,” Linear Algebra Its Applicat., vol. 88, pp. 695–714, Apr. 1987. [13] L. L. Scharf, Statistical Signal Processing, Detection, Estimation and Time Series Analysis. Reading, MA: Addison-Wesley, 1991. [14] L. J. Gleser, “Estimation in a multivariate “errors in variables” regression model: Large sample results,” Ann. Statist., vol. 9, no. 1, pp. 24–44, 1981. [15] B. E. Dunne and G. A. Williamson, “QR-based TLS and mixed LS-TLS algorithms with applications to adaptive IIR filtering,” IEEE Trans. Signal Process., vol. 51, no. 2, pp. 386–394, Feb. 2003. [16] G. W. Stewart and J. G. Sun, Matrix Perturbation Theory. New York: Academic , 1990. [17] J. H. Wilkinson, The Algebraic Eigenvalue Problem. Oxford, U.K.: Oxford Univ. Press, 1965. [18] G. H. Golub and C. F. Van Loan, Matrix Computations. Baltimore, MD: The Johns Hopkins Univ. Press, 1983. [19] J. J. Fuchs, “Rectangular Pisarenko method applied to source localization,” IEEE Trans. Signal Process., vol. 44, no. 10, pp. 2377–2383, Oct. 1996. [20] H. Lutkepohl, Handbook of Matrices. New York: Wiley, 1996. [21] B. D. Rao et al., “Subset selection in noise based on diversity measure information,” IEEE Trans. Signal Process., vol. 51, no. 3, pp. 760–770, Mar. 2003. [22] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Roy. Statist. Soc. B., vol. 58, pp. 267–288, 1996. [23] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” J. Roy. Statist. Soc. B., vol. 67, pp. 301–320, 2005. [24] M. Nafie, A. H. Tewfik, and M. Ali, “Deterministic and iterative solutions to subset selection problems,” IEEE Trans. Signal Process., vol. 50, no. 7, pp. 1591–1601, Jul. 2002. [25] S. A. Vorobyov, A. B. Gershman, and Z. Q. Luo, “Robust adaptive beamforming using worst-case performance optimization: A solution to the signal mismatch problem,” IEEE Trans. Signal Process., vol. 51, no. 2, pp. 313–323, Feb. 2003.

FUCHS AND MARIA: VARIABLE SELECTION USING THE TLS APPROACH

Jean-Jacques Fuchs (M’81) was born in France in 1950. He graduated from the École Supérieure d’Électricité, Paris, France, in 1973. He received the M.S. degree in electrical engineering from the Massachusetts Institute of Technology, Cambridge, in 1974. He received the Thèse d’Etat in adaptive control and identification from the Université de Rennes I in 1982. After a short period in industry with Thomson-C.S.F., he joined the Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA) in 1976. Since 1983, he is a professor with the Université de Rennes 1. His research interests are in signal processing. He is now involved in array processing and sparse representations.

19

Sébastien Maria was born on March 16, 1980, in Aix-en Provence, France. He received the Diplôme d’Ingénieur and the M.S. degrees from the Institute of Computer Science and Communication (IFSIC), Rennes, France, in 2003. He is currently working towards the Ph.D degree at the Institut de Recherches en Informatique et Systèmes Aléatoires (IRISA), Rennes. His current research interests include statistical signal processing, sparse signal representation, convex optimization, sensor array processing, and radar.