Report SSP-2010-1:
Optimum Joint Detection and Estimation
George V. Moustakides Statistical Signal Processing Group Department of Electrical & Computer Engineering University of Patras, GREECE
Contents 1 1.1 1.2
1.3
Joint Hypothesis Testing and Isolation 1 Introduction 1 Randomized Decision Rules and Classical Hypothesis Testing 2 1.2.1 Neyman-Pearson Binary Hypothesis Testing 3 1.2.2 Bayesian Multiple Hypothesis Testing 4 Combined Hypothesis Testing and Isolation 5 1.3.1 Optimality of GLRT 6 1.3.2 Combined Neyman-Pearson and Bayesian Hypothesis Testing 8
2 2.1
Joint Hypothesis Testing and Estimation Introduction 10 2.1.1 Optimum Bayesian Estimation 10
2.2
Variations 12
2.1.2
10
Combined Neyman-Pearson Hypothesis Testing and Bayesian Estimation 11
2.2.1 Known Parameters under H0 12 2.2.2 Conditional Cost 13
2.3
Examples 15 2.3.1 2.3.2 2.3.3
2.4 2.5
MAP Detection/Estimation 15 MMSE Detection/Estimation 15 Median Detection/Estimation 16
Conclusion 17 Acknowledgment 17
i
ii
1 Joint Hypothesis Testing and Isolation 1.1
Introduction
In binary hypothesis testing, when hypotheses are composite or the corresponding data pdfs contain unknown parameters, one can use the well known generalized likelihood ratio test (GLRT) to reach a decision. This test has the very desirable characteristic of performing simultaneous detection and estimation in the case of parameterized pdfs or combined detection and isolation in the case of composite hypotheses. Although GLRT is known for many years and has been the decision tool in numerous applications, only asymptotic optimality results are currently available to support it. In this work we introduce a novel, finite sample size, detection/estimation formulation for the problem of hypothesis testing with unknown parameters and a corresponding detection/isolation setup for the case of composite hypotheses. The resulting optimum scheme has a GLRT-like form which is closely related to the criterion we employ for the parameter estimation or isolation part. When this criterion is selected in a very specific way we recover the well known GLRT of the literature while we obtain interesting novel tests with alternative criteria. Our mathematical derivations are surprisingly simple considering they solve a problem that has been open for more than half a century. Consider a random data vector X ∈ RN and two composite hypotheses H0 , H1 defined as Hi : X ∼ fik (X) with prior probability πik , k = 1, . . . , Ki , i = 0, 1,
(1.1)
where fik (X) are pdf functions and “∼” means “distributed according to”. Under each hypothesis Hi the data pdf can take one out of the Ki possible forms fi1 (X), . . . , fiKi (X) with corresponding prior probabilities πi1 , . . . , πiKi . The classical approach for distinguishing between the two composite hypotheses consists in forming, for each hypothesis, the mixture pdf Ki X fi (X) = πik fik (X), (1.2) k=1
and then, for any realization X of the random vector X , applying the likelihood ratio test PK1 H π1k f1k (X) 1 f1 (X) = Pk=1 T λ, K0 f0 (X) k=1 π0k f0k (X) H
(1.3)
0
to make a decision. According to (1.3) we decide in favor of H1 when the likelihood ratio exceeds the threshold λ; in favor of H0 when the likelihood ratio falls below the threshold and perform a randomized decision between the two possibilities every time the likelihood ratio coincides with the threshold. Even though this decision scheme is optimum (in more than one senses), it can decide only between the two main hypotheses. There are clearly applications where one is interested in specifying the actual pdf that generates the data vector X. In other words in addition to the main hypothesis we could also attempt to fine-tune our decision mechanism by isolating the pdf that is responsible for the observed data X. This goal clearly demands for a joined detection/isolation strategy. A possible approach for solving the combined problem is with the help of GLRT, that 1
2
1: Joint Hypothesis Testing and Isolation
is, by applying the following test max f1k (X)
1≤k≤K1
max f0k (X)
1≤k≤K0
H1
T λ,
(1.4)
H0
which is equivalent to f1kˆ1 (X) f0kˆ0 (X)
H1
Tλ
(1.5)
H0
kˆi = arg max fik (X), i = 0, 1. 1≤k≤Ki
(1.6)
We observe that GLRT performs two simultaneous decisions: with (1.5) it decides between the two main hypotheses H0 , H1 and at the same time, with (1.6) it isolates the most likely pdf under each hypothesis. A significantly more interesting situation arises when under each hypothesis we have parameterized pdfs. Suppose that under hypothesis Hi , i = 0, 1 the data vector satisfies X ∼ fi (X|θi ) where for the parameter vector θi we assume that it is a realization of a corresponding random vector ϑi which is distributed R according to the prior pdf πi (θi ). A test for composite hypotheses would form the two mixture pdfs fi (X) = fi (X|θi )πi (θi )dθi and then apply the likelihood ratio on the resulting densities. Again, as before, this approach is unable to propose an estimate for the parameter vector θi that generates the observed data X. We realize that the isolation problem has now turned into a parameter estimation problem consequently, if our goal is to make, simultaneously, detection and parameter estimation, a possibility could be to apply the GLRT supθ1 f1 (X|θ1 ) H1 T λ, supθ0 f0 (X|θ0 ) H
(1.7)
0
or equivalently H f1 (X|θˆ1 ) 1 Tλ f0 (X|θˆ0 )
(1.8)
θˆi = arg sup fi (X|θi ), i = 0, 1.
(1.9)
H0
θi
With this test we decide between the two hypotheses providing, at the same time through (1.9), maximum likelihood estimates of the desired parameters. The first asymptotic optimality result for GLRT can be traced back to 1943 in the work of Wald [1] while subsequent results can be found in [2, 3, 4, 5]. A thorough analysis of this subject exists in [6, Chapter 22] and additional references in [7]. We should also mention a series of results [8]-[13] addressing the asymptotic optimality property of GLRT but for special classes of processes. Finally in [14] GLRT is related to the uniformly most powerful invariant (UMPI) test and conclusions about its asymptotic optimality are drawn from this connection. As far as applications are concerned, the literature dealing with GLRT is enormous, indicating the significant practical usefulness of this very simple decision mechanism. Despite GLRT’s extreme popularity, no finite sample size optimality result has been developed so far to support it. It is exactly this gap we intend to fill with our current work. Of course, it is unrealistic to expect that GLRT will turn out to be finite-sample-size-optimum with respect to some known criterion. The only chance we have to prove such type of optimality is by introducing a new performance measure. The measure we intend to adopt, we believe, makes a lot of sense and it is tailored to the fact that GLRT performs simultaneous detection/isolation or detection/estimation. Furthermore, with our analysis we will not only provide the missing optimality theory for GLRT but we will also offer novel GLRT-like alternatives which might turn out to be more suitable for certain applications than the existing test.
1.2
Randomized Decision Rules and Classical Hypothesis Testing
Before introducing our main results let us first revisit two classical problems from hypothesis testing theory, namely binary hypothesis testing in the Neyman-Pearson sense and multiple hypothesis testing in the Bayesian sense. We would like to develop the corresponding familiar optimum detection strategies by working with the
1.2: Randomized Decision Rules and Classical Hypothesis Testing
3
class of randomized decision rules. Randomized tests tend to be easier to optimize than their deterministic counterparts because they involve optimization of functions as opposed to optimization of (decision) sets needed in deterministic tests. The reason we insist on the two classical hypothesis testing problems is because we intend to propose a new combined version that will produce GLRT in a natural way. Furthermore, as we mentioned, we pay special attention to the class of randomized tests instead of the conventional deterministic class because with the former it is straightforward to develop the desired optimum decision strategy. 1.2.1
Neyman-Pearson Binary Hypothesis Testing
Consider a random data vector X that takes values in RN and two hypotheses H0 : X ∼ f0 (X); H1 : X ∼ f1 (X), where fi (X) denotes the pdf of the data vector X under hypothesis Hi . For every realization X we must come up with a decision D ∈ {0, 1}. Given X, with a randomized decision rule our decision d is a random variable. Therefore let δ0 (X), δ1 (X) denote the probability of our decision D being 0 and 1 respectively. It is clear that the two probabilities must be complementary, i.e. δ0 (X) + δ1 (X) = 1 and functions of the observation vector X. A randomized decision rule is completely specified once these two functions are known. A decision D is reached with the help of a random selection game where we select D = 0 with probability δ0 (X) and D = 1 with probability δ1 (X) using, for example, an unfair coin tossing procedure. The class of randomized decision rules is richer than the class of deterministic strategies. Indeed, we recall that a deterministic strategy is defined with the help of two complementary sets A0 , A1 ⊆ RN , where A1 = Ac0 and superscript “c” denotes complement, and we decide in favor of Hj whenever X ∈ Aj , j = 0, 1. Deterministic strategies make always the same decision for the same data vector X unlike their randomized counterparts where the decision depends on the outcome of the random game. A deterministic strategy can be viewed as a randomized rule by selecting δj (X) = 1{Aj } (X) where 1{A} (X) is the indicator function of the set A. Note that whenever X ∈ Aj the deterministic rule selects Hj ; its randomized version on the other hand selects Hj with probability δj (X) = 1{Aj } (X) = 1 which, of course, is the equivalent of a deterministic decision. The advantage of using randomized rules is that we work with functions instead of sets (which is the practice with deterministic strategies). This facilitates considerably the understanding of proofs by a novice reader who is more familiar with function than set optimization. Let us now attempt to solve the binary hypothesis testing problem in the sense of Neyman-Pearson. We are seeking a randomized rule [δ0 (X), δ1 (X)] that maximizes the probability of detection P(D = 1|H1 ) subject to the constraint that the false alarm probability P(D = 1|H0 ] does not exceed a prescribed level α ∈ [0, 1]. We can immediately see that Z P(D = j|Hi ) =
δj (X)fi (X)dX.
(1.10)
Using the Lagrange multiplier technique, we can transform the constrained optimization problem into an unconstrained one as follows Z max δ1 (X)[f1 (X) − λf0 (X)]dX , (1.11) δ1 (X)
where λ ≥ 0 the Lagrange multiplier. Since 0 ≤ δ1 (X) ≤ 1 (we recall that δ1 (X) is a probability) we conclude that the optimum δ1o (X) is when f1 (X) − λf0 (X) > 0 1 γ(X) when f1 (X) − λf0 (X) = 0 δ1o (X) = (1.12) 0 when f1 (X) − λf0 (X) < 0, where γ(X) is any arbitrary probability. This rule is of course equivalent to the classical likelihood ratio test of selecting with probability 1 (therefore deterministically) H1 when f1 (X)/f0 (X) > λ; favoring H0 when f1 (X)/f0 (X) < λ and deciding randomly with probability γ(X) in favor of H1 (and therefore with probability 1 − γ(X) in favor of H0 ) whenever the likelihood ratio coincides with the threshold λ. Threshold λ and randomization probability γ(X) are selected so that the likelihood ratio test meets the false alarm constraint with equality. The proof of existence of suitable values for λ and γ(X) (the latter is usually set to a constant) for any level α ∈ [0, 1] and of the optimality of the resulting test can be found in any basic textbook on hypothesis testing (see for example [15, Page 22]). We observe that under the richer class of randomized rules we still obtain the classical likelihood ratio test as our optimum detection scheme. It should be noted that although randomization does not improve the optimum
4
1: Joint Hypothesis Testing and Isolation
(deterministic) rule, this is not necessarily the case when randomization is applied to suboptimum tests (see for example [16] where the introduction of noise transforms a deterministic test into a randomized one and improves performance). 1.2.2
Bayesian Multiple Hypothesis Testing
Consider now the case where the random data vector X satisfies K hypotheses of the form Hk : X ∼ fk (X) with corresponding prior probability πk where k = 1, . . . , K. Here decision d takes values in the set {1, . . . , K} while the randomized decision mechanism is comprised of K complementary probabilities δ1 (X), . . . , δK (X), with δl (X) ≥ 0; δ1 (X) + · · · + δK (X) = 1 and δl (X) denoting the probability of selecting D = l, using a random selection game. For a Bayesian formulation we also need to specify a collection of costs Clk , k, l = 1, . . . , K, where Clk expresses the cost of deciding in favor of Hl (i.e. D = l) when the true hypothesis is Hk . The goal is to select the randomized decision strategy, namely the probabilities δl (X), in order to minimize the average cost. If we denote the latter by C and recall (1.10), we can write C =
K X K X
Clk P(D = l & Hk ) =
l=1 k=1
=
Z X K
≥
( δl (X)
Z
K X
) Clk fk (X)πk
dX =
Z X K
k=1
(1.13)
δl (X)Dl (X)dX
(1.14)
l=1
δl (X) min Dl (X)dX =
l=1
=
Clk P(D = l|Hk )πk
l=1 k=1
l=1
Z X K
K X K X
Z
l
( min Dl (X) l
K X
) δl (X) dX
(1.15)
l=1
min Dl (X)dX,
(1.16)
l
where the functions Dl (X) are defined as Dl (X) =
K X
Clk fk (X)πk .
(1.17)
k=1
In the previous derivations, inequality (1.15) is true because δl (X) ≥ 0, while (1.16) is a consequence of the same functions being complementary. The final integral in (1.16) is independent from the decision strategy, therefore it constitutes a lower bound to the performance of any randomized rule. Furthermore this lower bound is always attainable by the following decision rule which is thereby optimum 1 when k = arg minl Dl (X) o δk (X) = (1.18) 0 otherwise. The previous relation is the randomized version of the well known (deterministic) Bayesian optimum decision strategy D = arg min Dl (X). (1.19) 1≤l≤K
Clearly if more than one indexes attain the same minimum then we randomize among them with arbitrary complementary probabilities. We also recall the very interesting special case Clk = 1 when l 6= k and Cll = 0, for which the average costs C becomes the probability of making an erroneous decision. For this case the decision rule (1.19) is equivalent to πl fl (X) . D = arg max πl fl (X) = arg max PK 1≤l≤K 1≤l≤K k=1 πk fk (X)
(1.20)
In other words we select the hypothesis with the maximum aposteriori probability (MAP). Again, we observe that we obtain the classical optimum detection scheme of the deterministic setup. In the next section we are going to combine the previous two results and propose a new performance measure which will be optimized by GLRT.
5
1.3: Combined Hypothesis Testing and Isolation
1.3
Combined Hypothesis Testing and Isolation
Let us return to the binary case and assume that each hypothesis is composite. In other words under each hypothesis we have more than one possible data pdfs with a known prior probability. For notational simplicity we are going to regard each such possibility as a different subhypothesis. Therefore we are going to say that H0 is comprised of the subhypotheses H0k , k = 1, . . . , K0 , where under H0k : X ∼ f0k (X) with a prior probability π0k . Similarly H1 has the sybhypotheses H1k , k = 1, . . . , K1 , where under H1k : X ∼ f1k (X) with prior probability π1k . Probabilities πik , k = 1, . . . , Ki , i = 0, 1, are the prior probabilities of the subhypotheses given that the main hypothesis Hi is true. Consequently π01 + · · · + π0K0 = π11 + · · · + π1K1 = 1. If we simply like to decide between H0 and H1 then, as was mentioned in the Introduction, we apply the test depicted in (1.3). If however our goal is, in addition to this decision, to isolate the specific subhypothesis which is responsible for the observed data vector X, then we need to formulate the problem differently. Note that a randomized rule capable of selecting between subhypotheses requires the definition of K0 + K1 complementary probabilities δ01 (X), . . . , δ0K0 (X), δ11 (X), . . . , δ1K1 (X) (1.21) where for the randomization probabilities δjl (X), j = 0, 1 and l = 1, . . . , Kj , we have δjl (X) ≥ 0 and [δ01 (X) + · · · + δ0K0 (X)] + [δ11 (X) + · · · + δ1K1 (X)] = 1.
(1.22)
A key point in developing our methodology consists in observing that it is possible to write δjl (X)
=
δj (X)qjl (X),
where δj (X) = δj1 (X) + · · · + δjKj (X) and qjl (X) =
δjl (X) , j = 0, 1; l = 1, . . . , Kj . δj (X)
(1.23)
(1.24)
This alternative form of the randomization probabilities involves the following set of functions δ0 (X), δ1 (X), q01 (X), . . . , q0K0 (X), q11 (X), . . . , q1K1 (X)
(1.25)
for which, because of (1.22), (1.23), (1.24), we have δ0 (X) + δ1 (X) = q01 (X) + · · · + q0K0 (X) = q11 (X) + · · · + q1K1 (X) = 1.
(1.26)
Actually δj (X), j = 0, 1 expresses to the total randomization probability of selecting hypothesis Hj whereas qjl (X) becomes the conditional probability of selecting subhypothesis Hjl given that we have selected the main hypothesis Hj . The two different sets of randomization probabilities depicted in (1.21) and (1.25) suggest two different randomized games for the combined detection/isolation problem. With the help of the probabilities δjl (X) in (1.21), the decision mechanism involves a single step which directly selects a specific subhypothesis. In other words we simultaneously detect and isolate. This approach is similar to the multiple hypothesis testing problem considered previously. Now, if we use the alternative set in (1.25) then the detection/isolation process is concluded in two steps since it involves two different decisions, namely d1 for detection and d2 for isolation. Specifically: Step 1: We first make a decision d1 ∈ {0, 1} using the randomization probabilities δ0 (X), δ1 (X) and decide between the two main hypotheses H0 , H1 . Step 2: Given that in the first step we decided D1 = j, that is, in favor of the main hypothesis Hj , we continue with the isolation part and we select d2 ∈ {1, . . . , Kj } using the randomization probabilities qjl (X), thus isolating one of the subhypotheses Hjl . The second randomized decision must be (conditionally on X) independent from the one applied in the first step. The fact that in Step 2 the randomized selection game is independent from Step 1, allows for the writing of the probabilities δjl (X) in the product form appearing in (1.23). We would like to emphasize that the two randomized decision procedures, that is, the first based on (1.21) and the second using (1.25) are perfectly equivalent. Indeed from (1.21) we obtain (1.25) by applying (1.24) while we obtain (1.21) from (1.25) by using (1.23). The basic difference between the two decision strategies is that the
6
1: Joint Hypothesis Testing and Isolation
second method respects the grouping of the subhypotheses while the first disregards this property completely. It is in fact this grouping of the second decision mechanism that will give rise to the desired test. We should also mention that it is not equally straightforward to come up with the alternative decision mechanism by working solely with deterministic instead of randomized tests. Consequently, this fact justifies the use of this larger class of rules. 1.3.1 Optimality of GLRT Let us demonstrate the usefulness of the alternative decision mechanism presented above by introducing a simple detection/isolation problem which leads directly to the optimality of the classical GLRT. For our two-step decision process, consider the two probabilities P(Correct-detection/isolation|H1 ) and P(Miss-detection/isolation|H0 ). Following a Neyman-Pearson approach we are interested in maximizing P(Correct-detection/isolation|H1 ) subject to the constraint that P(Miss-detection/isolation|H0 ) is no larger than a prescribed level. The following theorem addresses explicitly this problem and introduces the corresponding optimum solution. Theorem 1.1: Consider the class Jα of all detection/isolation tests that satisfy the constraint P(Miss-detection/isolation|H0 ) ≤ α, where αmin ≤ α ≤ 1, with
(1.27)
Z αmin = 1 −
max {π0k f0k (X)}dX.
1≤k≤K0
(1.28)
Then the test, within the class Jα , that maximizes the probability P(Correct-detection/isolation|H1 ) is given by: Step 1: The optimum strategy for deciding between the two main hypotheses H0 and H1 is the GLRT max {π1k f1k (X)}
1≤k≤K1
max {π0k f0k (X)}
1≤k≤K0
H1
Tλ
(1.29)
H0
where, whenever the left hand side coincides with the threshold we perform a randomization between the two hypotheses and select H1 with probability γ. Step 2: If in Step 1 we decide in favor of hypothesis Hi (i.e. D1 = i) then the optimum isolation strategy is D2 = arg max {πik fik (X)}. 1≤k≤Ki
(1.30)
If more than one indexes attain the same maximum we perform an arbitrary randomization among them. Threshold λ and randomization probability γ of Step 1 must be selected so that the constraint in (1.27) is satisfied with equality. Proof: Note that P(Miss-detection/isolation|H0 ) = 1 − P(Correct-detection/isolation|H0 ), therefore the constraint is equivalent to P(Correct-detection/isolation|H0 ) ≥ 1 − α. Furthermore P(Correct-detection/isolation|Hi ) =
Ki X
P(Correct-detection/isolation|Hik )πik
(1.31)
k=1
with
Z P(Correct-detection/isolation|Hik ) =
δi (X)qik (X)fik (X)dX.
(1.32)
To solve the constrained optimization problem, let λ ≥ 0 be a Lagrange multiplier and, as in the classical NeymanPearson case, define the corresponding unconstrained version. With the help of (1.31) and (1.32) we can write P(Correct-detection/isolation|H1 ) + λ P(Correct-detection/isolation|H0 ) (K ) (K ) Z Z 0 1 X X = δ1 (X) q1k (X)π1k f1k (X) dX + λ δ0 (X) q0k (X)π0k f0k (X) dX k=1
(1.33)
k=1
Z
Z δ1 (X) max {π1k f1k (X)} dX + λ δ0 (X) max {π0k f0k (X)} dX 1≤k≤K1 1≤k≤K0 Z = δ1 (X) max {π1k f1k (X)} + δ0 (X)λ max {π0k f0k (X)} dX 1≤k≤K1 1≤k≤K0 Z ≤ max max {π1k f1k (X)} , λ max {π0k f0k (X)} dX.
≤
1≤k≤K1
1≤k≤K0
(1.34) (1.35) (1.36)
7
1.3: Combined Hypothesis Testing and Isolation
Inequality (1.34) is valid because the functions qik (X), k = 1, . . . , Ki are nonnegative and complementary and (1.36) is true because the same properties hold for δi (X), i = 0, 1. Note that the final expression constitutes an upper bound on the performance of any detection/isolation rule. Furthermore this upper bound is attainable by a specific detection/isolation strategy. Indeed we note that we have equality in (1.34) when the isolation probabilities are selected as 1 if k = arg min1≤l≤Ki {πil fil (X)} o qik (X) = (1.37) 0 otherwise, and we randomize if there are more than one indexes attaining the same maximum. This optimum isolation process is the randomized equivalent of (1.30). Similarly we have equality in (1.36) when we select the detection probabilities to be 1 if max1≤j≤K1 {π1j f1j (X)} ≥ λ max1≤j≤K0 {π0j f0j (X)} γ if max1≤j≤K1 {π1j f1j (X)} = λ max1≤j≤K0 {π0j f0j (X)} δ1o (X) = (1.38) 0 otherwise, and δ0o (X) = 1 − δ1o (X). Clearly this optimum detection procedure is the equivalent of (1.29). As far as the false alarm constraint is concerned let us define the following sets max1≤j≤K1 {π1j f1j (X)} >λ A (λ) = X : max1≤j≤K0 {π0j f0j (X)} max1≤j≤K1 {π1j f1j (X)} B(λ) = X : =λ . max1≤j≤K0 {π0j f0j (X)}
(1.39)
For the test introduced above, we can then write that P(Miss-detection/isolation|H0 ) Z Z max {π0j f0j (X)} dX =1− max {π0j f0j (X)} dX − γ B(λ) 1≤j≤K0 A (λ) 1≤j≤K0 Z ≥1− max {π0j f0j (X)} dX A (λ)∪B(λ) 1≤j≤K0 Z ≥1− max {π0j f0j (X)} dX = αmin .
(1.40)
1≤j≤K0
The lower bound αmin is clearly attainable in the limit by selecting γ = 1 and letting λ → 0. Also the missdetection/isolation probability is bounded from above by 1 and we can see that this value can also be attained in the limit by selecting γ = 0 and letting λ → ∞. Existence of a suitable threshold λ and a randomization probability γ that assure validity of the false alarm constraint with equality, as well as, optimality of the resulting test in the desired sense, can be easily demonstrated following exactly the same steps as in the classical NeymanPearson case1 . This concludes the proof. We realize that in order to apply the test in (1.29) we need knowledge of the prior probabilities πik . Whenever this information is not available we can consider equiprobable subhypotheses under each main hypothesis and select πik = 1/Ki . Under this assumption the optimum test in (1.29) is reduced into the classical form of GLRT depicted in (1.5) (after absorbing the two prior probabilities inside the threshold). Finally, we should mention that if hypothesis H0 is simple or, if under hypothesis H0 we are not interested in the isolation problem (therefore we can treat it as simple by forming the mixture density) then P(Miss-detection/isolation|H0 ) becomes the usual false alarm probability with corresponding αmin = 0. In other words the false alarm probability can take any value in the interval [0, 1] as in the classical Neyman-Pearson case. Remark 1.1: We observe that the optimum test, under each main hypothesis, selects the most appropriate subhy-
pothesis with the help of a MAP isolation rule, exactly as in (1.20). The interesting point is that this selection is performed independently from the other hypothesis and from the corresponding detection strategy. This is clearly a very desirable property since it separates the isolation from the detection problem. In our developments we are going to obtain suitable conditions that can guarantee the same characteristic for the extended detection/isolation problem introduced next. 1 In the proof we simply replace the pdfs f (X) with the functions max i 1≤j≤Ki {πik fik (X)}. Even though these functions are not densities, the proof goes through without change.
8
1: Joint Hypothesis Testing and Isolation
1.3.2
Combined Neyman-Pearson and Bayesian Hypothesis Testing
The previous results are directly extendable to a more general formulation where we impose costs on combinations of decisions and (sub)hypotheses. We should however emphasize that we are interested in preserving the grouping of the two sets of subhypotheses defined in the previous subsection, since this is the key idea that produces ik the GLRT. Therefore suppose that Cjl denotes the cost of deciding in favor of subhypothesis Hjl (i.e. D1 = j, D2 = l) when the true subhypothesis is Hik . For the indexes we have i, j ∈ {0, 1} while k ∈ {1, . . . , Ki } and l ∈ {1, . . . , Kj }. Let us now consider the average cost C i given that the main hypothesis Hi , i = 0, 1, is true. We have Ci =
Kj Ki X 1 X X
ik Cjl P(D1 = j & D2 = l & Hik |Hi )
k=1 j=0 l=1
=
K0 Ki X X k=1
ik C0l P(D1
= 0 & D2 = l|Hik ) +
l=1
K1 X
! ik C1l P(D1
= 1 & D2 = l|Hik ) πik
(1.41)
l=1
) Z ( K0 K1 X X i i = δ0 (X) q0l (X)D0l (X) + δ1 (X) q1l (X)D1l (X) dX, k=1 i where we define Djl (X) =
PKi
k=1
k=1 ik Cjl fik (X)πik .
By following a Neyman-Pearson like approach we propose to minimize C 1 under the constraint that C 0 does not exceed some prescribed value. As we realize, within each main hypothesis we employ a Bayesian formulation, whereas across main hypotheses, the formulation is of Neyman-Pearson type. With this specific setup we maintain the required grouping of subhypotheses mentioned before, a fact that will produce alternative to GLRT schemes. In the next theorem we define explicitly the optimization problem of interest and offer the corresponding general optimum solution. Theorem 1.2: Consider the class Jα of detection/isolation tests that satisfy C 0 ≤ α, then the test that minimizes
the cost C 1 within the class Jα is given by
H1 i h D01ˆl (X) − D11ˆl (X) T λ D10ˆl (X) − D00ˆl (X) , 0
1
1
0
(1.42)
H0
with the corresponding isolation process satisfying ˆlj = arg min [D 1 (X) + λD 0 (X)], j = 0, 1. jl jl 1≤l≤Kj
(1.43)
Threshold λ ≥ 0 and the randomization probability γ are selected so that the resulting test satisfies the constraint with equality. Proof: Consider the unconstrained problem of minimizing C 1 + λC 0 where λ ≥ 0 a Lagrange multiplier. We can then write C 1 + λC 0 ) Z ( K0 K1 X X 1 0 1 0 = δ0 (X) q0l (X) D0l (X) + λD0l (X) + δ1 (X) q1l (X) D1l (X) + λD1l (X) dX k=1
Z ≥
1 1 0 0 δ0 (X) min D0l (X) + λD0l (X) + δ1 (X) min D1l (X) + λD1l (X) dX 1≤l≤K0
1≤l≤K1
Z n h i h io = δ0 (X) D01ˆl (X) + λD00ˆl (X) + δ1 (X) D11ˆl (X) + λD10ˆl (X) dX 0 0 1 1 Z n o ≥ min D01ˆl (X) + λD00ˆl (X), D11ˆl (X) + λD10ˆl (X) dX. 0
(1.44)
k=1
0
1
1
(1.45) (1.46) (1.47)
We have equality in (1.45) whenever the isolation procedure satisfies (1.43) and equality in (1.47) whenever detection is according to (1.42). If threshold λ and randomization probability γ are such that the false alarm constraint is satisfied with equality, it is then straightforward to show that the corresponding combined scheme is indeed optimum in the sense that it minimizes C 1 within the class Jα . This concludes the proof.
9
1.3: Combined Hypothesis Testing and Isolation
Remark 1.2: Regarding the allowable values of level α we have that αmin ≤ α ≤ αmax . Under this general setting
it is possible to find an expression only for the lower end αmin . It is easily seen that ff Z 0 0 C 0 ≥ min min D0l (X), min D1l (X) dX = αmin . 1≤l≤K0
1≤l≤K1
(1.48)
Furthermore, this value αmin is attainable by the optimum test as one can verify by letting λ → ∞. Unfortunately we cannot obtain a similar expression for the upper limit αmax of C 0 , since it is not clear whether the cost C 0 (λ) of the optimum scheme is a monotone function of λ. Of course we can always say that αmax = sup0≤λ C 0 (λ), but the practical usefulness of this conclusion is minimal. Remark 1.3: From (1.43) we understand that the isolation process under each hypothesis (expressed through the cor-
responding minimization) takes into account the statistics of the other hypothesis and also depends on the detection rule through the threshold λ.
We recall that in GLRT this is not the case, since we simply use a MAP selection that neither depends on the other hypothesis nor on the threshold λ. In order to obtain the same desirable property under this more general setup it is sufficient to assume that2 1k 0k C0l = C01k ; C1l = C10k . (1.49) This, in turn, yields 1 0 D0l (X) = D01 (X); D1l (X) = D10 (X)
(1.50)
making (1.42) equivalent to H1
D01 (X)
− min
1≤l≤K1
1 D1l (X)
0 0 T λ D1 (X) − min D0l (X) ,
H0
1≤l≤K0
(1.51)
while the isolation process (1.43) simplifies to ˆli = arg min D j (X). il 1≤l≤Ki
(1.52)
With the conditions in (1.49) the isolation process simplifies considerably since under hypothesis Hi it involves only the Bayes cost Cilik , in other words the cost that we would use if we had only the isolation problem exactly as in Subsection 1.2.2. Consequently isolation under each hypothesis becomes independent from the isolation of the other hypothesis and also independent from the detection process, thus matching the property observed in GLRT. Although it is possible to offer several intriguing examples for our general problem, it seems more interesting to postpone this presentation until Section 2.3 after we consider the problem of combined detection/estimation.
2 In
1k = C 0k = 1. GLRT this property holds since C0l 1l
2 Joint Hypothesis Testing and Estimation 2.1
Introduction
A vastly more interesting problem arises when we combine hypothesis testing with parameter estimation. Therefore, suppose that under Hi , i = 0, 1 the corresponding data pdf have the form fi (X|θi ) where θi are parameters with prior pdf πi (θi ). As mentioned in the Introduction, if we simply desire to discriminate between H0 and H1 R then we can form the mixture pdfs fi (X) = fi (X|θi )πi (θi )dθi and apply the likelihood ratio test. When however our goal is to perform simultaneous detection and parameter estimation, then we need to develop techniques that are similar to the ones presented in the previous section and in particular Subsection 1.3.2. Before proceeding with this extension let us first discuss the notion of a randomized estimator by revisiting the problem of optimum Bayesian estimation. 2.1.1
Optimum Bayesian Estimation
As in hypothesis testing, let X ∈ RN be a random data vector which is distributed according to a pdf f (X|θ). For θ we assume that it is a realization of a random parameter vector ϑ for which we have available a known prior pdf ˆ Following a π(θ). Given a realization X of the data vector, we would like to come up with a parameter estimate θ. ˆ ˆ θ). Bayesian approach if θ is the true parameter vector and θ the corresponding estimate this generates a cost C(θ, Our goal is to propose an estimation strategy which minimizes the average cost. This problem is very similar to the Bayesian multiple hypothesis testing problem treated in Subsection 1.2.2. We recall that in hypothesis testing there was a finite number of hypotheses and an equal number of possible decisions (selections). Here, loosely speaking, each possible value of θ corresponds to a possible hypothesis, consequently our “decision” θˆ and the true parameter vector θ can take a continuum of values. We also recall that in the case of finite possibilities a randomized decision rule was defined with the help of a corresponding finite set of complementary probabilities δl (X). If we like to extend this idea, we need to assign to each possible selection θˆ a probability which is a function of X. Since θˆ takes a continuum of values, to each θˆ we can assign, in principle, a ˆ ˆ This suggests that the equivalent of the probabilities δl (X) is now a probability differential probability δ(θ|X)d θ. R ˆ ˆ ˆ density function δ(θ|X), that is, a function that satisfies δ(θ|X) ≥ 0 and δ(θ|X)d θˆ = 1. Clearly the notion of randomized estimator is not new. Bayesian approaches make use of such entities as one can verify by consulting [17, Page 65]. The posterior parameter pdf given the data X constitutes the most common randomized estimator used in practice. Here however we need the general definition where any parameter pdf can play the role of an estimator. As it becomes clear from the previous discussion, a randomized estimator is completely specified if we define ˆ ˆ We the pdf δ(θ|X). At this point it becomes interesting to mention how we can produce an actual estimate θ. recall that in the previous section our decision was the outcome of a random selection game. Following the same ˆ idea here, we need to generate a realization of a random variable distributed according to δ(θ|X). This realization becomes our estimate! Although randomized estimates might seem even more awkward than randomized decisions, they nevertheless constitute their natural extension. Despite the seemingly counter-intuitive form of the proposed estimation mechanism, we must point out that randomized estimators unify the two problems of hypothesis testing and estimation in a straightforward manner. 10
11
2.1: Introduction
Indeed, as we will be able to verify shortly, we obtain the corresponding optimum schemes by applying exactly the same methodology. Finally we should also add that the class of randomized estimators is richer than the class of their deterministic counterparts. This is because any deterministic estimator of the form θˆ = G(X), where ˆ G(X) is a deterministic function of X, can be modeled as a randomized estimator having the pdf δ(θ|X) = ˆ ˆ Dirac(θ − G(X)). In other words the pdf assigns all its probability mass to the selection θ = G(X). Let us now look for the optimum estimator within the class of randomized estimators that minimizes the expected cost. If we call the latter C we can write ZZZ ˆ θ)δ(θ|X)f ˆ ˆ C = C(θ, (X|θ)π(θ)dθdθdX Z Z ˆ θ)f (X|θ)π(θ)dθ dθˆ dX = ˆ ˆ X)dθˆ dX C(θ, δ(θ|X)D( θ, Z Z Z Z n o ˆ ˆ ≥ δ(θ|X) inf D(U, X) dθˆ dX = inf D(U, X) δ(θ|X)d θˆ dX U U Z = inf D(U, X)dX, Z Z
=
ˆ δ(θ|X)
Z
(2.1)
U
R where we defined D(U, X) = C(U, θ)f (X|θ)π(θ)dθ. The last integral in (2.1) constitutes a lower bound on the performance of any randomized estimator. This lower bound is attainable if we select ˆ δ(θ|X) = Dirac θˆ − arg inf D(U, X) , (2.2) U
provided that arg inf U D(U, X) is a usual function1 of X. It is clear that if the infimum is attained by a single function of X, the resulting optimum estimator is purely deterministic. When however we have more than one choices then we can randomize among them with arbitrary randomization probabilities and the resulting estimator will be randomized. By comparing the previous derivations with Eqs. (1.13)-(1.16) of Subsection 1.2.2 we realize that the corresponding steps are completely analogous. 2.1.2
Combined Neyman-Pearson Hypothesis Testing and Bayesian Estimation
In this part we are going to extend the result obtained in Subsection 1.3.2. Suppose again that the data vector X under hypothesis Hi , i = 0, 1 satisfies X ∼ fi (X|θi ) where θi is a realization of a random parameter vector ϑi with prior pdf πi (θi ). When a realization X of X is available we would like to decide between H0 and H1 and also estimate the corresponding parameter vector. A randomized detection/estimation structure will be comprised of the following set of functions δ0 (X), δ1 (X), q0 (θˆ0 |X), q1 (θˆ1 |X), that are the equivalent of (1.25). These functions are nonnegative satisfying Z Z δ0 (X) + δ1 (X) = q0 (θˆ0 |X)dθˆ0 = q1 (θˆ1 |X)dθˆ1 = 1,
(2.3)
(2.4)
that corresponds to (1.26). The two probabilities δj (X) are complementary while the two functions qj (θˆj |X) are pdfs with respect to θˆj . Our randomized detection/estimation strategy involves again two steps. In Step 1 with probabilities δj (X), j = 0, 1 we decide between the two main hypotheses Hj while in Step 2, given that in the previous step the decision was D1 = j, using the randomized estimator qj (θˆj |X) we provide a parameter estimate θˆj . Let us now develop the equivalent of our results in Subsection 1.3.2. This will become our starting point for considering various special cases that will give rise to interesting novel GLR-type tests. Denote with Cji (θˆj , θi ) the cost of providing the parameter estimate θˆj after having decided that the main hypothesis is Hj , when the true 1 For simplicity we assume that the infimum is in fact a minimum, in other words that there exists (at least one) function θ ˆ = G(X) that attains the minimal value. In the opposite case we need to become more technical and introduce the notion of -optimality with estimation strategies that have performance which is -close to the optimum.
12
2: Joint Hypothesis Testing and Estimation
main hypothesis is Hi and the corresponding true parameter value is θi . If C i denotes the average cost given that hypothesis Hi is true, then we have the following expression for this quantity, which is the equivalent of (1.41) Z Z Z Ci = δ0 (X) q0 (θˆ0 |X)D0i (θˆ0 , X)dθˆ0 + δ1 (X) q1 (θˆ1 |X)D1i (θˆ1 , X)dθˆ1 dX. (2.5) R We define Dji (U, X) = Cji (U, θi )fi (X|θi )πi (θi )dθi . Consider now the problem of optimizing C 1 among all detection/estimation schemes that satisfy the constraint that C 0 is no larger than a prescribed value. The next theorem defines this problem explicitly and provides the corresponding optimum solution. Theorem 2.1: Consider the class Jα of detection/estimation tests that satisfy C 0 ≤ α, then the test that mini-
mizes the cost C 1 within the class Jα is given by
H1 h i D01 (θˆ0 , X) − D11 (θˆ1 , X) T λ D10 (θˆ1 , X) − D00 (θˆ0 , X) ,
(2.6)
H0
with the corresponding estimations defined by θˆj = arg inf [Dj1 (U, X) + λDj0 (U, X)], j = 0, 1. U
(2.7)
Proof: The proof is exactly similar to the proof of Theorem 2 with the sums replaced by integrals. Remark 2.1: For the level α we have αmin ≤ α ≤ αmax and, as in the discrete case, we have an expression only for the lower bound Z n o αmin = min min D00 (U, X), min D10 (U, X) dX. (2.8) U
U
Remark 2.2: Following the same lines of Theorem 2, by assuming C01 (U, θ) = C01 (θ) and C10 (U, θ) = C10 (θ) we
obtain D01 (U, X) = D01 (X) and D10 (U, X) = D10 (X). Under this assumption the optimum test in (2.6) simplifies to H1 h i D01 (X) − inf D11 (U, X) T λ D10 (X) − inf D00 (U, X) , (2.9) U
H0
U
and the optimum parameter estimate becomes θˆj = inf Djj (U, X). U
(2.10)
The important consequence of this simplification is that the estimation part, under each hypothesis, reduces to the optimum Bayes estimator which is independent from the other hypothesis and the detection rule.
2.2
Variations
In this section we are going to present two variations of the same idea that might turn out to be interesting for applications. In both cases the resulting estimator under each hypothesis is the optimum Bayes, exactly as in GLRT. We start with the case where the parameters are known under H0 , a scenario that is quite frequent in practice. 2.2.1
Known Parameters under H0
Let f (X|θ) be a pdf with θ a parameter vector. Suppose that under H0 we have θ = 0 whereas under H1 vector θ follows a prior pdf π(θ). We would like to test H0 against H1 , but whenever we decide in favor of H1 we would also like to provide an estimate θˆ for the corresponding parameter vector θ. Since parameter estimation is needed only under H1 , this suggests that a combined detection/estimation ˆ ˆ scheme will be comprised of the functions δ0 (X), δ1 (X), q1 (θ|X) that satisfy δj (X) ≥ 0, j = 0, 1, q1 (θ|X) ≥ R ˆ ˆ 0, δ0 (X) + δ1 (X) = q1 (θ|X)dθ = 1. The two probabilities δ0 (X), δ1 (X) will be used in the first step to decide
13
2.2: Variations
ˆ between the two main hypotheses, while q1 (θ|X) will be employed in the second step to provide the required estimate for θ, every time we decide in favor of H1 . ˆ θ) to be the cost of providing an estimate θˆ when the true value is Regarding the Bayesian cost we define C(θ, θ. Of course this cost makes sense only under H1 . Consequently if the true hypothesis is H1 with parameter θ and ˆ θ). If however we are under we decide in favor of H1 with parameter estimate θˆ then, as we said, the cost is C(θ, H1 with parameter value is θ and we decide in favor of H0 , then this is like selecting θˆ = 0. Hence, it makes sense to assign to this event the cost C(0, θ). Using these observations it is straightforward to compute the average cost under H1 which takes the form ZZ Z 1 ˆ ˆ ˆ C = δ1 (X)D(θ, X)q1 (θ|X)dθdX + δ0 (X)D(0, X)dX (2.11) R with D(U, X) = C(U, θ)f (X|θ)π(θ)dθ. For this special problem we propose to minimize the average cost C 1 under H1 and at the same time control the false alarm probability under H0 . The next theorem presents explicitly the problem of interest and introduces the corresponding optimum solution. Theorem 4: Consider the class Jα of detection/estimation procedures with false alarm probability not exceeding the level α. Then within the class Jα the test that minimizes the average cost C 1 is given by H
D(0, X) − inf U D(U, X) 1 T λ. f (X|0) H
(2.12)
0
Threshold λ ≥ 0 and randomization probability γ are selected so that the false alarm constraint is satisfied with equality. R Proof: The false alarm under H0 is given by P(D1 = 1|H0 ) = δ1 (X)f (X|0)dX. If λ ≥ 0 a Lagrange multiplier then we are interested in minimizing the combination C 1 + λP(D1 = 1|H0 ). Using (2.11) we have C 1 + λP(D1 = 1|H0 ) ZZ Z Z ˆ ˆ ˆ = δ1 (X)D(θ, X)q1 (θ|X)dθdX + δ0 (X)D(0, X)dX + λ δ1 (X)f (X|0)dX Z n h i o ≥ δ1 (X) inf D(U, X)dX + λf (X|0) + δ0 (X)D(0, X) dX U Z n o ≥ min inf D(U, X)dX + λf (X|0), D(0, X) dX. U
(2.13) (2.14) (2.15)
ˆ We have equality in (2.14) whenever the estimator q1 (θ|X) is the optimum Bayesian estimator and equality in (2.15) whenever our decision between the two main hypotheses is according to (2.12). 2.2.2
Conditional Cost
A slightly different and in some sense more general approach is the assume that under H0 we have X ∼ f0 (X) and under H1 the data satisfy X ∼ f1 (X|θ) with the parameter vector having the prior π(θ). Here as before we assume that under H0 the data pdf is completely known but it does not necessarily correspond to a specific selection of the parameter vector θ of f1 (X|θ). ˆ θ) expressing the cost of providing the estimate Regarding Bayesian costs we only define the cost function C(θ, θˆ when the true value is θ. Clearly this cost makes sense whenever the true hypothesis is H1 and with our detection ˆ scheme we also decide in favor of H1 . Again our decision mechanism involves δ0 (X), δ1 (X), q1 (θ|X) since there is no estimation under H0 . Here however we are interested in computing the average cost under H1 but conditioned on the event that we have selected correctly the main hypothesis, namely RRR ˆ θ)δ1 (X)q1 (θ|X)f ˆ ˆ C(θ, 1 (X|θ)π(θ)dθdθdX ˆ θ)|H1 , D1 = 1] = RR C = E[C(θ, δ1 (X)f1 (X|θ)π(θ)dθdX (2.16) RR ˆ X)δ1 (X)q1 (θ|X)d ˆ ˆ D(θ, θdX R = δ1 (X)f1 (X)dX
14
2: Joint Hypothesis Testing and Estimation
R R where D(U, X) = C(U, θ)f1 (X|θ)π(θ)dθ and f1 (X) = f1 (X|θ)π(θi )dθ is the mixture pdf. In other words we consider the average (estimation) cost conditioned on the event that we have correctly detected the main hypothesis. We can now attempt to minimize C and at the same time control the false alarm probability. This setup makes a lot of sense since the false alarm constraint assures the acceptable performance of the detection part while the conditional cost minimization provides the best possible estimator whenever we have correctly decided in favor of H1 . The next theorem solves exactly this problem. Theorem 5: Consider the class Jα of detection/estimation procedures for which P(D1 = 1|H0 ) ≤ α with α ∈ [0, 1], then the optimum test that minimizes the cost C within the class Jα is given by H
ρf1 (X) − inf U D(U, X) 1 Tλ f0 (X) H
(2.17)
θˆ = arg inf D(U, X).
(2.18)
0
while the corresponding estimator is U
Parameter ρ, threshold λ ≥ 0 and randomization probability γ are selected so that the false alarm constraint is satisfied with equality, that is Z Z f0 (X)dX + γ f0 (X)dX = α, (2.19) A (ρ,λ)
B(ρ,λ)
but also we need the following equation Z Z h i ρf1 (X) − inf D(U, X) dX + γ U
A (ρ,λ)
B(ρ,λ)
h
i ρf1 (X) − inf D(U, X) dX = 0. U
(2.20)
A (ρ, λ) and B(ρ, λ) are the two subsets of RN for which the statistic in (2.17) exceeds and is equal to the threshold λ respectively. Proof: Assume for the moment existence of ρ, λ, γ that solve the system of equations (2.19) and (2.20) and call δ0o (X), δ1o (X) the randomization probabilities associated with the test in (2.17). Consider now any test in the class Jα and let us perform the following manipulations ZZ Z ˆ X)q1 (θ|X)d ˆ ˆ δ1 (X)D(θ, θdX − ρ δ1 (X)f1 (X)dX + λα (2.21) ZZ Z Z ˆ X)q1 (θ|X)d ˆ ˆ ≥ δ1 (X)D(θ, θdX − ρ δ1 (X)f1 (X)dX + λ δ1 (X)f0 (X)dX (2.22) Z Z Z ≥ δ1 (X) inf D(U, X)dX − ρ δ1 (X)f1 (X)dX + λ δ1 (X)f0 (X)dX (2.23) U Z h i = δ1 (X) inf D(U, X)dX − ρf1 (X) + λf0 (X) dX (2.24) U Z n o ≥ min inf D(U, X)dX − ρf1 (X) + λf0 (X), 0 dX (2.25) U Z Z h i = δ1o (X) inf D(U, X)dX − ρf1 (X) dX + λ δ1o (X)f0 (X)dX (2.26) U
= λα.
(2.27)
Until (2.25) the results are straightforward. Equ. (2.26) expresses the fact that the lower bound is attainable by the proposed detection/estimation scheme of (2.17), (2.18). Finally (2.27) is a consequence of ρ, λ, γ solving the two equations (2.19), (2.20). Comparing (2.21) with (2.27) we conclude that for any test in the class Jα we have C ≥ ρ with equality whenever the test coincides with the one proposed by the theorem. In order for our proof to be complete we need to show that there exists combination of ρ, λ, γ that solve the two equations. For simplicity we are going to assume that the set B(ρ, λ) has zero probability with respect to f0 (X) this allows us to select γ = 0.
15
2.3: Examples
2.3
Examples
In this section we present a number of interesting examples by selecting various forms for the cost functions. We basically concentrate on the most well known costs encountered in classical Bayesian estimation theory. We start with the MAP estimate which demonstrates optimality of GLRT. 2.3.1
MAP Detection/Estimation
Consider the following combination of cost functions C01 (U, θ) = C10 (U, θ) = 1; C00 (U, θ) = C11 (U, θ) =
0 1
kU − θk ≤ ∆ 1 otherwise.
(2.28)
We recall from the classical Bayesian estimation theory (see [15, Page 145]) that, as ∆ → 0 and assuming sufficient smoothness of the pdf functions, the specific selection of costs leads to the MAP parameter estimation under each main hypothesis. Indeed since2 Dii (U, X) ≈ 1 − V∆i fi (X|U )πi (U )
(2.29)
where V∆i is the volume of a hypersphere of radius ∆ (which can be different for each hypothesis if the two parameter vectors are not of the same length). By substituting in (2.10) yields H
supU f1 (X|U )π1 (U ) 1 V∆0 = λ0 , Tλ supU f0 (U |X)π0 (U ) H V∆1
(2.30)
0
and the optimum estimator under each hypothesis is the MAP estimator θˆi = arg sup fi (X|U )πi (U ).
(2.31)
U
Similarly for the special case of Subsection 2.2.1 if we define 0 kU − θk ≤ ∆ 1 C(U, θ) = 1 otherwise,
(2.32)
then D(U, X) ≈ 1 − V∆ f (X|U )πi (U ) and the optimum test in (2.12) takes the form H
supU f (X|U )π(U ) 1 λ T = λ0 , f (X|0) V ∆ H
(2.33)
0
with the optimum estimator being θˆ = arg supU f (X|U )π(U ). In both tests (2.30) and (2.33) the threshold λ0 (and the corresponding randomization probabilities) are selected to satisfy the false alarm constraint with equality. If the prior probabilities πi (θi ), π(θ) are unknown and are replaced by the uniform we obtain the classical form of GLRT. If we now consider the discrete version of the problem and assume that θi can take only upon a finite set of values Vi = {θi 1, . . . , θiKi } with corresponding prior probabilities πi1 , . . . , πiKi then we recover the GLRT 2.3.2
MMSE Detection/Estimation
Let us now develop the first test that can be used as an alternative to GLRT. Consider the following costs C01 (U, θ) = C01 (θ); C10 (U, θ) = C10 (θ); C00 (U, θ) = C11 (U, θ) = kU − θk2 ,
(2.34)
where C01 (θ), C10 (θ) are functions to be specified in the sequel. Due to the previous selection, the estimation part is independent from the detection. Under each main hypothesis the optimum estimator is selected by minimizing the 2 The
approximate equality becomes exact as ∆ → 0.
16
2: Joint Hypothesis Testing and Estimation
corresponding mean square error. Consequently the optimum estimator is the conditional mean of the parameter vector given the data vector X (see [15, Page 143]). Specifically we have R θi fi (X|θi )πi (θi ) dθi ˆ , i = 0, 1. (2.35) θi = E[θi |X, Hi ] = R fi (X|θi )πi (θi ) dθi The corresponding optimum test after substituting in (2.10) takes the form H1
A1 (X) T λA0 (X)
(2.36)
H0
where A0 (X) = kθˆ0 k2 f0 (X) +
Z
[C10 (θ0 ) − kθ0 k2 ]f0 (X|θ0 )π0 (θ0 ) dθ0
Z A1 (X) = kθˆ1 k2 f1 (X) + [C01 (θ1 ) − kθ1 k2 ]f1 (X|θ1 )π1 (θ1 ) dθ1 Z fi (X) = fi (X|θi )πi (θi ) dθi .
(2.37)
Selecting C01 (θ1 ) = kθ1 k2 and C10 (θ0 ) = kθ0 k2 simplifies the test considerably yielding R H kθˆ1 k2 f1 (X|θ1 )π1 (θ1 ) dθ1 1 R T λ. kθˆ0 k2 f0 (X|θ0 )π1 (θ0 ) dθ0 H
(2.38)
0
We recognize in the second ratio the statistic that is used to decide optimally between the two main hypotheses. By including the first ratio of the two norm square estimates the test performs simultaneous optimum detection and estimation. For the special case of Subsection 2.2.1, it is easy to verify that the corresponding test takes the form ˆ 2 kθk
R
f (X|θ)π(θ) dθ H1 T λ, f (X|0) H
(2.39)
0
where θˆ = E[θ|X, H1 ] =
R
R θf (X|θ)π(θ)dθ/ f (X|θ)π(θ)dθ.
In both tests in (2.36) and (2.39), if the priors are not known and are replaced by uniforms, we obtain tests that are the equivalent of GLRT for the MMSE criterion. 2.3.3
Median Detection/Estimation
As our final example we present the case of the median estimation where θi , θˆi , θ, U are scalars and we select the cost functions as follows C01 (U, θ) = C01 (θ); C10 (U, θ) = C10 (θ); C00 (U, θ) = C11 (U, θ) = |U − θ|.
(2.40)
As in the previous examples the estimation part is independent from detection and under each hypothesis it is the optimum Bayes estimator. Consequently for this cost function the optimum estimator is the conditional median [15, Page 143] ) ( Ry fi (X|θi )πi (θi ) dθi 1 −∞ θˆi = arg y : P(θi ≤ y|X, Hi ) = R = , i = 0, 1. (2.41) 2 fi (X|θi )πi (θi ) dθi The optimum test, as before, becomes H1
A1 (X) T λA0 (X) H0
(2.42)
17
2.4: Conclusion
where A0 (X) =
Z h
i C10 (θ0 ) + θ0 sgn(θˆ0 − θ0 ) f0 (θ0 |X)π0 (θ0 )dθ0
A1 (X) =
Z h
i C01 (θ1 ) + θ1 sgn(θˆ1 − θ1 ) f1 (θ1 |X)π1 (θ1 )dθ1 .
(2.43)
If additionally we select C01 (θ1 ) = |θ1 | and C10 (θ0 ) = |θ0 | then the optimum test takes the very convenient form R θˆ1 0 R θˆ0 0
θ1 f1 (X|θ1 )π1 (θ1 )dθ1
H1
θ0 f0 (X|θ0 )π0 (θ0 )dθ0
H0
T λ.
(2.44)
For the special case of Subsection 2.2.1 the corresponding optimum test becomes R θˆ 0
θf (X|θ)π(θ) dθ H1 T λ, f (X|0) H
(2.45)
0
while the optimum estimator is θˆ = arg{y : P(θ ≤ y|X, H1 ) = 0.5}. Finally when the priors are selected to be uniforms, we obtain a test that is the alternative to GLRT but tuned to the specific Bayesian criterion we employ in the estimation part.
2.4
Conclusion
We have introduced a novel detection/isolation and a corresponding detection/estimation formulation of the composite hypothesis testing problem by properly combining the well known Neyman-Pearson and Bayesian methodology. Adopting an alternative two-step realization of the classical randomized decision schemes we were able to demonstrate that the popular GLRT is finite-sample-size optimum under the combined formulation provided that the Bayesian cost is properly selected. Important side product of this new approach is that the selection of other Bayesian costs, as the mean square error or the mean absolute error lead to interesting novel GLR-type tests that can be used as alternatives to the conventional GLRT.
2.5
Acknowledgment
The author would like to thank his good friend, Prof. Igor Nikiforov from the Universit´e de Technologie de Troyes (UTT), France, for enlightening discussions.
Bibliography [1] A. Wald, “Tests of statistical hypotheses concerning several parameters when the number of observations is large,” Trans. American Math. Society, vol. 54, no. 3, pp. 426-482, Nov. 1943. [2] L. Le Cam, “On some asymptotic properties of maximum likelihood estimates and related Bayes estimates,” Univ. Calif. Publis. Statistics, vol. 1, pp. 277-329, Univ. of California Press, Berkeley and Los Angeles, 1953. [3] W. Hoeffding, “Asymptotically optimal tests for multinomial distributions,” Ann. Math. Stat., vol. 36, pp. 369-401, 1965. [4] J. Neyman, H. Chernoff and D.G. Chapman, “Discussion of Hoeffding’s Paper,” Ann. Math. Stat., vol. 36, pp. 401-408, 1965. [5] L. Weiss, “Neyman’s C(α) test as a GLRT test,” Stat. & Prob. Letters, vol. 15, no. 2, pp. 121-124, Sept. 1992. [6] M. Kendall, A. Stuart and S. Arnold, Advanced Theory of Statistics, Classical Inference and the Linear Model, 6th ed., vol. 2A, Hodder Arnold Publication, New York, 1999. [7] D.R. Cox and D.V. Hinkley, Theoretical Statistics, Chapman and Hall, New York, 1974. [8] J. Ziv, “On classification with empirically observed statistics and universal data compression,” IEEE Trans. Inf. Theory, vol. 34, no. 2, pp. 278-286, March 1988. [9] M. Gutman, “Assymptotically optimum classification for multiple tests with empirically observed statistics,” IEEE Trans. Inf. Theory, vol. 35, no. 2, pp. 401-408, March 1989. [10] N. Merhav, M. Gutman, J.Ziv, “On the estimation of the order of a markov chain and universal data compression,” IEEE Trans. Inf. Theory, vol. 35, pp. 1014-1019, Sept. 1989. [11] N. Merhav “The estimation of the model order in exponential families,” IEEE Trans. Inf. Theory, vol. 35, pp. 1109-1114, Sept. 1989. [12] J. Ziv and N. Merhav, “Estimating the number of states of a finite-state source,” IEEE Trans. Inf. Theory, vol. 37, pp. 61-65, Jan. 1992. [13] O. Zeitouni, J. Ziv and N. Merhav, “When is the generalized likelihood ratio test optimal?” IEEE Tran. Inf. Theory, vol. 38, No. 5, pp. 1597-1602, Sept. 1992. [14] J.R. Gabriel and S.M. Kay, “On the relationship between the GLRT and UMPI tests for the detection of signals with unknown parameters,” IEEE Trans. Signal Process., vol. 53, no. 11, Nov. 2005. [15] H.V. Poor, An Introduction to Signal Detection and Estimation, 2nd edition, Springer, New York, 1994. [16] D. Rousseau, G.V. Anand and F. Chapeau-Blondeau, “Noise-enhanced nonlinear detector to improve signal detection in non-Gaussian noise,” Signal Process., vol. 86, no. 11, pp. 3456-3465, Nov. 2006. [17] C.P. Robert, The Bayesian Choice, 2nd edition, Springer, 2007.
18