BAYES-OPTIMAL SEQUENTIAL MULTI-HYPOTHESIS TESTING IN EXPONENTIAL FAMILIES
arXiv:1506.08915v1 [cs.IT] 30 Jun 2015
JUE WANG Abstract. Bayesian sequential testing of multiple simple hypotheses is a classical sequential decision problem. But the optimal policy is computationally intractable in general, because the posterior probability space is exponentially increasing in the number of hypotheses (i.e, the curse of dimensionality in state space). We consider a specialized problem in which observations are drawn from the same exponential family. By reconstructing the posterior probability vector using the natural sufficient statistic, it is shown that the intrinsic dimension of the posterior probability space cannot exceed the number of parameters governing the exponential family, or the number of hypotheses, whichever is smaller. For univariate exponential families commonly used in practice, the probability space is of one or two dimension in most cases. Hence, the optimal policy can be attainable with only moderate computation. Geometric interpretation and illustrative examples are presented. Simulation studies suggest that the optimal policy can substantially outperform the existing method. The results are also extended to the sequential sampling control problem.
1. Introduction Sequential multi-hypothesis testing is a generalization of standard statistical hypothesis testing to account for sequential observations and multiple alternative hypotheses. After obtaining new observations, the decision maker can stop and accept one of multiple hypotheses about the underlying statistical distribution, or wait for more observations in the hope of improving the accuracy of future decisions. The goal is to identify the true hypothesis as quickly as possible and with a desired accuracy, which can often be translated to minimizing the expected cost incurred by accepting an incorrect hypothesis and making more observations. This problem is a classical sequential decision-making problem. It involves a trade-off between the identification accuracy and time delay, which arises in a vast array of applications including medical diagnostics (Kabasawa and Kaihara, 1981), supervised machine learning (Fu, 1968), network security (Jung et al., 2004), as well as educational testing, physiological monitoring, clinical trials and military target recognition, see Tartakovsky et al. (2014) for a comprehensive discussion. The study of sequential hypothesis testing is originated with Wald (1945), who proposed a Bayesoptimal procedure for binary simple hypotheses called the sequential probability ratio test (SPRT). In SPRT, the decision maker observes independent and identically distributed (iid.) samples of a statistical distribution, one at a time, and calculates the probability ratio reflecting how likely one distribution (hypothesis) is true when compared to the other. If this ratio strongly favors one hypothesis, then she should stop and accept that hypothesis. Otherwise, she can keep observing. Wald and Wolfowitz (1950) considered the generalization of SPRT to multiple simple hypotheses, but found that the Bayes-optimal policy is extremely difficult to implement in practice, even for only three hypotheses. The structure of the optimal policy, on the other hand, is well understood (Blackwell and Girschik, 1979). Numerous heuristics (e.g., Armitage (1950), Paulson (1962), Simons (1967), Lorden (1977)) and asymptotically optimal procedures (e.g., Baum and Veeravalli (1994), Date: June 26, 2015. Postdoctoral Fellow, Queen’s School of Business, Queen’s University, Kingston, Ontario, Canada, K7L 3N6,
[email protected], Previous affiliation: Department of Mechanical & Industrial Engineering (PhD Candidate), University of Toronto,
[email protected] . 1
2
JUE WANG
Dragalin et al. (1999, 2000)) have been studied, except the optimal policy, as noted by a recent review: “The problem of sequential testing of many hypothesis is substantially more difficult than that of testing two hypotheses. For multiple-decision testing problems, it is usually very difficult, if even possible, to obtain optimal solutions. . . . A substantial part of the development of sequential multihypothesis testing in the past several decades has been directed toward the study of suboptimal procedures, basically multihypothesis modifications of a sequential probability ratio test, for iid data models.” (Tartakovsky et al. (2014), Chapter 1)
One can view the sequential multi-hypothesis testing problem as a special type of partially observable Markov decision process problem (POMDP) with identity transition matrix (Naghshvar and Javidi, 2013). The main difficulty in generalizing the optimal policy stems from the curse of dimensionality in dynamic programming. In the presence of two hypotheses, it suffices to consider the posterior probability of just one hypothesis, which is a scalar. But in the face of more than two hypotheses, one must consider a posterior probability vector (belief vector). The size of the posterior probability space (belief space) increases exponentially in the number of hypotheses, making the problem notoriously difficult to solve (Papadimitriou and Tsitsiklis, 1987). We consider a specialization in which the observation distributions come from the same exponential family. Indeed, exponential families play a central role in statistical theory and include many parametric distributions commonly used in practice. We find that the integration of sequential hypothesis testing and exponential family gives rise to a unique property, which allows one to reconstruct the N -dimensional posterior probability vector using an M -dimensional natural sufficient statistic, where N is the number of hypotheses and M is the dimension of the exponential family. As a result, the intrinsic dimension of the belief space (denoted by r) is bounded as r 6 min{N, M }. For many univariate exponential-family distributions commonly used in practice, such as normal, Poisson, binomial (see Table 1), we have M 6 2. For these distributions, the dynamic programming problem can be reformulated in the space with a lower dimension, and the optimal policy can be found with only moderate computation, even when the number of hypotheses is large. This solution method does not require conjugate priors and is compatible with any discrete prior distribution. It also gives rise to decision regions that are non-stationary and prior-dependent, which are opposite to those generated by the standard belief-vector approach considered by Wald and Wolfowitz (1950), among others. Numerical experiments suggest that the optimal policy can substantially outperform the popular suboptimal method when the hypotheses are difficult to differentiate and when the penalties for delay are high. This paper is organized as follows. Section 2 briefly reviews the relevant literature. Section 3 describes the problem, its standard formulation and solution procedure. Section 4 describes the belief-vector reconstruction and the reformulation of optimality equation, illustrated with applications to open problems. Section 5 compares the performance of the optimal solution with the existing suboptimal procedure. Section 6 extends the results to sampling control problems. The summary and discussions are given in Section 7. 2. Literature Review The relevant literature is substantial, we can only provide a sketch here. A comprehensive review is available in Tartakovsky et al. (2014). In general, there are two streams: Bayes-optimal policy and suboptimal policies. The stream of optimal policy emphasizes the geometric structure of acceptance regions in the belief space. The stream of suboptimal policies focuses on practical solution procedures, which can be further divided into heuristic policies and asymptotically optimal policies. Note that “multi-hypotheses testing” in this paper refers to the “identification” problem and should not be confused with the multiple testing problem or multi-armed bandit problem.
BAYES-OPTIMAL SEQUENTIAL MULTI-HYPOTHESIS TESTING IN EXPONENTIAL FAMILIES
3
Optimal policy. The Bayes-optimal policy was first examined by Wald and Wolfowitz (1950), who formulated the problem in the belief space and showed that the optimal acceptance region for each hypothesis is convex and contains a vertex of the probability simplex. Tartakovsky (1988) represents this structure in terms of a conditional control limit policy. Dayanik et al. (2008) integrated sequential multi-hypothesis testing with change detection and characterized the geometric properties of the acceptance regions in the belief space, but the inherent complexity renders the optimal implementation rather impractical. In this paper, we seek a practical solution method rather than structural results. Suboptimal policy. Most heuristic approaches are based on the parallel implementation of multiple pairwise SPRTs. To test three hypotheses concerning the mean of normal distribution, Sobel and Wald (1949) constructed two SPRTs for two different pairs of hypotheses and specified a series of heuristic decision rules. Armitage (1950) extended this procedure to a general number of hypotheses. A different modification of the acceptance regions was given by Simons (1967). Representative procedures along this line are compared by Eisenberg (1991). However, these methods have been developed without much consideration on optimality. Baum and Veeravalli (1994) proposed an intuitively appealing approach called the M -ary sequential probability ratio test (MSPRT), which is a decoupled likelihood ratios test. It has been shown (Veeravalli and Baum, 1995, Dragalin et al., 1999) that MSPRT is asymptotically optimal when the observation costs approach zero or when the probabilities of incorrect selection approach zero. These limiting situations would arise where one can afford to obtain a substantial amount of information before making the final selection, or when the alternative hypotheses are easily distinguishable from each other. Asymptotically optimal solutions are also the foundation for recent developments of sequential joint detection and identification (Lai, 2000), decentralized sensing (Wang and Mei, 2011), as well as sampling control (Chernoff, 1959, Nitinawarat et al., 2013, Naghshvar and Javidi, 2013). Note that dynamic programming is generally not involved in suboptimal policies, whereas it is almost inevitable in the search for the optimal policy. Sequential testing with exponential family. Many heuristic policies have been developed for normal distribution (Sobel and Wald, 1949, Armitage, 1950, Simons, 1967), yet none claims optimality. As mentioned earlier, the Sobel-Wald procedure, as well as its extensions, are based on multiple SPRTs that are operated simultaneously, in which one must specify some coordination rules to manage potential conflicts among these parallel testings. One approach that does not involve multiple SPRTs is proposed by Billard and Vagholkar (1969) for the testing of three hypotheses about the normal mean. However, it prohibits accepting any hypothesis at the early stage. This contradicts the optimal policy found in Section 4.3.1 of this paper. Detailed reviews of heuristic procedures targeting specific exponential family can be found in Ghosh and Sen (1991). To our best knowledge, no optimal solution method that is scalable for multiple hypotheses about exponential families can be found in the literature. Contributions and limitations. The main contribution of this paper is to show that a practical optimal solution method, whose computational complexity does not grow in response to the number of hypotheses, is possible in many practice-relevant cases. This method is based on reconstructing the high-dimensional belief vector using the low-dimensional natural sufficient statistic, a technique not commonly used in partially observable Markov decision processes literature. The limitation of the proposed method is tied to the exponential family assumption. If the observations are multivariate or are drawn from high-dimensional exponential families, it would still suffer from the curse of dimensionality in most cases (see Section 7). It is also not directly applicable to distributions coming from different exponential families. In these situations, one should turn to suboptimal methods.
4
JUE WANG
3. Preliminaries Consider a sequence of iid. observations {Y1 , Y2 , . . .}, continuous or discrete, with probability density (or mass) function f defined on Y ⊂ RD . This distribution is unknown, but there are a finite number of distinct hypotheses about it, more specifically, Hi : f = fi , i = 0, . . . , N, where {f0 , f1 , . . . , fN } are known distributions and one of them is equal to f . At time k, after observing the sequence {Y1 , . . . , Yk }, we must choose an action among the following alternatives: stop and accept hypothesis Hi , where i ∈ N , {0, . . . , N }, or wait until the next period and make a new observation, Yk+1 . The decision process is terminated if we choose to stop. In Section 6, we will consider a more general problem with multiple sampling modes. We hope to identify the true distribution with a desirable accuracy as quickly as possible. A sequential policy δ = (τ, d) contains a stopping time τ with respect to the historical observations, and an acceptance decision rule d taking value in the set N . The decision process is terminated at time τ when we stop observing and, if d = i, we accept hypothesis Hi . We let ∆ denote the set of admissible policies in which the stopping and acceptance decisions are based on the information available at time τ . Suppose that hypothesis Hi is true (namely, the actual distribution is f = fi ), if we stop and accept hypothesis Hj , then a termination cost aij ≥ 0 is incurred, where aij = 0 if i = j (no penalty for a correct identification) and aij ≥ 0 if i 6= j (penalty for misidentification). If we wait, then an observation cost ci ≥ 0 is incurred per period. Before any observation is obtained, some prior belief about the true hypothesisP is available. Let 0 ≤ θi ≤ 1 denote the prior probability that the hypothesis i is true. Clearly, N i=0 θi = 1. Let θ = (θ0 , θ1 , . . . , θN ) be the prior belief vector in an N -dimensional belief space S N , {Π = (π0 , π1 , . . . , πN ) ∈ [0, 1]N +1 | π0 + π1 + · · · + πN = 1}. Our objective is to find the Bayes-optimal policy, given the prior belief, that minimizes the total expected cost over an infinite horizon ?
R =
inf
δ=(τ,d)∈∆
Eδ
N nX
τ ci I{f =fi } + I{τ 2, although they are not as frequently used in practice as those in Table 1. The results of this paper are based on the following assumption: Assumption 1. The sequential observations are independent and identically distributed (iid.), drawn from the same exponential family. The iid. assumption is standard in sequential models. As mentioned earlier, exponential families have also been extensively studied in sequential hypotheses testing and widely used in practice. Thus, this assumption is considered standard in the literature and has important practical roots.
4. Optimal Solution Method In this section, we first show that the belief vector can be reconstructed from the natural sufficient statistic. Then, we describe the optimal solution method and illustrate it with examples. 4.1. Belief vector reconstruction. Definition 2. Define the natural difference matrix as T η (α1 ) − η T (α0 ) .. H= , . T T η (αN ) − η (α0 ) where η(αi ) is the natural parameter vector for distribution fi , i ∈ N . The natural difference matrix H is an N -by-M matrix, where N is the number of hypotheses and M is the number of parameters in the exponential family. Let r = rank(H) denote its rank, so r ≤ min{N, M }. Since any matrix has a rank factorization, we can always find two matrices, L and U , such that H = LU, where L is an N -by-r matrix of full column rank and U is an r-by-M matrix of full row rank (although there are more than one rank factorizations, one can choose any one of them). The full-row-rank matrix U will be used to construct the minimal sufficient statistic: Definition 3. For a sequence of observations Y1 , . . . , Yk following the distribution f with natural sufficient statistic t(·), we define the minimal cumulative natural sufficient statistic as xk = P U km=1 t(Ym ).
Note that xk is an r-dimensional vector, which can also be viewed as an r-dimensional projection of the cumulative sum of natural sufficient statistic vector. We will call it the “sufficient statistic” k ) to denote the belief vector given the iid. for short from now on. We use Πk (Y ; θ) , (π0k , . . . , πN observations Y = (Y1 , . . . , Yk ) and the prior belief θ = (θ0 , . . . , θN ). The following proposition suggests that the (N + 1)-dimensional belief vector Πk (Y ; θ) can be reconstructed from the rdimensional vector xk .
BAYES-OPTIMAL SEQUENTIAL MULTI-HYPOTHESIS TESTING IN EXPONENTIAL FAMILIES
7
Proposition 1 (Belief vector reconstruction). Under Assumption 1, the belief vector Πk (Y ; θ) can be reconstructed from the minimal cumulative natural sufficient statistic xk through a mapping T k : Rr → S N . That is, Πk (Y ; θ) = T k (xk ; θ), where T k , (T0k , . . . , TNk ),
N −1 X θi , exp Dik (xk ) + 1 θ0 i=1 θj exp Djk (xk ) k k k πj = Tj (x ; θ) , PN , j = 1, . . . , N, k k) + θ θ exp D (x i 0 i=1 i
π0k = T0k (xk ; θ) ,
Dik (xk ) , ei Lxk − k B(αi ) − B(α0 ) , ej is an N -dimensional unit row vector with 1 at the jth component. Proof. We first show that π0k can be reconstructed from xk . By the definition of T0k , we have
k N k X X n −1 X o θi T0k (xk ; θ) = T0k U t(Ym ); θ = exp Dik U t(Ym ) + 1 θ0 m=1
m=1
i=1
k N −1 X n X o θi = exp ei H t(Ym ) − k B(αi ) − B(α0 ) + 1 θ0 m=1
i=1 N X
k n −1 X o θi T T = exp η (αi ) − η (α0 ) t(Ym ) − k B(αi ) − B(α0 ) + 1 θ m=1 i=1 0 T Pk N X −1 θi exp η (αi ) m=1 t(Ym ) − kB(αi ) = + 1 Pk T θ m=1 t(Ym ) − kB(α0 ) i=1 0 exp η (α0 ) T Qk N X −1 h(Y ) exp η (α )t(Y ) − B(α ) m i m i θi m=1 = + 1 θ0 Q k h(Y ) exp η T (α )t(Y ) − B(α ) i=1
m=1
m
0
m
0
Q N N −1 X −1 X πik θi km=1 f (Ym ; αi ) +1 = = π0k , = + 1 Qk k θ0 m=1 f (Ym ; α0 ) π0 i=1
i=1
where the third equality follows from the definition of Dik and H = LU , the fourth equality follows from the definition of H, the fifth and sixth equalities involve some algebraic manipulation and the reconstruction of the exponential family, the seventh equality follows from the definition of exponential family and the iid. assumption, the eighth equality follows from the Baye’ rule, and P k = 1. Using the same technique, other components can the last equality follows because N π i=0 i
8
JUE WANG
also be reconstructed from xk . For j = 1, . . . , N , we have P k X θj exp Djk (U km=1 t(Ym )) k k k t(Ym ); θ) = PN Tj (x ; θ) = Tj (U k Pk θ exp D (U t(Y )) + θ0 i m i=1 m=1 i m=1 Pk θj exp ej LU m=1 t(Ym ) − k B(αj ) − B(α0 ) = PN Pk + θ0 m=1 t(Ym ) − k B(αi ) − B(α0 ) i=1 θi exp ei LU T P k θj exp η (αj ) − η T (α0 ) m=1 t(Ym ) − k B(αj ) − B(α0 ) = PN Pk η T (αi ) − η T (α0 ) + θ0 i=1 θi exp m=1 t(Ym ) − k B(αi ) − B(α0 ) Pk Q θj exp η T (αj )t(Ym )−kB(αj ) θj km=1 h(Ym ) exp{η T (αj )t(Ym )−B(αj )} Pkm=1 Qk T θ0 exp η (α0 )t(Ym )−kB(α0 ) θ0 m=1 h(Ym ) exp{η T (α0 )t(Ym )−B(α0 )} Pm=1 =P = Q k T θi km=1 h(Ym ) exp{η T (αi )t(Ym )−B(αi )} PN θi exp N m=1 η (αi )t(Ym )−kB(αi ) Q +1 +1 i=1 θ0 k T Pk i=1 T m=1 h(Ym ) exp{η (α0 )t(Ym )−B(α0 )} =
θ0 exp m=1 η (α0 )t(Ym )−kB(α0 ) Qk θj m=1 fj (Ym ) Q πjk /π0k θ0 km=1 f0 (Ym ) = Q P PN θi km=1 fi (Ym ) N k k +1 i=1 πi /π0 i=1 θ0 Qk f (Y ) m=1 0 m
+1
= πjk .
k ) = Πk (Y ; θ), By now we have shown that T k (xk ; θ) = (T0k (xk ; θ), . . . , TNk (xk ; θ)) = (π0k , . . . , πN thereby completing the proof.
Remark 1. Proposition 1 suggests that the belief vectors of interest can be represented by a sufficient statistic with a dimension determined by the rank of the natural difference matrix, r 6 {N, M }. Note in Table 1 that M 6 2 in many distributions relevant in practice, meaning that the belief space is essentially in only one or two dimension, even when many hypotheses are involved. A geometric interpretation is given in Figure 1. In essence, the subset of the belief space that is reachable from the initial state (i.e., the prior θ) in k periods is an r-dimensional manifold Fk embedded in the N -dimensional belief space. The nonlinear transformation T k “curls” the sufficient-statistic space to form such manifold with the same intrinsic dimension as itself. 4.2. Reformulating the Optimality Equation. Now we reformulate the original optimality equation (1) in the sufficient-statistic space. Unlike the belief vector, updating the sufficient statistic does not require Bayes’ theorem; it only involves a simple addition, namely, xk+1 = xk + U t(Yk+1 ), by Definition 3. Consider a new value function J k (x; θ) , V (Πk ), defined by moving the time index k out of the original value function V (·) and emphasizing Πk ’s dependence on θ. The optimality equation involving J k (x; θ) is given in the following corollary: Corollary 1. Under Assumption 1, the optimality equation (1) can be reformulated using the minimal natural sufficient statistic x, as follows: n o k J k (x; θ) = min J0k (x; θ), . . . , JN (x; θ), Jwk (x; θ) , k = 1, 2, . . . (4) Jjk (x; θ) =
N X i=0
Jwk (x; θ)
=
N X i=0
Tik (x; θ)aij , Tik (x; θ)ci
+
j = 0, . . . , N, Z
Y
J k+1 x + U t(y)
N X i=1
Tik (x; θ)dFi (y).
(5)
Remark 2. The function J k (x; θ) is defined on the r-dimensional state space where x resides. Since r 6 2 in many real settings, we can solve these one or two dimensional problem by discretizing the belief space and performing value iteration with truncation (Blackwell and Girschik, 1979). Note
BAYES-OPTIMAL SEQUENTIAL MULTI-HYPOTHESIS TESTING IN EXPONENTIAL FAMILIES
9
Table 2. Belief-vector vs. sufficient-statistic approach
Dimension of the state space
Belief-vector aprch.
Sufficient-statistic aprch.
N
r ≤ min{N, M }
k
xk
State variable
Π
Value function
V (Πk )
J k (xk ; θ)
Stationary acceptance region
Yes
No
Prior-dependent acceptance region
No
Yes
Prior-dependent state
Yes
No
Belief space
Minimal Natural Sufficient Statistic space
⇡2
r=1 k=1
N =3
Tk
F1 F2 ⇡0
k=2
⇡1
⇡3 x2
F1 N =4
r=2 k=1
F2
x1
T
k
x2
⇡2 ⇡1
k=2 x1
Figure 1. Illustration of the reachable belief space. also that J k (x; θ) explicitly depends on both time k and the prior θ (as opposed to the original value function V (Π) that is independent of k and θ). Accordingly, the acceptance regions become nonstationary and prior-dependent. We compare and contrast these two methods in Table 2, although the decisions generated by them are the same. 4.3. Applications to Open Problems. We now apply the sufficient-statistic approach to test the hypotheses concerning the normal distribution. Various suboptimal procedures have been developed for these seemingly basic problems, such as Sobel and Wald (1949), Armitage (1950), Simons (1967), Billard and Vagholkar (1969), reviewed by Eisenberg (1991). However, no procedure has yet been known to be both optimal and scalable to a large number of hypotheses. 4.3.1. Testing the mean of normal distribution. We begin with a standard problem of testing simple hypotheses about the mean of normal distribution, assuming that the variances are known. Suppose
10
JUE WANG
that independent scalar observations y = {Y1 , . . . , Yk } are sequentially drawn from one of N + 1 univariate normal distributions fi = N (µi , σ 2 ), i = 0, . . . , N , differing in the mean. We are concerned with the hypotheses Hi : Yk ∼ N (µi , σ 2 ), i = 0, . . . , N . We have obtained the prior belief vector θ = (θ0 , . . . , θN ) reflecting our initial knowledge. The first step is to find the natural difference matrix T η (α1 ) − η T (α0 ) (µ1 − µ0 )/σ 2 , 0 .. .. H= = . . . 2 T T (µN − µ0 )/σ , 0 η (αN ) − η (α0 )
0) 0 , . . . , µNσ−µ Clearly, the rank of this matrix is one, i.e., r = 1. A rank factorization of H is L = µ1 −µ 2 σ2 k U = (1, 0). The minimal natural sufficient statistic is the cumulative sum of observations x = Pk k m=1 ym . The transformation T (x; θ) can be specialized as
N X µ −µ −1 θi µ2i − µ20 i 0 exp x − k + 1 , θ0 σ2 2σ 2 i=1 µ −µ µ2 −µ2 θj exp x jσ2 0 − k j2σ2 0 k Tj (x; θ) = P , j = 1, . . . , N. µi −µ0 µ2i −µ20 N + θ0 i=1 θi exp x σ 2 − k 2σ 2
T0k (x; θ) =
The optimality equation can be obtained by specializing (4) using T k (x), defined above, the natural sufficient statistic t(y) = (y, y 2 )T , and the corresponding normal distribution function Fi (y). Example 1: Illustrative Example. Consider the example with the parameters: (µ0 , µ1 , µ2 ) = (45, 55, 60), σ 2 = 25, Π0 = (θ0 , θ1 , θ2 ) = (1/3, 1/3, 1/3), (c0 , c1 , c2 ) = (0.5, 0.2, 0.3), a01 = 2, a02 = 5, a10 = 3, a12 = 6, a20 = 4, a21 = 7, a11 = a22 = a33 = 0. The simple hypotheses to be tested are H0 : Yk ∼ N (45, 25), H1 : Yk ∼ N (55, 25), H2 : Yk ∼ N (60, 25). Four independent realizations of Yk , {y1 , y2 , y3 , y4 } = {58, 52, 41, 57}, are sequentially generated from the distribution N (55, 25). We first describe the standard belief-vector approach. Based on the prior and sequential observation, we find the posterior belief vectors: Π1 = (0.019, 0.466, 0.515), Π2 = (0.013, 0.721, 0.266), Π3 = (0.399, 0.593, 0.008), Π4 = (0.039, 0.949, 0.012). The first three belief vectors lie in the waiting region but the fourth vector falls in the acceptance region for hypothesis H1 , as shown in Figure 2 (upper panel). Clearly, the sample path depends on the prior, but the acceptance regions do not and they remain fixed over time. Next, we describe the proposed sufficient-statistic approach. The sequence of sufficient statistic is x1 = 58, x2 = 110, x3 = 151, x4 = 208. We compare each statistic with the acceptance intervals shown in Figure 2 (lower panel) and find that it is optimal to wait until the fourth period. The sequence of actions are, undoubtedly, the same as the belief-vector approach. But the decision process now falls in one dimension. Note that the sufficient statistic is independent of the prior, but the acceptance intervals depend on both the prior and time. Remark 3. To implement this approach in real world, it is often desirable to use the sample P average ( km=1 ym )/k as the statistic so that the size of the state space does not increase with k. P The sufficient statistic ( km=1 ym )/k has been used by the heuristic methods (Sobel and Wald, 1949, Billard and Vagholkar, 1969). However, the decision rules draw a sharp distinction between these heuristics and the optimal policy. For example, the Sobel-Wald procedure combines multiple SPRT’s, whereas the optimal method only requires a single test. The Billard-Vagholkar procedure prohibits accepting any hypothesis at the first few periods, but the optimal policy allows stopping even at the first period.
T
,
BAYES-OPTIMAL SEQUENTIAL MULTI-HYPOTHESIS TESTING IN EXPONENTIAL FAMILIES
11
Stationary acceptance regions ⇡0
Accept H0
Belief-vector approach
Accept H1
Belief vector (N -dimensional)
Accept H2
⇧0
Wait
⇧3
⇡1
Sequential data y = {y1 , y2 , y3 , y4 }
⇧2
⇧4
⇧1
⇡2
Non-stationary acceptance intervals
Minimal natural Sufficient-statistic sufficient statistic k X approach
xk =
ym
x4 200
Accept H0
x3
150
Accept H1
m=1
Accept H2
x2
100
Wait
x1
50
k= 1
2
3
4
Figure 2. Comparison of the belief-vector approach and sufficient-statistic approach in Example 1. Example 2: Flexible priors. Using the natural sufficient statistic as the state variable is not new in Bayesian dynamic programming, most methods rely on natural conjugate priors. By ways of contrast, the proposed method does not require conjugate priors. In fact, it is compatible with arbitrary nonzero prior beliefs. We will illustrate such flexibility using the following example with ten hypotheses, a size that cannot be efficiently solved by the belief-vector approach. The parameters in this example are µi = 40 + 5i, σi2 = σ 2 = 100, ci = 0.02 − 0.01i, aij = |i − j| + 0.5[max(j − i, 0)]2 , for i = 0, . . . , 9. To illustrate the flexibility on prior selection, we choose a non-conjugate prior distribution involving a trigonometric function θi = (sin(i) + 1.5)/16.4 for i = 0, . . . , 9, as shown in Figure 3-a. This bimodal prior implies that H1 and H7 are the most likely whereas H3 and H4 are the least likely. After collecting a series of observations, the posterior probabilities for k = 1, 5, 10, 15, 21 are shown in the same figure. Figure 3-b illustrates the optimal acceptance intervals on the sufficient statistic xk for each k. These intervals are similar to those in Figure 2. Indeed, increasing the number of hypotheses no longer requires going to higher dimensions; it only adds more intervals into this chart. It is clear that the waiting intervals gradually shrink as k increases, because the uncertainty would decrease as the information accumulate. 4.3.2. Testing both mean and variance of normal distribution. When we are testing both the mean and variance, the cumulative sum of observations (y) may not be sufficient because the variance information is often better captured by the squared observations (y 2 ). Suppose that we sequentially observe y = {Y1 , . . . , Yk } drawn from one of N + 1 > 2 normal distributions fi = N (µi , σi2 ), i = 0, . . . , N . The goal is to find the true distribution by testing the hypotheses Hi : Yk ∼ N (µi , σi2 ), i = 0, . . . , N . As before, we first find the natural difference matrix 2 σ0 µ1 −σ12 µ0 σ12 −σ02 T T , 2 2 2 2 η (α1 ) − η (α0 ) σ0 σ1 2σ0 σ1 . . . .. H= . = . 2 2 −σ 2 2 µ T T σ µ −σ σ 0 N 0 N η (αN ) − η (α0 ) , N2 20 2 2 σ0 σN
2σ0 σN
12
JUE WANG Accept H9
xk
Prior ✓j
Optimal Action 4500 4500 Optimal Action
0.1 0 0.2
Posterior ⇡j1
1=
x 1 (y)
0.1
4000 4000
65
Wait
0
Time
k
0.4
Posterior ⇡j5
5
x 5 (y)
0.2
= 301
Posterior
3500 3500 3000 3000
Wait
0
⇡j10
Accept Accept Accept Accept Accept Accept Accept Accept
2500 2500 10 =
0.5
x 10 (y)
609
Wait
2000 2000
H8 H7 H6 H5 H4 H3 H2 H1
Accept H0
0
1500 1500
Posterior ⇡j15
15
x 15 (y)
0.5
= 938
Wait
0
Posterior ⇡j21
21=
x 21 (y)
0.5 0
0
1
2
3
4
5
6
7
8
1284
Accept H3
9
1000 1000
Wait
500 500
25 25
30 30
Hypothesis index j
55
10 10
15 15
20 20
Time
k
(a)
(b)
35 35
40 40
45 45
50 50
Figure 3. Example 2: (a) belief vectors and (b) acceptance intervals on the sufficient statistic. Next, we define ζi , (σ02 µi − σi2 µ0 )/(σi2 − σ02 ) and examine two cases: Case I: ζi ’s are identical. If ζi = ζ for all i = 1, . . . , N , the matrix H has rank one (r = 1). Thus, σ 2 µ −σ 2 µ T σ 2 µ −σ 2 µ the problem is in one dimension. A rank factorization is L = 0 σ12 σ21 0 , . . . , 0 σN2 σ2 N 0 , U = 0 1 P 0 N P 2 /ζ). (1, 1/ζ). The sufficient statistic is the scalar xk = U km=1 t(Ym ) = (1, 1/ζ) = km=1 (ym + ym Identical ζi ’s can arise when the means are identical but the variances differ (ζi = µ0 − µ), or when the means are different but the variances are identical (e.g., Example 1-2). But it can also appear when both mean and variance are different, for instance, when (µ0 , µ1 , µ2 , µ3 ) = (0, 1, 2, 3), (σ02 , σ12 , σ22 , σ32 ) = (1, 2, 3, 4), in which the rank of the matrix H is still one, and the P 2 ). corresponding sufficient statistic is still a scalar given by xk = km=1 (ym + ym Case II: ζi ’s are non-identical. If ζi ’s are different, we have r = 2. A rank factorization is L = Pk H, U = I (the identity matrix). In this case, the sufficient statistic is xk = m=1 t(Ym ) = T Pk Pk 2 k m=1 ym . The transformation T (x; θ) can be specialized as m=1 ym , N X n σ2µ − σ2µ h µ 2 σ 2 − µ2 σ 2 σ io −1 θi σi2 − σ02 i 0 i 0 i i 0 i 0 = exp x + x − k + ln + 1 , 1 2 θ σ0 σ02 σi2 2σ02 σi2 2σ02 σi2 i=1 0 n σ2 µ −σ2 µ o µ2 σ2 −µ2 σ2 σ 2 −σ 2 0 j σ θj exp 0 σ2 σ2j x1 + 2σj 2 σ20 x2 − k j 2σ0 2 σ20 j + ln σ0j 0 j 0 j n 2 0 j2 Tjk (x1 , x2 ; θ) = P , j = 1, . . . , N. µ2i σ02 −µ20 σi2 o σi2 −σ02 σ0 µi −σi µ0 N σi θ exp x + x − k + ln + θ i 1 2 0 2 2 2 2 2 2 i=1 σ0 σ σ 2σ σ 2σ σ
T0k (x1 , x2 ; θ)
0 i
0 i
0 i
P P 2 . We used the notation x = (x1 , x2 ) in the above expressions, where x1 = km=1 ym , x2 = km=1 ym The two-dimensional form of the finite-horizon optimality equation can be obtained by specializing (4) using the T k (x) defined as above, t(y) = (y, y 2 )T , as well as the corresponding normal distribution function Fi (y). Example 3. Consider five hypotheses about normal distribution fi , i = 0, . . . , 4, differing in both mean and variance. For the distribution fi , the mean is given by µi = 30 + 4(i − 1)3/2 and the variance follows the expression σi2 = 74 − i. The prior is a uniform distribution given by θi = 1/5. The costs are ci = 0.02 + 0.01i and aij = |i − j|/6 + [max(i − j, 0)]2 /12. The sufficient statistic space has a dimension r(H) = 2. Examples of the acceptance regions are shown in Figure 4 for k = 1
BAYES-OPTIMAL SEQUENTIAL MULTI-HYPOTHESIS TESTING IN EXPONENTIAL FAMILIES
k=1
⇥104 16
196
12
144 100
10
x2
k=4
⇥104 256
14
13
x2
64
8 6
36
4
16
2
4
50
100
150
200
250
300
350
200
400
400
600
x1
800
1000
1200
1400
1600
x1 Accept H0
H1
H2
H3
H4
Wait
Figure 4. The acceptance regions for Example 3. and k = 4. The horizontal axis is the cumulative sum of observations x1 = P 2 . axis is the cumulative sum of squared observations x2 = km=1 ym
Pk
m=1 ym ,
the vertical
5. Comparison with MSPRT
From a practical point of view, it is important to know the magnitude of improvement that the optimal policy can provide over existing suboptimal policies. Recent developments are mainly based on asymptotically optimal policies, a good benchmark is the M -ary sequential probability ratio test (MSPRT) by Baum and Veeravalli (1994). This procedure is known to be asymptotically optimal as the observation costs (or identification errors) approach zero (Dragalin et al., 1999, 2000). Consider the case with three hypotheses about the mean of normal distribution, namely, Hi : fi ∼ N (µi , σ 2 ), i = 0, 1, 2. Suppose that the observation costs are identical, i.e., ci = c and termination costs are zero-one, namely, aij = 1 if i 6= j, aij = 0 if i = j. In this context, the MSPRT defines a series of Markov times τi = inf{k : πik ≥ Ai }, where πik is the posterior probability for hypothesis i, and Ai is the corresponding constant threshold. The MSPRT stopping time is defined as the minimum Markov time τ = mini {τi }, and the acceptance decision rule is: d = i if τ = τi . We perform simulation studies to compare the performances of MSPRT with the optimal policy for different combinations of observation costs c and means µi . In this simulation experiment, we enumerate all combinations of MSPRT thresholds and choose the minimum cost as benchmark. The simulation is run long enough so that the width of the 95% confidence interval for estimated average cost is less than 0.001. The estimated average costs and the MSPRT’s percentage of loss from optimal are shown in Table 3. We observe in Table 3 that the sub-optimality of MSPRT becomes larger when the observation costs c are larger, or when the differences in the mean µi become smaller. Such observations are consistent with asymptotic optimality. They suggest that the optimal policy is more desirable when the hypotheses are more difficult to differentiate from one another, or when we cannot afford to take many observations for fear of increasing the response delay time. Nevertheless, MSPRT gives good approximation when the hypotheses are relatively easy to differentiate and when the observation cost is low. In these situations, one may argue that MSPRT serves as a satisfactory alternative. Incidentally, for cases in Table 3, the computation time of the optimal policy ranges from 17.9 to 24.6 seconds in the MATLABTM environment on a desktop computer with two 3.4 GHz Intel Core i7 processors. This is far from prohibitive for applications of hypothesis testing.
14
JUE WANG
Table 3. Comparison of the total cost between the optimal policy and MSPRT.
c Optimal MSPRT Error (%)
0.5 1.328 6.749 408.2
(µ1 − µ0 )/σ 0.4 1.232 5.534 349.2
= 0.2, 0.3 1.140 4.329 279.7
(µ2 − µ0 )/σ 0.2 1.027 3.240 215.5
= 0.4 0.1 0.936 1.999 113.5
0.05 0.904 1.430 58.22
0.01 0.898 1.198 33.48
c Optimal MSPRT Error (%)
0.5 1.344 4.181 211.0
(µ1 − µ0 )/σ 0.4 1.230 3.315 169.5
= 0.4, (µ2 − µ0 )/σ = 0.6 0.3 0.2 0.1 1.147 1.045 0.945 2.821 2.140 1.452 145.9 104.8 53.70
0.05 0.898 1.148 27.81
0.01 0.814 0.961 18.04
c Optimal MSPRT Error (%)
0.5 1.332 1.803 35.39
(µ1 − µ0 )/σ 0.4 1.199 1.582 31.99
= 0.8, (µ2 − µ0 )/σ = 1.4 0.3 0.2 0.1 1.101 1.023 0.957 1.439 1.185 1.029 30.84 15.89 7.497
0.05 0.911 0.968 6.300
0.01 0.883 0.906 2.569
c Optimal MSPRT Error (%)
0.5 1.280 1.287 0.546
(µ1 − µ0 )/σ 0.4 1.161 1.167 0.516
= 1.0, (µ2 − µ0 )/σ = 2.0 0.3 0.2 0.1 1.120 0.980 0.976 1.124 0.981 0.977 0.330 0.101 0.092
0.05 0.889 0.889 0.067
0.01 0.791 0.792 0.037
6. The Sampling Control Problem We now extend the main results to the sequential multi-hypothesis testing problem with sampling control, in which one can adaptively choose among multiple alternative sampling modes with different diagnostic powers and costs. This subject is initiated by Chernoff (1959), and still remains a vibrant area of research (Nitinawarat et al., 2013, Naghshvar and Javidi, 2013). Consider a set of hypotheses denoted by Hi , i ∈ N , among which only one is true. The decision maker is interested in finding the true hypothesis by conducting sequential sampling and observation. At the decision epoch k, the decision maker can either accept one of the hypotheses and terminate the decision process, or choose a sampling mode ak from the sampling action set A = {1, . . . , K}. When the true hypothesis is Hi , the action a ∈ A generates an observation Yk ∈ Y ⊂ RD with probability density (or mass) function fia . We assume that the functions fia , a ∈ A, i ∈ N are known and the observations Yk ’s are independent conditional on the action and hypothesis. A sequential policy δ = (τ, Aτ , d) contains the stopping time τ , the sequential sampling actions Aτ = {a1 , . . . , aτ −1 }, and the acceptance decision rule d : Aτ × {Y1 , . . . , Yτ −1 } → N . Let θ = (θ0 , θ1 , . . . , θN ) ∈ S N be the prior belief vector. Suppose that fja belongs to an exponential family with the natural parameter αja and the nat ural sufficient statistic tj , namely, fja (y) = f (y; αaj ) = h(y) exp η T (αaj )t(y) − B(αaj ) . Let Y k = (Y1 , . . . , Yk ) be the observation sequence generated by the sampling-mode sequence Ak = (a1 , . . . , ak ) ∈ Ak . Let Ωa , {m ∈ {1, . . . , k} : am = a} be the set of decision periods (up to time k < τ ) at which the sampling mode a ∈ A is used. Further, let ka ∈ N be the cardinality of the set Ωa , representing the total number ofPtimes of using the sampling mode a ∈ A before stopping. Clearly, ∪a∈A Ωka = {1, . . . , k}, and a∈A ka = k for k < τ .
BAYES-OPTIMAL SEQUENTIAL MULTI-HYPOTHESIS TESTING IN EXPONENTIAL FAMILIES
15
Table 4. Multivariate distributions in the exponential family (D ≥ 2) Distribution
α
η T (α)
tT (y)
B(α)
Categorical
(p1 , . . . , pD )
(ln p1 , . . . , ln pD )
(I{y=1} , . . . , I{y=D} )
0
Dirichlet
(α1 , . . . , αD )
Multivariate normal
(µ, Σ)
Multinomial
(p1 , . . . , pD )
(α1 − 1, . . . , αD − 1) (Σ
−1
(ln y1 , . . . , ln yD )
µ, − 21 Σ−1 )
T
(y, yy )
(ln p1 , . . . , ln pD )
PD
i=1
M D
ln Γ(αi ) − ln Γ(
1 T −1 µ Σ µ 2
(y1 , . . . , yD )
0
+
1 2
PD
i=1
ln |Σ|
αi )
D 2D D
Define an N × (M K + K) natural difference matrix as follows T 1 T K 1 1 K K η (α1 ) − η T (α10 ), · · · , η T (αK 1 ) − η (α0 ), B(α0 ) − B(α1 ), · · · , B(α0 ) − B(α1 ) .. .. .. .. .. .. Hs = , . . . . . . T 1 T 1 T K T K 1 1 K K η (αN ) − η (α0 ), · · · , η (αN ) − η (α0 ), B(α0 ) − B(αN ), · · · , B(α0 ) − B(αN ) Let rs = rank(Hs ) denote its rank, so rs ≤ min{N, M K + K}. Consider a rank factorization, Hs = Ls Us , where Ls is an N -by-rs matrix of full column rank and Us is an rs -by-(M K + K) matrix of full row rank. Definition 4. Define the minimal cumulative natural sufficient statistic as X X T xks = Us tT (Ym ), . . . , tT (Ym ), k1 , . . . , kK . m∈Ω1
m∈ΩK
Let Πk (Y k , Ak ; θ) be the belief vector conditional on the observation sequence Y k , samplingmode sequences Ak , the prior belief θ and the time index k. For brevity, we use πik to denote the (i + 1)th component of this belief vector. The following result suggests that this belief vector can be reconstructed from the minimal cumulative natural sufficient statistic xks defined above. Proposition 2. There is a mapping T k : Rrs → S N , such that Πk (Y k , Ak ; θ) = T k (xks ; θ). More specifically, T k = (T0k , . . . , TNk ), and N −1 X θi exp Dik (xks ) + 1 , θ0 i=1 θj exp Djk (xks ) k k k , j = 1, . . . , N, πj = Tj (xs ; θ) , PN k k) + θ θ exp D (x i 0 s i=1 i
π0k = T0k (xks ; θ) ,
where Dik (xks ) = ei Ls xks .
Remark 4. When the number of sampling control actions are small as compared with the number of hypotheses, it might be beneficial to use the sufficient-statistic approach. Further, if there are many control actions but each action can cause systematic change in the observation distributions (see Example 4), then the rank of the natural difference matrix may still be low and the sufficient-statistic approach is still preferred. When there is only one sampling mode, i.e., K = 1, the sufficient statisT T P T tic becomes xks = Us = (xk )T , k , consisting the r-dimensional sufficient m∈Ω1 t (Ym ), k1 statistic xk introduced in Definition 3 and the time index k (which accounts for the non-stationary acceptance regions). Thus, the classical problem discussed in Section 4 is a special case of the sampling control problem by letting K = 1.
16
JUE WANG
Example 4: Adaptive sample size. For a random variable X, suppose that the population distribution of X is normal N (µ, σ 2 ), with known variance σ 2 but unknown mean µ. To test multiple simple hypotheses about the mean, namely Hi : N (µi , σ 2 ), i = 0, . . . , N , the decision maker can take multiple samples at once, or one by one, before accepting a hypothesis. Generally speaking, taking multiple samples at once is not equivalent to taking the same number of samples sequentially, because the latter allows one to stop at any time, before all samples are observed. At the decision period m, she can choose the sample size am ∈ {1, 2, . . . , K} and observe the sample ¯ ∼ N (µ, σ 2 /am ). If the hypothesis Hi is true, then X ¯ ∼ ¯ , 1 Pam X` . Clearly X average X `=1 am a N (µi , σ 2 /am ), which implies that fi m = N (µi , σ 2 /am ), i = 0, . . . , N . Note that an action can cause a global change in the variances of all hypotheses. For convenience, let µ0 = 0. The natural difference matrix becomes √ √ µ1 √ µ1 2µ2 Kµ2 µ2 1 µσ1 , 0, 2 σ , 0, · · · , K σ , 0, − 2σ12 , − 2σ12 , · · · , − 2σ21 .. .. .. .. .. .. .. .. .. .. .. Hs = . . . . . . . . . . . √ µ √ µN √ µN 2 2 2 µ 2µ Kµ 1 σ , 0, 2 σ , 0, · · · , K σN , 0, − 2σN2 , − 2σN2 , · · · , − 2σ2N
,
whose rank is two (unless µi ’s are identical). A minimal natural sufficient statistic is xks = T P PK PK √ P a m∈Ωa Ym , K a=1 aka is the total number of samples taken a=1 a=1 aka , in which up to the period k.
7. Concluding Remarks The generalization of Wald’s SPRT to multiple hypotheses has been widely discussed. The structure of the optimal policy is well understood but the optimal policy itself is difficult to implement in general. We find that, for exponential families, it is possible to devise an efficient solution method, which is scalable to a large number of simple hypotheses without assuming conjugate priors in most practical cases. The method reconstructs the belief vector using the natural sufficient statistic of exponential family and reformulate the original dynamic programming in a low-dimensional space, whose dimensionality is determined by the rank of the natural difference matrix. The resulting control policy is distinct from the standard belief-vector approach in the sense that the acceptance regions are non-stationary and prior-dependent. The optimal solution is particularly desirable when the alternative hypotheses are difficult to differentiate and when a quick decision has to be made. The natural sufficient statistic has been used as the state variable in some online learning problems (often under conjugate priors). The fundamental theory of sufficient statistic is also well established in sequential models (Zacks (1971), Chapter 2.8). But the sufficient statistic, itself, does not give a direct solution to the multi-hypothesis testing problem. To solve this open problem, one needs to use a low-dimensional sufficient statistic to reconstruct the high-dimensional belief vector and reformulate the optimality equation. This across-dimension reconstruction technique is the key to this solution. One must be aware that the proposed method becomes less efficient for multivariate distributions in general. Some commonly used multivariate exponential-family distributions are listed in Table 4, from which we observe that M ≤ 2D, where D is the dimension of the distribution. Thus, the dimension of the sufficient statistic, r ≤ min{N, M } ≤ min{N, 2D}, may be large in certain applications, especially when the matrix H is full-rank. In these situations, suboptimal methods should be used.
BAYES-OPTIMAL SEQUENTIAL MULTI-HYPOTHESIS TESTING IN EXPONENTIAL FAMILIES
17
Appendix Proof of Proposition 2 Proof. T0k (xks ; θ) = T0k Us
X
tT (Ym ), . . . ,
m∈Ω1
X
i=1
N X n θi = exp ei Hs θ0 i=1
=
i=1
θi exp θ0
tT (Ym ), k1 , . . . , kK
m∈ΩK
N X n θi = exp Dik Us θ0
N X
X
K nX a=1
m∈Ω1
X
X
tT (Ym ), . . . ,
T
;θ
tT (Ym ), k1 , . . . , kK
m∈ΩK
tT (Ym ), . . . ,
m∈Ω1
X
tT (Ym ), k1 , . . . , kK
m∈ΩK
T o
T o
+1
+1
−1
−1
k −1 T a X o T a η (αi ) − η (α0 ) t(Ym ) − B(αai ) − B(αa0 ) ka +1 m∈Ωa
N k k X −1 nX X o T am θi = exp B(αai m ) − B(αa0m ) + 1 η (αi ) − η T (αa0m ) t(Ym ) − θ m=1 m=1 i=1 0 Pk Pk N am am −1 X T θi exp m=1 B(αi ) m=1 η (αi )t(Ym ) − + 1 = Pk P k am am T θ m=1 B(α0 ) m=1 η (α0 )t(Ym ) − i=1 0 exp T am Qk am N X −1 ) )t(Y ) − B(α h(Y ) exp η (α m m i i θi m=1 = + 1 θ0 Q k h(Y ) exp η T (αam )t(Y ) − B(αam ) i=1
=
m=1
m
0
i=1
Tjk (xks ; θ) = Tjk (Us
=
=
=
0
Q N N X −1 X −1 θi km=1 f (Ym ; αai m ) πik + 1 = + 1 = π0k . Q θ0 km=1 f (Ym ; αa0m ) π0k i=1
=
m
X
tT (Ym ), . . . ,
m∈Ω1
X
tT (Ym ), k1 , . . . , kK
m∈ΩK
T
; θ)
T P P T T θj exp Djk (Us ) m∈Ω1 t (Ym ), . . . , m∈ΩK t (Ym ), k1 , . . . , kK k PN P P T T T ) + θ0 i=1 θi exp Di (Us m∈Ω1 t (Ym ), . . . , m∈ΩK t (Ym ), k1 , . . . , kK PK T a Pk a ) − B(αa ) k T (αa ) t(Y ) − B(α θj exp η (α ) − η m a 0 0 j j a=1 m∈Ωa P K PN P k a ) − B(αa ) k T (αa ) − η T (αa ) θ exp η t(Y ) − B(α + θ0 i m a 0 0 i=1 a=1 m∈Ωa i i Pk T am P k am am am T θj exp m=1 η (αj ) − η (α0 ) t(Ym ) − m=1 B(αj ) − B(α0 ) Pk PN Pk am am am am T T + θ0 i=1 θi exp m=1 η (αi ) − η (α0 ) t(Ym ) − m=1 B(αi ) − B(α0 ) Pk P a a k Q am θj exp η T (αj m )t(Ym )− m=1 B(αj m ) m θj km=1 h(Ym ) exp{η T (αa j )t(Ym )−B(αj )} Pm=1 Pk Qk am am k a T m m θ0 exp η (α0 )t(Ym )− m=1 B(α0 ) θ0 m=1 h(Ym ) exp{η T (α0 )t(Ym )−B(αa 0 )} Pm=1 = Q P PN θi km=1 h(Ym ) exp{ηT (αai m )t(Ym )−B(αai m )} am k k m PN θi exp η T (αa i )t(Ym )− m=1 B(αi ) +1 Pm=1 am am + 1 i=1 θ0 Qk T P i=1 am am k k T m=1 h(Ym ) exp{η (α0 )t(Ym )−B(α0 )} θ0 exp Qk
m=1
η (α0 )t(Ym )−
θj m=1 fjam (Ym ) Q θ0 km=1 f0am (Ym ) Qk a N θi m=1 fi m (Ym ) am i=1 θ0 Qk m=1 f0 (Ym )
=P
+1
= PN
m=1
B(α0 )
πjk /π0k
k k i=1 πi /π0
+1
= πjk .
18
JUE WANG
References P. Armitage. Sequential analysis with more than two alternative hypotheses, and its relation to discriminant function analysis. J. R. Stat. Soc. Ser. B, 12(1):137–144, 1950. C. W. Baum and V. V. Veeravalli. A sequential procedure for multi-hypothesis testing. IEEE Trans. Inform. Theory, 40(6):1994–2007, 1994. L. Billard and M. K. Vagholkar. A sequential procedure for testing a null hypothesis against a two-sided alternative hypothesis. J. R. Stat. Soc. Ser. B, 31(2):285–294, 1969. D. Blackwell and M. A. Girschik. Theory of Games and Statistical Decisions. Reprint of the John Wiley 1954 edition. Dover, New York, 1979. H. Chernoff. Sequential design of experiments. Ann. Math. Statist., 30(3):755–770, 1959. S. Dayanik, C. Goulding, and H.V. Poor. Bayesian sequential change diagnosis. Math. Oper. Res., 33:475–496, 2008. V.P. Dragalin, A.G. Tartakovsky, and V.V. Veeravalli. Multihypothesis sequential probability ratio tests -Part I: Asymptotic optimality. IEEE Trans. Inform. Theory, 45(11):2448–2461, 1999. V.P. Dragalin, A.G. Tartakovsky, and V.V. Veeravalli. Multihypothesis sequential probability ratio tests Part II: Accurate asymptotic expansions for the expected sample size. IEEE Trans. Inform. Theory, 46(4):1366–1383, 2000. B. Eisenberg. Multihypothesis problems. In B. K. Ghosh and P. K. Sen, editors, Handbook of Sequential Analysis, pages 229–243. Marcel Dekker, New York, NY, 1991. K. S. Fu. Sequential Methods in Pattern Recognition and Learning. Academic Press, New York, 1968. B. K. Ghosh and P. K. Sen. Handbook of Sequential Analysis. Marcel Dekker, New York, NY., 1991. J. Jung, V. Paxson, A. W. Berger, and H. Balakrishnan. Fast portscan detection using sequential hypothesis testing. In Proceedings of IEEE Symposium on Security and Privacy, pages 211–225. IEEE, May 2004. K. Kabasawa and S. Kaihara. A sequential diagnostic model for medical questioning. Med. Inform., 6(3):175–185, 1981. T. L. Lai. Sequential multiple hypothesis testing and efficient fault detection–isolation in stochastic systems. IEEE Trans. Inform. Theory, 46(2):595–608, 2000. G. Lorden. Nearly-optimal sequential tests for finitely many parameter values. Ann. Statist., 5(1): 1–21, 1977. M. Naghshvar and T. Javidi. Active sequential hypothesis testing. Ann. Statist., 41(6):2703–2738, 2013. S. Nitinawarat, G. Atia, and V. V. Veeravalli. Controlled sensing for multihypothesis testing. IEEE Trans. Autom. Control, 58(10):2451–2464, 2013. C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of Markov decision processes. Math. Oper. Res., 12:441–450, 1987. E. Paulson. A sequential decision procedure for choosing one of k hypotheses concerning the unknown mean of a normal distribution. Ann. Math. Statist., 34(2):549–554, 1962. G. Simons. A sequential three hypothesis test for determining the mean of a normal population with known variance. Ann. Math. Statist., 38(5):1365–1375, 1967. M. Sobel and A. Wald. A sequential decision procedure for choosing one of three hypotheses concerning the unknown mean of a normal distribution. Ann. Math. Statist., 20(4):502–522, 1949. A. Tartakovsky. Sequential testing of many simple hypotheses with independent observations. Probl. Peredachi Inf., 24(4):53–66, 1988.
BAYES-OPTIMAL SEQUENTIAL MULTI-HYPOTHESIS TESTING IN EXPONENTIAL FAMILIES
19
A. Tartakovsky, I. Nikiforov, and M. Basseville. Sequential Analysis: Hypothesis Testing and Changepoint Detection, volume 136 of Monographs on Statistics & Applied Probability. Chapman and Hall/CRC, 2014. V. V. Veeravalli and C. W. Baum. Asymptotic efficiency of a sequential multihypothesis test. IEEE Trans. Inform. Theory, 41(6):1994–1997, 1995. A. Wald. Sequential tests of statistical hypotheses. Ann. Math. Statist., 16(2):117–186, 1945. A. Wald and J. Wolfowitz. Bayes solutions of sequential decision problems. Ann. Math. Stat., 21 (1):82–99, 1950. Y. Wang and Y. Mei. Asymptotic optimality theory for decentralized sequential multihypothesis testing problems. IEEE Trans. Inform. Theory, 57(10):7068–7083, 2011. S. Zacks. The theory of statistical inference. Wiley, New York, 1971.