expectation propagation for estimating the ... - Semantic Scholar

Report 2 Downloads 214 Views
EXPECTATION PROPAGATION FOR ESTIMATING THE PARAMETERS OF THE BETA DISTRIBUTION Zhanyu Ma and Arne Leijon Sound and Image Processing Lab KTH - Royal Institute of Technology Stockholm, Sweden {zhanyu, leijon}@kth.se

ABSTRACT Parameter estimation for the beta distribution is analytically intractable due to the integration expression in the normalization constant. For maximum likelihood estimation, numerical methods can be used to calculate the parameters. For Bayesian estimation, we can utilize different approximations to the posterior parameter distribution. A method based on the variational inference (VI) framework reported the posterior mean of the parameters analytically but the approximating distribution violated the correlation between the parameters. We now propose a method via the expectation propagation (EP) framework to approximate the posterior distribution analytically and capture the correlation between the parameters. Compared to the method based on VI, the EP based algorithm performs better with small amounts of data and is more stable. Index Terms— Beta Distribution, Expectation Propagation, Variational Inference, Importance Sampling, Laplace Approximation

2. BETA DISTRIBUTION AND PARAMETER ESTIMATION The beta distribution is a family of continuous probability distributions defined on the interval [0, 1] with two positive real parameters. The probability density function (PDF) of the beta distribution is Beta(x; u, v) =

1 xu−1 (1 − x)v−1 beta(u, v)

(1)

where beta(u, v) is the beta function

1. INTRODUCTION The beta distribution is a continuous distribution defined on the interval [0, 1]. For data with compact support range, the beta distribution describe the data better than the conventional Gaussian distribution [1, 2, 3]. To estimate the parameters is the central task in the application of beta distribution for machine learning (e.g. image classification, image segmentation). Since the normalization constant (the beta function) in the beta distribution is defined with an integration expression, the exact parameter estimation is analytically intractable. The maximum likelihood (ML) estimation of the parameters in the beta distribution has been well studied [4, 5]. For Bayesian estimation, Bouguila et al. proposed a practical algorithm based on Gibbs sampling to simulate the posterior distribution of the parameters. An analytical method to approximate the posterior distribution of the parameter was proposed in [3] using the framework of variational inference (VI) [6, 7]. Since the approximation in [3] violated the correlation between the parameters, this analytical method only reported the posterior mean of the parameters and therefore worked better with a larger amount of data. The origins of expectation propagation (EP) was presented by Minka in [8, 9]. The EP framework is an improved version of assumed density filtering (ADF) which is a one pass technique for the approximation to the posterior distribution in Bayesian analysis [10, 11]. As an online learning method, the performance of ADF depends crucially on the order of the input sequence. To overcome this drawback, EP extended the ADF to pass messages though all the

978-1-4244-4296-6/10/$25.00 ©2010 IEEE

factors and made the refinement of each component based on all the other factors. The approximation in each step was iteratively refined by later observation and the updating procedure prevented the effect of the input sequence order. With this robust framework, we propose an algorithm based on EP for the posterior distribution approximation in beta distributions. Using the absolute evidence difference ratio (AEDR) as criterion, the performance of EP and VI for posterior distribution approximation is compared for different amounts of training data.

2082

beta(u, v) =

Γ(u)Γ(v) Γ(u + v)

(2)

and Γ(·) is the gamma function defined as  ∞ tz−1 e−t dt . Γ(z) =

(3)

0

For Bayesian analysis, the conjugate prior distribution to the beta distribution [3] is  ν 1 Γ(u + v) 0 −α0 (u−1) −β0 (v−1) e e (4) p(u, v) = C Γ(u)Γ(v) where α0 , β0 , ν0 are the free positive parameters and C is a normalization factor (a function of α0 , β0 , ν0 ) such that  ∞ ∞ p(u, v)dudv = 1 . 0

0

Then with N i.i.d. observation X = {x1 , . . . , xN }, we obtain the posterior distribution of u, v as p(X|u, v)f (u, v) p(u, v|X) =  ∞  ∞ p(X|u, v)p(u, v)dudv 0  ν0N Γ(u+v) e−αN (u−1) e−βN (v−1) Γ(u)Γ(v)  νN =   ∞ ∞ Γ(u+v) e−αN (u−1) e−βN (v−1) dudv Γ(u)Γ(v) 0 0

(5)

ICASSP 2010

where1 αN = α0 −

N 

where f0 (Θ) is the prior distribution and fi (Θ) = p(xi |Θ), i = {1, . . . , N } is the likelihood given the ith observation. EP approximates the posterior distribution p(Θ|X) by a distribution q(Θ) factorized as N fi (Θ) . (8) q(Θ) =  Ni=0 i=0 fi (Θ)dΘ

ln xn

n=1

βN = β0 −

N 

ln(1 − xn )

n=1

νN = ν0 + N . The posterior distribution is not analytically tractable in practical problems due to integration expression. Some stochastic techniques (e.g. Gibbs sampling [1]) could be utilized to numerically calculate the posterior distribution. In this paper, we propose an algorithm based on EP to approximate the posterior distribution and compare it with the method based on the VI framework.

By minimizing the KL divergence KL(p  q), the optimal q(Θ) is obtained. With the EP framework, we remove one factor f j (Θ) to get N

q(Θ) = (9) q /j (Θ) = f i (Θ) . f j (Θ) i=0,i=j Then a new unnormalized distribution is obtained by replacing f j (Θ) with the true likelihood fj (Θ) as

2.1. Posterior Approximation with Variational Inference In [3], we proposed an algorithm using the VI framework [6, 7] to approximate the posterior distribution of the parameters of the beta distribution. The VI approach minimizes the Kullback-Leibler (KL) divergence of the true posterior distribution p(Θ|X) from the approximating distribution q(Θ). The optimized approximate distribution is q ∗ (Θ) = arg min {KL(q  p)} q(Θ)

 (6) q(Θ) q(Θ) ln = arg min dΘ q(Θ) p(Θ|X)   u where Θ = . Furthermore, we factorized q(Θ) into a prodv uct of two gamma distributions as q(u, v) = Gam(u)Gam(v). Then with some approximations, the optimal approximation to the posterior distribution could be analytically calculated iteratively. However, this factorization violated the correlation between u and v. This method could lead to a good estimation to the posterior mean of the parameters but not a good complete posterior distribution. A good posterior distribution approximation could be reached with a larger amount of data, because the true posterior distribution was sharply peaked at the mean, and the correlation had only a small effect on the shape of the true posterior. Given a smaller amount of data, the true posterior distribution spread over a larger area so that the approximating distribution could not describe the shape correctly.

N

g(Θ) = fj (Θ)q /j (Θ) = fj (Θ)

f i (Θ) .

(10)

i=0,i=j

By the technique of moment matching, a new approximation q new (Θ) is obtained by setting the sufficient statistics of q new (Θ) equal to those of the unnormalized g(Θ). With this new updated approximation, the new factor fjnew (Θ) is fjnew (Θ) ∝

q new (Θ) . q /j (Θ)

(11)

With the steps above, we can iteratively update each factor with all the other factors till convergence. In contrast with VI, EP minimizes the KL divergence of q(Θ) from p(Θ|X) (i.e. KL(p  q)) instead of KL(q  p). 2.2.2. Gaussian Assumptions Theoretically, any distribution from the exponential family could be used as the factor f i (Θ). For convenience of calculation, we choose each factor as a two- dimensional Gaussian distribution f i (Θ) = N (Θ; m i , Σi ) and obtain that the approximation is still a Gaussian as N N (Θ; m i , Σi ) = N (Θ; m, Σ) q(Θ) =  Ni=0 N (Θ; m i , Σi )dΘ i=0

(12)

(13)

where

2.2. Posterior Approximation With Expectation Propagation 2.2.1. Expectation Propagation

m=Σ

To handle the case with smaller amounts of data and (approximately) capture the correlation between u and v, we propose an algorithm based on the EP framework. The EP technique was originated by Minka [8, 9, 12]. It is an improved version of ADF such that the result from EP does not depend on the order of the input sequence and is suitable for approximating some other distributions. The posterior distribution p(Θ|X) is considered as a product of factors as N f0 (Θ) N fi (Θ) fi (Θ) =  (7) p(Θ|X) =  Ni=0 Ni=1 f (Θ)dΘ (Θ) f f i 0 i=0 i=1 i (Θ)dΘ 1 To prevent the infinity quantity in the practical implementation, we assign ε1 to xn when xn = 0 and 1 − ε2 to xn when xn = 1. Both ε1 and ε2 are slightly positive real numbers.

2083

N  

Σ−1 i mi



i=0

Σ−1 =

N 

Σ−1 . i

i=0

Here we use the fact that N (x; a, A)N (x; b, B) ∝ N (x; c, C), where   c =C A−1 a + B −1 b C −1 =A−1 + B −1 . Removing the jth component from q(Θ), we have q /j (Θ) ∝ N (Θ; m /j , Σ/j )

(14)

Finally, by moment matching, the parameters of q new (Θ) are obtained from

where N 

m /j = Σ/j

 −1  /j Σ−1 Σ m − Σ−1 i mi = Σ j mj

m new = E [Θ]

  Σnew +m new (m new )T = E ΘΘT .

i=0,i=j



Σ/j

−1

=

N 

Σ−1 = Σ−1 − Σ−1 . i j

i=0,i=j

The unnormalized distribution in (10) is written as g(Θ) = Beta(xj ; Θ)N (Θ; m/j , Σ/j ) .

(15)

To obtain q new (Θ), we match the 1st and 2nd moments of q new (Θ) to those of g(Θ). From (15), we know that the unnormalized distribution is a product of a beta distribution and a set of Gaussians, which makes it difficult to calculate the moments analytically. To evaluate the 1st and 2nd moments approximately, we apply the importance sampling (IS) technique to numerically calculate them. Firstly, we approximate g(Θ) in (15) with a Gaussian via Laplace approximation. The logarithm of g(Θ) (without the constant) is T  −1   1 Θ − m/j Σ/j Θ − m/j 2 − ln beta(u, v) + (u − 1) ln xj + (v − 1) ln (1 − xj )

ln g(Θ) = −

(16)

The first and second derivatives with respect to Θ is given by   ∂ ln g(Θ) ∂ ln g(Θ)/∂u = ∂ ln g(Θ)/∂v ∂Θ    −1   ψ(u + v) − ψ(u) + ln xj = Θ − m/j − Σ/j ψ(u + v) − ψ(v) + ln (1 − xj )   ∂ 2 ln g(Θ) ∂ 2 ln g(Θ)/∂u2 ∂ 2 ln g(Θ)/∂u∂v = 2 2 2 ∂ ln g(Θ)/∂v∂u ∂ ln g(Θ)/∂v ∂Θ2     −1 ψ (u + v) − ψ  (u) ψ  (u + v) = − Σ/j    ψ (u + v) ψ (u + v) − ψ (v) (17) where ψ(·) is the digamma function. By setting the first derivative ˘ numerically. So the Laplace equal to 0, we could obtain the mode Θ approximation to g(Θ) is  T   1 1 −1 ˘ ˘ Θ − Θ Θ − Θ exp − C g (Θ) = 2 2π|C|1/2 (18) 2 ∂ ln g(Θ) C −1 = − | . ˘ Θ=Θ ∂Θ2 Secondly, we generate L samples from g (Θ) and for each sam g (Θl ) and wl = rl / L ple Θl we calculate rl = g(Θl )/ l=1 rl . With the principles of IS, the moments of g(Θ) is approximated as E [Θ] 

L 

wl Θl

l=1

E ΘΘ

The Laplace approximation only uses the mode and Hessian of g(Θ) and, therefore, IS works with both normalized or unnormalized distributions. Even though g(Θ) is an unnormalized distribution, the results obtained above are the same as those with a normalized g(Θ). 2.2.4. f jnew (Θ) and Convergence

2.2.3. Moment Matching



(20)

T





L 

With q(Θ), the updated jth factor f j (Θ) can be obtained by , Σnew ) f jnew (Θ) =N (Θ; mnew j j

(21)

where   −1  new new −1 new /j /j = Σ ) m − Σ m m new (Σ j j 

Σnew j

−1

−1  = (Σnew )−1 − Σ/j .

(22)

Unlike ADF which is a one-pass method, EP will repeat the above steps through all the factors f j (Θ) till convergence. After convergence, the approximation to the posterior distribution will not be affected by the order of the input sequence. From (22), we realize might not be positive definite that the factor covariance matrix Σnew j in some iterations, which makes fj (Θ) nonexistent. If this happens, the factor is simply kept unchanged (within this iteration,) and the procedure continues to the next factor to start a new iteration. This unchanged factor is updated in a later round. 3. EXPERIMENTAL RESULTS AND DISCUSSION To verify this EP based algorithm, we apply it to approximate the posterior distribution with some synthetic data generated from a beta distribution and compare it with the VI based algorithm. 3.1. Approximation To The Posterior Distribution Choosing a Gaussian as the approximating posterior distribution has two advantages: 1) the Gaussian is easy to calculate analytically; 2) Gaussians can capture the correlation between {u, v} quite well. Fig. 1 shows the true posterior distribution, the approximation via EP and the approximation via VI. The comparison shows that EP yields a more reasonable and better approximation compared to VI. But VI still gives a good point estimate of the parameters. 3.2. Model Comparison To evaluate the performance of different approximations, we compare the EP based algorithm to the VI based algorithm. Considering  (23) pm (X) = p(X|Θ)pPost m (Θ)dΘ, m ∈ {EP, VI}

(19) wl Θl ΘTl

as the model likelihood given the posterior distribution, we evaluate the fitness of the approximating posterior distribution to the

.

l=1

2084

True Posterior

Posterior Via EP

30

AEDR with u = 2, v = 2

Posterior Via VI

30

VI EP

v

10

10

10

0

0

0

0.01

0

5

u

10

30

0

5

u

10

30

u

2 log N

0

2.5

1

VI EP

0.01 0.005 0

0

5

u

10

0

0

5

u

10

0

0

5

u

10

Fig. 1. The true posterior, posterior via EP and posterior via VI. Distributions were obtained with data generated from a beta distribution Beta(x; 3, 8). Upper row: N = 20, lower row: N = 100. true posterior distribution by the Absolute Evidence Difference Ratio (AEDR) calculated as |pm (X) − pTrue (X)| , m ∈ {EP, VI} (24) pTrue (X)  where p∗ (X) = N p∗ (X) is the geometric mean of the model likelihood for N observation samples. Furthermore, 20 rounds of simulation were made and the geometric mean of the 20 p∗ (X)s were taken as the exact likelihood. This quantity shows the distance between the likelihood evaluated via a specified approximation and the true model likelihood of the training data. A better approximation will yield a smaller AEDR. The true model likelihood is calculated through the importance sampling method as the reference for the comparison. To make a general conclusion, we used different parameter settings for the beta distribution and for each distribution we generated different amounts of data for evaluation. Fig. 2 shows the AEDR comparison between EP and VI. For smaller amounts of data, EP is significantly better. This is because with a smaller amount of data, the posterior distribution is not concentrated and EP could adapt the approximation to the true distribution, but VI just focuses around the posterior mean of the true distribution. As the number of data increases, the AEDRs for both EP and VI decrease toward 0 and the difference between these two frameworks is negligible. Also, EP performs with more stable results (AEDR never exceeded 0.5%) compared to VI for different amounts of data. For real-world data that contains more complex relations, we can apply the beta mixture model to fit the data as in [2, 3] and use the algorithm proposed in this paper to estimate the parameter distributions. AEDRm =

4. CONCLUSION We proposed an algorithm based on the EP framework for the posterior distribution approximation in beta distributions. By assigning each factor as a Gaussian distribution, EP captured the correlation between the parameters and yielded a better posterior distribution approximation compared to VI. With smaller amounts of data, EP performs better than VI according to the criterion of AEDR. The overall performance of EP with different amounts of data is more stable than with VI.

2085

2.5

1

1.5

2 log N

2.5

VI EP

0.02 0.01 0

1

10

0

2 log10 N

0.03

0.015

10

1.5

AEDR with u = 8, v = 8

0.02

v 10

1.5

AEDR with u = 3, v = 8

10

20

v 10

5

30

20

v

20

0

1

10

AEDR

0

0.01

0.005

AEDR

v

AEDR

20

v

20

VI EP

0.02 AEDR

0.015

20

AEDR with u = 2, v = 10

0.02

30

1.5

2 log N

2.5

10

Fig. 2. AEDR with different parameter settings. We generated several sets of data with N = {10, 15, 20, 25, 30, 35, 40, 50, 100, 200, 300, 400, 500} for each beta distribution. 5. REFERENCES [1] N. Bouguila, D. Ziou, and E. Monga, “Practical Bayesian estimation of a finite beta mixture through Gibbs sampling and its applications,” Statistics and Computing, vol. 16, pp. 215–225, 2006. [2] Z. Ma and A. Leijon, “Beta mixture models and the application to image classification,” in Proceedings of International Conference on Image Processing, 2009. [3] Z. Ma and A. Leijon, “Bayesian estimation of beta mixture models with variational inference,” To be submitted., 2009. [4] R. Gnanadesikan, R. S. Pinkham, and L. P. Hughes, “Maximum likelihood estimation of the parameters of the beta distribution from smallest order statistics,” Technometrics, vol. 9, pp. 607–620, 1967. [5] F. Cribari-Neto and K. L. P. Vasconcellos, “Nearly unbiased maximum likelihood estimation for the beta distribution,” Journal of Statistical Computation and Simulation, vol. 72, pp. 107–118, 2002. [6] T. S. Jaakkola and M. I. Jordan, “Bayesian parameter estimation via variational methods,” Statistics and Computing, vol. 10, pp. 25–37, 2000. [7] T. S. Jaakkola, “Tutorial on variational approximation methods,” in Advances in Mean Field Methods, M. Opper and D. Saad, Eds., pp. 129–159. MIT Press, 2001. [8] T. P. Minka, A family of algorithms for approximate Bayesian inference, Ph.D. thesis, Massachusetts Institute of Technology, 2001. [9] T. P. Minka, “Expectation propagation for approximate Bayesian inference,” in Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, 2001, pp. 362–369. [10] S. L. Lauritzen, “Propagation of probabilities, means and variances in mixed graphical association models,” Journal of the American Statistical Association, vol. 87, pp. 1098–1108, 1992. [11] M. Opper, “A Bayesian approach to on-line learning,” On-line learning in neural networks, pp. 363–378, 1999. [12] C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.