Improving Convergence of Divergence Functional Ensemble Estimators

Comment

Report 10 Downloads 127 Views

Improving Convergence of Divergence Functional Ensemble Estimators Kevin R. Moon∗ , Kumar Sricharan†, Kristjan Greenewald∗, Alfred O. Hero III∗ ∗ EECS

Dept., University of Michigan, {krmoon,greenewk,hero}@umich.edu † Xerox PARC, [email protected]

arXiv:1601.06884v1 [cs.IT] 26 Jan 2016

Abstract Recent work has focused on the problem of nonparametric estimation of divergence functionals. Many existing approaches are restrictive in their assumptions on the density support or require difficult calculations at the support boundary which must be known a priori. We derive the MSE convergence rate of a leave-one-out kernel density plug-in divergence functional estimator for general bounded density support sets where knowledge of the support boundary is not required. We then generalize the theory of optimally weighted ensemble estimation to derive two estimators that achieve the parametric rate when the densities are sufficiently smooth. The asymptotic distribution of these estimators and some guidelines for tuning parameter selection are provided. Based on the theory, we propose an empirical estimator of Rényi-α divergence that outperforms the standard kernel density plug-in estimator, especially in high dimension.

I. I NTRODUCTION Information divergence is a measure of the difference between probability distributions and has many applications in the fields of information theory, statistics, signal processing, and machine learning. Some applications involving divergences include estimating the decay rates of error probabilities [1], estimating bounds on the Bayes error [2]–[7], testing the hypothesis that two sets of samples come from the same probability distribution [8], clustering [9]–[11], feature selection and classification [12]– [14], blind source separation [15], [16], image segmentation [17]–[19], extending machine learning to algorithms to distributional features [20]–[23], and steganography [24]. Additionally, mutual information and entropy are both special cases of divergences and have been used in some of the above applications as well as others such as determining channel capacity [1], fMRI data processing [25], intrinsic dimension estimation [26], [27], and texture classification and image registration [28]. For many more applications of divergence measures, see [29]. An important subset of information divergences is the family of f -divergences [30], [31]. This family includes the wellknown Kullback-Leibler (KL) divergence [32], the Rényi-α divergence [33], the Hellinger-Bhattacharyya distance [34], [35], the Chernoff-α divergence [5], and the total variation distance. We consider the problem of estimating divergence functionals when only two finite populations of independent and identically distributed (i.i.d.) samples are available from two d-dimensional distributions that are unknown, nonparametric, and smooth. This paper paper greatly improves upon the results reported in [7], [36] which explored k-nearest neighbor (k-nn) based estimators of f -divergences. We focus on plug-in kernel density estimators (KDEs) of general divergence functionals. We establish a central limit theorem for the divergence estimator distribution and derive more general conditions on the required densities’ smoothness for the mean squared error (MSE) convergence rates. Specifically, we derive an ensemble estimator that achieves the parametric rate when the densities are at least s ≥ (d + 1)/2 times differentiable, whereas [36] requires s ≥ d. We extend these estimators to more general bounded density support sets in Rd , whereas the proofs in [36] restricted the estimator to compactly supported densities with no boundary conditions (e.g. a support set equal to the surface of a torus), which is unnecessarily restrictive. Finally, we use a leave-one-out approach that uses all of the data for both density estimation and integral approximation in contrast to [7], [36]–[40] which use a less efficient data-splitting approach. Other divergence functional estimators with parametric convergence rates have recently been published. Nguyen et al [41] proposed a f -divergence estimator that estimates the likelihood ratio of the two densities by solving a convex optimization problem and then plugs it into the divergence formulas. For this method they prove that the minimax MSE convergence rate is parametric (O N1 ) when the likelihood ratio is at least d/2 times differentiable. However, this estimator is restricted to true f -divergences instead of the broader class of divergence functionals considered in this paper. Furthermore, solving the convex problem of [41] can be more computationally demanding than our approach when N is large. Other recent work [39], [40], [42], [43] has focused on estimating various divergence functionals by using an optimal KDE. These estimators achieve parametric convergence rates when the densities have at least d [42], [43] or d/2 [39], [40] derivatives. However, these optimal KDEs are difficult to construct when the density support set is bounded as they require knowledge of the support boundary and difficult computations at the boundary. Additionally, numerical integration is required for some functionals, which can be computationally difficult. Other important functionals are excluded from the estimation This work was partially supported by NSF grant CCF-1217880 and a NSF Graduate Research Fellowship to the first author under Grant No. F031543.

framework in [39], [43], including some that bound the Bayes error [2], [4], [6]. In contrast, our method applies to a large class of divergence functionals, does not require knowledge of the support boundary, and avoids numerical integration. The asymptotic distributions of the estimators in [39], [41]–[43] are currently unknown. Kandasamy et al [40] prove a central limit theorem for their data-splitting estimator but do not prove similar results for their leave-one-out estimator. In Section II, we derive MSE convergence rates for kernel density plug-in divergence functional estimators. We derive the asymptotic distribution of the weighted ensemble estimators in Section III which enables us to perform hypothesis testing. We then generalize the theory of optimally weighted ensemble entropy estimation developed in [37] to obtain two divergence functional estimators with a MSE convergence rate of O N1 , where N is the sample size, when the densities are at least d or (d + 1)/2 times differentiable. These estimators apply to general divergence functionals and are simpler to implement than other estimators that also achieve the parametric rate. Based on the theory derived in these sections, we then provide some guidelines for selecting the tuning parameters in Section IV followed by experimental validation for the special case of Rényi-α divergence estimation in Section V. Bold face type is used for random variables and random vectors. The conditional expectation given a random variable Z is denoted EZ . II. T HE D IVERGENCE F UNCTIONAL W EAK E STIMATOR We focus on estimating functionals of the form G (f1 , f2 ) =

ˆ

g (f1 (x), f2 (x)) f2 (x)dx,

(1)

where g(t) is a smooth functional, and f1 and f2 are smooth d-dimensional probability densities. If g (f1 (x), f2 (x)) = g ff12 (x) (x) , is convex, and g(1) = 0, then G (f1 , f2 ) defines the family of f -divergences. Some common divergences that belong to this family include the KL divergence (g(t) = − ln t), the Rényi-α divergence (g(t) = tα ), and the total variation distance (g(t) = |t − 1|). In this work, we do not require g to be convex nor g(1) = 0 which allows us to consider a broader class of functionals. We use a kernel density plug-in estimator of the divergence functional in (1). Assume that N i.i.d. realizations {Y1 , . . . , YN } are available from f1 and N i.i.d. realizations {X1 , . . . , XN } are available from f2 . Let h > 0 be the kernel bandwidth for the density estimators. Let K(·) be a symmetric, product kernel function with bounded support in each dimension and ||K||∞ < ∞. We focus on finite support kernels for simplicity in the proofs although it is likely that our results extend to some infinitely supported kernels as well. The KDEs are N 1 X Xj − Yi ˜f1,h (Xj ) = K , N hd i=1 h N Xj − Xi 1 X ˜f2,h (Xj ) = , K M hd i=1 h i6=j

where M = N − 1. G (f1 , f2 ) is then approximated as N X ˜h = 1 g ˜f1,h (Xi ) , ˜f2,h (Xi ) . G N i=1

(2)

Similar to [7], [36], [37], the principal assumptions we make on the densities f1 and f2 and the functional g are that: 1) f1 , f2 , are s times differentiable and g is infinitely differentiable; 2) f1 and f2 have common bounded support sets S; 3) f1 and f2 are strictly lower bounded on S. The smoothness assumptions on the densities are relaxed compared to [7], [36], [37]. However, we assume stricter conditions on the smoothness of g to enable us to achieve good rates without knowledge of the boundary of the support set. We also assume 4) that the support is smooth with respect to the kernel K(u) in the sense that the expectation of the area outside of S wrt any random variable u with smooth distribution is a smooth function of the bandwidth h. It is not necessary for S to have smooth contours with no edges as this assumption is satisfied when S = [−1, 1]d and when K(x) is the uniform rectangular kernel. See Appendix A for more details on all of the assumptions. ˜ h is given by Theorem 1: For infinitely differentiable g, the bias of the plug-in estimator G s h i X 1 1 s ˜h = c10,j hj + c11 B G . (3) + O h + N hd N hd j=1

If g(x, y) has k, l-th order mixed derivatives any positive integer λ ≥ 2, the bias is

∂ k+l g(x,y) ∂xk ∂y l

h i ˜h = B G

s X

that depend on x, y only through xα y β for some α, β ∈ R, then for

c10,j hj +

λ/2 s X X

c11,q,j

q=1 j=0

j=1

hj q (N hd )

λ +O hs + 1/ N hd 2 .

(4)

Divergence functionals that satisfy the mixed derivatives condition required for (4) include the KL divergence and the Rényi-α divergence. Obtaining higher order terms for other divergence functionals is left for future work. Theorem 2: Assume that the functional g is Lipschitz continuous in both arguments with constant Cg . Then the variance of ˜ h is bounded by the plug-in estimator G h i 11C 2 ||K||2 g ∞ ˜h ≤ V G , N where kKk∞ is the ℓ∞ norm of the kernel K in the KDE. ˜ h to be unbiased while the variance of the From Theorems 1 and 2, it is clear that we require h → 0 and N hd → ∞ for G plug-in estimator depends primarily on the number of samples. The dominating terms in the bias are Θ (h) and Θ N1hd . If no ∗ −1/(d+1) attempt is made to correct the bias, the optimal , resulting choice of h in terms of minimizing the MSE is h = Θ N −1/(d+1) in a dominant bias term of Θ N . Note that this differs from the standard result for the optimal KDE bandwidth which is Θ N −1/(d+4) for the symmetric uniform kernel [44].

A. Proof Sketches of Theorems 1 and 2

To prove the expressions for the bias, the bias is first decomposed into two parts by adding and subtracting g EZ ˜f1,h (Z), EZ ˜f2,h (Z) within the expectation: a “bias” term andha “variance” term. Applying a Taylor series expansion on these terms results in i ˜i,h (Z) := ˜fi,h (Z) − EZ˜fi,h (Z), respectively. expressions that depend on powers of BZ ˜fi,h (Z) := EZ ˜fi,h (Z) − fi (Z) and e = 0. Applying KDE properties A point X ∈ S is defined to be in the interior of S (denoted SI ) if for all Y ∈ / S, K X−Y h and a Taylor series expansion of the densities gives for Z ∈ SI , h i ⌊s/2⌋ X ci,j (Z)h2j + O (hs ) . BZ ˜fi,h (Z) =

(5)

j=1

A point X ∈ S is near the boundary of S (denoted SB ) if it is not in SI . If the boundary is sufficiently smooth, then for Z restricted to SB and arbitrary γ with supx,y |γ(x, y)| < ∞, (Appendix C) s h ii X h c4,i,j,t hj + o (hs ) . E γ (f1 (Z), f2 (Z)) BtZ ˜fi,h (Z) = j=1

h i Combining this result with (5) gives an expression for the terms involving BZ ˜fi,h (Z) . ˜i,hi (Z), define the random variable Vi (Z) = K Xih−Z − EZ K For the terms involving e theorem and properties of KDEs, ⌊s/2⌋ h i X c3,2,j,m (Z)h2m + O h2d . EZ Vij (Z) = hd

Xi −Z h

. Then by the binomial

m=0

h i h i ˜22,h (Z) = EZ Vi2 (Z)/ N h2d . A similar ˜q2,h (Z) . For example, EZ e These expressions can then be used to simplify EZ e h i ˜q1,h (Z) . For general g, we only have that ∂ k+l g EZ ˜f1,h (Z), EZ ˜f2,h (Z) / ∂xk ∂y l = approach can be used for EZ e ∂ k+l g (f1 (Z), f2 (Z)) / ∂xk ∂y l + o(1). Thus we can only gaurantee the bias expansion up to the O 1/ N hd term in (3). However, if the mixed derivatives of g(x, y) have the form of xα y β , then we can apply the generalized binomial theorem to the derivatives of g evaluated at EZ ˜fi,h (Z) to obtain (4).

The proof of the variance bound uses the Efron-Stein inequality which bounds the variance by analyzing the expected ′ squared difference between the plug-in estimator when one sample is allowed to differ. It can be shown that if X1 and X1 are drawn independently from f2 , then ′ 2 E ˜fi,h (X1 ) − ˜fi,h (X1 ) ≤ 2||K||2∞ , 2 ′ 2||K||2∞ ˜ ˜ , E fi,h (Xj ) − fi,h (Xj ) ≤ (N − 1)2

′ ˜ ′ as the estimators with X1 replaced with X′ . Applying the Lipschitz condition and Jensen’s where we denote ˜fi,h and G 1 h inequality with these results gives 2 ˜ ˜ ′ ≤ 20C 2 ||K||2 /N 2 . E G h−G

h

g

∞

′

A similar result can be derived when Y1 and Y1 are instead considered. The Efron-Stein inequality finishes the proof. The full proofs of both theorems are given in Appendix C and D. III. W EIGHTED E NSEMBLE E STIMATION

˜ h decreases very Theorem 1 shows that when the dimension of the data is not small, the bias of the plug-in estimator G slowly as a function of sample size, resulting in large MSE. However, by applying the theory of optimally weighted ensemble estimation, developed in [37] for entropy estimation, we can take a weighted sum of an ensemble of estimators and decrease the bias via an appropriate choice of weights. We form an ensemble of estimators by choosing different values of h. Choose ¯l = {l1 , . . . , lL } to be real positive numbers that index h(lP i ). Thus the parameter l indexes over different neighborhood sizes for the KDEs. Define w := {w (l1 ) , . . . , w (lL )} ˜ ˜ w := and G l∈¯ l w(l)Gh(l) . The key to reducing the MSE is to choose the weight vector w to reduce the lower order terms in the bias without substantially increasing the variance. ˜ w which enables us to perform inference tasks like hypothesis testing on We first present the asymptotic distribution of G the divergence functional. The proof is based on the Efron-Stein inequality and an application of Slutsky’s Theorem. Details are given in [?]. Theorem 3: Assume that the functional g is Lipschitz in both arguments with constant Cg . Further assume that h = o(1), ˜ w is N → ∞, and N hd → ∞. Then for fixed L, the asymptotic distribution of the weighted ensemble estimator G ! i r h i h ˜ ˜ ˜ w ≤ t → P r(S ≤ t), Pr Gw − E Gw / V G where S is a standard normal random variable. The theory of optimally weighted ensemble estimation is a general theory originally presented by Sricharan et al [37] that can be applied to many estimation problems as long as the bias and variance of the estimator can be expressed in a specific way. We generalize the conditions required to apply the theory. Let ¯l = {l1 , . . . , lL } be na seto of index values, e.g., kernel widths, ˆl and N the number of samples available. For an indexed ensemble of estimators E of a parameter E, the weighted l∈¯ l P ˆ l. E ˆw ˆ w = P ¯ w (l) E ensemble estimator with weights w = {w n (l1 ) o , . . . , w (lL )} satisfying l∈¯l w(l) = 1 is defined as E l∈l ˆl is asyptotically unbiased if the estimators E are asymptotically unbiased. Consider: l∈¯ l

•

ˆ l is given by C.1 For each l ∈ ¯ l, the bias of E h i X 1 ˆl = , B E ci ψi (l)φi,d (N ) + O √ N i∈J

where ci are constants depending on the underlying density, J = {i1 , . . . , iI } is a finite index set with I < L, and ψi (l) are basis functions depending only on the parameter l and not the sample size. ˆ l is given by • C.2 For each l ∈ ¯ l, the variance of E h i 1 1 ˆ V E l = cv +o . N N n o ˆl Theorem 4: Assume conditions C.1 and C.2 hold for an ensemble of estimators E . Then there exists a weight vector l∈¯ l w0 such that 2 1 ˆ w0 − E . E E =O N

The weight vector w0 is the solution to the following convex optimization problem: minw P ||w||2 subject to l∈¯ l w(l) P= 1, γw (i) = l∈¯l w(l)ψi (l) = 0, i ∈ J.

(6)

Proof: A more restrictive version of Theorem 4 was originally presented in [37] with the stricter condition of φi,d (N ) = N −1/(2d) . The proof of our generalized version (Theorem 4) is sketched here. From condition C.1, the bias of the weighted estimator is ! √ h i X L||w|| 2 ˆw = √ ci γw (i)φi,d (N ) + O B E . N i∈J The variance of the weighted estimator is bounded as h i L||w||2 2 ˆw ≤ V E . (7) N The optimization problem in Eq. (6) zeroes out the lower-order bias terms and limits the ℓ2 norm of the weight vector w to limit the variance contribution. This results in an MSE rate of Ow (1/N ) when the dimension d is fixed and when L is fixed independently of the sample size N . Furthermore, a solution to Eq. (6) is guaranteed to exist as long as L > I and the vectors ai = [ψi (l1 ), . . . , ψi (lL )] are linearly independent. To achieve the parametric rate O (1/N ) in MSE convergence it is not necessary that γw (i) = 0, i ∈ J. Solving the following convex optimization problem in place of the optimization problem in Theorem 4 retains the O(1/N ) rate: minw ǫ P 2 subject to l∈¯l w(l) = 1, kwk 2 ≤ η, 1 γw (i)N 2 φi,d (N ) ≤ ǫ, i ∈ J,

(8)

where the parameter η is chosen to achieve trade-off between bias and variance. Instead √ of forcing γw (i) = 0, the relaxed optimization problem uses the weights to decrease the bias terms at the rate of Ow 1/ N yielding an MSE of Ow (1/N ). We refer to the distributional functional estimators obtained using this theory as Optimally Weighted Distributional Functional (ODin) estimators. Sricharan et al [37] applied a more restrictive version of Theorem 4 to obtain an entropy estimator with convergence rate Ow (1/N ). We apply the same theory to obtain a divergence functional estimator with the same asymptotic 1 rate. Let h(l) = lN −1/(2d) . From Theorem 1, ψi (l) = li , i = 1, . . . , d. Note that if s ≥ d, then we are left with O ld √ N in addition to the terms in the sum. To obtain a uniform bound on the bias with respect to w and ¯l, we also include the function ψd+1 (l) = l−d in the optimization problem. The bias of the resulting base estimator satisfies condition C.1 with φi,d (N ) = N −i/(2d) for i = 1, . . . , d and φd+1,d (N ) = N −1/2 . The variance also satisfies condition C.2. The optimal weight ˜ w0 ,1 with an MSE convergence rate of w0 is found by using Eq. (8) to obtain a plug-in divergence functional estimator G 1 1 . We refer to this estimator Ow0 N as long as L ≥ d and s ≥ d. Otherwise, we can only guarantee the rate up to Ow0 N s/d as the ODin1 estimator. We define another weighted ensemble estimator that imposes less restrictive assumptions on the densities’ smoothness by −1 letting h(l) decrease at a faster rate. Let h(l) = lN d+1 . From Theorem 1, if g(x, y) has mixed partial derivatives of the form j+q of xα y β , then the bias has terms proportional to lj−dq N − d+1 where j, q ≥ 0 and j + q > 0. Theorem 4 can then be applied j+q to this ensemble of estimators. Let φj,q,d (N ) = N − d+1 and ψj,q (l) = lj−dq . Let J = {{j, q}|0 < j + q ≤ (d + 1)/2, q ∈ {0, 1, . . . , λ/2}, j ∈ {0, 1, . . . , s}} .

(9)

˜ h(l) satisfies condition C.1. If L > |J|, then Theorem 4 can be applied to obtain the optimal Then from Eq. (4), the bias of G ˜ w0 ,2 achieves the parametric convergence rate if λ ≥ d + 1 and if s ≥ (d + 1)/2. weight vector. The corresponding estimator G ˜ w0 ,2 is This is more in line with existing estimators that achieve parametric rates of MSE convergence when s ≥ d2 [39]. G referred to as the ODin2 estimator and is summarized in Algorithm 1. These improvements come at a cost in the number of kernel parameters L. If s ≥ d+1 2 then the size of J is on the order of 2 d /8. This may lead to increased variance of the ensemble estimator for high dimensions (see Eq. (7)). Future work is also required to extend ODin2 to other functionals of interest.

Algorithm 1 Optimally weighted ensemble estimator of divergence functionals Input: η, L positive real numbers ¯l, samples {Y1 , . . . , YN } from f1 , samples {X1 , . . . , XN } from f2 , dimension d, function g, kernel K ˜ w ,2 Output: The optimally weighted divergence estimator G 0 j+q − d+1 and basis functions ψj,q (l) = lj−dq , l ∈ ¯l, and {i, j} ∈ J defined in 1: Solve for w0 using Eq. 8 with φj,q,d (N ) = N Eq. 9 2: for all l ∈ ¯ l do −1 3: h(l) ← lN d+1 4: for i = 1 to N do PN Xi −Xj 1 ˜f1,h(l) (Xi ) ← 1 d PN K Xi −Yj , ˜f2,h(l) (Xi ) ← K 5: j=1 d j=1 h(l) h(l) N h(l) (N −1)h(l) 6: 7: 8: 9:

end for ˜ h(l) ← 1 PN g ˜f1,h(l) (Xi ), ˜f2,h(l) (Xi ) G i=1 N end for P ˜ ˜ w0 ,2 ← G l∈¯ l w0 (l)Gh(l)

j6=i

IV. T UNING PARAMETER S ELECTION The optimization problem in Eq. (8) has the parameters η, L, and ¯l. The parameter η provides an upper bound on the norm of the weight vector which gives an upper bound on the constant in the variance of the ensemble estimator. If all the constants in Eq. (3) or Eq. (4) and an exact expression for the variance of the ensemble estimator were known, then η could be chosen to minimize the MSE. Since the constants are unknown, by applying Eq. (8), the resulting MSE of the ensemble estimator is O ǫ2 /N + O Lη 2 /N , where each√term in the sum comes from the bias and variance, respectively. Since there is a tradeoff between η and ǫ, setting η = ǫ/ L would minimize these terms in theory. √ In practice, we find that the variance of the ensemble estimator is less than the upper bound of Lη 2 /N and setting η = ǫ/ L is therefore too restrictive. Setting η = ǫ instead works well in practice. For fixed L, the set of kernel widths ¯l can in theory be chosen by minimizing ǫ in Eq. (8) over ¯l in addition to w. However, this results in a nonconvex optimization problem since w does not lie in the non-negative orthant. A parameter search may not be practical as ǫ generally decreases as the size and spread of ¯l increases. This decrease in ǫ does not always correspond to a decrease in MSE as high and low values of h(l) can lead to inaccurate density estimates. Denote the value of the minimum value of l so that ˜fi,h(lmin ) (Xj ) > 0 ∀i = 1, 2 as lmin and the diameter of the support S as D. To ensure the density estimates are bounded away from zero, we require that min(¯l) ≥ lmin . The weights in w0 are generally largest for the smallest values of ¯l so min(¯l) should also be sufficiently larger than lmin to render an adequate estimate. Similarly, max(¯l) should be sufficiently smaller than D as high bandwidth values lead to high bias. The remaining ¯l values are chosen to be equally spaced between min(¯l) and max(¯l). Large L leads to increase in the similarity of bandwidth values h(l) and basis functions ψi,d (l), which results in a negligible decrease in the bias. Hence L should be chosen large enough to decrease the bias but small enough so that the h(l) values are sufficiently distinct (typically 30 ≤ L ≤ 60). V. E XPERIMENTS : R ÉNYI -α DIVERGENCE To validate our theory, we estimated the Rényi-α divergence integral between two truncated multivariate Gaussian distributions with varying dimension and sample sizes. The densities have means µ ¯1 = 0.7 ∗ ¯1d , µ ¯2 = 0.3 ∗ ¯1d and covariance matrices ¯ 0.1 ∗ Id where 1d is a d-dimensional vector of ones, and Id is a d-dimensional identity matrix. We used α = 0.5 and restricted the Gaussians to the unit cube. The top figure in Fig. 1 shows the MSE (200 trials) of the uniform kernel plug-in estimator, the two proposed optimally ˜ ρ = (1 − ρ)G ˜ w ,1 + ρG ˜ w ,2 , for various dimensions weighted estimators, and a linear combination of ODin1 and ODin2, G 0 0 ¯ and sample sizes. l, L, and ρ are tuned to minimize the MSE. The bandwidth for the plug-in estimator is the value from those used in the ODin2 ensemble that minimizes the MSE. Note that for d = 4, the plug-in estimator performs comparably with the optimally weighted estimators. However, for d = 7, 10, the plug-in estimator performs considerably worse. This demonstrates the strength of ensemble estimation as the weighted sum of these poor estimators results in a very good estimator. The bottom figure in Fig. 1 shows the corresponding average estimates with standard error bars compared to the true values. ODin1 has smaller variance than ODin2 when d = 4 and slightly larger variance when d = 10. The average values for ρ are 0.16, 0.59, 0.61 when d = 4, 7, 10, which indicates a preference for ODin1 for low dimension and a slight preference for ODin2 for higher dimensions. Paired t-tests on the MSE (125 trials) of the two methods indicate that the differences are statistically significant (see Table I).

0

10

Kernel, d=4 ODin1, d=4 ODin2, d=4 Comb. d=4 Kernel, d=7 ODin1, d=7 ODin2, d=7 Comb. d=7 Kernel, d=10 ODin1, d=10 ODin2, d=10 Comb. d=10

MSE

10

−2

10

−3

10

−4

10

2

10

3

0.8

0.6 0.5 0.4 0.3 0.2

4

10 Sample Size N

True, d=4 Kernel, d=4 ODin1, d=4 ODin2, d=4 Comb. d=4 True, d=10 Kernel, d=10 ODin1, d=10 ODin2, d=10 Comb. d=10

0.7 Renyi−α integral

−1

10

2

3

10

10 Sample Size N

4

10

Figure 1. (Top) Log-log plot of MSE of the uniform kernel plug-in (“Kernel”), the two proposed optimally weighted estimators (ODin1 and ODin2), and the optimal linear combination of ODin1 and ODin2 for various dimensions and sample sizes. (Bottom) Plot of the average value of the same estimators with standard error bars compared to the true values being estimated when d = 4, 10. The proposed weighted ensemble estimators match the theoretical rate (average log-log slope is −0.98 and −0.99 for ODin1 and ODin2, respectively) and perform better than the plug-in estimator for high dimensions. Table I p- VALUES OF PAIRED T- TESTS OF OD IN 1 VS . OD IN 2 MSE (N = 1300). Dim. 4 10

ODin1>ODin2 1 0

ODin10, x+uh∈S /

where vt (h) admits the expansion vt (h) =

r−q X i=1

ei,q,t hi + o hr−q ,

for some constants ei,q,t . Assumption A.5 appears to be complex but is actually satisfied by a simple case: Theorem 5: Assumption A.5 is satisfied when S = [−1, 1]d and when K(x) is the uniform rectangular kernel; that is K(x) = 1 for all x : ||x||1 ≤ 1/2.

A PPENDIX B P ROOF OF T HEOREM 5 Consider a uniform rectangular kernel K(x) that satisfies K(x) = 1 for all x such that ||x||1 ≤ 1/2. Also consider the family of probability densities f with rectangular support S = [−1, 1]d . We will show that S satisfies the following smoothness condition (A.5): for any polynomial px (u) : Rd → R of order q ≤ r = ⌊s⌋ with coefficients that are r − q times differentiable wrt x, !t ˆ ˆ px (u)du dx = vt (h), (10) x∈S

/ u:||u||1 ≤ 12 , x+uh∈S

where vt (h)has the expansion vt (h) =

r−q X i=1

ei,q,t hi + o hr−q .

Note that the inner integral forces the x’s under consideration to be boundary points via the constraint x + uh ∈ / S. A. Single Coordinate Boundary Point We begin by focussing on points x which are boundary points by virtue of a single coordinate xi such that xi + ui h ∈ / S. WLOG, assume that xi + ui h > 1. The inner integral in Eq. 10 can then be evaluated first wrt all coordinates other than i. Since all of these coordinates lie within the support, the inner integral over these coordinates will amount to integration of 1 the Pq polynomialmpx (u) over a symmetric d − 1 dimensional rectangular region |uj | ≤ 2 for all j 6= i. This yields a function ˜m (x)ui where the coefficients p˜m (x) are each r − q times differentiable wrt x. m=1 p i to 12 for some 1 > xi > 1 − h2 . Consider With respect to the ui coordinate, the inner integral will have limits from 1−x h q the p˜q (x)ui monomial term. The inner integral wrt this term yields m+1 ! ˆ 12 q q X X 1 1 1 − xi m p˜m (x) ui dui = . (11) p˜m (x) − 1−xi m + 1 2m+1 h h m=1 m=1 Raising the right hand side of Eq. 11 to the power of t results in an expression of the form j qt X 1 − xi , pˇj (x) h j=0

(12)

where the coefficients pˇj (x) are r − q times differentiable wrt x. Integrating Eq. 12 over all the coordinates in x other than xi results in an expression of the form j qt X 1 − xi , (13) p¯j (xi ) h j=0 where again the coefficients¯ pj (xi ) are r − q times differentiable wrt xi . Note that since the other cooordinates of x other than xi are far away from the boundary, the coefficients p¯j (xi ) are independent of h. To evaluate the integral of Eq. 13, consider the r − q term Taylor series expansion of p¯j (xi ) around xi = 1. This will yield terms of the form ˆ 1 j+k j+k+1 xi =1 (1 − xi ) (1 − xi ) dxi = − k hk h (j + k + 1) 1−h/2 xi =1−h/2

=

hj+1 , (j + k + 1)2j+k+1

for 0 ≤ j ≤ r − q, and 0 ≤ k ≤ qt. Combining terms results in the expansion vt (h) = B. Multiple Coordinate Boundary Point

Pr−q i=1

ei,q,t hi + o (hr−q ).

The case where multiple coordinates of the point x are near the boundary is a straightforward extension of the single boundary point case so we only sketch the main ideas here. As an example, consider the case where 2 of the coordinates are near the boundary. Assume WLOG they are x1 and x2 and that x1 + u1 h > 1 and x2 + u2 h P > 1. The inner integral in Eq. 10 q j can again be evaluated first wrt all coordinates other than 1 and 2. This yields a function m,j=1 p˜m,j (x)um 1 u2 where the coefficients p˜m,j (x) are each r − q times differentiable wrt x. Integrating this wrt x1 and x2 and then raising the result to the power of t yields a double sum similar to Eq. 12. Integrating this over all the coordinates in x other than x1 and x2 gives a double sum similar to Eq. 13. Then a Taylor series expansion of the coefficients and integration over x1 and x2 yields the result.

A PPENDIX C P ROOF OF T HEOREM 1 We prove generalized versions of Theorems 1, 2, and 3 where the densities may have different bandwidths h1 and h2 and ˜ h1 ,h2 can be expressed as number of samples N1 and N2 . The bias of the base estimator G h i h i ˜ h1 ,h2 B G = E g ˜f1,h1 (Z), ˜f2,h2 (Z) − g (f1 (Z), f2 (Z)) i h = E g ˜f1,h1 (Z), ˜f2,h2 (Z) − g EZ ˜f1,h1 (Z), EZ ˜f2,h2 (Z) i h (14) +E g EZ ˜f1,h1 (Z), EZ ˜f2,h2 (Z) − g (f1 (Z), f2 (Z)) ,

where Z is drawn from f2 . We findbounds for these terms by using Taylor series expansions. The Taylor series expansion of g EZ˜f1,h1 (Z), EZ ˜f2,h2 (Z) around f1 (Z) and f2 (Z) is i i h h   ∞ X ∞ i+j BiZ ˜f1,h1 (Z) BjZ ˜f2,h2 (Z) X ∂ g(x, y)   = g EZ˜f1,h1 (Z), EZ ˜f2,h2 (Z) ∂xi ∂y j x=f1 (Z) i!j! i=0 j=0

(15)

y=f2 (Z)

h j i where BjZ ˜fi,hi (Z) = EZ ˜fi,hi (Z) − fi (Z) . This expansion can be used to control the second term in Eq. 14. To accomplish i h this, we require an expression for EZ ˜fi,hi (Z) − fi (Z) = BZ ˜fi,hi (Z) . h i To obtain an expression for BZ ˜fi,hi (Z) , we consider separately the cases when Z is in the interior of the support S or when Z is near the boundary of the support. A point X ∈ S is defined to be in the interior of S if for all Y ∈ / S, X−Y = 0. A point X ∈ S is near the boundary of the support if it is not in the interior. Denote the region in the interior K hi and near the boundary wrt hi as SIi and SBi , respectively. Lemma 1: Let Z be a realization of the density f2 independent of ˜fi,hi for i = 1, 2. Assume that the densities f1 and f2 belong to Σ(s, L). Then for Z ∈ SIi , ⌊s/2⌋ h i X s ci,j (Z)h2j EZ ˜fi,hi (Z) = fi (Z) + i + O (hi ) .

(16)

j=ν/2

Proof: Obtaining the lower order terms in Eq. 16 is a common result in kernel density estimation. However, since we also require the higher order terms, we present the proof here. Additionally, some of the results in the proof will be useful later. From the linearity of the KDE, we have that if X is drawn from fi and is independent of Z, then 1 X−Z EZ ˜fi,hi (Z) = EZ d K hi hi ˆ 1 x−Z fi (x)dx = K hi hd ˆ i = K (t) fi (thi + Z)dt, (17) where the last step follows from the substitution t = we can expand it as

x−Z hi .

fi (thi + Z) = fi (Z) +

Since the density fi belongs to Σ(s, K), using multi-index notation

X Dα fi (Z) (thi )α + O (kthi ks ) , α!

|α|≤⌊s⌋

where α! = α1 !α2 ! . . . αd ! and Y α = Y1α1 Y2α2 . . . Ydαd . Combining Eqs. 17 and 18 give X Dα fi (Z) |α| ˆ hi tα K(t)dt + O(hsi ) EZ ˜fi,hi (Z) = fi (Z) + α! |α|≤⌊s⌋

=

fi (Z) +

⌊s/2⌋

X

s ci,j (Z)h2j i + O(hi ),

j=ν/2

where the last step follows from the fact that K is symmetric and of order ν. To obtain a similar result for the case when Z is near the boundary of S, we use assumption A.5.

(18)

Lemma 2: Let γ(x, y) be an arbitrary function satisfying supx,y |γ(x, y)| < ∞. Let S satisfy the boundary smoothness conditions. Assume that the densities f1 and f2 belong to Σ(s, L) and let Z be a realization of the density f2 independent of ˜fi,h for i = 1, 2. Let h′ = min (h1 , h2 ). Then i r h ii h X t ˜ c4,i,j,t hji + o (hri ) (19) = E 1{Z∈SB } γ (f1 (Z), f2 (Z)) BZ fi,hi (Z) i

h E 1{Z∈SB

1 ∩SB2

j=1

h i h ii q ˜ t ˜ (Z) B f (Z) = f γ (f (Z), f (Z)) B 1 2 Z 1,h1 Z 2,h2 }

r−1 r−1 X X

c4,j,i,q,t hj1 hi2 h + o ′

j=0 i=0

′ r h

(20)

Proof: For fixed X near the boundary of S, we have ˆ h i 1 X −Y ˜ E fi,hi (X) − fi (X) = fi (Y )dY − fi (X) K hi hd " i Yˆ :Y ∈S # 1 X −Y = K fi (Y )dY − fi (X) hi hdi Y :K X−Y >0 hi ˆ X −Y 1 K fi (Y )dY − d hi hi Y :Y ∈S / = T1,i (X) − T2,i (X). Note that in T1,i (X), we are extending the integral beyond the support of the density fi . However, by using the same Taylor series expansion method as in the proof of Lemma 1, we always evaluate fi and its derivatives at the point X which is within the support of fi . Thus it does not matter how we define an extension of fi since the Taylor series will remain the same. Thus T1,i (X) results in an identical expression to that obtained from Eq. 16. For the T2,i (X) term, we expand it as follows using multi-index notation as ˆ 1 X −Y fi (Y )dY K T2,i (X) = hi hd / ˆ i Y :Y ∈S K (u) fi (X + hi u)du = u:hi u+X ∈S,K(u)>0 /

X h|α| ˆ i K (u) Dα fi (X)uα du + o (hri ) . α! u:hi u+X ∈S,K(u)>0 /

=

|α|≤r

Recognizing that the |α|th derivative of fi is r −|α| times differentiable, we can apply assumption A.5 to obtain the expectation of T2,i (X) wrt X: ˆ ˆ 1 X −Y E [T2,i (X)] = fi (Y )dY f2 (X)dx K hi hdi X Y :Y ∈S / X h|α| ˆ ˆ i = K (u) Dα fi (X)uα duf2 (X)dX + o (hri ) α! X u:hi u+X ∈S,K(u)>0 / |α|≤r   X h|α| X r−|α| |β| i   + o (hri ) eβ,r−|α|hi + o hi = α! =

|α|≤r r X

1≤|β|≤r−|α|

ej hji + o (hri ) .

j=1

Similarly, we find that h i t E (T2,i (X))

t ˆ ˆ 1 X −Y f (Y )dY f2 (X)dx K i hi hdt X Y :Y ∈S / i t  ˆ X h|α| ˆ i  K (u) Dα fi (X)uα du f2 (X)dX = α! u:hi u+X ∈S,K(u)>0 / X

=

=

r X j=1

|α|≤r

ej,t hji + o (hri ) .

Combining these results gives i t h E 1{Z∈SB } γ (f1 (Z), f2 (Z)) EZ ˜fi,hi (Z) − fi (Z)

i h = E γ (f1 (Z), f2 (Z)) (T1,i (Z) − T2,i (Z))t   t X t (T1,i (Z))j (−T2,i (Z))t−j  = E γ (f1 (Z), f2 (Z)) j j=0 =

r X

c4,i,j,t hji + o (hri ) ,

j=1

where the constants are functionals of the kernel, γ, and the densities. Equation 20 can be proved in a similar manner. Applying Lemmas 1 and 2 to Eq. 15 gives r−1 X r−1 r X i X h ′ j j ˜ ˜ c5,i,j hj1 hi2 h + o (hr1 + hr2 ) . (21) c4,1,j h1 + c4,2,j h2 + E g EZ f1,h1 (Z), EZ f2,h2 (Z) − g (f1 (Z), f2 (Z)) = j=0 i=0

j=1

For the first term in Eq. 14, the truncated Taylor series expansion of g ˜f1,h1 (Z), ˜f2,h2 (Z) around EZ ˜f1,h1 (Z) and EZ ˜f2,h2 (Z) gives   λ λ i+j X X  ∂ g(x, y) ˜i1,h1 (Z)˜ ej2,h2 (Z) e ˜λ2,h2 (Z) ˜λ1,h1 (Z) + e g ˜f1,h1 (Z), ˜f2,h2 (Z) = (22) +o e   ˜ i j f (Z) x=E Z 1,h1 ∂x ∂y i!j! i=0 j=0

f2,h2 (Z) y=EZ ˜

h i ˜ji,hi (Z) . ˜i,hi (Z) := ˜fi,hi (Z) − EZ ˜fi,hi (Z). To control the first term in Eq. 14, we require expressions for EZ e where e Lemma 3: Let Z be a realization of the density f2 that is in the interior of the support and is independent of ˜fi,hi for i = 1, 2. Let n(q) be the set of integer divisors of q including 1 but excluding q. Then, P P⌊s/2⌋ 1 2m  h i + O N1i , q ≥ 2 m=0 c6,i,q,j,m (Z)hi j∈n(q) (N hd )q−j q 2 2 ˜i,hi (Z) = (23) EZ e 0, q = 1,  P⌊s/2⌋  P 1 2m  ×, q, l ≥ 2 q−i  m=0 c6,1,q,i,m (Z)h1   i∈n(q) (N1 hd1 ) i h q l P P⌊s/2⌋ ˜1,h1 (Z)˜ e2,h2 (Z) = EZ e (24) 1 2t + O N11 + N12  j∈n(l) N hd l−j t=0 c6,2,l,j,t (Z)h2  ( ) 2  2   0, q = 1 or l = 1 where c6,i,q,j,m is a functional of γ, f1 , and f2 . Xi −Z Proof: Define the random variable Vi (Z) = K Xhi −Z − E K . This gives Z h2 2 ˜2,h2 (Z) e

= ˜f2,h2 (Z) − EZ ˜f2,h2 (Z) N2 1 X = Vi (Z). N2 hd2 i=1

Clearly, EZ Vi (Z) = 0. From Eq. 17, we have for integer j ≥ 1 ˆ Xi − Z j = K j (t) f2 (th2 + Z)dt EZ K h2 =

hd2

⌊s/2⌋

X

m=0

c3,2,j,m (Z)h2m 2 ,

where the constants c3,2,j,m depend on the density f2 , its derivatives, and the moments of the kernel K j . Note that since K is symmetric, the odd moments of K j are zero for Z in the interior of the support. However, all even moments may now be nonzero since K j may now be nonnegative. By the binomial theorem, j−k j h i X Xi − Z Xi − Z j EZ K EZ Vij (Z) = EZ K k h2 h2 k k=0 ⌊s/2⌋ j X j d d(j−k) X = h2 O h2 c3,2,k,m (Z)h2m 2 k m=0 k=0 ⌊s/2⌋ X d h2 c3,2,j,m (Z)h2m 2 m=0

=

+ O h2d .

h i ˜q2,h2 (Z) . As an example, let q = 2. Then since the Xi s are independent, We can use these expressions to simplify EZ e 2 ˜2,h2 (Z) EZ e

=

1 EZ Vi2 (Z) N2 h2d 2

=

⌊s/2⌋ 1 1 X 2m c (Z)h + O . 3,2,2,m 2 N2 N2 hd2 m=0

Similarly, we find that 3 ˜2,h2 (Z) EZ e

1 EZ Vi3 (Z) N22 h3d 2

=

For q = 4, we have 4 ˜2,h2 (Z) EZ e

= =

The pattern is then for q ≥ 2,

1 N23 h4d 2 N2 hd2

N2 hd2

EZ Vi4 (Z) + ⌊s/2⌋

1 3

X

⌊s/2⌋

1

=

2

X

c3,2,3,m (Z)h2m 2 +o

m=0

2 N2 − 1 EZ Vi2 (Z) 3 4d N 2 h2

c3,2,4,m (Z)h2m 2

+

m=0

i h X ˜q2,h2 (Z) = EZ e

i∈n(q)

⌊s/2⌋

1 N2 hd2

q−i

X

m=0

1

N2 hd2

2

⌊s/2⌋

X

1 N2

.

c6,2,2,m (Z)h2m 2

m=0

c6,2,q,i,m (Z)h2m 2 +O

1 N2

+o

1 N2

.

.

For any integer q, the largest possible factor is q/2. Thus for given q, possible exponent on the N2 hd2 term is h the smallest i q ˜1,h1 (Z) except the Xi s are replaced with Yi , f2 is q/2. This increases as q increases. A similar expression holds for EZ e replaced with f1 , and N2 and h2 are replaced with N1 and h1 , respectively, all resulting in different constants. Then since ˜2,h2 (Z) are conditionally independent given Z, ˜1,h1 (Z) and e e    ⌊s/2⌋ ⌊s/2⌋ h i X X X X 1 1   ˜q1,h1 (Z)˜ el2,h2 (Z) =  EZ e c6,1,q,i,m (Z)h2m c6,2,l,j,t (Z)h2t q−i l−j 1 2 d d m=0 t=0 i∈n(q) N1 h1 j∈n(l) N2 h2 1 1 . + +O N1 N2

Applying Lemma 3 to Eq. 22 when taking the conditional expectation given Z in the interior gives an expression of the form ! λ/2 ⌊s/2⌋ h2m h2m X X 1 2 c7,1,j,m EZ ˜f1,h1 (Z), EZ ˜f2,h2 (Z) j + c7,2,j,m EZ ˜f2,h2 (Z), EZ ˜f2,h2 (Z) j N1 hd1 N2 hd2 j=1 m=0 +

λ/2 ⌊s/2⌋ λ/2 ⌊s/2⌋ X XX X j=1 m=0 i=1 n=0

2n h2m 1 h2 i j N2 hd2 N1 hd1   1 1  +O  λ2 + λ2 . d d N 1 h1 N 2 h2

c7,j,i,m,n EZ˜f2,h2 (Z), EZ ˜f2,h2 (Z)

(25)

Note that the functionals c7,i,j,m and c7,j,i,m,n depend on the derivatives of g and EZ˜fi,hi (Z) which depends on hi . To apply ensemble estimation, we need to separate the dependence on hi from the constants. If we use ODin1, then it is sufficient to note that in the interior of the support, EZ˜fi,hi (Z) = fi (Z) + o(1) and therefore c EZ ˜f1,h1 (Z), EZ ˜f2,h2 (Z) = c (f1 (Z), f2 (Z)) + o(1) for some functional c. The terms in Eq. 25 reduce to 1 1 1 1 c7,1,1,0 (f1 (Z), f2 (Z)) . + c7,2,1,0 (f1 (Z), f2 (Z)) +o + N1 hd1 N2 hd2 N1 hd1 N2 hd2 For ODin2, we need the higher order terms. To separate the dependence on hi from the constants, we need more information about the functional g and its derivatives. Consider the special case where the functional g(x, y) has derivatives of the form of xα y β with α, β < 0. This corresponds the important cases of the KL divergence and the Renyi divergence. The generalized to α binomial theorem states that if m := α(α−1)...(α−m+1) and if q and t are real numbers with |q| > |t|, then for any complex m! number α, ∞ X α α−m m α q t . (26) (q + t) = m m=0 P ⌊s/2⌋ s Since the densities are bounded away from zero, for sufficiently small hi , we have that fi (Z) > j=ν/2 ci,j (Z)h2j i + O (hi ) . Applying the generalized binomial theorem and Lemma 1 gives that m  ⌊s/2⌋ ∞ α X X α α−m s  ci,j (Z)h2j fi (Z)  . EZ ˜f1,h1 (Z) = i + O (hi ) m m=0 j=ν/2

Since m is an integer, the exponents of the hi terms are also integers. Thus Eq. 25 gives in this case i h = EZ g ˜f1,h1 (Z), ˜f2,h2 (Z) − g EZ ˜f1,h1 (Z), EZ ˜f2,h2 (Z)

λ/2 ⌊s/2⌋ X X

c8,1,j,m (Z)

j=1 m=0

+

λ/2 ⌊s/2⌋ λ/2 ⌊s/2⌋ XX X X

h2m 1 N1 hd1

c8,j,i,m,n (Z)

j=1 m=0 i=1 n=0



+O 

1 N1 hd1

λ2 +

j + c8,2,j,m (Z)

1 N2 hd2

h2m 2 N2 hd2

2n h2m 1 h2 i j N2 hd2 N1 hd1 

s s λ2 + h1 + h2 .

j

!

(27)

As before, the case for Z close to the boundary of the support is more complicated. However, by using a similar technique to the proof of Lemma 2 for Z at the boundary and combining with the previous results, we find that for general g, h i 1 1 1 1 E g ˜f1,h1 (Z), ˜f2,h2 (Z) − g EZ ˜f1,h1 (Z), EZ ˜f2,h2 (Z) = c9,1 . (28) + c + o + 9,2 N1 hd1 N2 hd2 N1 hd1 N2 hd2

If g(x, y) has derivatives of the form of xα y β with α, β < 0, then we can similarly obtain i h = E g ˜f1,h1 (Z), ˜f2,h2 (Z) − g EZ ˜f1,h1 (Z), EZ ˜f2,h2 (Z)

λ/2 r X X

hm 1

c9,1,j,m

N1 hd1

j=1 m=0

+

λ/2 r λ/2 r X X XX

j + c9,2,j,m

c9,j,i,m,n

j=1 m=0 i=1 n=0



Combining Eq. 21 with either Eq. 28 or 29 completes the proof.

+O 

1

λ + d 2

N 1 h1

1 N2 hd2

hm 2 N2 hd2

j

n hm 1 h2 i j N2 hd2 N1 hd1 

s s λ2 + h1 + h2 .

!

(29)

A PPENDIX D P ROOF OF T HEOREM 2 To bound the variance, we will use the Efron-Stein inequality: ′ ′ Lemma 4 (Efron-Stein Inequality): Let X1 , . . . , Xn , X1 , . . . , Xn be independent random variables on the space S. Then if f : S × · · · × S → R, we have that n 2 ′ 1X E f (X1 , . . . , Xn ) − f (X1 , . . . , Xi , . . . , Xn ) . V [f (X1 , . . . , Xn )] ≤ 2 i=1 o n ′ Suppose we have samples {X1 , . . . , XN2 , Y1 , . . . , YN1 } and X1 , . . . , XN2 , Y1 , . . . , YN1 and denote the respective ˜′ ˜ h1 ,h2 and G estimators as G . We have that h1 ,h2

˜ ˜′ Gh1 ,h2 − G h1 ,h2

≤

′ ′ 1 ˜ g f1,h1 (X1 ), ˜f2,h2 (X1 ) − g ˜f1,h1 (X1 ), ˜f2,h2 (X1 ) N2 N2 ′ 1 X ˜ + g f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) . N2 j=2

Since g is Lipschitz continuous with constant Cg , we have ′ ′ ˜ g f1,h1 (X1 ), ˜f2,h2 (X1 ) − g ˜f1,h1 (X1 ), ˜f2,h2 (X1 ) ≤ ′ ˜ f1,h1 (X1 ) − ˜f1,h1 (X1 )

′ 2 =⇒ E ˜f1,h1 (X1 ) − ˜f1,h1 (X1 )

= ≤ ≤

′ ′ , Cg ˜f1,h1 (X1 ) − ˜f1,h1 (X1 ) + ˜f2,h2 (X1 ) − ˜f2,h2 (X1 ) (31)

N !! ′ 1 X X1 − Yi X1 − Yi −K K h h 1 1 i=1 ! ′ N1 1 X X1 − Yi X1 − Yi K − K h1 h1 N1 hd1 i=1  !!2  ′ N1 X X1 − Yi 1 X1 − Yi , E K −K h h1 N1 h2d 1 1 i=1 1 N1 hd1

(30)

′

′

X −Y

(32)

i and ui = 1h1 i , this gives where the last step follows from Jensen’s inequality. By making the substitution ui = X1h−Y 1  !!2  !!2 ˆ ′ ′ X1 − Yi 1  X1 − Yi X1 − Yi X1 − Yi 1  E K K −K −K = × h1 h1 h2d h1 h1 h2d 1 ′

′

f2 (X1 )f2 (X1 )f1 (Yi )dX1 dX1 dYi

≤ 2||K||2∞ . Combining this with Eq. 32 gives

′ 2 ˜ ˜ E f1,h1 (X1 ) − f1,h1 (X1 ) ≤ 2||K||2∞ .

Similarly,

′ 2 E ˜f2,h2 (X1 ) − ˜f2,h2 (X1 ) ≤ 2||K||2∞ .

Combining these results with Eq. 31 gives 2 ′ ′ E g ˜f1,h1 (X1 ), ˜f2,h2 (X1 ) − g ˜f1,h1 (X1 ), ˜f2,h2 (X1 ) ≤ 8Cg2 ||K||2∞ . The second term in Eq. 30 is controlled in a similar way. From the Lipschitz condition, 2 2 ′ ′ ˜ g f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) ≤ Cg2 ˜f2,h2 (Xj ) − ˜f2,h2 (Xj ) Cg2 Xj − X1 K −K = h M22 h2d 2 X −X

′

(33)

′

Xj − X1 h

!!2

.

′

X −X

j 1 and uj = jh2 1 within the expectation to obtain The h2d 2 terms are eliminated by making the substitutions of uj = h2 2 2Cg2 ||K||2∞ ′ E g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) ≤ (34) M22  2  N 2 ′  X ˜  =⇒ E  g f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj )  

j=2

=

N2 N2 X X j=2 i=2

≤ ≤

h i ′ ′ E g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) g ˜f1,h1 (Xi ), ˜f2,h2 (Xi ) − g ˜f1,h1 (Xi ), ˜f2,h2 (Xi )

2 ′ M22 E g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) 2Cg2 ||K||2∞ ,

(35)

where we use the Cauchy Schwarz inequality to bound the expectation within each summand. Finally, applying Jensen’s inequality and Eqs. 33 and 35 gives 2 2 ′ ′ ′ 2 ˜ ˜ ˜ ˜ ˜ ˜ E g (X ) − g (X ), f f (X ) (X ), f f ≤ E G − G 1 1 2,h2 1,h1 2,h2 1,h1 h1 ,h2 1 1 h1 ,h2 N22  2  N 2 ′ 2  X ˜  + 2 E  g f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj )   N2 j=2 ≤

20Cg2 ||K||2∞ . N22

o n ′ Now suppose we have samples {X1 , . . . , XN2 , Y1 , . . . , YN1 } and X1 , . . . , XN2 , Y1 , . . . , YN1 and denote the respective ˜′ ˜ h1 ,h2 and G estimators as G h1 ,h2 . Then ′ ′ ˜ g f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) ≤ Cg ˜f1,h1 (Xj ) − ˜f1,h1 (Xj ) ! ′ Xj − Y1 Xj − Y1 Cg −K = K h1 h1 N1 hd1 ′ 2 2Cg2 ||K||2∞ =⇒ E g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) . ≤ N12 Thus using a similar argument as was used to obtain Eq. 35,  2  N2 X 2 ′ 1  ˜ ˜  ˜′ ≤ E G E g f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj )   h1 ,h2 − Gh1 ,h2 N22 j=1 ≤

2Cg2 ||K||2∞ . N22

Applying the Efron-Stein inequality gives h i 10C 2 ||K||2 Cg2 ||K||2∞ N1 g ∞ ˜ h1 ,h2 ≤ V G + . N2 N22 A PPENDIX E P ROOF OF T HEOREM 3

We are interested in the asymptotic distribution of h i p ˜ h1 ,h2 − E G ˜ h1 ,h2 N2 G =

N2 h i 1 X √ g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) − EXj g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) N2 j=1

N2 h i h i 1 X EXj g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) − E g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) +√ . N2 j=1

Note that by the standard central limit theorem, the second term converges in distribution to a Gaussian random variable. If the first term converges in probability to a constant (specifically, 0), then we can use Slutsky’s theorem to find the asymptotic distribution. So now we focus on the first term which we denote as VN2 . To prove convergence in probability, we will use Chebyshev’s inequality. Note that E [VN2 ] = 0. To bound the variance of ′ ′ VN2 , we again use the Efron-Stein inequality. Let X1 be drawn from f2 and denote VN2 and VN2 as the sequences using ′ X1 and X1 , respectively. Then h i ′ 1 ˜ VN2 − VN2 = √ g f1,h1 (X1 ), ˜f2,h2 (X1 ) − EX1 g ˜f1,h1 (X1 ), ˜f2,h2 (X1 ) N2 i h ′ ′ ′ ′ 1 ˜ g f1,h1 (X1 ), ˜f2,h2 (X1 ) − EX′ g ˜f1,h1 (X1 ), ˜f2,h2 (X1 ) +√ 1 N2 N 2 ′ 1 X g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) . (36) +√ N2 j=2 Note that E

h h i2 h ii g ˜f1,h1 (X1 ), ˜f2,h2 (X1 ) − EX1 g ˜f1,h1 (X1 ), ˜f2,h2 (X1 ) = E VX1 g ˜f1,h1 (X1 ), ˜f2,h2 (X1 ) .

h i p If we condition on X1 , then by the standard central limit theorem Ni hdi ˜fi,hi (X1 ) − EX1 ˜fi,hi (X1 ) converges in 2 distribution to a zero mean Gaussian random variable with variance σi (X1 ) = O(1). This is true even if X1 is close to the boundary of the support of the densities. The KDEs ˜f1,h1 (X1 ) and ˜f2,h2 (X1 ) are conditionally independent given X1 as are their limiting distributions. Thus the KDEs converge jointly in distribution to a Gaussian random vector with zero mean, zero covariance, and their respective variances. By the delta method, we have that if g(x, y) is continuously differentiable with h i respect to both x and y at EX1 ˜fi,hi (X1 ) for i = 1, 2, respectively, then h i 1 1 ˜ ˜ = o(1), VX1 g f1,h1 (X1 ), f2,h2 (X1 ) = O + N1 hd1 N2 hd2 h h ii provided that Ni hdi → ∞. Thus E VX1 g ˜f1,h1 (X1 ), ˜f2,h2 (X1 ) = o(1). A similar result holds when we replace X1 with ′

X1 . For the third term in Eq. 36,  2  N 2 ′   X ˜ E  g f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj )   j=2

=

N2 N2 X X j=2 i=2

h i ′ ′ E g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) g ˜f1,h1 (Xi ), ˜f2,h2 (Xi ) − g ˜f1,h1 (Xi ), ˜f2,h2 (Xi ) .

There are M2 terms where i = j and we have from Appendix D (see Eq. 34) that 2 2C 2 ||K||2 ′ ˜ g ∞ ˜ ˜ ˜ . E g f1,h1 (Xj ), f2,h2 (Xj ) − g f1,h1 (Xj ), f2,h2 (Xj ) ≤ M22

Thus these terms are O X −X

1 M2

. There are M22 − M2 terms when i 6= j. In this case, we can do four substitutions of the form

uj = jh2 1 to obtain h i 4C 2 ||K||2 h2d ′ ′ g ∞ 2 E g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) g ˜f1,h1 (Xi ), ˜f2,h2 (Xi ) − g ˜f1,h1 (Xi ), ˜f2,h2 (Xi ) ≤ . M22 Then since hd2 = o(1), we get  2  N 2 ′  X ˜  E  g f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj )   = o(1),

(37)

j=2

=⇒ E

′

VN2 − VN2

2

h i2 3 ˜ ˜ ˜ ˜ E g f1,h1 (X1 ), f2,h2 (X1 ) − EX1 g f1,h1 (X1 ), f2,h2 (X1 ) ≤ N2 h i2 ′ ′ ′ ′ 3 ˜ ˜ ˜ ˜ ′ + E g f1,h1 (X1 ), f2,h2 (X1 ) − EX g f1,h1 (X1 ), f2,h2 (X1 ) 1 N2  2  N2 X ′ ′ 3   g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj )   + E  N2 j=2

= o

1 N2

.

o n ′ Now consider samples {X1 , . . . , XN2 , Y1 , . . . , YN1 } and X1 , . . . , XN2 , Y1 , . . . , YN1 and the respective sequences VN2 ′

and VN2 . Then

′

VN2 − VN2

=

N2 ′ 1 X √ g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj ) . N2 j=1

Using a similar argument as that used to obtain Eq. 37, we have that if hd1 = o(1) and N1 → ∞, then  2  N2 ′ X ˜   E  g f1,h1 (Xj ), ˜f2,h2 (Xj ) − g ˜f1,h1 (Xj ), ˜f2,h2 (Xj )   = o(1) j=2

Applying the Efron-Stein inequality gives

2 ′ 1 =o =⇒ E VN2 − VN2 . N2 V [VN2 ] = o

Thus by Chebyshev’s inequality,

N2 + N1 N2

= o(1).

V [VN2 ] = o(1), ǫ2 h i √ ˜ h1 ,h2 − E G ˜ h1 ,h2 and therefore VN2 converges to zero in probability. By Slutsky’s theorem, N2 G converges in distribution to a zero mean Gaussian random variable with variance h ii h , V EX g ˜f1,h (X), ˜f2,h (X) Pr (|VN2 | > ǫ) ≤

1

2

where X is drawn from f2 . i h √ ˜w = ˜w −E G ˜ w where G For the weighted ensemble estimator, we wish to know the asymptotic distribution of N2 G P ˜ l∈¯ l w(l)Gh(l) . We have that i h p ˜w −E G ˜w N2 G

=

N2 X h i 1 X √ w(l) g ˜f1,h(l) (Xj ), ˜f2,h(l) (Xj ) − EXj g ˜f1,h(l) (Xj ), ˜f2,h(l) (Xj ) N2 j=1 ¯ l∈l      N 2 X X X 1 EXj  w(l)g ˜f1,h(l) (Xj ), ˜f2,h(l) (Xj )  − E  w(l)g ˜f1,h(l) (Xj ), ˜f2,h(l) (Xj )  . +√ N2 j=1 ¯ ¯ l∈l

l∈l

The second term again converges in distribution to a Gaussian random variable by the central limit theorem. The mean and variance are, respectively, zero and   h i X w(l)EX g ˜f1,h(l) (X), ˜f2,h(l) (X)  . V l∈¯ l

The first term is equal to   N2 h i X X 1  w(l)  √ g ˜f1,h(l) (Xj ), ˜f2,h(l) (Xj ) − EXj g ˜f1,h(l) (Xj ), ˜f2,h(l) (Xj ) N 2 ¯ j=1

=

X

w(l)oP (1)

l∈¯ l

l∈l

=

oP (1),

where oP (1) denotes convergence to zero in probability. In the last step, we used the fact that if two random variables converge in probability to constants, then their linear combination converges in probability to the linear combination of the constants. Combining this result with Slutsky’s theorem completes the proof. R EFERENCES [1] T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 2012. [2] H. Avi-Itzhak and T. Diep, “Arbitrarily tight upper and lower bounds on the Bayesian probability of error,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 1, pp. 89–91, 1996. [3] W. A. Hashlamoun, P. K. Varshney, and V. Samarasooriya, “A tight upper bound on the Bayesian probability of error,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 2, pp. 220–224, 1994. [4] K. Moon, V. Delouille, and A. O. Hero III, “Meta learning of bounds on the Bayes classifier error,” in IEEE Signal Processing and SP Education Workshop. IEEE, 2015. [5] H. Chernoff, “A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations,” The Annals of Mathematical Statistics, pp. 493–507, 1952. [6] V. Berisha, A. Wisler, A. O. Hero III, and A. Spanias, “Empirically estimable classification bounds based on a new divergence measure,” IEEE Transactions on Signal Processing, 2015. [7] K. R. Moon and A. O. Hero III, “Multivariate f-divergence estimation with confidence,” in Advances in Neural Information Processing Systems, 2014, pp. 2420–2428. [8] K. R. Moon, J. J. Li, V. Delouille, R. De Visscher, F. Watson, and A. O. Hero III, “Image patch analysis of sunspots and active regions. I. Intrinsic dimension and correlation analysis,” Journal of Space Weather and Space Climate, vol. 6, no. A2, 2016. [9] I. S. Dhillon, S. Mallela, and R. Kumar, “A divisive information theoretic feature clustering algorithm for text classification,” The Journal of Machine Learning Research, vol. 3, pp. 1265–1287, 2003. [10] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering with Bregman divergences,” The Journal of Machine Learning Research, vol. 6, pp. 1705–1749, 2005. [11] J. Lewi, R. Butera, and L. Paninski, “Real-time adaptive information-theoretic optimization of neurophysiology experiments,” in Advances in Neural Information Processing Systems, 2006, pp. 857–864. [12] L. Bruzzone, F. Roli, and S. B. Serpico, “An extension of the Jeffreys-Matusita distance to multiclass cases for feature selection,” Geoscience and Remote Sensing, IEEE Transactions on, vol. 33, no. 6, pp. 1318–1321, 1995. [13] X. Guorong, C. Peiqi, and W. Minhui, “Bhattacharyya distance feature selection,” in Pattern Recognition, 1996., Proceedings of the 13th International Conference on, vol. 2. IEEE, 1996, pp. 195–199. [14] D. M. Sakate and D. N. Kashid, “Variable selection via penalized minimum ϕ-divergence estimation in logistic regression,” Journal of Applied Statistics, vol. 41, no. 6, pp. 1233–1246, 2014. [15] K. E. Hild, D. Erdogmus, and J. C. Principe, “Blind source separation using Renyi’s mutual information,” Signal Processing Letters, IEEE, vol. 8, no. 6, pp. 174–176, 2001. [16] M. Mihoko and S. Eguchi, “Robust blind source separation by beta divergence,” Neural computation, vol. 14, no. 8, pp. 1859–1886, 2002. [17] B. C. Vemuri, M. Liu, S. Amari, and F. Nielsen, “Total Bregman divergence and its applications to DTI analysis,” Medical Imaging, IEEE Transactions on, vol. 30, no. 2, pp. 475–483, 2011. [18] A. B. Hamza and H. Krim, “Image registration and segmentation by maximizing the Jensen-Rényi divergence,” in Energy Minimization Methods in Computer Vision and Pattern Recognition. Springer, 2003, pp. 147–163. [19] G. Liu, G. Xia, W. Yang, and N. Xue, “SAR image segmentation via non-local active contours,” in Geoscience and Remote Sensing Symposium (IGARSS), 2014 IEEE International. IEEE, 2014, pp. 3730–3733. [20] B. Póczos and J. G. Schneider, “On the estimation of alpha-divergences,” in International Conference on Artificial Intelligence and Statistics, 2011, pp. 609–617. [21] J. Oliva, B. Póczos, and J. Schneider, “Distribution to distribution regression,” in Proceedings of The 30th International Conference on Machine Learning, 2013, pp. 1049–1057. [22] Z. Szabó, A. Gretton, B. Póczos, and B. Sriperumbudur, “Two-stage sampled learning theory on distributions,” To appear in AISTATS, 2015. [23] K. R. Moon, V. Delouille, J. J. Li, R. De Visscher, F. Watson, and A. O. Hero III, “Image patch analysis of sunspots and active regions. II. Clustering via matrix factorization,” Journal of Space Weather and Space Climate, vol. 6, no. A3, 2016. [24] V. Korzhik and I. Fedyanin, “Steganographic applications of the nearest-neighbor approach to Kullback-Leibler divergence estimation,” in Digital Information, Networking, and Wireless Communications (DINWC), 2015 Third International Conference on. IEEE, 2015, pp. 133–138. [25] B. Chai, D. Walther, D. Beck, and L. Fei-Fei, “Exploring functional connectivities of the human brain using multivariate information analysis,” in Advances in neural information processing systems, 2009, pp. 270–278. [26] K. M. Carter, R. Raich, and A. O. Hero III, “On local intrinsic dimension estimation and its applications,” Signal Processing, IEEE Transactions on, vol. 58, no. 2, pp. 650–663, 2010. [27] K. R. Moon, J. J. Li, V. Delouille, F. Watson, and A. O. Hero III, “Image patch analysis and clustering of sunspots: A dimensionality reduction approach,” in IEEE International Conference on Image Processing. IEEE, 2014.

[28] A. O. Hero III, B. Ma, O. Michel, and J. Gorman, “Applications of entropic spanning graphs,” Signal Processing Magazine, IEEE, vol. 19, no. 5, pp. 85–95, 2002. [29] M. Basseville, “Divergence measures for statistical data processing–An annotated bibliography,” Signal Processing, vol. 93, no. 4, pp. 621–633, 2013. [30] I. Csiszar, “Information-type measures of difference of probability distributions and indirect observations,” Studia Sci. MAth. Hungar., vol. 2, pp. 299–318, 1967. [31] S. M. Ali and S. D. Silvey, “A general class of coefficients of divergence of one distribution from another,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 131–142, 1966. [32] S. Kullback and R. A. Leibler, “On information and sufficiency,” The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79–86, 1951. [33] A. Rényi, “On measures of entropy and information,” in Fourth Berkeley Sympos. on Mathematical Statistics and Probability, 1961, pp. 547–561. [34] E. Hellinger, “Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen.” Journal für die reine und angewandte Mathematik, vol. 136, pp. 210–271, 1909. [35] A. Bhattacharyya, “On a measure of divergence between two multinomial populations,” Sankhy¯a: The Indian Journal of Statistics, pp. 401–406, 1946. [36] K. R. Moon and A. O. Hero III, “Ensemble estimation of multivariate f-divergence,” in Information Theory (ISIT), 2014 IEEE International Symposium on. IEEE, 2014, pp. 356–360. [37] K. Sricharan, D. Wei, and A. O. Hero, “Ensemble estimators for multivariate entropy estimation,” Information Theory, IEEE Transactions on, vol. 59, no. 7, pp. 4374–4388, 2013. [38] K. Sricharan, R. Raich, and A. O. Hero, “Estimation of nonlinear functionals of densities with confidence,” IEEE Trans. Information Theory, vol. 58, no. 7, pp. 4135–4159, 2012. [39] A. Krishnamurthy, K. Kandasamy, B. Poczos, and L. Wasserman, “Nonparametric estimation of renyi divergence and friends,” in Proceedings of The 31st International Conference on Machine Learning, 2014, pp. 919–927. [40] K. Kandasamy, A. Krishnamurthy, B. Poczos, L. Wasserman, and J. Robins, “Nonparametric von mises estimators for entropies, divergences and mutual informations,” in Advances in Neural Information Processing Systems, 2015, pp. 397–405. [41] X. Nguyen, M. J. Wainwright, and M. I. Jordan, “Estimating divergence functionals and the likelihood ratio by convex risk minimization,” Information Theory, IEEE Transactions on, vol. 56, no. 11, pp. 5847–5861, 2010. [42] S. Singh and B. Póczos, “Exponential concentration of a density functional estimator,” in Advances in Neural Information Processing Systems, 2014, pp. 3032–3040. [43] ——, “Generalized exponential concentration inequality for rényi divergence estimation,” in Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014, pp. 333–341. [44] B. E. Hansen, “Lecture notes on nonparametrics,” 2009.