Sparse Nonparametric Density Estimation in High Dimensions Using ...

Report 8 Downloads 68 Views
Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo Han Liu∗ Advisors: John Lafferty and Larry Wasserman Carnegie Mellon University June 15, 2006

ABSTRACT We consider the problem of estimating the joint density of a d-dimensional random vector X = (X1 , X2 , ..., Xd ) when d is large (d À 3). Current nonparametric density estimation methods, such as kernel density estimation or local likelihood methods, fail in this case due to the exponentially increasing amount of data required and intractable computational cost of bandwidth selection. In this paper, with a simple modification of a previously developed nonparametric regression framework named rodeo (regularization of derivative expectation operator), we propose a computationally attractive alternative to perform high-dimensional density estimation. We empirically show that the density rodeo works well even for very high-dimensional problems in terms of both accuracy and efficiency. When the unknown density function satisfies some suitably defined sparsity condition, our approach avoids the curse of dimensionality and achieves an optimal converge rate of the risk. Theoretical guarantees are provided even when the dimension is allowed to increase with sample size. Keywords: nonparametric density estimation, sparsity, adaptive bandwidth selection, high dimensionality



Technical report for the KDD project of Machine Learning Department at Carnegie Mellon University

1

1

Introduction

Consider the problem of estimating the joint density of a continuous d-dimensional random vector X = (X1 , X2 , ..., Xd ) ∼ F

(1)

where F is the unknown distribution with the density function f (x). The objective is to estimate a function fˆ(x) that best approximates f (x) according to some criterion. If the parametric form of the distribution is known, parametric density estimation methods can be applied. However, in most real applications, it’s unlikely that the underlying distribution can be characterized by just a few parameters. In these cases, nonparametric density estimation is preferred, as it makes fewer assumptions about the true density. The nonparametric density estimation problem has been the focus of a large body of research. From a frequentist perspective, the most popular technique is Parzen and Rosenblatt’s kernel based method [1, 2], which uses fixed bandwidth local functions (e.g. Gaussians) to interpolate the multivariate density. Hjort et al. and Loader [3, 4, 5] independently developed the local likelihood method, which corrects the boundary bias for standard kernel density estimators. Different adaptive bandwidth kernel density estimators were introduced by Scott et al. and Staniswalis [6, 7, 8]. Bandwidth selection approaches for these density estimators include crossvalidation or some heuristic techniques. These methods work very well for low-dimensional problems (d ≤ 3) but are not effective for high-dimensional problems. The major difficulty is due to the intractable computational cost of cross validation, when bandwidths need to be selected for each dimension, and the lack of theoretical guarantees for the other heuristics. What’s more, in a d-dimensional space, minimax theory shows that the best convergence rate for the mean squared error is Ropt = O(n−2k/(2k+d) ). For example, in a Sobolev space of order k, which represents the “curse of dimensionality” when d is large. Instead of using local kernels, Friedman et al. [9] developed an exploratory projection pursuit approach which looks for “interesting” (e.g. non-Gaussian) low dimensional data projections to reveal the distribution pattern of the original data. Stone et al. [10] proposed the Log-spline model to estimate log f (x), while Silverman et al [11] developed penalized likelihood method for density estimation. From a Bayesian perspective, Escobar et al. [12] proposed a Bayesian density estimation method for normal mixtures in the framework of mixtures of Dirichlet processes. When the base distribution of the Dirichlet process is conjugate to the data likelihood, MCMC sampling can be derived for posterior inference. However, for very high-dimensional problems, the computation is slow, representing another form of the curse of dimensionality in the Bayesian setting. Therefore, for a very high-dimensional density estimation problem, it is desirable to somehow exploit low dimensional structure or sparsity in order to combat the curse of dimensionality and develop methods that are both computationally tractable and amenable to theoretical analysis. Recently, Lafferty and Wasserman developed a nonparametric regression framework called rodeo [13]. For the regression problem, Yi = m(Xi )+²i , i = 1, . . . , n, where Xi = (Xi1 , ..., Xid ) ∈ Rd is a d-dimensional covariate. Assuming that the true function only depends on r covariates r ¿ d, the rodeo can simultaneously perform bandwidth selection and (implicitly) variable selection to achieve a better minimax convergence rate of O(n−4/(4+r) ), as if the r relevant variables were explicitly isolated in advance. The purpose of this paper is to extend the idea of 2

rodeo to the nonparametric density estimation setting. Toward this goal, we need to first define an appropriate “sparsity” condition in the density estimation setting. Assume that the variables are numbered such that the “relevant” dimensions correspond to 1 ≤ j ≤ r and the “irrelevant” dimensions correspond to r + 1 ≤ j ≤ d, one may first want to write f (x) ∝ f (x1 , ..., xr ). However, this is not a proper distribution on the irrelevant dimensions. To make it well-defined, we can assume, withoutQloss of generality, that all dimensions have compact support [0, 1]d , that is, f (x) ∝ f (x1 , ..., xr ) dj=1 I{0 ≤ xj ≤ 1}. In this case, we deem the uniform distribution as irrelevant (or, “uninteresting”) dimensions, while the non-uniform distributions are relevant. In fact, we can further generalize this definition. Our sparsity specification is characterized by f (x) ∝ g(x1 , ..., xr )h(x) where hjj (x) = o(1) for j = 1, ..., d.

(2)

Thus, we assume that the density function f (·) can be factored into two parts: the relevant components g(·) and the irrelevant components h(·); hjj (x) is the second partial derivative of h on the j-th dimension. The constraint in expression (2) may look unnatural at the first sight, but it simply imposes a condition that h(·) belongs to a family of very smooth functions (e.g. the uniform distribution). We adopt a standard setup where the function f and dimension d are allowed to vary with sample size n. The motivation for such a definition comes from both realworld scenarios and the need for theoretical analysis. Empirically, many problems have such a sparsity property; later in this paper, we will show an image processing example. Theoretically, this definition is strong enough for us to prove our main theoretical results, showing that minimax rates in the effective dimension r are achievable. In fact, we can even generalize h(·) to other other distributions (e.g. Gaussian) to build a more general framework. Later in this paper, we use Gaussian as a special case to illustrate this possibility. In this paper, based on the above defined sparsity assumptions, we adapt the regression rodeo framework to density estimation problems, referring to the resulting adaptive bandwidth density estimation method as the density rodeo. Similar to rodeo for regression, the density rodeo is built upon relatively simple and theoretically well understood nonparametric density estimation techniques, such as the kernel density estimator or local likelihood estimator, leading to methods that are simple to implement and can be used to solve very high dimensional problems. As we present below, for any ² > 0, the density rodeo achieves the near optimal convergence rate in the risk Rh∗ = O(n−4/(4+r)+² )

(3)

Thus, it avoids the curse of apparent dimensionality d by zeroing in on the effective dimension r ¿ d by bandwidth selection. Theoretical guarantees are provided even when d is allowed to increase with n. This work also illustrate the generality of the rodeo framework, showed that it is adaptable to a wide range of nonparametric inference problems. Other recent work that achieves risk bounds for density estimation with a smaller effective dimension is the minimum volume set learning method of Scott and Nowak [14]. Based on a similar sparsity assumption, they provide theoretical guarantees for a technique that uses dyadic trees. Another related work is done by Gray and Moore [15], from a computational perspective, their dual-tree algorithm could also deal with large sample size and high dimensional problems (d = 20 ∼ 30). It would be a very interesting future work to compare the performance of density rodeo with their algorithm. 3

This paper is organized as follows, section 2 gives out the basic idea of density estimation rodeo. In section 3, we derived the density estimation rodeo for both kernel density estimator and local likelihood estimator and showed the drodeo algorithms. Section 4 specifies our main theoretical results about the asymptotic running time, selected bandwidths, and convergence rate of the risk. Section 5 shows some variations and extensions of the basic density rodeo. Section 6 uses both synthetic and real-world dataset to test our method. Finally, section 7 summarizes the results and suggests possible extensions for future work.

2

Methodology

The idea for the density rodeo is exactly the same as its regression counterpart [13]: Assuming a fixed point x and let fˆH (x) denote an estimator of f (x) based on smoothing parameter matrix H = diag(h1 , ..., hd ), if c is a scalar, then we write h = c to mean h = (c, ..., c). The key observation is that if a dimension j is irrelevant, then we expect that changing the bandwidth hj for that dimension should cause only a small change in the estimator fˆH (x). Based on this, a sequence of decreasing bandwidths will be tested to see whether an infinitesimal change in the bandwidth from its current setting could lead to a significant change in the estimate. For more details, let M (h) = E(fˆH (x)) denote the mean of fˆH (x), therefore, f (x) = M (0) = E(fˆH (x)). Similarly as defined in [13], assuming P = {h(t) : 0 ≤ t ≤ 1} is a smooth path through the set of smoothing parameters with h(0) = 0 and h(1) = 1, then Z 1 Z 1 dM (h(s)) ˙ f (x) = M (1) − (M (1) − M (0)) = M (1) − ds = M (1) − hD(s), h(s)ids (4) ds 0 0 ³ ´T ∂M ∂M ˙ where D(h) = ∇M (h) = ∂h1 , ..., ∂hd is the is the gradient of M (h) and h(s) = dh(s) ds ˆ derivative of h(s) along the path. A biased, low variance estimator of M (1) is f1 (x). An ³ ˆ ´T ∂ fˆH (x) H (x) unbiased estimator of D(h) is Z(h) = ∂ f∂h , ..., . The naive estimator fˆH (x) = ∂hd 1 R1 ˙ is identically equal to fˆ0 (x), which has poor risk due to the large fˆ1 (x) − hZ(s), h(s)ids 0

variance of Z(h) for small bandwidth. However, the sparsity assumption on f suggests that there should be paths for which D(h) is also sparse. Along such a path, Z(h) could be replaced ˆ with an estimator D(h) that makes use of the sparsity assumption. Then, the estimate of f (x) becomes Z 1 ˜ ˆ ˙ ˆ fH (x) = f1 (x) − hD(s), h(s)ids (5) 0

The remaining thing is to find a sparse path and utilize this sparseness to estimate D along that path. Similar as the regression case, we expect that changing the bandwidth hj for an irrelevant dimension should cause only a small change in the estimator fˆH (x), while the changing of bandwidth for a relevant dimension should cause a large change in the estimator. Thus, Zj = ∂ fˆH (x)/∂hj should discriminate between relevant and irrelevant dimensions. We used the same greedy searching strategy as in the regression case, assuming a discrete bandwidth set hj ∈ B = {h0 , βh0 , β 2 h0 ..., } for some 0 < β < 1. Then, a sequence of hypothesis tests will be conducted for each dimension, to determine the first h such that |Zj (h)| < λj (h) for some threshold λj (where j denotes the current dimension). This corresponds to the hard threshold version of the regression rodeo, a soft threshold version is derived later. 4

3

Algorithms

In this section, we show the detailed density rodeo algorithm for both kernel density estimator and local likelihood density estimator, which are known to be simple and have good properties. Assuming x is a d-dimensional target point at which we want to evaluate the density values, let fˆH (x) represents the kernel density estimator of f (x) with a bandwidth matrix H. For a specific point, kernel density estimators smooth out the contribution of each observed data point over Ra local neighborhood of that data point. assuming that K is a standard symmetric kernel, R 1 s.t. K(u)du = 1, uK(u)du = 0d while KH (·) = det(H) K(H −1 ·) represents the kernel with bandwidth matrix H = diag(h1 , ..., hd ). fˆH (x) =

n

X 1 K(H −1 (x − Xi )) n det(H)

(6)

i=1

For the following density Rodeo algorithm, we assume that K is a product kernel and H to be diagonal with elements h = (h1 , ..., hd ), therefore fˆH (x) =

n

n

d

X 1 XY 1 1 K(H −1 (x − Xi )) = K n det(H) n hj i=1

µ

i=i j=1

xj − Xij hj

¶ (7)

The density estimation Rodeo is based on the statistics Zj

= =

∂ fˆH (x) ∂hj

(8)

³ ´ µ ¶ ˙ xj −Xij Y n K d X hj xk − Xik 1 ³ ´ K x −X n hk K j ij i=1



1 n

n X

(9)

k=1

hj

(10)

Zji .

i=1

For the variance term, since à s2j = Var(Zj ) = Var

n

1X Zji n i=1

! =

1 Var(Zj1 ) n

(11)

Here, we used the sample variance of the Zji to estimate Var(Zj1 ). Therefore, for a general kernel, we have that à µ ¶! Y ¶ µ n d xj − Xij d log K xj − Xij 1X 1 ∂ fˆ(x) 1 xk − Xik =− + Zj = K (12) 2 ∂hj n hj dx hj hk hk h j i=1 k=1

5

In the case where K is the Gaussian kernel this becomes Zj

= = ∝ =

∂ fˆH (x) ∂hj

(13)

µ ¶ n d d ¢Y xk − Xik 1 Y 1 X¡ 2 2 K (xj − Xij ) − hj hk hk nh3j i=1 k=1 k=1 µ ¶ n d ¢Y 1 X¡ xk − Xik 2 2 (xj − Xij ) − hj K n hk i=1 k=1 ! Ã d n X (xk − Xik )2 ¢ 1 X¡ 2 2 (xj − Xij ) − hj exp − n 2h2k i=1

(14) (15) (16)

k=1

Here, the constant of proportionality

1 h3j

Qd

1 k=1 hk

can be safely ignored to avoid overflow in the

computation as hk → 0 for large d1 . The hard thresholding version of the density estimation rodeo is described in table 1. We make ¯ ¯use of a sequence cn satisifying dcn = Ω(log n), where ¯ f (n) ¯ we write f (n) = Ω (g(n)) if lim inf n ¯ g(n) ¯ > 0. TABLE 1: Density Estimation Rodeo: Hard thresholding version 1.

Select parameter 0 < β < 1 and initial bandwidth h0 , where h0 is slowly decreasing to zero: h0 = c0 /log log n for some constant c0 . Let cn be a sequence satisfying cn = O(1).

2.

Initialize the bandwidths, and activate all dimensions: (a) hj = h0 , j = 1, 2, ..., d. (b) A = {1, 2, ..., d}.

3.

While A is nonempty, do for each j ∈ A

4.

(a) Compute the estimated derivative and variance: Zj and sj . p (b) Compute the threshold λj = sj 2 log(ncn ). (c) If |Zj | > λj , then set hj ← βhj ; otherwise remove j from A. Output bandwidths h∗ = (h1 , ..., hd ) and estimator f˜(x) = fˆH ∗ (x)

The density rodeo algorithm for the local likelihood estimator is quite similar, when assuming a product Gaussian kernel, the closed form of the local likelihood estimator could be written as 1

This constant is only ignored for the numerical computing purpose, in the following theoretical analysis, we do not ignore it

6

fˆH (x) =



d

n

1 XY K n

µ

i=1 j=1

Xij − xj hj





d  1X   exp − h2k    2

µ

Pn

i=1 exp

−2

Pn

j=1

µ

i=1 exp

k=1

³

1 Pd

− 12

Xij −xj hj

³

Pd

j=1

´2 ¶ ³

Xij −xj hj

Xik −xk h2k

´2 ¶

´ 2      (17)   

Which could be viewed as a standard kernel density estimator multiplied by an exponential bias ˆH (x) correction term. To evaluate Zm = ∂ f∂h , assuming fˆ(x) is the kernel density estimator m   µ ¶ ¶2 n n Y d d µ X X X X − x X − x 1 1 1 ij j ij j  exp − fˆ(x) = K = √ (18) n hj 2 hj ( 2π)d nh1 ...hd i=1 j=1

i=1

and 1 ∂ ˆ f (x) = √ d ∂xk ( 2π) nh1 ...hd

gˆk (x) =

Then Zm

n X i=1

j=1

 ¶2 µ ¶ d µ X Xij − xj  Xik − xk 1 exp − 2 hj h2k 

j=1

  Ã !2  d ³ ´ X ∂ 1 gˆk (x)  ∂ ˆ = f (x) exp − h2k fˆH (x) = ∂hm ∂hm 2 fˆ(x) k=1   Ã Ã !2  !2  ¶ µ d d X X gˆk (x)  ˆ gˆk (x)  1 1 ∂ ˆ h2k h2k f (x) exp − + f (x) exp − ×A = ˆ ∂hm 2 2 f (x) fˆ(x) k=1

where

 A =

d 1X

∂  − ∂hm 2

à h2k

k=1

k=1

gˆk (x) fˆ(x)

!2  =−

Bk =

∂ ∂hm

hk gˆk (x) fˆ(x)



!

µ If k 6= m, denote Ei = exp Pn ˆ( k) = B

i=1 Ei

³

∂   ∂hm 

=

− 21

(Xim −xm )2 h3 Pnm i=1 Ei

Pd

´³

³

j=1

Xik −xk hk

Ã

gˆk (x) hk × fˆ(x)

k=1

to evaluate the last term Ã

d X

µ

Pn

i=1 exp

− 12

Pn

Pd µ

i=1 exp

Xij −xj hj

´

j=1

− 12

Xij −xj hj

Pd

j=1

³

∂ ∂hm

Ã

gˆk (x) hk × fˆ(x)

´2 ¶ ³

Xij −xj hj

Xik −xk hk

´2 ¶

! (19)

´   

(20)

´2 ¶

³P −

³

!

n i=1 Ei

7

³

Xik −xk hk

´´ ³P

n i=1 Ei

P ( ni=1 Ei )2

³

m) ( Ximh−x 3 m

2

´´ (21)

If k = m Pn ˆ( k) = B

³³

i=1 Ei

´

(Xim −xm )3 − h4m Pn i=1 Ei

³

(Xim −xm ) h2m

´´

³P −

n i=1 Ei

³

Xim −xm hm

´´ ³P

n i=1 Ei

P ( ni=1 Ei )2

³

m) ( Ximh−x 3

2

´´

m

At last, in the above formulas ∂h∂m fˆ(x) is the partial derivative of the kernel density estimator with respect to the bandwidth hm , which has been derived in equation (13) for the Gaussian cases.

4

Asymptotic Properties

Here shows the asymptotic properties of the resulting estimator. Our main theoretical results characterize the asymptotic running time, selected bandwidths and the risk of the algorithm. We assume that the underlying density function f has continuous second order derivatives in a neighborhood of x. For convenience of notation, the dimensions are numbered such that the relevant variables xj correspond to 1 ≤ j ≤ r and the irrelevant variables xj correspond to r + 1 ≤ j ≤ d. A key aspect of our analysis is that we allow the dimension d to increase with sample size n, and show that the algorithm achieves near optimal minimax rates of convergence if d log d = O(log n). This hinges on a careful analysis of the asymptotic bias and variance ˜ P (an ) to mean that Yn = O(bn an ) where of the estimated derivative Zj . We write Yn = O ¯ ¯ ¯ ¯ bn is logarithmic in n. As noted earlier, we write an = Ω(bn ) if lim inf n ¯ abnn ¯ > 0; similarly ˜ n ) if an = Ω(bn cn ) where cn is logarithmic in n. Also, let Hf (x) denote the Hessian an = Ω(b (s)

matrix of f (x), let hj (s)

denote the j th bandwidth as step s and denote the bandwidth matrix (s)

by H (s) = diag(h1 , ..., hd ). We make some assumptions to take the increasing dimension into account. Assumption 1 (A1) K is a standard symmetric kernel, R R Kernel assumption: assuming that 1 s.t. K(u)du = 1, uK(u)du = 0d while KH (·) = det(H) K(H −1 ·) represents the kernel with bandwidth matrix H = diag(h1 , ..., hd ). then Z uuT K(u)du = v2 Id and v2 < ∞ (22) Z K2 (u)du = R(K) < ∞ (23)

Assumption 2 (A2) Dimension assumption: assuming that dimension d could increase with the sample size n, and satisfies d log d = O(log n).

(24)

This assumption allows the dimension to increase with the sample size, thus makes the theoretical result stronger. However, the condition d log d = O(log n) restrict its growth rate, in fact, d log d = O (log n) is equivalent to d = O (log n/log log n). Since d log d = O (log n) implies that log d = O (log log n), further implies that d = O (log n/log log n). For the other direction, d = O(log n/ log log n) implies that log d = O(log log n), therefore, d log d = O(log n). 8

(0)

Assumption 3 (A3) Initial bandwdith assumption: Let hj the j-th dimension (0)

hj

=

denotes the initial bandwidth for

c0 for (j = 1, ..., d). log log n

(25)

Q (0) Combing with the assumption A2, we can get a conclusion that limn→∞ n dj=1 hj = ∞. Since from A2, d = O (log n/log log n), define ² = log log log n/ log log n, we have that à µ ¶d ! d ³ ´ Y ¡ ¢ 1 (0) − log log log n = O n · n log log n n hj = O n · = O n1−² −→ ∞ (26) log log n j=1

(0)

(0)

Also, from the definition of hj , it’s obvious that hj

−→ 0.

Assumption 4 (A4) Sparsity assumption: Assuming that f (x) could be factorized into two components, f (x) ∝ g(x1 , ..., xr )h(x), where hjj (x) = o(1), and satisfies r = O(1). This is our main assumption, it entails a restriction on the size of the relevant dimensions. Assumption 5 (A5) Hessian assumption: Let HR (x) denote the Hessian matrix of the all the relevant dimensions j ≤ r. diag(HR (x)) is a continuous vector and Z T tr(HR (u)HR (u))du < ∞ (27) fjj (x) 6= 0

for

j = 1, 2, ..., r

(28)

As will be shown in the following, this assumption ensures that µj = ∂h∂ j fˆH (x) is proportional to hj |fjj (x)| for small hj . since we take the initial bandwidth to be decreasingly slowly with n, A3 implies that µj (hj ) ≥ chj |fjj (x)| for some constant c > 0, for sufficiently large n. Our main theoretical results proceed with a series of key lemmas. When proving the lemmas, we sometimes suppress the superscript s for the bandwidth matrix H (s) . Lemma 1 Under assumptions A1 − A5, suppose that x is interior to the support of f and KH (s) (s) (s) is a product kernel with bandwidth matrix at the step s: H (s) = diag(h1 , ..., hd ). Let HR (x) denote the Hessian matrix of the all the relevant dimensions j ≤ r. fˆH (s) (x) is the kernel density estimator defined as à ! n n Y d X X x − X 1 1 1 j ij (s) −1 fˆH (s) (x) = K((H ) (x − Xi )) = K (29) (s) (s) n n det(H (s) ) i=1 hj i=i j=1 hj then 1 (s) EfˆH (s) (x) = f (x) + v2 tr((H (s) )T HR (x)H) + oP (tr((H (s) )T H (s) )) 2 and Var(fˆH (s) (x)) =

1 R(K)f (x) + oP n det(H (s) ) 9

µ

1 n det(H (s) )

(30)

¶ (31)

where v2 and R(K) are as defined in A1. Proof: For the expectation term, n

X 1 K(H −1 (x − Xi )) (32) n det(H) i=1 Z 1 K(H −1 (u − x))f (u)du (33) det(H) Z K(u)f (x + Hu)du (34) Z 1 K(u){f (x) + uT H T ∇f (x) + uT H T Hf (x)Hu + oP (tr(uT H T Hu))}du (35) 2 1 f (x) + v2 tr(H T Hf (x)H) + oP (tr(H T H)) (36) 2 1 f (x) + v2 tr(H T HR (x)H) + oP (tr(H T H)) (37) 2

EfˆH (x) = E = = = = =

Equation (37) follows from A4. While à ! n X 1 −1 Var(fˆH (x)) = Var K(H (x − Xi )) n det(H) i=1 ¡ ¢ 1 = Var K(H −1 (x − Xi )) 2 n det(H) 1 1 = E{K2 (H −1 (x − Xi ))} − E2 {K(H −1 (x − Xi ))} n det(H)2 n det(H)2 Z 1 1 = {K(H −1 (u − x))}2 f (u)du − E2 {fˆH (x)} 2 n det(H) n Z 1 1 = {K(u)}2 f (u + Hu)du − E2 {fˆH (x)} n det(H) n µ ¶ 1 1 = R(K)f (x) + oP n det(H) n det(H)

(38) (39) (40) (41) (42) (43)

The first equality follows from the fact that all the Xi ’s are i.i.d. The last equality follows by a Taylor’s expansion. Lemma 2 Suppose that fˆH (s) (x) is the kernel density estimator as defined in lemma 1, for the product kernel, a weak convergence result holds: q n det(H (s) )(fˆH (s) (x) − EfˆH (s) (x)) −→d N (0, f (x)R(K)) (44) Proof: We will use the Liapunov’s central limit theorem for triangular arrays to show this lemma. Define µ ¶ n n d n xj − Xij 1X 1X 1 XY 1 K = fˆH (x) = KH (x − Xi ) = zin (45) n n hj hj n i=1

i=i j=1

10

i=1

Then, similar as shown in lemma 1 R(K)f (x) Var(zin ) = + oP det(H)

µ

1 det(H)

¶ (46)

Then, we derive an upper bound for the third central moment min (3) as min (3) = E{|zin − E(zin )|3 }

(47)

3

≤ 8E{|zin | } © ª = 8E |KH (x − Xi )|3 µ Z 3 = 8 |K(u)| du + oP

(48) 1 det(H)2

(49)



(50)

where the inequality follows from the fact that E{|X − EX|3 } ≤ E{|X| + |EX||3 } = 8E|X|3

(51)

Therefore, the Liapunov condition is, for r ≥ 3 ³ R ´1/3 Pn n 3 du + o ( 1/r 8n |K(u)| ) P det(H)2 min (3)) ( Pni=1 ≤ ´1/2 ³ 1/r ( i=1 Var(zin )) nR(K)f (x) n + o ( P det(H) det(H) Ã ! Ã ! 1/3 n n−1/2 = O ·O det(H)2/3 det(H)−1/2 ´ ³ = O (n det(H))−1/6 −→ 0

(52)

(53) (54)

The last convergence result follows from A3, by the Liapunov’s Central Limit Theorem fˆH (x) − EfˆH (x) q −→d N (0, 1) ˆ Var(f (x))

(55)

The statement follows from lemma 1 directly. Lemma 3 Under assumptions A1 − A5, suppose that x is interior to the support of f . Suppose (s) (s) that K is a product kernel with bandwidth matrix H (s) = diag(h1 , ..., hd ). Then (s)

µj =



(s)

(s) ∂hj

E[fˆH (s) (x) − f (x)] = oP (hj ) for all j ∈ Rc

(56)

For j ∈ R we have (s)

µj =

∂ (s) ∂hj

(s)

(s)

E[fˆH (s) (x) − f (x)] = hj v2 fjj (x) + oP (hj ). (s)

(s)

(57) (0)

Thus, for any integer s > 0, hs = h0 β s , each j > r satisfies µj = oP (hj ) = oP (hj ). 11

Remark: Special treatment is needed is x is a boundary point, we do not discuss it here. Proof: for j ∈ R, from lemma 1 1 EfˆH (x) − f (x) = v2 tr(H T HR (x)H) + oP (tr(H T H)) 2

(58)

under some regularity conditions µj =

∂ E[fˆH (x) − f (x)] = hj v2 fjj (x) + oP (hj ). ∂hj

(59)

For j ∈ Rc , if the irrelevant dimensions follow uniform distributions, the result is straightforward EfˆH (x) = E

n

d

1 XY 1 K n hj

µ

i=i j=1

= E

d Y j=1

1 K hj

µ



xj − Xij hj

xj − X1j hj

(60)

¶ (61)

µ ¶ Z Y d xj − uj 1 K f (u1 , ..., ur )du1 ...dud = hj hj j=1 µ ¶ Z Y r xj − uj 1 = K f (u1 , ..., ur )du1 ...dur + oP (tr(H T H)) hj hj

(62) (63)

j=1

the first term does not contain hj , thus, uj = oP (hj ) for all the j ∈ Rc . If the irrelevant dimensions are generalized defined as in expression (2), the proof proceeds by viewing equation (58), when j ∈ Rc , the corresponding elements in the Heissen Hf (x) will be o(1), the result follows directly. Lemma 4 Define

C= (0)

then, if hj

R(K)f (x) 4

(64)

is defined as in A3. (t) (sj )2

=

(t) Var(Zj )

=

Ã

C

d Y 1

(t)

n(hj )2

(t)

k=1

hk

! (1 + oP (1))

(65)

Proof: Assuming that ξ ∼ N (0, 1). From lemma2 and lemma1, we could represents the kernel density estimator fˆH (x) as q (66) fˆH (x) = EfˆH (x) + Var(fˆH (x)) × ξ q 1 = f (x) + v2 tr(H T Hf (x)H) + oP (tr(H T H)) + Var(fˆH (x)) × ξ (67) 2 12

Thus, Zj = Since ∂ ∂hj

µq

∂ fˆH (x) ∂ + ∂hj ∂hj

¶ Var(fˆH (x)) =

µ

¶ µq ¶ 1 ∂ v2 tr(H T HR (x)H) + Var(fˆH (x)) × ξ 2 ∂hj

(68)

´´ ³ ∂ ³ Var fˆH (x) (69) ∂hj ˆ Var(fH (x)) ¶µ µ ¶¶ µ 1 1 1 R(K)f (x) = − q 1 + oP (70) 2 hj n det(H) n det(H)hj Var(fˆH (x)) s p 1 R(K)f (x) = − (1 + oP ( hj )) (71) 2 2 hj n det(H) 1 q 2

1

The second equality follows from lemma1, therefore à d ! Y 1 C (1 + oP (1)) . s2j = Var(Zj ) = hk nh2j

(72)

k=1

(t)

Lemma 5 Define Zj = (t)

Pn

(t) i=1 Zji /n,

(t)

(t)

assuming that (v (t) )2j is the sample variance of Zji (i = (t)

(t)

1, ..., n), Then, Zj /vj ∼ N (θ, 1), where θ = µj /sj Proof: Since

vj2 −→ s2j P Zj = n1 Zji −→P µj

(73) (74)

by the consistency property of the sample variance and the Weak Law of Large Numbers (W.L.L.N), the result follows by Slusky’s theorem. Lemma 6 Let Z ∼ N (θ, 1). Then 1 2 θ2 Pθ (|Z| ≥ t) ≤ e−t /2 + t 4

(75)

g(θ, t) = Pθ (|Z| ≥ t) = Φ(−t − θ) + 1 − Φ(t − θ).

(76)

¯ ∂ 2 g(θ, t) ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯ = ¯(t − θ)φ(t − θ) − (t + θ)φ(t + θ)¯ ∂θ2

(77)

1 sup |g 00 | ≤ 2 sup |uφ(u)| ≤ . 2 u t,θ

(78)

Proof: Let

Hence, g 0 (0) = 0 and

so that

13

Also, 2φ(t) g(0) = P0 (|Z| ≥ t) ≤ = t

r

2 1 −t2 /2 1 −t2 /2 e ≤ e . πt t

(79)

Finally, g(θ, t) ≤ g(0) +

1 2 θ2 θ2 sup |g 00 | ≤ e−t /2 + . 2 t 4

(80)

which gives the statement of the lemma. Theorem 1 Suppose that r = O(1) and (A 1) –(A 5) hold. In addition, suppose that Amin = ˜ ˜ minj≤r |fjj (x)| = Ω(1) and Amax = maxj≤r |fjj (x)| = O(1). Then, for every ² > 0, the number of iterations Tn until the Rodeo stops satisfies ¶ µ 1 1 P log1/β (n1−² an ) ≤ Tn ≤ log1/β (n1+² bn ) −→ 1 (81) 4+r 4+r ˜ ˜ where an = Ω(1) and bn = O(1). More over, the algorithm outputs bandwidths h∗ that satisfy ³ ´ (0) P h∗j = hj for all j > r −→ 1 (82) Also, we have ³ ´ (0) (0) P hj (nbn )−1/(4+r)−² ≤ h∗j ≤ hj (nan )−1/(4+r)+² for all j ≤ r −→ 1

(83)

(0)

assuming that hj is defined as in A3. Proof: In what follows, when hj = o(1), we ignore the oP (1) terms in the asymptotic expressions for sj and µj , it being understood that these hold, except on a set of probability tending to 0. First, consider j > r. Let Vt = {j > r : hj = h0 β t } be the set of irrelevant dimensions that are active at stage t > 1 of the algorithm. Then X P(|Zj | > λj , for some j ∈ Vt ) ≤ P(|Zj | > λj ) (84) j∈Vt

=

X

µ P

j∈Vt



X

à P

j∈Vt



ncn

λj |Zj | > sj sj

(85)

sj −λ2j /(2s2j ) 1 µ2j e + λj 4 s2j

d

p



2 log(dcn )

+

1 X µ2j 4 s2j

! (86) (87)

j∈Vt

From lemma 4, we have that s2j ≥

1 1 C 1 C ≥ n (h(0) )2 β 2t (h(0) )d n (h(0) )(d+2) j j j 14

(88)

and from lemma 3 we have that (0)

µj = oP (hj )

(89) (0)

Without loss of generality, in what follows, we assume that hj that X µ2j j∈Vt

s2j



d · µ2j s2j

= h0 for j = 1, ..., d. It follows

nd · µ2j hd+2 0 ≤ −→ 0 C

also, from the condition in density Rodeo’s algorithm µ√ ¶ d log n p =O n ncn 2 log(ncn )

(90)

(91)

Therefore, with probability tending to 1, hj = h0 for each j > r,meaning that the bandwidth for each irrelevant dimension is frozen in the first step in the algorithm. Now consider j ≤ r. By the previous assumption, µj ≥ chj |fjj (x)|. Without loss of generality, assume that chj fjj > 0. We claim that in iteration t of the algorithm, if à ! 1 c2 nA2min h4+d 0 t≤ log1/β (92) 4+r 8C log(ncn ) then P(hj = h0 β t , for all j ≤ r) −→ 1. To show this, first note that inequality (92) can be written as µ ¶t(4+r) 1 c2 nA2min h4+d 0 ≤ β 8C log(ncn ) Except on an event of vanishing probability, we have shown above that µ ¶d−r Y 1 1 ≤ hj h0

(93)

(94)

(95)

j>r

So on the complement of this event, if each relevant dimension is active at step s ≤ t, we have λ2j h2j

= = ≤ ≤ ≤

2s2j log(ncn ) h2j

2C log(ncn ) Y 1 hi nh4j i µ ¶(4+r)t 2C log(ncn ) 1 β nh4+d 0 c2 A2min 4 2 c fjj (x)2 4 15

(96) (97) (98) (99) (100)

which implies that cfjj (x)hj ≥ 2λj

(101)

p cfjj (x)hj − λj λj ≥ = 2 log(ncn ) sj sj

(102)

and hence

for each j ≤ r. Now, P( rodeo halts ) = P(|Zj | < λj for all j ≤ r)

(103)

≤ P(|Zj | < λj for some j ≤ r) X ≤ P(|Zj | < λj )

(104) (105)

j≤r



X

P(Zj < λj )

j≤r



X

µ P

j≤r



X

µ P

j≤r



X j≤r



µj − λj Zj − µj > sj sj

(106) ¶

Zj − µj cfjj (x)hj − λj > sj sj

µ

(107) ¶

cfjj (x)hj − λj Zj − µj |> P | sj sj

ncn

p

r 2 log(ncn )

(108) ¶ (109) (110)

The last inequality follows from the standard Miller’s inequality. Finally, summing over all iterations s ≤ t gives   o [ [ n (s) tr (s) p P |Zj | < λj  ≤ (111) ncn 2 log(ncn ) s≤t j≤r ¶ µ c2 nA2min h4+d r 0 4+r log1/β 8C log(ncn ) p ≤ (112) ncn 2 log(ncn ) ¶ µ√ log n −→ 0 = O (113) n by the density Rodeo’s algorithm. Thus, the bandwidths hj for j ≤ r satisfy, with high probability, Ã !1/(4+r) 8C log(nc ) n hj = h0 β t ≤ h0 (114) c2 A2min nh4+d 0 Ã !1/(4+r) 8C log(nc ) n = n−1/(4+r) (115) c2 A2min hd−r 0 16

Now, since d log d = O(log n), it follows that d = O(log n/ log log n) and ³ ´ 1 d = O (log log n) hd0 ³ log log log n ´ = O n log log n = O (n² )

(116) (117) (118)

For any ² > 0. In particular, with probability approaching one, the algorithm runs for a number of iterations Tn bounded from below by Tn ≥

1 log1/β (n1−² an ) 4+r

(119)

an =

c2 A2min ˜ = Ω(1). 8C log(ncn )

(120)

where

We next show that the algorithm is unlikely to reach iteration s, if s≥

1 log1/β (n1+δ bn ) 4+r

(121)

˜ for any δ > 0, where bn = O(1) will be defined below. From the argument above, we know that except on an event of vanishing probability, each relevant dimension j ≤ r has bandwidth no larger than hj ≤ h0 β (log1/β (n

1−² a

n)

)/(4+r) =

h0 1−² (n an )1/(4+r)

(122)

Thus, if relevant dimension j has bandwidth hj ≤ h0 β s , then from lemma 4 we have that s2j u2j

= ≥

s2j

(123)

2 (x)h4 v22 fjj j C 2 (x)h4 nh2 det(H) v22 fjj j j

(124) r/(4+r)



C nr(1−²)/(4+r) an 2 (x)h2 nh4 β 4s hr0 v22 fjj 0 j

=

C an 1 2 2 2 4+d 4/(4+r)+δ β 4s v2 fjj (x)hj n h0



an 1 4+d 4/(4+r)+δ 2 β 4s Amax n h0

1 hd−r 0

(125)

r/(4+r)

C

(126)

r/(4+r)

(127) (128)

where δ = r²/(4 + r). Therefore, s2j µ2j

≥ log log n 17

(129)

in case µ ¶s 1 ≥ (n1+δ bn )1/(4+r) β à !1/4 A2max log log n 1/(4+r)+δ 0 = n r/(4+r) Can à !1/4 A2max h4+d log log n 1/(4+r)+δ 0 0 ≥ n r/(4+r) Can

(130) (131)

(132)

1 ˜ which defines bn = O(1). It follows that in iteration s ≥ 4+r log1/β (n1+δ bn ), the probability of a relevant dimension having estimated derive Zj above threshold is bounded by X P(|Zj | > λj , for some j ≤ r) ≤ P(|Zj | > λj ) (133) j≤r

=

X

µ P

j≤r



X

à P

j≤r

≤ ≤

ncn

|Zj | λj > sj sj

2 log(ncn ) r

p

(134)

sj −λ2j /(2s2j ) 1 µ2j e + λj 4 s2j

r

p



ncn 2 log(ncn ) ¶ µ 1 = O log log n

+

! (135)

1 X µ2j 4 s2j

(136)

r 4 log log n

(137)

j∈Vt

+

(138)

which gives the statement of the theorem. Theorem 2 Under the same condition of theorem 1, the risk Rh∗ of the density Rodeo estimator satisfies ´ ³ ˜ P n−4/(4+r)+² Rh∗ = O (139) for every ² > 0. Proof: given the bandwidths in expression(82) and expression(83), we have that the squared bias is given by  2 ³ ´ X Bias2 fˆH ∗ (x) =  v2 fjj (x)h2j  + oP (tr(H T H)) (140) j≤r

=

X

v22 fii (x)fjj (x)h2i h2j + oP (tr(H T H))

(141)

i,j≤r

˜ P (n−4/(4+r)+² ) = O 18

(142)

by Theorem 1. Similarly, by lemma 4, we calculated the variance as ³ ´ Var fˆH ∗ (x) =

1Y 1 R(K)f (x)(1 + oP (1)) n hi

(143)

i

˜ P (n−1+r/(4+r)+² ) = O ˜ P (n−4/(4+r)+² ) = O

(144) (145)

The result follows from the bias-variance decomposition. These theoretical properties are mainly based on the large sample assumption. However, the experiments presented below indicate that the method can also perform well with relatively samll sample sizes.

5

Extension and Variation

In this section, discuss some extensions and variations of the previously defined hard-thresholding density rodeo algorithm, including a soft-thresholding version, a global version and a reverse bandwidth selection version. We also describe how to extend the sparsity definition in expression (2) to yield a more general framework.

5.1

Soft-thresholding version

ˆ j (h) = For the previously defined hard-thresholding version, we have taken the estimate D Zj (h)I(|Zj (h)| > λj ) with the result that Z f˜H (x) = fˆ1 (x) −

1

˙ ˆ hD(s), h(s)ids

(146)

0

ˆ is Similar as the regression rodeo, a soft-thresholding path could be defined. First, the D ˆ j (t) = sign((Zj (h)))(|Zj (h)| − λj )+ , where the replaced with the soft-thresholded estimate D th index t denotes the t step of the algorithm. Since hj is updated multiplicatively as hj ← βhj , the differential dhj (t) is given by dhj (t) = (1 − β)hj . Using the resulting estimate of D(t) and finite difference approximation for h(t) leads to the algorithm detailed in the table 2.

19

TABLE 2: Density Estimation Rodeo: Soft thresholding version 1.

Select parameter 0 < β < 1 and initial bandwidth h0 , satisfying 1 ≤ h0 ≤ logl/d n, for a fixed constant l, where h0 is slowly decreasing to zero: Let cn be a sequence satisfying dcn = Ω(log n).

2.

Initialize the bandwidths, and activate all covariates: (a) hj = h0 , j = 1, 2, ..., d. (b) A = {1, 2, ..., d}. (c) Initialize step, t = 1.

3.

While A is nonempty (a) Set dhk (t) = 0,(k = 1, ...d) (b) Do for each j ∈ A (1) Compute the estimated derivative expression: Zj and sj . p (2) Compute the threshold λj = sj 2 log(dcn ).

4.

5.2

(3) If |Zj | > λj , then set dhj (t) = (1 − β)βhj ; otherwise remove j from A. ˆ j (t) = sign(Zj (h))(|Zj (h) − λj |)+ (4) Set D P ˆ dh(s)i Output bandwidths h∗ = (h1 , ..., hd ) and estimator f˜(x) = fˆh0 (x) − ts=1 hD(s),

Global density rodeo

Instead of using the local density rodeo which corresponds to the adaptive density estimation, the idea could be easily extended to carry out global bandwidth selection, in which case each dimension uses a fixed bandwidth. The idea is by averaging the test statistics for multiple evaluation points x1 , ..., xk , these points could be sampled from the empirical distribution of the observed sample points. As pointed by Lafferty & Larry [13], averaging the Zj s directly leads to a statistic whose P mean for relevant variables is asymptotically k1 hj ki=1 fjj (xi ). Because of sign changes in fjj (x), cancellations can occur resulting in a small value for the statistics. To avoid this problem, the statistic is squared, let x1 , ..., xm denote the evaluation points and Zj (xi ) represents the derivative for the i-th evaluation point with respect to the bandwidth hj . Therefore n

1X Zjk (xi ), Zj (xi ) = n

i = 1, ..., m,

j = 1, ..., d

(147)

k=1

Let γjk = (Zj1 (xk ), Zj2 (xk ), ..., Zjm (xk ))T (k = 1, ..., n). assuming that Var(γjk ) = Σj , denote Zj· = (Zj1 , Zj2 , ..., Zjm )T , by the multivariate central limited theory, we know that Var(Zj· ) =

20

Σj /n ≡ Cj . Based on which, we define the test statistic m

1 X 2 Tj = Zj (xk ), m

j = 1, ..., d

(148)

k=1

while q sj =

Var(Tj ) =

1q 1 Var(ZjT Zj ) = m m

q

2tr(Cj2 ) + 4µˆj T Cj µˆj

(149)

1 Pm c where µ ˆ = m i=1 Zj (xi ). For the irrelevant dimension j ∈ R , as shown before, EZj (xi ) = oP (hj ), therefore,

ETj ≈ Var(Zj (xi ))

(150)

We use s2j as an estimate for Var(Zj (xi )) ,Therefore, we take the threshold to be λj = s2j + 2sj

p

log(ncn )

(151)

Several examples of this algorithm and its comparison with the other algorithms are given in the experimental section, the theoretical properties can be analyzed in a way that is similar to what is presented above.

5.3

Bootstrapping version

For the previous examples, the explicit expression for the Zj and sj could be easily derived due to the existence of a closed form for the targeted density estimators. Sometimes, the density estimator fˆH (x) might not have a closed form expression. In these cases, we could still numerically evaluate the derivative Zj as Zj =

fˆH+4hj (x) − fˆH (x) 4hj

(152)

Then, bootstrapping technique could be applied to evaluate the variance of Zj , for this, the algorithm is shown in the following: Bootstrapping Method to calculate the s2j 1.

Draw a sample X1∗ , ..., Xn∗ of size n, with replacement: Repeat B times for the following ∗ from data X ∗ , ..., X ∗ , i = 1, ..., B . Compute the estimate Zji n 1

2.

3.

Compute the bootstrapped variance P ∗ ¯ 2 s2j = B1 B b=1 (Zji − Zj· ) . where Output the resulted s2j .

21

Z¯j· =

1 B

PB

ˆ∗ b=1 Zj

This bootstrapping version works for both local and global density rodeo algorithms, thus provides a more general framework. We expect that similar analytic results will hold. However, this bootstrapping version needs more computation. The performance of this bootstrapping version algorithm will also be shown in the following experiment section.

5.4

Reverse density rodeo

The previous density rodeo algorithm uses a sequence of decreasing bandwidths and estimates the optimal value by testing the normal means. On the contrary, we could begin from a very small bandwidth, and use a sequence of increasing bandwidths to estimate the optimal value. The theoretical analysis is the same as before. One example of this algorithm will be given in the following experimental section.

5.5

Using other distributions as irrelevant

Instead of using the sparsity definition in expression 2, we can use a general parametric distribution h(x) as irrelevant dimensions. The key trick is that a new semi-parametric density estimator will be defined as Pn ˆ h(x) i=1 KH (Xi − x) ˆ fH (x) = R (153) ˆ n KH (u − x)h(u)du ˆ where h(x) is a parametric density estimator at point x. The motivation of this estimator comes from local likelihood method [5]. We see that starting from a large bandwidth, if the true function is h(x), the algorithm will tend to freeze the bandwidth decaying process for the estimator defined in expression (153). If we assume h(x) is the multivariate Gaussian density function with µ and a diagnolized matrix Σ as its parametric maximum likelihood estimator. When we use the product Gaussian kernels with bandwidth matrix H, a closed form estimator can be derived as à ! ¡ ¢ µ ¶s n Y d T Σ−1 − (H + Σ)−1 (x − µ) X (x − µ) X − x 1 |H + Σ| ij j fˆH (x) = K exp − n hj |Σ| 2 i=1 j=1

The partial derivative of the estimator with respect to the bandwidth hm (m = 1, ..., d) is v à  ! ! u d à d 2 Y X ˜(x) h2j (x − µ ) ∂ fˆH (x) u ∂ f j j t ˜ − ³ ´ Zm = = + K · f (x) (154) 1 + 2 exp ∂hm ∂hm σj 2 σ 2 (σ 2 + h2 )/h2 j=1

j=1

j

j

j

j

where K=

2 + h2 ) + (x − µ )2 ) hm (2(σm m m m 2 2 2 2(σm + hm )

(155)

and f˜(x) is the standard kernel density estimator as defined in equation (7). The variance of Zm could be evaluated by the previously introduced bootstrapping approach. More details about the derivation are given in the appendix. 22

6

Experimental Results

In this section, we applied the density rodeo on both synthetic and real data, including onedimensional, two-dimensional, high-dimensional and very high-dimensional examples to investigate how the density estimation rodeo performs in various conditions. The algorithms are also applied on some real-wold datasets (e.g. the Geyser data and the digits image data ). For the purpose of evaluating the algorithm performance quantitatively, we need some criterion to measure the distance between the estimated density function with the true density function. For this, we used the Hellinger distance defined as s ¶2 Z Z µq p fˆ(x) fˆ(x) − f (x) dx = 2 − 2 f (x) D(fˆkf ) = dx (156) f (x) Assuming that we have altogether m evaluation points, then the hellinger distance could be numerically calculated by the Monte Carlo integration s m X 2 fˆH (Xi ) D(fˆkf ) ≈ 2 − (157) m f (Xi ) i=1

Since this measure utilizes the property that f (x) is a density function, it’s expected to be numerically more stable than the commonly used Kullback-Leibler (KL) divergence as a loss function for evaluating the discrepancy between two density functions. In the following, we first use the simulated data, about which we have known the true distribution function, to investigate the algorithm performance. Then our algorithm is also applied on some real data for analysis and compare our results with the results gotten from the other authors.

6.1

One-dimensional examples

First, we illustrate the performance of the density Rodeo algorithm in one dimensional examples. We have conducted a series of comparative study on a list of 15 “test densities” proposed by Marron and Wand [16], which are all normal mixtures representing many different types of challenges to density estimation. Our method achieves a comparable performance as the kernel density estimation with bandwidth selected by unbiased cross-validation (from the base library of R ). Due to the space consideration, only the strongly skewed example is reported here, since it demonstrates the advantage of adaptive bandwidth selection for the local density rodeo algorithms. Example 1 (Strongly skewed density): This density is chosen to resemble to lognormal distribution, it distributes as à µ ¶ µ ¶2i ! 7 X 1 2 i 2 X∼ (158) N 3 ( ) −1 , . 8 3 3 i=0

200 samples were generated from this distribution, The estimated density functions by the local density rodeo algorithm, the global density rodeo algorithm, and kernel density estimator with bandwidth chosen by unbiased cross validation are shown in figure 1. In which, the solid line is the true density function, the dashed line illustrates the estimated densities by different 23

1.0

1.0

1.5

Local kernel density Rodeo

1.5

Unbiased Cross Validation

True density Local Rodeo

0.0

0.5

density

0.0

0.5

density

True density Unbiased C.V.

−2

−1

0

1

2

−3

−2

−1

0

1

x , bandwidth = 0.0612

x

Global kernel density Rodeo

Bandwidth estimated

2

bandwidth

0.0

0.0

0.2

0.5

density

True density Global Rodeo

0.4

1.0

0.6

1.5

0.8

−3

−3

−2

−1

0

x , bandwidth = 0.05

1

2

−3

−2

−1

0

1

2

x

Figure 1: The different versions of the algorithms run on the highly skewed unimodal example. The first three plots are results for the different estimators, the last one is the fitted bandwidth for the local density rodeo algorithm. 24

0.000

0.005

0.010

0.015

0.020

methods. The local density rodeo works the best, this is because because the true density function is strongly skewed, the fixed bandwidth density estimator can not fit the smooth tail very well. The last subplot from firgure 1 illustrates the selected bandwidth for the local density rodeo method, it shows how smaller bandwidths are selected where the function is more rapidly varying. Figure 2 shows the distribution of the empirical Hellinger distances based on 100 simulations. The boxplots show that the local density rodeo works better than the remaining two methods, while the global rodeo and the unbiased cross-validation methods are comparable in this one dimensional example.

KDE

Local Rodeo

Global Rodeo

Figure 2: Highly skewed unimodal distribution: The boxplots of the empirical Heillinger’s losses on test samples of estimated densities by the three methods based on 100 simulations. In stead of showing the positive cases where the density rodeo works well, we also give out a negative example. Which is a combined Beta distribution from Loader [5]. Example 2 (Combined Beta density): The density function is a Beta mixtures on the support of [0, 1], it has a strong boundary effect on the left side, the density function is 2 1 19! 9 f (x) = 2(1 − x) + x (1 − x)9 . 3 3 9!2

(159)

altogether 500 samples were generated from this distribution, the estimated density functions by different methods are shown in figure 3. Similar as before, the solid line is the true density function, while the dashed line illustrates the estimated functions. In this case, the global rodeo 25

1.5 0.5

1.0

density

1.0 0.5

density

1.5

2.0

Local kernel density Rodeo

2.0

Unbiased Cross Validation

True density Local Rodeo

0.0

0.0

True density Unbiased C.V.

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

x , bandwidth = 0.0386

x

Global kernel density Rodeo

Bandwidth estimated

0.8

0.0

0.1

True density Global Rodeo

0.2

bandwidth

1.0 0.5

density

0.3

1.5

0.4

2.0

0.0

0.0

0.2

0.4

0.6

0.8

0.0

x , bandwidth = 0.0314

0.2

0.4

0.6

0.8

x

Figure 3: Results for the combined beta distribution. The first three plots are results for different estimators, the last one is the fitted bandwidth for the local density rodeo algorithm. 26

0.030 0.020 0.010 0.000

Local Rodeo

Global Rodeo

0.0

0.5 0.1

0.3

bandwidth

1.0 0.5

density

1.5

0.7

KDE with CV

0.0

0.4

0.8 x

0.0

0.4

0.8 x

Figure 4: Upper: The boxplots of the empirical Heillinger’s losses on test samples of estimated densities by the three methods based on 100 simulations for the combined beta density. Lower: the local likelihood rodeo run on the combined beta example. The lower left plot shows the estimated density function, the lower right is the fitted bandwidths at different evaluation points. 27

and the unbiased cross-validation estimate the density function well. However, the local rodeo fails to fit the right tail. From the bandwidth plot (as shown in the last subplot in figure 3), we see that the local rodeo tends to select larger bandwidth near the right boundary, which cause the problem. The upper plot of figure 4 shows the distribution of the empirical Hellinger distances based on 100 simulations, which is consistent with the result shown in figure 3. As pointed by Loader [5], local likelihood density estimator could correct the boundary bias, we expect the local likelihood rodeo developed before could solve this problem. The results of the local likelihood rodeo derived in section 3 is shown in the lower plot of figure 4. From the estimated bandwidth plot, we see that the boundary effect problem is alleviated.

6.2

Two dimensional examples

Here, two 2-dimensional examples are showed due to its ease of illustration. One example used a synthetic dataset, the other one uses some real data analyzed by the other authors. The density rodeo’s performance is compared with a built-in method named KDE2d ( from MASS package in R ). The empirical results show that the density rodeo algorithm works better than the built-in method on the synthetic data, where we know the ground truth. For the real-world dataset, where we do not know the underling distribution, our method achieves a very similar result as those of the original authors who analyzed this data before.

releva

nt dim

irre

lev

an t

dim

en sio n nt dim

irre

lev

an t

dim

en sio n

Density

Density releva

ensio

n

(a) the rodeo estimation

ensio

n

(b) the KDE2d estimation

Figure 5: Perspective plots of the estimated density functions by the global density rodeo (upper) and the R built-in method KDE2d (lower) on a 2-dimensional synthetic data.

28

True irrelevant dimension

1.2 0.6

0.0

0.8

1.0

density

1.0 0.5

density

1.5

1.4

True relevant density

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

x

x

relevant by Rodeo

irrelvant by Rodeo

0.8

1.0

0.8

1.0

0.8

1.0

0.03 0.01 0.00

0.00

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

relevant dimension

irrelevant dimension

relevant by KDE2d

irrelevant by KDE2d

0.00

0.02 0.00

0.01

0.01

0.02

density

0.03

0.03

0.04

0.04

0.0

density

0.02

density

0.03 0.02 0.01

density

0.04

0.04

0.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

relevant dimension

0.2

0.4

0.6

irrelevant dimension

Figure 6: Marginal distributions of the relevant and the irrelevant dimensions for example 3 29

Example 3:( Combined Beta distribution with unifrom distribution as irrelevant dimension ). We simulate a 2-dimensional dataset with n = 500 points. The two dimensions are independently generated as 1 2 Beta(1, 2) + Beta(10, 10) 3 3 ∼ Uniform(0, 1)

X1 ∼

(160)

X2

(161)

Figure 5 illustrates the estimated density functions by the global density rodeo and the built-in method KDE2d. From which, we see that the global rodeo algorithm fits the irrelevant uniform dimension perfectly, while KDE2d fails. For a quantitative comparison, we evaluated the empirical Hellinger distance between the estimated density and the true density, the global rodeo algorithm outperforms KDE2d uniformly on this example. For a qualitative study, figure 6 illustrates the numerically integrated marginal distributions of the two estimators (not normalized), it’s consistent with the previous observations.

first d

sec

sec

on

dd

on dd ime

ime

nsi o

nsi on

n

Density

Density first d

imens

imens io

ion

(a) the rodeo estimation

n

(b) the KDE2d estimation

Figure 7: Perspective plots of the estimated density functions by the global density rodeo (upper) and the R built-in method KDE2d (lower) on the geyser data. Example 4:( Geyser data ). A version of the eruptions data from the “Old Faithful” geyser in Yellowstone National Park, Wyoming. This version comes from Azzalini and Bowman [17] and is of continuous measurement from August 1 to August 15, 1985. There are two variables with 299 observations altogether. The first variable ,“Duration”, represents the numeric eruption time in minutes. While the other variable, “waiting”, represents the waiting time to next eruption. We 30

1.0 0.8 0.6 0.4 0.0

0.2

second dimension

0.8 0.6 0.4 0.2 0.0

second dimension

1.0

apply the global density rodeo algorithm on this dataset. The estimated density functions of the rodeo algorithm and the built-in KDE2d method (used by the original authors) are provided in figure 7. And figure 8 illustrates the corresponding contour plots. Based on a visual examination, our methods achieves a very similar estimation as those provided by the original authors who analyzed this data before.

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

first dimension

0.4

0.6

0.8

1.0

first dimension

Figure 8: Contour plots for the Geyser data. Left: the results from the global density rodeo, Right: the results from the KDE2d

6.3

High dimensional examples

Example 5: ( High dimensional case ) Figure 9 illustrates the output bandwidths from the local density rodeo for a 30-dimensional synthetic dataset with r = 5 relevant dimensions (n = 100, with 30 trials). The relevant dimensions are generated as Xi ∼ N (0.5, (0.02i)2 ),

for i = 1, ..., 5.

(162)

for i = 6, ..., 30.

(163)

while the irrelevant dimensions are generated as Xi ∼ Uniform(0, 1),

The test point is x = ( 12 , ..., 12 ). The boxplot illustrates the selected bandwidths out of 30 trials. The plot shows that the bandwidths of the relevant dimensions shrink toward zero, while the bandwidths of the irrelevant dimensions remain large, indicates that the algorithm’s performance is consistent with our previous theoretical analysis. Also, from the bandwidth plot, we see that, 31

0.8

0.8

789 30 28 24 20 19 27 25 21 12 22 23 11 18 29 6 15 13 14 10 16 26

0.6 0.2

0.4

selected bandwidths

0.4

5

0.2

Bandwidth

0.6

17

4 3

0

5

10

15

20

0.0

2 1

25

X1

Rodeo Step

X4

X7 X10

X14

X18

X22

X26

X30

dimensions

Figure 9: the bandwidth output by the local density rodeo for a 30-dimensional synthetic dataset (Left) and its boxplot for 30 trials. (Right) for the relevant dimensions, the smaller the variance is, the smaller the estimated bandwidth will be. Example 6: ( Image processing ). Here we apply the reverse local density rodeo algorithm on image data. The results are shown in figure 10. The algorithm was run on 2200 grayscale images of 1s and 2s, each with 256 = 16 × 16 pixels with some unknown background noise; thus this is a 256-dimensional density estimation problem. A test point is shown in the upper left subplot of figure 10, and the bandwidths output by the rodeo algorithm is shown in the upper right subplot. The estimated bandwidth plots in different rodeo steps (step 10,20,40,60,and 100) are shown in the lower plot series−− larger bandwidths have darker colors, the pixels with smaller bandwidth is more informative than those with larger bandwidths. Figure 10 visualizes the evolution of the bandwidths and could be viewed as a dynamic process for feature selection −− the earlier a dimension’s bandwidth decays, the more informative it is. The rodeo algorithm is quite efficient for this extremely high-dimensional problem. One interesting thing to note is, the early stages of the rodeo reveal that some of the 2s in the data have looped bottoms, while some have straight bottoms; the test point does not have a loop. This might because in the original dataset, some 2s have this loop while the others not. The density rodeo algorithm could discover these kind of characteristics automatically.

32

1.0 0.0

0.2

0.4

0.6

0.8

1.0 0.8 0.6 0.4 0.2 0.0

0.4

0.8

0.4

0.8

0.4

0.0

0.4

0.8

0.6

0.8

1.0

0.4

0.4 0.0

0.4

0.0

0.2

0.8

0.0

0.0

0.4 0.0

0.4 0.0

0.0

1.0

0.0

0.8

0.8

0.6

0.8

0.4

0.8

0.2

0.8

0.0

0.0

0.4

0.8

0.0

0.4

0.8

Figure 10: the image processing example: the upper plots are test digits and bandwidths output by the reverse density rodeo. The lower subplots illustrate a series of bandwidths plots sampled at different rodeo steps: 10, 20, 40, 60, and 100

6.4

Using other distributions as the irrelevant dimensions

Example 7: ( Using Gaussian distributions as the irrelevant dimensions ) Figure 11 illustrates the output bandwidths from the semiparametric rodeo (developed in section 5) for both 15dimensional and 20-dimensional synthetic datasets with r = 5 relevant dimensions (n = 1000). When using gaussian distributions as irrelevant dimensions, the relevant dimensions are generated as Xi ∼ Uniform(0, 1),

for i = 1, ..., 5.

(164)

while the irrelevant dimensions are generated as Xi ∼ N (0.5, (0.05i)2 ),

33

for i = 6, ..., d.

(165)

0.8

0.8

The test point is x = ( 12 , ..., 12 ). Even when gaussian distributions are used as irrelevant dimensions, the plot shows that the bandwidths of the relevant dimensions shrink toward zero, while the bandwidths of the irrelevant dimensions remain large, this is exactly what we expected.

15 679 11 12 13 8 10 14

0.6 0.4

Bandwidth

8

0.2

0.4 0.2

Bandwidth

0.6

10 12 67 14 17 13 19 18 15 16 20 95 11

34 1

31452

0.0

0.0

2

5

10

15

5

Rodeo Step

10

15

20

25

Rodeo Step

Figure 11: the bandwidth output by the local density rodeo for a 15-dimensional synthetic dataset (Left) and a 20-dimensional synthetic dataset (Right). Using Gaussian distribution as the irrelevant dimensions We also applied the semiparametric rodeo algorithm on some one dimensional examples. In the first example, we simulated 1000 one-dimensional data points with xi ∼ Uniform(0, 1). With β = 0.9, the results of the semiparametric rodeo algorithm are shown in figure 12. The first plot shows the true density function, the second plot is the estimated density function by the semiparametric rodeo algorithm, the lower left plot illustrates the estimated bandwidths at different evaluation points, the last one is the estimated density function by the kernel density estimator with bandwidth selected by unbiased cross validation. Based on a visual examination of the results, we see that the density function estimated by the semiparametric rodeo is quite similar to that estimated by the kernel density estimator with unbiased cross validation. However, the selected bandwidths are quite small in this case (≈ 0.015). In the second example, we simulated 1000 one-dimensional data points with xi ∼ N (0.5, 1). With β = 0.9, the results of the semiparametric rodeo algorithm are shown in figure 13. The first plot shows the true density function, the second plot is the estimated density function by the semiparametric rodeo algorithm, the lower left plot illustrates the estimated bandwidths at different evaluation points. When viewing the semiparametric density estimator in equation (153) as a standard kernel density estimator multiplied by a bias correction factor, the last subplot shows the estimated correction factors at different evaluation points. The estimated density function captures the shape of the target 34

fitted by Rodeo

density 0.0

0.6

0.8

0.5

1.0

density

1.0

1.2

1.5

1.4

True density

0.2

0.4

0.6

0.8

1.0

0.2

0.4

0.6

0.8

x

Bandwidth estimated

By unbiased CV

1.0 Density

0.018

0.5

0.016

0.0

0.012

0.014

bandwidth

0.020

0.022

0.024

1.5

x

1.0

0.2

0.4

0.6 x

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

N = 1000 Bandwidth = 0.02894

Figure 12: Results for fitting the uniform distribution with the semiparametric density rodeo algorithm. The first plot shows the true density function, the second plot is the estimated density function by the semiparametric rodeo algorithm, the lower left plot illustrates the estimated 35 bandwidths at different evaluation points, the last one is the estimated density function by the kernel density estimator with bandwidth selected by unbiased cross validation

fitted by Rodeo

0.0

0.0

0.1

0.2

density

0.2 0.1

density

0.3

0.3

0.4

0.4

Gaussian

−2

−1

0

1

2

−3

−2

−1

0

1

x

x

Bandwidth estimated

correction factor

2

1.00 0.98 0.96 0.94

C0

0.15 0.05

0.90

0.92

0.10

bandwidth

0.20

1.02

−3

−3

−2

−1

0 x

1

2

−3

−2

−1

0

1

2

test

Figure 13: Results for fitting the Gaussian distribution with the semiparametric density rodeo algorithm. The first plot shows the true density function, the second plot is the estimated density function by the semiparametric rodeo algorithm,36the lower left plot illustrates the estimated bandwidths at different evaluation points, the last one is the estimated correction factors at different evaluation points

density function correctly, but not as smooth as the true function. Which is because that the estimated correction factor is discrete. the selected bandwidths for this Gaussian example is 10 times larger than the previous uniform example. Which is consistent with our theoretical analysis.

7

Conclusions

This work is mainly purposed to illustrate the generality of the rodeo framework. Under some suitably-defined sparsity conditions, the previously developed nonparametric regression framework is easily adapted to perform high-dimensional density estimation. The resulting method is both computationally efficient and theoretically soundable. Empirical results show that our method is better than the built-in methods in many cases.

References [1] E. Parzen. On the estimation of a probability density function and the mode. The Annuals of Mathematical Statistics, 33:832–837, 1962. [2] M. Rosenblatt. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, 27:642–669, 1956. [3] N.L. Hjort and M.C.Jones. Locally parametric nonparametric density estimation. The Annals of Statistics, 24:1619–1647, 1996. [4] N.L. Hjort and I.K. Glad. Nonparametric density estimation with a parametric start. The Annals of Statistics, 23(3):882–904, 1995. [5] C.R. Loader. Local likelihood density estimation. The Annals of Statistics, 24:1602–1618, 1996. [6] G.R.Terrell and D.W.Scott. Variable kernel density estimation. The Annals of Statistics, 20(3):1236–1265, 1992. [7] S.R. Sain and D.W. Scott. On locally adaptive density estimation. Journal of the American Statistical Association, 91:1525–1534, 1996. [8] J.G. Staniswalis. Local bandwidth selection for kernel estimates. Journal of the American Statistical Association, 84:284–288, 1989. [9] J. Friedman, W. Stuetzele, and A. Schroeder. Projection pursuit density estimation. Journal of the American Statistical Association, 79:599–608, 1984. [10] C.J. Stone. Large sample inference for log-spline models. The Annals of Statistics, 18:717– 741, 1990. [11] B.W. Silverman. On the estimation of a probability density function by the maximum penalized likelihood method. The Annals of Statistics, 10:795–810, 1982.

37

[12] M.D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90(430):577–588, 1994. [13] J.D.Laffery and L.A.Wasserman. Rodeo: Sparse nonparametric regression in high dimensions. Advances in Neural Information Processing Systems (NIPS), 18, 2005. [14] C.D.Scott and R.D.Nowak. Learning minimum volume sets. Journal of Machine Learning Research, 7:665–704, 2006. [15] A.G.Gary and A.W. Moore. Very fast multivariate kernel density estimation via computational geometry. Proceedings of the ASA joint Statistical Meeting, 2003. [16] J.S.Marron and M.P. Wand. Exact mean integrated squared error. The Annals of Statistics, 20:712–736, 1992. [17] A.Azzalini and A.W.Bowman. A look at some data on the old faithful geyser. Applied Statistics, 39:357–365, 1990.

A

Derivation of the rodeo estimators (Gaussian kernel)

A.1

The local likelihood density estimator’s closed form: 1 dimensional case

The likelihood is L(f, x) =

n X j=1

µ K

Xj − x h



µ

Z (a0 + a1 (Xi − x)) − n

K X

u−x h

¶ exp(a0 + a1 (u − x))du

To find the M.L.E, we need to solve the following two equations: ¶ µ ¶ µ Z n 1X u−x Xi − x a0 =e × K exp(a1 (u − x))du K n h h X i=1 µ ¶ µ ¶ Z n X 1 Xi − x u−x (Xi − x) = ea0 × K (u − x) exp(a1 (u − x))du K n h h X

(166) (167)

i=1

³ ´ ´ ³ For the Gaussian kernels, we set K Xih−x = h1 φ Xih−x , where φ is the standard Gaussian density function. Also, we define the kernel density estimator as n

1X fˆh (x) = K n

µ

i=1

Xi − x h

¶ (168)

and set n

1X gˆh (x) = K n i=1

µ

Xi − x h

38

¶ (Xi − x)

(169)

Therefore

¡ u−x ¢ R K exp(a1 (u − x))du fˆh (x) X¡ ¢h =R u−x gˆh (x) (u − x) exp(a1 (u − x))du h X K

(170)

By the method of m.g.f, we have that µ µ ¶ ¶ Z 1 2 2 u−x exp(a1 (u − x))du = exp a h K h 2 1 X µ ¶ µ µ ¶¶ Z u−x d 1 2 2 K (u − x) exp(a1 (u − x))du = exp a h h da1 2 1 X

(171) (172)

Thus, we get a ˆ1 = gˆh (x)/(h2 × fˆh (x))

(173)

Plug the a ˆ1 into the original equation, we obtain the local linear density estimator µ ¶ µ ¶ n X 1 X − x 1 i 2 2 m ˆ h (x) = eˆa0 = K exp − a ˆ h n h 2 1

(174)

i=1

Write it in an explicitly closed form 1 n

m ˆ h (x) =

A.2

n X

µ K

i=1

Xi − x h





P

 1 exp − 2  2h

n i=1 K

³

Pn

Xi −x h

³

i=1 K

´

2  (Xi − x)    ´

Xi −x h

(175)

The local likelihood density estimator’s closed form: d dimensional case

The multivariate likelihood is

L(f, x) =

n Y d X

µ K

i=1 j=1

n

Z Y d

µ K

X j=1



Xij − xj hj

uj − xj hj

(a0 +

d X

a1j (Xij − xj )) −

(176)

j=1

¶ exp(a0 +

d X

a1j (uj − xj ))du

j=1

To find the M.L.E, we need to solve the following d + 1 equations: n

d

1 XY K n

µ

i=1 j=1

Xij − xj hj

¶ a0

=e

×

Z Y d

µ K

X j=1

uj − xj hj



d X exp( a1j (uj − xj ))du

(177)

j=1

When setting m = 1, ..., d n

d

Y 1X (Xim − xm ) K n i=1

j=1

µ

Xij − xj hj

¶ = ea0 ×

Z Y d X j=1

39

µ K

uj − xj hj



d X exp( a1j (uj − xj ))(um − xm )du j=1

Define the kernel density estimator (Assuming productive Gaussian kernel)   ¶ ¶2 µ d µ n Y d n X X X X − x X − x 1 1 ij j ij j  = √ fˆ(x) = K exp − hj 2 hj ( 2π)d nh1 ...hd i=1 j=1

(178)

j=1

i=1

Also, we set   ¶2 µ ¶ n d µ X X X − x 1 1 X − x ∂ ˆ ij j ik k  f (x) = √ exp − gˆk (x) = ∂xk 2 hj h2k ( 2π)d nh1 ...hd i=1 j=1

(179) (180)

Using the same strategy as before a ˆ1j = gˆj (x)/(fˆ( x))

j = 1, ..., d

(181)

Plug it into the original equation, we got  m ˆ h1 ,..,hd (x) = fˆ(x) exp − Write it in an explicitly closed form  m ˆ h (x) =

n Y d X

µ K

i=1 j=1

A.3



Xij − xj hj

d 1X

2



d  1X   h2k  exp −   2

à h2k

k=1

!2  gˆk (x)  fˆ(x) µ

Pn

− 12

i=1 exp

Pn

³

Pd

j=1

µ

i=1 exp

k=1

(182)

Xij −xj hj

Pd

− 21

j=1

³

´2 ¶ ³

Xij −xj hj

Xik −xk h2k

´2 ¶

´ 2      (183)   

The derivative for the 1-dimensional local likelihood density estimator 

d d 1 m ˆ h (x) =  dh dh n Set fˆ(x) =

1 n

³

Pn

i=1 K

 d m ˆ h (x) = dh

n X

µ K

i=1 Xi −x h

Xi − x h





 1 exp − 2  2h

´ , and gˆ(x) = 

Ã

P

1 n

Pn

i=1 K

!2 

³

n i=1 K

Xi −x h

Pn

³

Xi −x h

i=1 K

´

³

´

2  (Xi − x)    ´

Xi −x h

(184)

( Xhi −x 2 )

d ˆ 1 gˆ(x)  f (x) exp − h2 (185) dh 2 fˆ(x)   !2  !2  Ã ! Ã ! Ã Ã 1 hˆ g (x) d hˆ g (x) g ˆ (x) 1 g ˆ (x) 0 2 2  − fˆ(x) exp − h  = fˆ (x) exp − h 2 2 fˆ(x) fˆ(x) fˆ(x) dh fˆ(x)  Ã Ã !2  Ã Ã ! !! g ˆ (x) 1 g ˆ (x) g ˆ (x) d  fˆ0 (x) − fˆ(x) h × h× = exp − h2 2 fˆ(x) fˆ(x) dh fˆ(x) 40

Further consider the last term, we have ´³ ´ ³ P Ã ! n Xi −x Xi −x K i=1 h h d gˆ(x) d   ³ ´ h× = Pn ˆ X −x dh dh i f (x) K i=1

1 nh

=

³

Pn

i=1 K

h

Xi −x h

´³

Xi −x h

´ µ³

Xi −x h

´2

¶ −1 −

fˆ(x)

where n

1 X K fˆ0 (x) = nh

µ

i=1

Xi − x h

¶ õ

Xi − x h

h × gˆ(x)fˆ0 (x) (186) fˆ2 (x)

!

¶2 −1

(187)

Combining all of the above results 

P

d  1 m ˆ h (x) = exp − h2  dh 2 n

1X K − n i=1

µ

n i=1 K

³

³

Pn

i=1 K

Xi − x h

¶µ

´³

Xi −x h

Xi −x h2

Xi −x h

Xi − x h



d dh

d dh

gˆ(x) h× fˆ(x)

! =

=

1 nh

1 nh

³

Pn

i=1 K

³

Pn

i=1 K

Xi −x h

Xi −x h

´³

n

1 X   K ( nh

´

µ

i=1

Xi − x h

¶ õ

Xi − x h

!

¶2 −1

Ã

Where Ã

´ 2 

Xi −x h

! gˆ(x) h× ) fˆ(x)

´ µ³

fˆ(x) ´³ ´ µ³ Xi −x h

Xi −x h

Xi −x h

´2

(188)

¶ −1 −

´2

h × gˆ(x)fˆ0 (x) fˆ2 (x)

¶ − 1 fˆ(x) − h × gˆ(x)fˆ0 (x)

fˆ2 (x) (189)

A.4

The partial derivative for the d-dimensional local likelihood density estimator 

Zm

∂ = ∂hm





¶ µ n Y d d 1 X  1X  Xij − xj   exp − K h2k    hj n  2 i=1 j=1

k=1

µ

Pn

i=1 exp

Pn

− 12

i=1 exp

Pd µ

³

j=1

− 12

Xij −xj hj

Pd

j=1

³

´2 ¶ ³

Xij −xj hj

Xik −xk h2k

´2 ¶

Here, we set n

d

1 XY fˆ(x) = K n i=1 j=1

µ

Xij − xj hj



  ¶2 d µ n X X X − x 1 1 ij j  = √ exp − 2 hj ( 2π)d nh1 ...hd i=1 j=1 41

(190)

´ 2           

and gˆk (x) =

∂ ˆ 1 f (x) = √ d ∂xk ( 2π) nh1 ...hd

Then

n X i=1

 ¶2 µ ¶ d µ X Xij − xj  Xik − xk 1  exp − 2 hj h2k 

j=1





!2 

Ã

d 1X

∂ ˆ gˆk (x)  ∂ h2k (m ˆ h1 ,..,hd (x)) = f (x) exp − ∂hm ∂hm 2 fˆ(x) k=1   Ã !2  Ã !2  µ ¶ d d X X ∂ ˆ 1 gˆk (x)  ˆ 1 gˆk (x)  = f (x) exp − h2k + f (x) exp − h2k ×A ˆ ∂hm 2 2 f (x) fˆ(x)

Zm =

k=1

where

 A =

d 1X

∂  − ∂hm 2

à h2k

k=1

k=1

gˆk (x) fˆ(x)

!2  =−

∂ Bk = ∂hm

hk gˆk (x) fˆ(x)

!

 ∂   = ∂hm 

gˆk (x) hk × fˆ(x)

k=1

to evaluate the last term Ã

Ã

d X

µ

Pn

i=1 exp

− 12

Pn

³

Pd µ

j=1

1 i=1 exp − 2

!

Xij −xj hj

³

Pd

j=1

Ã

∂ ∂hm

gˆk (x) hk × fˆ(x)

´2 ¶ ³

Xij −xj hj

Xik −xk hk

! (191)

´

´2 ¶

  

(192)

If k 6= m Pn

µ − 12

i=1 exp

Bk =



³

j=1

Xij −xj hj

´2 ¶ ³

(Xim −xm )2 h3m

´³

Xik −xk hk

´

µ (193) ³ ´ ¶ Xij −xj 2 1 Pd exp − i=1 j=1 2 hj µ ¶ µ ³ ´ ³ ³ ´ ¶³ ´ ´ Pn Xij −xj 2 Xij −xj 2 Xim −xm )2 Xik −xk Pn 1 Pd 1 Pd exp − exp − ( 3 i=1 j=1 j=1 i=1 2 hj hk 2 hj hm µ µ ¶¶ 2 ³ ´ Pn Xij −xj 2 1 Pd exp − i=1 j=1 2 hj Pn

If k = m Pn

µ − 12

i=1 exp

Bk =

Pd

j=1

³

Xij −xj hj

´2 ¶ ³³

(Xim −xm )3 h4m

µ ³ Xij −xj 1 Pd exp − i=1 j=1 2 hj ¶ ³ ´ ³ ´

Pn Pn

i=1 exp



Pd

µ − 12

Pd

j=1

Xij −xj hj

µ

2

Pn

Xim −xm hm

µ

1 i=1 exp − 2

42

´

³

− ´2 ¶

(Xim −xm ) h2m

µ

Pn

i=1 exp

Pd

j=1

³

− 12

Xij −xj hj

´´ (194)

Pd

j=1

´2 ¶¶2

³

Xij −xj hj

´2 ¶ ³

m) ( Ximh−x 3 m

2

´

At last, as we have derived before ∂ ˆ 1 f (x) = √ d ∂hm ( 2π) nh1 ...h2m ...hd

A.5

n X i=1

 ! µ ¶2 õ ¶2 d X Xij − xj 1 Xij − xj  exp − − 1 (195) 2 hj hj 



j=1

The new semiparametric density estimator’s closed form: 1-dimensional case

The estimator is Pn Kh (Xi − x)fˆ(x) fˆH (x) = Ri=1 n Kh (u − x)fˆ(u)du

(196)

when both Kh and fˆ are Gaussian, we know that the convolution of two Gaussians are still Gaussian. Assuming that Kh (u) =



fˆ(u) =



1

u2

2πh2 1 2πσ 2

e− 2h2 e−

(197)

(u−µ)2 2σ 2

(198)

We get that the denominator is ∝ N (µ, h2 + σ 2 ), therefore Z



n −∞

¶ µ (x − µ)2 exp − 2(σ 2 + h2 ) 2π(σ 2 + h2 ) n

Kh (u − x)fˆ(u)du = p

(199)

Therefore, we have that

n

fˆ(x)

R∞

−∞ Kh (u

− x)fˆ(u)du

=

=

√ 1 2πσ 2

³ ´ 2 exp − (x−µ) 2 2σ ³ ´ (x−µ)2 exp − 2 2 2(σ +h ) 2

√ n2 2π(σ +h ) r µ ¶ 1 h2 (x − µ)2 1 + 2 exp − n σ 2 (σ 2 (σ 2 + h2 )/h2 )

(200)

(201)

The final estimator is fˆH (x) =

=

¶r

¶ µ h2 (x − µ)2 1 + 2 exp − σ 2 (σ 2 (σ 2 + h2 )/h2 ) i=1   ¶r µ n 2 2 X 1 Xi − x h (x − µ)  ´ K 1 + 2 exp − ³ 4 n h σ 2 σ + σ2 n

1X K n

µ

Xi − x h

i=1

h2

43

(202)

(203)

A.6

The new semiparametric density estimator’s closed form: d-dimensional case

Assuming that Kh (u) ∼ N (0, H), fˆ(u) = N (µ, Σ)

H = diag(h21 , h22 , ..., h2d )

(204) (205)

We get that the denominator is ∝ N (µ, H + Σ), therefore Z



µ

(x − µ)T (H + Σ)−1 (x − µ) n Kh (u − x)fˆ(u)du = p exp − 2 2π|H + Σ| −∞ n

¶ (206)

Therefore, we have that

n

R∞

fˆ(x)

−∞ Kh (u

³ ´ T −1 (x−µ) exp − (x−µ) (Σ) 2 ³ ´ (207) T −1 (x−µ) √ n exp − (x−µ) (H+Σ) 2 2π|H+Σ| s à ! ¡ ¢ (x − µ)T Σ−1 − (H + Σ)−1 (x − µ) 1 |H + Σ| exp − (208) n |Σ| 2 √

− x)fˆ(u)du

=

=

1 2π|Σ|

The final estimator is fˆH (x) =

n

d

1 XY K n i=1 j=1

A.7

µ

Xij − xj hj

¶s

à ! ¡ ¢ (x − µ)T Σ−1 − (H + Σ)−1 (x − µ) |H + Σ| exp − (209) |Σ| 2

The derivative for the 1-dimensional semiparametric density estimator

Assuming that f˜(x) is the original kernel density estimator r µ ¶ (x − µ)2 d ˆ h2 0 fH (x) = f˜ (x) 1 + 2 exp − dh σ 2 (σ 2 (σ 2 + h2 )/h2 ) µ ¶−1/2 µ ¶ h2 h (x − µ)2 ˜ + f (x) 1 + 2 · 2 exp − σ σ 2 (σ 2 (σ 2 + h2 )/h2 ) r ¶ µ (x − µ)2 · h h2 (x − µ)2 · 1 + 2 exp − + σ 2 (σ 2 (σ 2 + h2 )/h2 ) 2(σ 2 + h2 )2

(210)

(211)

Finally d ˆ fH (x) = dh

r

µ ¶· ¸ h2 (x − µ)2 (x − µ)2 · h ˜ h 0 ˜ ˜ 1 + 2 exp − · f (x) + · f (x) f (x) + 2 σ 2 (σ 2 (σ 2 + h2 )/h2 ) σ + h2 2(σ 2 + h2 )2

44

A.8

The derivative for the d-dimensional semiparametric density estimator

To make the derivation easier, we assume that Σ is diagnolized, that is, Σ = diag(σ12 , σ22 , ..., σd2 ). Therefore, we get that v s ! u d à Y h2j |H + Σ| u =t 1+ 2 (212) |Σ| σj j=1

also µ Σ

−1

− (H + Σ)

−1

= diag

h2d h21 h22 , , ..., σ12 (σ12 + h21 ) σ22 (σ22 + h22 ) σd2 (σd2 + h2d )

¶ (213)

Assuming that f˜(x) is the multivariate kernel density estimator as before v   ! u d à d 2 X uY h2j − ³ (xj − µj ) ´ fˆH (x) = f˜(x)t 1 + 2 exp 2 (σ 2 + h2 )/h2 σ j 2 σ j=1 j=1 j

j

j

(214)

j

To calculate the derivative v   ! u d à d 2 2 Y X u hj d ˆ − ³ (xj − µj ) ´ fH (x) = f˜0 (x)t 1 + 2 exp dhm 2 2 2 2 σ j 2 σ (σ + h )/h j=1 j=1 j



d Y

Ã

h2j σj2

!−1/2

Y

Ã

h2j σj2

j

!

j

j

d X



 )2

− ³ (xj − µj ´ 2 2 2 2 2 σj (σj + hj )/hj j=1 j=1 j6=m v   ! u d à d 2 X uY h2j (x − µ )2 · h t − ³ (xj − µj ) ´ · m 2 m 2 2m 1 + 2 exp + f˜(x) 2(σm + hm ) σj 2 σ 2 (σ 2 + h2 )/h2 + f˜(x) 

1+



·

j=1

j=1

1+

j

j

hm exp 2 σm

j

j

To organize it into a simpler form d ˆ fH (x) = dhm

(215)

v   ! u d à µ ¶ d 2 2 X uY hj (xj − µj ) hm (xm − µm )2 · hm ˜ 0 t ˜ ˜   ´ 1 + 2 exp − ³ f (x) + 2 f (x) + f (x) 2 + h2 )2 σm + h2m 2(σm σj m 2 σ 2 (σ 2 + h2 )/h2 j=1 j=1 j

j

j

j

45