Consistency, Efficiency and Robustness of Conditional Disparity ...

Comment

Report 3 Downloads 64 Views

Consistency, Efficiency and Robustness of Conditional Disparity Methods Giles Hooker and Anand Vidyashankar November 18, 2012 Abstract This report demonstrates the consistency and asymptotic normality of minimum-disparity estimators based on conditional density estimates. In particular, it provides L1 consistency results for conditional density estimators and conditional density estimators under homoscedastic model restrictions. It defines a generic formulation of disparity estimates in conditionally-specified models and demonstrates their asymptotic consistency and normality. In particular, for regression models with more than one continuous response or covariate, we demonstrate that disparity estimators based on unrestricted conditional density √ estimates have a bias that is larger than n, however for univariate homoscedastic models conditioned on a univariate continuous covariate, this bias can be removed.

1

Framework and Assumptions

Throughout the following, we assume that we observe {Xn1 (ω), Xn2 (ω), Yn1 (ω), Yn2 )(ω), n ≥ 1} i.i.d. random variables with where Xn1 (ω) ∈ Rdx , Xn2 (ω) ∈ Sx , Yn1 (ω) ∈ Rdy , Yn2 (ω) ∈ Sy for countable sets Sx and Sy with joint distribution g(x1 , x2 , y1 , y2 ) = P (X2 = x2 , Y2 = y2 )P (X1 ∈ dx1 , Y1 ∈ dy1 |x2 , y2 ) and define the marginal and conditional densities h(x1 , x2 ) =

X Z

g(x1 , x2 , y1 , y2 )dy1

y2 ∈Sy

f (y1 , y2 |x1 , x2 ) =

g(x1 , x2 , y1 , y2 ) . h(x1 , x2 )

on the support of (x1 , x2 ). Further, for a scalar Yn1 ∈ R, removing Yn2 (or setting it to be always identically 0) we define the conditional expectation Z m(x1 , x2 ) = y1 f (y1 |x1 , x2 )dy1 and note that in homoscedastic regression models we write for some density f ∗ (e), f (y1 |x1 , x2 ) = f ∗ (y1 − m(x1 , x2 )).

1

(1.1)

Within this context, the use of disparity methods to estimate parameters in linear regression was treated in Pak and Basu (1998) from the point of view of placing a disparity on the score equations. We take a considerably more direct approach here. The following regularity structures may be assumed in the theorems below: (D1) g is bounded and continuous in x1 and y1 . (D2) All third derivatives of g with respect to x1 and y1 exist, are continuous and bounded. (D3) The support of h, X ∈ Rdx ⊗Sx , is compact and h− =

inf (x1 ,x2 )∈X

h(x1 , x2 ) > 0.

We note that under these conditions, continuity of h and f in x1 and y1 is inherited from g. In the case of homoscedastic models we also assume R (E1) |∇e f ∗ (e)|de < ∞. We will form kernel estimates for these quantities as follows: n X

x1 − Xi1 (ω) y1 − Yi1 (ω) K Ix2 (Xi2 (ω))Iy2 (Yi2 (ω)) y dy x cnx2 cny2 ncdnx 2 cny2 i=1 n X x − Xi1 (ω) ˆ n (x1 , x2 , ω) = 1 K Ix2 (Xi2 (ω)) h x x cnx2 ncdnx 2 i=1 X Z gˆn (x1 , x2 , y1 , y2 , ω)dy1 =

gˆn (y1 , y2 , x1 , x2 , ω) =

1

y2 ∈Sy

Kx

(1.2) (1.3)

Rdy

gˆn (x1 , x2 , y1 , y2 , ω) fˆn (y1 , y2 |x1 , x2 , ω) = . ˆ n (x1 , x2 , ω) h

(1.4) (1.5)

where Kx and Ky are densities on the spaces Rdx and Rdy respectively. Further conditions on these are detailed below. In the case of homoscedastic regression models we further define a homoscedastic conditional density via the Nadaraya-Watson estimator: Pn x1 −Xi1 (ω) Y (ω)K Ix2 (Xi2 (ω)) i1 x i=1 cnx2 m ˆ n (x1 , x2 , ω) = (1.6) Pn x1 −Xi1 (ω) K I (X (ω)) x x i2 2 i=1 cnx2 n X 1 e − (Yi (ω) − m ˆ n (Xi1 (ω), Xi2 (ω))) ∗ ˆ fn (e, ω) = dy Ky (1.7) cny2 ncny2 i=1 f˜n (y1 |x1 , x2 , ω) = fˆ∗ (y1 − m ˆ n (x1 , x2 , ω), ω) . (1.8) n

with notation cny2 maintained as a bandwidth for the sake of consistency.

2

For the sake of notational compactness, we define measures µ and ν on Rdx ⊗Sx and Rdy ⊗Sy respectively given by the product of counting and Lebesgue measure and define y = (y1 , y2 ) and x = (x1 , x2 ). Where needed, we will write for any function F (x1 , x2 , y1 , y2 ), Z Z Z Z X Z Z F (x1 , x2 , y1 ,2 )dx1 dy1 = F (x1 , x2 , y1 , y2 )dµ(y)dν(x) = F (x, y)dµ(y)dν(x). (1.9) x∈Sx ,y∈Sy

Throughout we will attempt to keep the notation explicit where possible and, in particular, will not make use of this convention in Sections 2 and 3 where the distinction between discrete-valued and continuous valued random variables is particularly important. Throughout we make the following assumptions on the kernels Kx and Ky : (K1) Kx and Ky are densities on Rdx and Rdy respectively. (K2) For some finite K + , supx1 ∈Rdx Kx (x1 ) < K + and supy1 ∈Rdy Ky (y1 ) < K + (K3) lim kx1 k2dx Kx (x1 ) → 0 as kx1 k → ∞ and lim ky1 k2dy Ky (y1 ) → 0 as ky1 k → ∞. (K4) Kx (x1 ) = Kx (−x1 ) and Ky (y1 ) = Ky (−y1 ). R R (K5) kx1 k2 Kx (x1 ) < ∞ and ky1 k2 Ky (y1 ) < ∞ (K6) Kx and Ky have bounded variation and finite modulus of continuity. We also assume that following properties of the bandwidths. These will be given in terms of the number of observations falling at each combination values of the discrete variables: n(x2 ) =

n X

Ix2 (X2i (ω)),

i=1

n(y2 ) =

n X

Iy2 (Y2i (ω)),

i=1

n(x2 , y2 ) =

n X

Ix2 (X2i (ω))Iy2 (Y2i (ω))

i=1

(B1) cnx2 → 0, cny2 → 0. d

x x (B2) n(x2 )cdnx → ∞ for all x2 ∈ Sx and n(x2 , y2 )cdnx c y → ∞ for all (x2 , y2 ) ∈ Sx ⊗ Sy . 2 2 ny2

2dx → ∞. (B3) n(x2 )cnx 2 2d

y x (B4) n(x2 , y2 )c2d nx2 cny2 → ∞.

(B5)

P∞

d

x −dx −γn(x2 )cnx 2 n(x2 )=1 cnx2 e

≤ ∞ for all γ > 0.

(B6) n(y2 )c4ny2 → 0 if dy = 1 and n(x2 )c4nx2 → 0 if dx = 1. where the sum is taken to be over all observations in the case that X2 or Y2 are singletons.

3

2

Consistency Results for Conditional Densities over Spaces of Mixed Types

In this section we will provide a number of L1 consistency results for kernel estimates of densities and conditional densities of multivariate random variables in which some coordinates take values in Euclidean space while others take values on a discrete set. Pointwise consistency of conditional density estimates of this form can be found in, for example, Li and Racine (2007). However, we are unaware of equivalent L1 results which will be necessary for our development of conditional disparity-based inference. Throughout, we have assumed that both the conditioning variable x and the response y are multivariate with both types of coordinates. The specification to univariate models, or models with only discrete or only continuous variables in either x or y (and to unconditional densities) is readily seen to be covered by our results as well. We begin with a result on densities for random variables taking discrete values. Theorem 2.1. Let {Xn : n ≥ 1} denote a collection of i.i.d. discrete random variables with support S where S is a countable set. Let f (t) = P (X1 = t), t ∈ S, and

n

fn (t, ω) =

1X I{t} (Xk (ω)), n k=1

then there exists a set B with P (B) = 1 such that for ω ∈ B, X lim |f (t) − fn (t, ω)| = 0. n→∞

t∈S

Proof. By strong law of large numbers there exists a set At such that P (At ) = 0 and on Act , fn (t, ω) converges to f (t) pointwise. Let B = ∩t∈S Act . Then, P (B) = 1 and for ω ∈ B, fn (t, ω) converges to f (t) pointwise for all t. P Notice that t∈S fn (t, ω) = 1. Let Dn = {t ∈ S, f (t) ≥ fn (t, ω)}. Hence it follows that, X X X 0= (f (t) − fn (t, ω)) = (f (t) − fn (t, ω))IDn (t) + (f (t) − fn (t, ω))IDnc (t). t∈S

t∈S

t∈S

Hence, X

(f (t) − fn (t))IDn (t) =

t∈S

X

(fn (t) − f (t))IDnc (t).

t∈S

Now, X

|f (t) − fn (t, ω)| =

t∈S

X

|f (t) − fn (t, ω)|IDn (t) +

t∈S

=

|f (t) − fn (t, ω)|IDnc (t)

t∈S

X X (f (t) − fn (t, ω))IDn (t) + (fn (t, ω) − f (t))IDnc (t) t∈S

Hence using (2.10), we see that X |f (t) − fn (t, ω)| = t∈S

X

t∈S

2

X (f (t) − fn (t, ω))IDn (t). t∈S

4

(2.10)

Alternatively, we can express the above equality as below X X |f (t) − fn (t, ω)| = 2 (f (t) − fn (t, ω))IDn (t) t∈S

t∈S

Z =

(f (t) − fn (t, ω))IDn (t)dν(t),

2 S

where ν(·) is the countingR measure. Now, fix ω ∈ B and set an (t, ω) = (f (t) − fn (t, ω))IDn (t). Notice that 0 ≤ an (t, ω) ≤ f (t), and S f (t)dν(t) = 1 < ∞. Hence by dominated convergence theorem applied to the counting measure space S, we see that Z X lim |f (t) − fn (t, ω)| = lim (f (t) − fn (t, ω))IDn (t)dν(t) n→∞

n→∞

t∈S

S

Z =

Since P (B) = 1 and limn→∞ empirical probabilities.

P

t∈S

lim (f (t) − fn (t, ω))IDn (t)dν(t) = 0.

S n→∞

|f (t) − fn (t, ω)| = 0, we have almost sure L1 −consistency of the

Using this, we can now provide the L1 convergence of densities of mixed types; pointwise and uniform convergence is included for completeness. Theorem 2.2. Let {(Xn1 , Xn2 ), n ≥ 1} be given as in Section 1, under assumptions D1 - D2, K1 - K6 and B1-B2 there exists at set B with P (B) = 1 such that for all ω ∈ B, X Z ˆ n (x1 , x2 , ω) − h(x1 , x2 ) dx1 = 0. (2.11) h x2 ∈Sx

R

For almost all (x1 , x2 ) ∈ Rdx ×Sx , there exists a set B(x1 ,x2 ) with P (B(x1 ,x2 ) ) = 1 such that for all ω ∈ B(x1 ,x2 ) , ˆ n (x1 , x2 , ω) → h(x1 , x2 ). h (2.12) Further, under Assumptions D3, and B5 there exists at set Bs with P (Bs ) = 1 such that for all ω ∈ Bs , ˆ sup h (2.13) n (x1 , x2 , ω) − h(x1 , x2 ) → 0. (x1 ,x2 )∈X

Proof. We first re-write 1 ˆ n (x1 , x2 , ω) = nx2 (ω) h x n nx2 (ω)cdnx 2

X

Kx

Xi2 (ω)=x2

x1 − Xi1 (ω) cnx2

˜ x (x1 , ω) = pˆx2 (ω)h 2

and observe that for every x2 there is a set Ax2 with P (Ax2 ) = 1 such that for ω ∈ Ax2 , nx2 (ω) → ∞ and hence from Devroye and Gy¨ orfi (1985, Chap. 3 Theorem 1) there is a set Cx2 ⊂ Ax2 with P (Cx2 ) = 1 such that Z ˜ hx2 (x1 , ω) − h(x1 |x2 ) dx1 → 0.

5

We can now write Z X X Z ˜ ˆ n (x1 , x2 , ω) − h(x1 , x2 ) dx1 ≤ p(x2 ) h (x , ω) − h(x |x ) h x2 1 1 2 dx1 x2 ∈Sx

R

x2 ∈Sx

+

X

R

|ˆ p(x2 , ω) − p(x2 )|

x2 ∈Sx

where the second term converges for ω ∈ D with P (D) = 1 from Theorem 2.1 and the first term converges on E = ∪x2 ∈Sx Cx2 from the dominated convergence theorem applied with the observation that Z Z X X ˜ ˜ x (x1 , ω) + h(x1 |x2 ))dx1 ≤ 2. p(x ) dx ≤ (h p(x2 ) h (x , ω) − h(x |x ) 2 1 x2 1 1 2 2 x2 ∈Sx

R

x2 ∈Sx

R

Hence (2.11) holds for ω ∈ B = D ∪ E with P (B) = 1. ˆ To demonstrate (2.12), we observe that (2.11) requires that for almost all (x1 , x2 ), h n (x1 , x2 , ω) − h(x1 , x2 ) → 0. Finally (2.13) follows by the finiteness of Sx under Assumption D3 and the uniform convergence of kernel densities under assumption B5 (see e.g. Nadaraya, 1989, Chap. 4). The next theorem demonstrates convergence that is uniformly L1 ; that is the L1 distance measured in (y1 , y2 ) at each (x1 , x2 ) converges uniformly over (x1 , x2 ). We begin by considering only the continuousvalued random variables where g(x1 , y1 ) and gˆn (x1 , y1 ) will indicate the corresponding density and kernel density estimates. Theorem 2.3. Let {(Xn1 , Yn1 ), n ≥ 1} be given as in Section 1 under assumptions D1 - D2, K1 - K6 and B1-B2 then for almost all x1 there exists a set Bx1 with P (Bx1 ) = 1 such that for all ω ∈ Bx1 Z |ˆ gn (x1 , y1 , ω) − g(x1 , y1 )| dy1 → 0. (2.14) Further, if Assumptions D3 and B5 hold, there exists a set B with P (B) = 1 such that for all ω ∈ B Z sup |ˆ gn (x1 , y1 , ω) − g(x1 , y1 )| dy1 → 0. (2.15) x1 ∈X

Proof. For (2.14) we observe that Z Z Z |ˆ gn (x1 , y1 , ω) − g(x1 , y1 )| dy1 dx1 = T (x1 )dx1 →0 almost surely with T (x1 ) > 0, see Devroye and Gy¨orfi (1985, Chap. 3 Theorem 1). Thus T (x1 ) → 0 for almost all x1 .

6

Turning to (2.15), first we demonstrate the result on the expectation of gˆn (x1 , y1 ). We observe that we can choose bn(y2 ) → ∞ with bn(y2 ) cny2 → 0 so that Z sup |Eˆ gn (x1 , y1 ) − g(x1 , y1 )| dy1 x1 ∈X Z Z Z K (u)K (v)g(x + c u, y + c v)dudv − g(x , y ) = sup x y 1 nx2 1 ny2 1 1 dy1 x1 ∈X Z |g(x01 , y10 ) − g(x1 , y1 )| dy1 + Ky (bn(y2 ) ) ≤ sup sup sup sup g(x1 , y1 ) x1 ∈X x01 ∈X y10 :ky10 −y1 k>cny2 bn(y2 )

x1 ∈X ,y1 ∈Rdy

≤ M bdny cdnyy 2 + Ky (bn(y2 ) ) → 0. We now observe that since X is compact, it can be given a disjoint covering. Further, since the modulus of continuity of K is finite, for any , we can take a covering X =

N [n

Aj

j=1

and define

X Nn 1 u − x1 (an1 (x1 ), . . . , aNn (x1 )) = arg min sup aj (x1 )IAj (u) Kx − cnx2 a1 ,...,aNn u∈X cnx2 j=1

and Kn+ (u, x) =

Nn X

ajn (x1 )IAj (u)

j=1

such that

1 u − x1 + sup sup Kx − Kn (u, x1 ) < cnx2 3 x∈X u∈X cnx2

x for some constant C depending on where we also observe that for any j, supx ajn (x1 ) < when Nn = C /cdnx 2 + dx K /cnx2 . From here we define n 1 X + y1 − Y1i (ω) gˆn∗ (x, y, ω) = dy Kn (x1 , X1i (ω))Ky cny2 ncny2 i=1

and observe that Z Z sup |ˆ gn (x1 , y1 , ω) − Eˆ gn (x1 , y1 )| dy1 ≤ sup |ˆ gn (x1 , y1 , ω) − gˆn∗ (x1 , y1 , ω)| dy1 x1 ∈X x1 ∈X Z + sup |Eˆ gn (x1 , y1 ) − Eˆ gn∗ (x1 , y1 )| dy1 x1 ∈X Z + sup |ˆ gn∗ (x1 , y1 , ω) − Eˆ gn∗ (x1 , y1 )| dy1 x1 ∈X Z 2 ≤ + sup |ˆ gn∗ (x1 , y1 , ω) − Eˆ gn∗ (x1 , y1 )| dy1 . 3 x1 ∈X

7

We thus only need to demonstrate that Z X ∗ ∗ P sup |ˆ gn (x1 , y1 , ω) − Eˆ gn (x1 , y1 )| dy1 > < ∞ x1 ∈X

n

and the result will follow from the Borel-Cantelli lemma. To do this we observe that gˆn∗ (x1 , y1 , ω) is constant in x1 over partitions Aj and thus Z Z sup |ˆ gn∗ (x1 , y1 , ω) − Eˆ gn∗ (x1 , y1 )| dy1 ≤ max |ˆ gn∗ (x1j , y1 , ω) − Eˆ gn∗ (x1j , y1 )| dy1 j∈1,...,Nn

x1 ∈X

= max qj ((X11 (ω), Y11 (ω)), . . . , (Xn1 (ω), Yn1 (ω))) j

for any x1j ∈ Aj . Then 0 |qj ((x11 , y11 ), . . . , (xn1 , yn1 )) − qj ((x011 , y11 ), . . . , (xn1 , yn1 ))| Z 0 y − y11 y1 − y11 1 1 0 dy1 ajn (x11 )Ky − a (x )K ≤ jn 11 y ncny cny cny 2

2

2

2 supx∈X aj (x) ≤ n 2K + ≤ dx ncnx2 Hence by the bounded difference inequality 2d

2

ncnxx 2 P |qj ((X11 , Y11 ), . . . , (Xn1 , Yn1 )) − Eqj ((X11 , Y11 ), . . . , (Xn1 , Yn1 ))| > < 2e− 8K +2 . 2

We now observe that Z Eqj ((X11 (ω), Y11 (ω)), . . . , (Xn1 (ω), Yn1 (ω))) ≤

q 2 ∗ ∗ gn (y1 , x1j ) − E gˆn (y1 , x1j )) dy1 min E gˆn (y1 , x1j ), E (ˆ

∗

−dy /2 −dx ≤ Cg n−1/2 cny cnx2 2

for some constant Cg since E

(ˆ gn∗ (y1 , x1j )

∗

2

− E gˆn (y1 , x1j )) ≤

Z

1 d

ncnyy 2

ajn (x1 )2 Ky (u)2 g(x1 , y1 + cny2 v)dudv.

−d /2

−dx Collecting terms, for n sufficiently large, Cg n−1/2 cny2y cnx < /2 by Assumption B4 and 2

P

Z sup x1 ∈X

2 nc2 nx2 |ˆ gn∗ (x1 , y1 , ω) − Eˆ gn∗ (x1 , y1 )| dy1 > ≤ 2Nn e− 8K +2 2d

≤ which is summable under assumptions B5

8

2

x 2 2C − ncnx+2 8K e dx cnx2

We can now readily extend the above result to the mixed data-types case Theorem 2.4. Let {(Xn1 , Xn2 , Yn1 , Yn2 ), n ≥ 1} be given as in Section 1 under assumptions D1 - D2, K1 - K6 and B1-B2 then for almost all (x1 , x2 ) there exists a set B(x1 ,x2 ) with P (B(x1 ,x2 ) ) = 1 such that for all ω ∈ B(x1 ,x2 ) X Z |ˆ gn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1 → 0. (2.16) y2 ∈Sy

Further, if Assumptions D3 and B5 hold, there exists a set B with P (B) = 1 such that for all ω ∈ B X Z |ˆ gn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1 → 0. (2.17) sup (x1 ,x2 )∈X y ∈S 2 y

Proof. (2.16) follows from the same arguments as (2.14) and the proof is omitted here. To demonstrate (2.17), from assumption D3, Sx is finite and it is therefore sufficient to demonstrate the result at each value of x2 . Writing g˜n (x1 , y1 |x2 , y2 , ω) =

n X

1 d

y x n(x1 , x2 )cdnx 2 cny2

Ky

i=1

y1 − Y1i (ω) cny2

and pˆ(x2 , y2 ) =

Kx

x1 − X1i (ω) cnx2

Ix2 (X2i (ω))Iy2 (Y2i (ω))

n(x2 , y2 ) n

we can write sup x1 ∈X

X Z

|ˆ gn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1

y2 ∈Sy

≤ sup x1 ∈X

X

Z p(x2 , y2 )

|˜ gn (x1 , y1 |x2 , y2 , ω) − g(x1 , y1 |x2 , y2 )| dy1

y2 ∈Sy

+ sup x1 ∈X

X

ˆ n (x1 , x2 , ω). |ˆ p(x2 , y2 ) − p(x2 , y2 )| h

y2 ∈Sy

ˆ n (x1 , x2 , ω) is bounded and the second term converges almost surely. From Theorem 2.2, h Further, the integral in the first term converges almost surely for each y2 by Theorem 2.3. Using that Z X ˆ n (x1 , x2 , ω) + h(x1 , x2 ) sup p(x2 , y2 ) |˜ gn (x1 , y1 |x2 , y2 , ω) − g(x1 , y1 |x2 , y2 )| dy1 ≤ sup h x1 ∈X

x1 ∈X

y2 ∈Sy

is also almost surely bounded, the result now follows from the dominated convergence theorem. The results above can now be readily extended to equivalent L1 results for conditional densities. Theorem 2.5. Let {(Xn1 , Xn2 , Yn1 , Yn2 ), n ≥ 1} be given as in Section 1n under assumptions D1- D2, K1 - K6 and B1-B2

9

1. For almost all x = (x1 , x2 ) ∈ Rdx ⊗Sx there exists a set Bx with P (Bx ) = 1 such that for all ω ∈ Bx X Z (2.18) fˆn (y1 , y2 |x1 , x2 , ω) − f (y1 , y2 |x1 , x2 ) dy1 → 0. y2 ∈Sy

2. There exists a set B1 with P (B1 ) = 1 such that for all ω ∈ B1 , X X Z h(x1 , x2 ) fˆn (y1 , y2 |x1 , x2 , ω) − f (y1 , y2 |x1 , x2 ) dy1 dx1 → 0.

(2.19)

x2 ∈Sx y2 ∈Sy

3. If further, Assumptions D3 and B5 X Z sup (x1 ,x2 )∈X y ∈S 2 y

hold, ˆ fn (y1 , y2 |x1 , x2 , ω) − f (y1 , y2 |x1 , x2 ) dy1 → 0.

Proof. We begin with (2.18) and observe that X Z fˆn (y1 , y2 |x1 , x2 , ω) − f (y1 , y2 |x1 , x2 ) dy1

(2.20)

(2.21)

y2 ∈Sy

≤

X Z gˆn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 ) dy1 h(x1 , x2 )

y2 ∈Sy

+

X Z y2 ∈Sy

=

1 1 gˆn (x1 , x2 , y1 , y2 , ω) − dy ˆ n (x1 , x2 , ω) h(x1 , x2 ) 1 h

Z 1 |ˆ gn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1 h(x1 , x2 ) ˆ n (x1 , x2 , ω) h + 1 − . h(x1 , x2 )

(2.22)

For the first term in (2.22), we observe that h(x1 , x2 ) > 0 and the result follows from the first part of Theorem 2.4. ˆ n (x1 , x2 , ω) → h(x1 , x2 ) with probability 1 from the second For the second term, for almost all (x1 , x2 ), h part of Theorem 2.2.

10

To demonstrated (2.19), we observe X X Z h(x1 , x2 ) fˆn (y1 , y2 |x1 , x2 , ω) − f (y1 , y2 |x1 , x2 ) dy1 dx1 x2 ∈Sx y2 ∈Sy

≤

X

X Z Z

|ˆ gn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1 dx1

x2 ∈Sx y2 ∈Sy

+

X

X Z Z

x2 ∈Sx y2 ∈Sy

=

X

X Z Z

1 1 h(x1 , x2 )ˆ gn (x1 , x2 , y1 , y2 , ω) − dy dx ˆ n (x1 , x2 , ω) h(x1 , x2 ) 1 1 h

|ˆ gn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1 dx1

x2 ∈Sx y2 ∈Sy

+

X Z ˆ n (x1 , x2 , ω) − h(x1 , x2 ) dx1 h x2 ∈Sx

where both terms in the final line converge to zero with probability 1 from Theorem 2.2. For (2.20), we follow the proof of (2.18) and observe that taking a surpremum for the first term in (2.22) gives Z 1 |ˆ gn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1 sup (x1 ,x2 )∈X h(x1 , x2 ) Z 1 |ˆ gn (x1 , x2 , y1 , y2 , ω) − g(x1 , x2 , y1 , y2 )| dy1 ≤ − sup h (x1 ,x2 )∈X →0 from Theorem 2.4 while the supremum over the second term converges to zero by applying Theorem 2.4.

3

Consistency Results for Homoscedastic Conditional Densities

In this section we study the conditional density estimates defined in Hansen (2004) for continuous-valued random variables in which a non-parametric location family is assumed: f (y1 |x1 , x2 ) = f ∗ (y − m(x1 , x2 )) for some f ∗ that does not depend on (x1 , x2 ) and a regression function m. This is estimated according to (1.6-1.8). Here we show that essentially all the consistency properties for conditional density estimates demonstrated in Section 2 continue to hold for the estimate (1.6-1.8). Lemma 3.1. Let {(Xn1 , Xn2 , Yn1 ), n ≥ 1} be given as in Section 1 with the restriction (1.1), under assumptions D1-D3, K1 - K6, B1-B2 and B5 there exists a set B with P (B) = 1 such that for all ω ∈ B. sup

|m ˆ n (x1 , x2 , ω) − m(x1 , x2 )| → 0.

(x1 ,x2 )∈X

Proof. We observe that for any x2 m ˆ n (x1 , x2 , ω) is a Nadaraya-Watson estimator using the data restricted to X2 (ω) = x2 . Since the set Sx is finite, the result follows from Theorem 2.1 of Nadaraya (Chap. 4 1989)

11

Theorem 3.1. Let {(Xn1 , Xn2 , Yn1 ), n ≥ 1} be given as in Section 1 with the restriction (1.1), under assumptions D1-D3, E1 K1 - K6, B1-B2 and and B5 there exists a set B with P (B) = 1 such that for all ω∈B Z ˆ∗ (3.23) fn (e, ω) − f ∗ (e) de → 0 and sup (x1 ,x2 )∈X

Z ˆ fn (y1 |x1 , x2 , ω) − f ∗ (y1 − m(x1 , x2 )) dy1 → 0.

(3.24)

Proof. We begin by defining f˜n∗ (e) to be the distribution of Yi (ω) − m(X ˜ i1 , Xi2 ) for any m(x ˜ 1 , x2 ) with at least two continuous derivatives in x1 : Z f˜∗ (e, m) ˜ = f ∗ (e − z)f + (z, m)dz ˜ (3.25) where f + (e, m) ˜ gives the density of m(x ˜ 1 , x2 ) − m(x1 , x2 ) for (x1 , x2 ) distributed according to h(x1 , x2 ). We also define its kernel density estimate fˆn (e, m, ˜ ω) =

1

n X

d

ncnyy 2

i=1

K

e − Yi (ω) − m(X ˜ i1 (ω), Xi2 (ω)) cny2

(3.26)

∗ where fˆn (e, m ˆ n , ω) = fˆm (e, ω). To demonstrate (3.23), we break the integral into Z Z Z ˆ ˆ n , ω) − f ∗ (e) de ≤ fˆn (e, m ˆ n , ω) − f˜∗ (e, m ˆ n ) de + f˜∗ (e, m ˆ n ) − f ∗ (e) de. fn (e, m

We begin by observing that the second term converges by a mild generalization of Deroye and Lugosi (2001, Theorem 9.1) given in Lemma 3.2 below with Gn (e) = f + (e, m ˆ n ) and the observation that Lemma 3.1 implies that the support of f + (e, m ˆ n ) shrinks to zero. For the first term, we follow the arguments of Deroye and Lugosi (2001, Theorems 9.2-9.4) for m ˆ n replaced with any bounded continuous m ˜ and observe that the convergence can be given in terms that are independent of m. ˜ In particular, following Deroye and Lugosi (2001) in using the bounded difference inequality we have that for any g, Z Z 2 ˆ ˆ ˜ ω) − g(e) de − E fn (e, m, ˜ ω) − g(e) de > ≤ 2e− /2n P fn (e, m, hence

Z Z ˆ ˜ ω) − f˜∗ (e, m) ˜ de − E fˆn (e, m, ˜ ω) − f˜∗ (e, m) ˜ de → 0 fn (e, m,

almost surely by the Borel-Cantelli lemma. It is thus sufficient to establish the convergence of Z Z Z E fˆn (e, m, ˜ ω) − f˜(e, m) ˜ de ≤ E fˆn (e, m, ˜ ω) − f˜(e, m) ˜ de + E

12

ˆ ˜ ω) − E fˆn (e, m, ˜ ω) de. fn (e, m,

For the first term, we replace fˆn (e, m, ˜ ω) with fˆn∗ (e, m, ˜ ω) by employing the kernel K + (e) = Ky (e)1kek L ⇒ h(x) = 0} we have Z Z Z Z dx → 0. f (x − y − z)h(z)G (y)dzdy − f (x − z)h(z)dx lim sup n n→∞ h∈D L

Proof. This proof is a generalization of Deroye and Lugosi (2001, Theorem 9.1) and follows from nearly identical arguments. First, assuming the result is true on any dense subspace of functions g, we have Z Z Z Z dx f (x − y − z)h(z)dzG (y)dy − f (x − z)h(z)dz n Z Z Z f (x − y − z)Gn (y)dy − f (x − z) h(z)dzdx ≤ Z Z Z ≤ |f (x − y) − g(x − y)| Gn (y)dydx + |f (x) − g(x)| dx Z Z + g(x − y)Gn (y)dy − g(x) dx Z ≤ 2 |f (x) − g(x)| dx + o(1) where the first integral can be made as small as desired without reference to h(z). It is therefore sufficient to prove the theorem in a dense subclass. In this case we take the class of Lipschitz densities with compact support. For any such f , let the support of f be contained in [−M ∗ M ∗ ]d and |f (y) − f (x)| ≤ Ckx − yk, x, y ∈ Rd . Then we observe that for any h(z) ∈ DL Z Z | (f (y − z) − f (x − z))h(z)dz| ≤ |f (y − z) − f (x − z)| h(z)dz ≤ |f (y) − f (x)| ≤ Ckx − yk

14

so that uniformly over DL . Z Z Z f (x − y − z)h(z)Gn (y)dzdy − f (x − z)h(z)dz dx Z Z Z ≤ f (x − y − z)Gn (y)dzdy − f (x − z) h(z)dzdx Z Z Z = f (x − y − z)Gn (y)dy − f (x − z) Gn (y)dy h(z)dzdx Z Z ≤ |f (x − y − z) − f (x − z)| h(z)Gn (y)dzdydx Z Z ≤ Ckyk|Gn (y)|dydx [−M ∗ −L−Mn ,M ∗ +L+Mn ]d

√ ≤ (2M ∗ + 2L + 2Mn )d CMn d → 0.

4

Consistency of Minimum Disparity Estimators for Conditional Models

In this section we define minimum disparity estimators for the conditionally specified models based on distributions and data defined in Section 1. A parametric model is given f (y1 , y2 |x1 , x2 ) = φ(y1 , y2 |x1 , x2 , θ), where we assume that the (Xi1 , Xi2 ) are independently drawn from a distribution h(x1 , x2 ) which is not parametrically specified. For this model, the maximum likelihood estimator for θ given observations (Yi , Xi ), i = 1, . . . , n is n X θˆM LE = arg max log φ(Yi1 , Yi2 |Xi1 , Xi2 , θ) i=1

with attendant asymptotic variance X X Z Z I(θ0 ) = ∇2θ [log φ(y1 , y2 |x1 , x2 , θ0 )] φ(y1 , y2 |x1 , x2 , θ0 )h(x1 , x2 )dy1 dx1 y2 ∈Sy x2 ∈Sx

when the specified parametric model is correct at θ = θ0 . In the context of disparity estimation, for every value (x1 , x2 ) we define the conditional disparity between f and φ as X Z f (y1 , y2 |x1 , x2 ) − 1 φ(y1 , y2 |x1 , x2 , θ)dy1 D(f, φ|x1 , x2 , θ) = C φ(y1 , y2 |x1 , x2 , θ) y2 ∈Sy

these are combined over observed (Xi1 , Xi2 ) by n

Dn (f, θ) =

1X D(f, φ|Xi1 , Xi2 , θ) n i=1 15

or

X Z

˜ n (f, θ) = D

ˆ n (x1 , x2 )dx1 D(f, φ|x1 , x2 , θ)h

x2 ∈Sx

with limiting case D∞ (f, θ) =

X Z

D(f, φ|x1 , x2 , θ)h(x1 , x2 )dx1 .

x2 ∈Sx

We now define the conditional minimum disparity estimator as θˆnD = arg min Dn (fˆn , θ) θ∈Θ

with fˆn defined as either (1.4) or (1.8) and C a strictly convex function from R to [−1 ∞) with a unique minimum at 0. Classical choices of C include e−x − 1, resulting in the Negative Exponential disparity (NED) √ 2 and x + 1 − 1 − 1, which corresponds to Hellinger distance (HD). Under this definition, we first establish the existence and consistency of θˆnD . To do so we note that disparity results all rely on the boundedness of D(f, φ|Xi , θ) over θ and f and a condition of the form that for any conditional densities g and f , X Z sup |D(g, φ|Xi1 , Xi2 , θ) − Dn (f, φ|Xi1 , Xi2 , θ)| ≤ K |g(y1 , y2 |Xi1 , Xi2 ) − f (y1 , y2 |Xi1 , Xi2 ))| dy1 . θ∈Θ

y2 ∈Sy

(4.27) In the case of Hellinger distance (Beran, 1977), D(g, θ) < 2 and (4.27) follows from Minkowski’s inequality. For the alternate class of divergences studied in Park and Basu (2004), boundedness of D is established from supt∈[−1 ∞) |C 0 (t)| ≤ C ∗ < ∞ which also provides Z f (y1 , y2 |x1 , x2 ) g(y1 , y2 |x1 , x2 ) C −1 −C − 1 φ(y1 , y2 |x1 , x2 , θ)dµ(y) φ(y1 , y2 |x1 , x2 , θ) φ(y1 , y2 |x1 , x2 , θ) Z g(y1 , y2 |x1 , x2 ) f (y1 , y2 |x1 , x2 ) ∗ ≤C φ(y1 , y2 |x1 , x2 , θ) − φ(y1 , y2 |x1 , x2 , θ) φ(y1 , y2 |x1 , x2 , θ)dµ(y) Z = C ∗ |g(y1 , y2 |x1 , x2 ) − f (y1 , y2 |x1 , x2 )| dµ(y) where we have employed the notational convention (1.9). For simplicity, we therefore use (4.27) as a condition below. In general, we will require the following assumptions (P1) The parameter space Θ locally is compact. (P2) P There exists N such that for n > N and θ1 6= θ2 , with probability 1, maxi∈1,...,n |φ(y1 , y2 |Xi1 , Xi2 , θ1 )− n i=1 φ(y1 , y2 |Xi1 , Xi2 , θ2 )| > 0 on a non-zero set of dominating measure in y. (P3) φ(y1 , y2 |x1 , x2 , θ) is continuous in θ for almost every (x1 , x2 , y1 , y2 ). (P4) Dn (f, φ|x1 , x2 , θ) is uniformly bounded over f , x1 , x2 and θ and (4.27) holds.

16

(P5) For every f there exists a compact set Sf ⊂ Θ and N such that for n ≥ N , inf Dn (f, θ) > inf Dn (f, θ).

θ∈Sfc

θ∈Sf

These assumptions combine those of Park and Basu (2004) for a general class of disparities with those of Cheng and Vidyashankar (2006) which relax the assumption of compactness of Θ. Together, these provide the following results:F Theorem 4.1. Under assumtions P1-P5, define Tn (f ) = arg min Dn (f, θ),

(4.28)

θ∈Θ

for n = 1, . . . , ∞ inclusive, then (i) For any f ∈ F there exists θ ∈ Θ such that Tn (f ) = θ. (ii) For n ≥ N , for any θ, θ = Tn (φ(·|·, θ)) is unique. (iii) If Tn (f ) is unique and fm → f in L1 for each x, then Tn (fm ) → Tn (f ). Proof. (i) Existence. We first observe that it is sufficient to restrict the infimum in (4.28) to Sf . Let {θm : θm ∈ Sf } be a sequence such that θm → θ as as m → ∞. Since f (y1 , y2 |x1 , x2 ) f (y1 , y2 |x1 , x2 ) C − 1 φ(y1 , y2 |x1 , x2 , θm ) → C − 1 φ(y1 , y2 |x1 , x2 , θ) φ(y1 , y2 |x1 , x2 , θm ) φ(y1 , y2 |x1 , x2 , θ) by Assumption P3, using the bound on D(f, φ, θ) from Assumption P4 we have Dn (f, θm ) → Dn (f, θ) by the dominated convergence theorem. Hence Dn (f, t) is continuous in t and achieves its minimum for t ∈ Sf since Sf is compact. (ii) Uniqueness. This is a consequence of Assumption P2 and the unique minimum of C at 0. (iii) Continuity in f . For any sequence fm (·|x1 , x2 ) → f (·|x1 , x2 ) in L1 for every x as m → ∞ we have sup |Dn (fm , θ) − Dn (f, θ)| → 0

(4.29)

θ∈Θ

from Assumption P4. Now consider θm = Tn (fm ). We first observe that there exists M such that for m ≥ M , θm ∈ Sf otherwise from (4.29) and Assumption P5 Dn (fm , θm ) > inf Dn (fm , θ) θ∈Sf

contradicting the definition of θm . Now suppose that θm does not converge to θ0 . By the compactness of Sf we can find a subsequence θm0 → θ∗ 6= θ0 implying Dn (f, θm0 ) → Dn (f, θ∗ ) from Assumption P3. Combining this with (4.29) implies Dn (f, θ∗ ) = Dn (f, θ0 ), contradicting the assumption of the uniqueness of Tn (f ).

17

Theorem 4.2. Let {(Xn1 , Xn2 , Yn1 , Yn2 ), n ≥ 1} be given as in Section 1 and define θn0 = arg min Dn (f, θ) θ∈Θ

0 θ∞

for every n including ∞. Further, assume that is unique in the sense that for every there exists δ such that 0 0 kθ − θ∞ k > ⇒ D∞ (f, θ) > D∞ (f, θ∞ )+δ ¯ then under assumptions D1-D3, K1 - K6, B1-B2 and P1-P5 with fn either of the estimators (1.4) or (1.8) 0 θˆn = Tn (f¯n ) → θ∞ as n → ∞ almost surely.

Similarly, 0 ˜ n (f, θ) → θ∞ T˜n (f¯n ) = arg min D as n → ∞ almost surely. θ∈Θ

Proof. First, we observe that for every f , it is sufficient to restrict attention to Sf and that sup |Dn (f, θ) − D∞ (f, θ)| → 0 almost surely

(4.30)

θ∈Sf

from the strong law of large numbers, the compactness respect to θ. Further, Z ∗ ¯ sup Dm (fn , θ) − Dm (f, θ) ≤ C sup

of Sf and the assumed continuity of C and of φ with

f¯n (y1 , y2 |x1 , x2 ) − f (y1 , y2 |x1 , x2 ) dµ(y)

x∈X

m∈N,θ∈Θ

→ 0 almost surely

(4.31)

where the convergence is obtained from Theorem 2.5 or Theorem 3.1 depending on the form of f¯n . 0 , then we can find > 0 and a subsequence θˆn0 such that Suppose that θˆn does not converge to θ∞ 0 kθˆn0 − θ∞ k > for all n0 . However, on this subsequence Dn0 (fˆn0 , θˆn0 ) = Dn0 (fˆn0 , θ0 ) + (Dn0 (f, θ0 ) − Dn0 (fˆn0 , θ0 )) + (D∞ (f, θ0 ) − Dn0 (f, θ0 )) + (D∞ (f, θˆn0 ) − D∞ (f, θ0 )) + (Dn0 (f, θˆn0 ) − D∞ (f, θˆn0 )) + (Dn0 (fˆn0 , θˆn0 ) − Dn0 (f, θˆn0 )) ≤ Dn0 (fˆn0 , θ0 ) + δ − 2 sup |Dn0 (f, θ) − D(f, θ)| θ∈Θ − 2 sup Dn0 (fˆn0 , θ) − Dn0 (f, θ) θ∈Θ

but from (4.30) and (4.31) we can find N so that for n0 ≥ N sup |Dn0 (f, θ) − D(f, θ)| ≤ θ∈Θ

and

δ 6

δ sup Dn0 (fˆn0 , θ) − Dn0 (f, θ) ≤ 6 θ∈Θ ˆ ¯ ˜ contradicting the optimality of θn0 . The proof for Tn (fn ) follows analogously.

18

5

Asymptotic Normality and Efficiency of Minimum Disparity Estimators for Conditional Models: Unrestricted Case

In this section we demonstrate the asymptotic normality and efficiency of minimum conditional disparity estimators using the unrestricted conditional density estimate fˆn (y1 , y2 |x1 , x2 ). In order to simplify some of our expressions, we introduce the following notation, that for a column vector A we define the matrix AT T = AAT . This will be particularly useful in defining information matrices. The proof techniques employed here are an extension of those developed in i.i.d. settings in Beran (1977); Tamura and Boos (1986); Lindsay (1994); Park and Basu (2004). In particular we will require the following assumptions: (N1) Define Ψθ (x1 , x2 , y1 , y2 ) = then sup

XZ

∇θ φ(y1 , y2 |x1 , x2 , θ) φ(y1 , y2 |x1 , x2 , θ)

Ψθ (x1 , x2 , y1 , y2 )Ψθ (x1 , x2 , y1 , y2 )T f (y1 , y2 |x1 , x2 )dy1 < ∞

(x1 ,x2 )∈X y∈S y

elementwise. Further, there exists ay > 0 such that XZ sup sup sup Ψθ (x1 + s, x2 , y1 + t, y2 )2 f (y1 , y2 |x1 , x2 )dy1 < ∞. (x1 ,x2 )∈X ktk≤ay ksk≤ay y∈S y

and

Z sup

sup

2

(∇y Ψθ (x1 + t, x2 , y1 + s)) f (y1 |x1 , x2 )dy1 < ∞

sup

(x1 ,x2 )∈X ktk≤ay ksk≤ay

and

Z sup

sup

2

(∇x Ψθ (x1 + t, x2 , y1 + s)) f (y1 |x1 , x2 )dy1 < ∞.

sup

(x1 ,x2 )∈X ktk≤ay ksk≤ay

(N2) There exists sequences bn and αn diverging to infinity such that (i) nc4ny2 b4n → 0 and n

sup

sup

X Z Z

(x1 ,x2 )∈X kuk>bn y ∈S 2 y

Ψ2θ (x1 +cnx2 u, x2 , y1 +cny2 v, y2 )Ky2 (u)Kx2 (v)g(x1 , x2 , y1 , y2 )dvdy1 → 0 kvk>bn

elementwise. (ii) sup(x1 ,x2 )∈X nP (kY1 − cny2 bn k > αn − c) → 0. (iii) sup

X

(x1 ,x2 )∈X y ∈S 2 y

√

Z

1 d

y x ncdnx 2 cny2

19

|Ψθ (x1 , x2 , y1 , y2 )|dy1 → 0 ky1 k≤αn

(iv) sup

sup

sup

sup

X g(x1 + cnx s, x2 , y1 + cny t, y2 ) 2 2 = O(1). g(x1 , x2 , y1 , y2 )

(x1 ,x2 )∈X ktk≤bn ksk≤bn ky1 k 3, the asymptotic bias in the Central Limit Theorem is √ Assumption nc2nx2 c2ny2 → ∞ and will not become zero when the variance is controlled. We will further need to restrict

2d

y x to n−1 c2d nx2 cny2 , effectively reducing the unbiased central limit theorem to the cases where either dx = 1 or dy = 1. As in Tamura and Boos (1986) we also note that this bias is often small in practice. However, as we demonstrate in Section 6 in the case of a univariate, homoscedastic model, the use of f˜n (y|x1 , x2 ) can remove this bias in some circumstances.

Lemma 5.3. Let {(Xn1 , Xn2 , Yn1 , Yn2 ), n ≥ 1} be given as in Section 1, under assumptions K1 - K6, B1-B4, and N1-N2iv for any function J(y1 , y2 , x1 , x2 ) satisfying the conditions on Ψ in Assumptions N1-N2iv √

n

X Z Z

sup

q J(y1 , y2 , x1 , x2 )

s fˆn (y1 , y2 |x1 , x2 ) −

(x1 ,x2 )∈X y ∈S 2 y

Eˆ gn (x1 , x2 , y1 , y2 ) ˆ n (x1 , x2 ) Eh

!2 dy1 → 0

(5.41)

in probability and further √

Z Z

X

n

q J(y1 , y2 , x1 , x2 )

s fˆn (y1 , y2 |x1 , x2 ) −

y2 ∈Sy ,x2 ∈Sx

Eˆ gn (x1 , x2 , y1 , y2 ) ˆ n (x1 , x2 ) h

!2 ˆ n (x1 , x2 )dy1 dx1 → 0 h (5.42)

+ + for Kx and Ky throughout. We begin by breaking and Kny Proof. Applying Lemma 5.1 we substitute Knx the integral in (5.41) into regions ky1 k ≥ αn and ky1 k < αn . For the first of these, we expand the square results in three terms First: Z √ X Eˆ gn (x1 , x2 , y1 , y2 ) n J(y1 , y2 , x1 , x2 ) dy1 ˆ n (x1 , x2 ) Eh y2 ∈Sy ky1 k>αn √ X Z Z Z n 1 u − x1 u − y1 + + J(y , y , x , x )K ≤ − K g(u, x2 , v, y2 )dudvdy1 1 2 1 2 nx ny dx dy h cnx2 cny2 y2 ∈Sy cnx2 cny2 ky1 k>αn

√ ≤

+ o(1)  n

h−

M  sup

1/2 sup

X Z

ktk≤bn ksk≤bn y ∈S 2 y

ky1 k>αn

J(y1 + cny2 s, y2 , x1 + cnx2 t, x2 )2 g(x1 , x2 , y1 , y2 )dy1 

where Z Z Z M=

+2 +2 Knx (t)Kny (s)g(x1

1/2 + cnx2 s, x2 , y1 + cny2 s)dsdtdµ(y)

the result now follows from Assumptions N1, N2ii and N2iii.

24

√

n

Second: X Z y2 ∈Sy

J(y1 , y2 , x1 , x2 )fˆn (y1 , y2 |x1 , x2 )dy1

ky1 k>αn

1/2 x1 −X1i 1 + n Z I (X ) x 2i dx Knx X X 2 cnx2 sup Ky (y)  cnx  √ n(y2 , x2 )J(Y1i + cny2 bn , y2 , x1 , x2 )2 2 ≤ dsdtdy1   ˆ n hn (x1 , x2 ) i=1 ky1 k>αn 

y2 ∈Sy

+ o(1) and we observe that from Assumption N2ii the final integral is non-zero with probability smaller than nP (ky1 − cny2 bn k > αn ) → 0. Third: s Z Z q X √ Eˆ gn (x1 , x2 , y1 , y2 ) 2 n J(y1 , y2 , x1 , x2 ) fˆn (y1 , y2 |x1 , x2 ) dy1 dx1 h(x1 , x2 ) y2 ∈Sy ,x2 ∈Sx

≤

√

1/2

 n

X Z

y2 ∈Sy

ky1 k>αn

J(y1 , y2 , x1 , x2 )2 fˆn (y1 , y2 |x1 , x2 )dy1 

reducing to the case above. √ 2 √ 2 q Turning to the region ky1 k < αn , we use the identity ( a − b) ≤ (a − b) /a and add and subtract ˆ n (x1 , x2 ) to obtain gˆn (x1 , x2 , y1 , y2 )/E h √

n

X Z

q J(y1 , y2 , x1 , x2 )

s fˆn (y1 , y2 |x1 , x2 ) −

y2 ∈Sy

Eˆ gn (x1 , x2 , y1 , y2 ) ˆ n (x1 , x2 ) Eh

!2 dy1

√ Z (ˆ gn (x1 , x2 , y1 , y2 ) − Eˆ gn (x1 , x2 , y1 , y2 ))2 n J(y1 , y2 , x1 , x2 ) dy1 ≤ − h Eˆ gn (x1 , x2 , y1 , y2 ) ky1 kαn −c

" n

#1/2

Z sup

J(y1 + η + cny2 s, x1 , x2 ) f (y1 − m(x1 , x2 ))dy1

sup M

kηk<my ksk≤bn

2 ∗

ky1 k>αn −c

→0 from Assumption N2ii with M = supy Ky . Similarly Z √ n J(y1 , x1 , x2 )fˆ(e, m)dy ˜ 1 ky1 k>αn −c

" n Z #1/2 X sup Ky (y) √ ≤ sup sup J(Yi + η + cny2 s, x1 , x2 )2 du n kηk<my ksk≤bn i=1 ky1 k>αn −c → 0.

35

√ √ On the region ky1 k < αn − c we make use of the identity ( a − b)2 ≤ (a − b)2 /a to obtain that √

q

Z n

J(e + m(x1 , x2 ), x1 , x2 )

fˆn (e, m) ˜ −

q

2 ˆ de E fn (e, m) ˜

ky1 k≤αn −c

≤

√

Z n

2 fˆn (e, m) ˜ − E fˆn (e, m) ˜

J(e + m(x1 , x2 ), x1 , x2 )

de

E fˆn (e, m) ˜

ky1 k≤αn −c

which is smaller in expectation than √

+2 Kny (s)f˜(e + cny2 s, m)ds ˜ |J(e + m(x1 , x2 ), x1 , x2 )| R + de ˜ Kny (s)f (e + cny2 s, m)ds ˜ ky1 k≤αn −c Z 1 f ∗ (e + 2t) |J(e + m(x1 , x2 ), x1 , x2 )|de sup sup ≤ √ dy f ∗ (e) kek

Recommend Documents

The Efficiency and the Robustness of Natural Gradient Descent ...