Nonparametric estimation of multivariate convex-transformed densities

Report 1 Downloads 51 Views
Submitted to the Annals of Statistics arXiv: math.ST/0911.4151v1

NONPARAMETRIC ESTIMATION OF MULTIVARIATE CONVEX-TRANSFORMED DENSITIES By Arseni Seregin∗,‡ and Jon A. Wellner†,‡ University of Washington‡ We study estimation of multivariate densities p of the form p(x) = h(g(x)) for x ∈ Rd and for a fixed function h and an unknown convex function g. The canonical example is h(y) = e−y for y ∈ R; in this case the resulting class of densities P(e−y ) = {p = exp(−g) : g is convex} is well-known as the class of log-concave densities. Other functions h allow for classes of classes of densities with heavier tails than the log-concave class. We first investigate when the MLE pb exists for the class P(h) for various choices of monotone transformations h including decreasing and increasing functions h. The resulting models for increasing transformations h extend the classes of log-convex densities studied previously in the econometrics literature corresponding to h(y) = exp(y). We then establish consistency of the MLE for fairly general functions h, including the log-concave class P(e−y ) and many others. In a final section we provide asymptotic minimax lower bounds for estimation of p and its vector of derivatives at a fixed point x0 under natural smoothness hypotheses on h and g. The proofs rely heavily on results from convex analysis.

1. Introduction and Background. 1.1. Log–concave and r-concave densities. A probability density p on Rd is called log–concave if it can be written as p(x) = exp(−g(x)) for some convex function g : Rd → (−∞, ∞]. We let P(e−y ) denote the class of all log-concave densities on Rd . As shown by Ibragimov [1956], a ∗

Research supported in part by NSF grant DMS-0503822 Research supported in part by NSF grant DMS-0804587 and NI-AID grant 2R01 AI291968-04 AMS 2000 subject classifications: Primary 62G07, 62H12; secondary 62G05, 62G20 Keywords and phrases: consistency, log–concave density estimation, lower bounds, maximum likelihood, mode estimation, nonparametric estimation, qualitative assumptions, shape constraints, strongly unimodal, unimodal †

1 imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

2

ARSENI SEREGIN AND JON A. WELLNER

density function p on R is log–concave if and only if its convolution with any unimodal density is again unimodal. Log-concave densities have proved to be useful in a wide range of statistical problems: see Walther [2010] for a survey of recent developments and statistical applications of log-concave densities on on R and Rd , and see Cule, Samworth and Stewart [2007] for several interesting applications of estimators of such densities in Rd . Because the class of multivariate log-concave densities contains the class of multivariate normal densities and is preserved under a number of important operations (such as convolution and marginalization), it serves as a valuable nonparametric surrogate or replacement for the class of normal densities. Further study of the class of log-concave from this perspective has been given by Schuhmacher, H¨ usler and Duembgen [2009]. On the analysis side, various isoperimetric and Poincar´e type inequalities have been obtained by Bobkov [1999, 2007b], Foug`eres [2005], and Milman and Sodin [2008]. Log-concave densities have the slight drawback that the tails must be decreasing exponentially, so a number of authors, including Koenker and Mizera [2008], have proposed using generalizations of the log-concave family involving r−concave densities defined as follows. For a, b ∈ R, r ∈ R, and λ ∈ (0, 1), define the generalized mean of order r, Mr (a, b; λ), by

Mr (a, b; λ) =

 r r 1/r   ((1 − λ)a + λb ) , r 6= 0, a, b > 0,

0,

  a1−λ bλ ,

r < 0, ab = 0, r = 0.

Then a density function p is r−concave on C ⊂ Rd if and only p((1 − λ)x + λy) ≥ Mr (p(x), p(y); λ)

for all x, y ∈ C, λ ∈ (0, 1). 1/r

b We denote the class of all r−concave densities on C ⊂ Rd by P(y + ; C), 1/r d b and write P(y+ ) when C = R . As noted by Dharmadhikari and Joagb 1/r ), and it is Dev [1988], page 86, for r ≤ 0 it suffices to consider P(y + b 1/r ) if and only if p(x) = almost immediate from the definitions that p ∈ P(y + (g(x))1/r for some convex function g from Rd to [0, ∞). For r > 0, p ∈ b 1/r ; C) if and only if p(x) = (g(x))1/r where g mapping C into (0, ∞) is P(y + concave. −s These results motivate definitions of the classes P(y+ ) = {p(x) = g(x)−s : g is convex} for s ≥ 0 and, more generally, for a fixed monotone function h from R to R, P(h) ≡ {p(x) = h(g(x)) : g convex}. imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

MULTIVARIATE CONVEX DENSITIES

3

Such generalizations of log-concave densities and log-concave measures based on means of order r have been introduced by a series of authors, sometimes with differing terminology, apparently starting with Avriel [1972], and continuing with Borell [1975], Brascamp and Lieb [1976], Pr´ekopa [1973], Rinott [1976], and Uhrin [1984]. A nice summary of these connections is given by Dharmadhikari and Joag-Dev [1988]. These authors also present results concerning preservation of r−concavity under a variety of operations including products, convolutions, and marginalization. In the mathematics literature, the underlying fundamental inequality has come to be known as the Borell-Brascamp-Lieb inequality; see e.g. Cordero-Erausquin, McCann and Schmuckenschl¨ ager [2001]. For these heavy-tailed classes, development of isoperimetric and Poincar´e inequalities is also underway: see e.g. Bobkov [2007a] and Bobkov and Ledoux [2009]. Despite the long-standing and current rapid development of the properties of such classes of densities on the probability side, very little has been done from the standpoint of nonparametric estimation, especially when d ≥ 2. Nonparametric estimation of a log-concave density on Rd was initiated by Cule, Samworth and Stewart [2007]. These authors developed an algorithm for computing their estimators and explored several interesting applications. Koenker and Mizera [2008] developed a family of penalized criterion functions related to the R´enyi divergence measures, and explored duality in the optimization problems. They did not succeed in establishing consistency of their estimators, but did investigate Fisher consistency. Recently, Cule and Samworth [2009] have established consistency of the (nonparametric) maximum likelihood estimator of a log-concave density on Rd , even in a setting of model miss-specification: when the true density is not log-concave, then the estimator converges to the closest log-concave density to the true density in the sense of Kullback-Leibler divergence. In this paper our goal is to investigate maximum likelihood estimation in the classes P(h) corresponding to a fixed monotone (decreasing or increasing) function h. In particular, for decreasing functions h, we handle all the 1/r r−concave classes P(y+ ) with r = −1/s and r ≤ −1/d (or s ≥ d). On the increasing side, we treat, in particular, the cases h(y) = y1[0,∞) (y) and h(y) = ey with C = Rd+ . The first of these corresponds to an interesting class of models which can be thought of as multivariate generalizations of the class of decreasing and convex densities on R+ treated by Groeneboom, Jongbloed and Wellner [2001], while the second, h(y) = ey corresponds to multivariate versions of the log-convex families studied by An [1998]. Note 1/r that our increasing classes P(y+ , Rd+ ) with r > 0 are quite different from the r−concave classes defined above and appear to be completely new, corimsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

4

ARSENI SEREGIN AND JON A. WELLNER

responding instead to r−convex densities on Rd+ . Here is an outline of the rest of the paper. All of our main results are presented in Section 2. Subsection 2.1 gives definitions and basic properties of the transformations involved. Subsection 2.2 establishes existence of the maximum likelihood estimators for both increasing and decreasing transformations h under suitable conditions on the function h. In subsection 2.3 we give statements concerning consistency of the estimators, both in the Hellinger metric and in uniform metrics under natural conditions. In subsection 2.4 we present asymptotic minimax lower bounds for estimation in these classes under natural curvature hypotheses. We conclude this section with a brief discussion of conjectures concerning attainability of the minimax rates by the maximum likelihood estimators. All the proofs are given in Section 3. We summarize a number of key results from convex analysis in an appendix, Section 4. 1.2. Convex-transformed density estimation. Now let X1 , . . . , Xn be n independent random variables distributed according to a probability density p0 = h(g0 (x)) on Rd here h is a fixed monotone (increasing or decreasing) function and g0 is an (unknown) convex function. The probability measure on the Borel sets Bd corresponding to p0 is denoted by P0 . The maximum likelihood estimator (MLE) of a log–concave density on R was introduced in Rufibach [2006] and Duembgen and Rufibach [2009]. Algorithmic aspects were treated in Rufibach [2007] and in a more general framework in D¨ umbgen, H¨ usler and Rufibach [2007], while consistency with respect to the Hellinger metric was established by Pal, Woodroofe and Meyer [2007], and rates of convergence of fbn and Fbn were established by Duembgen and Rufibach [2009]. Asymptotic distribution theory for the MLE of a logconcave density on R was established by Balabdaoui, Rufibach and Wellner [2009]. If C denotes the class of all closed proper convex functions g : Rd → (−∞, ∞], the estimator gbn of g0 is the maximizer of the functional: Ln g ≡

Z

(log h) ◦ gdPn

over the class G(h) ⊂ C of all convex functions g such that h ◦ g is a density and where Pn is the empirical measure of the observations. The maximum likelihood estimator of the convex-transformed density p0 is then pbn := h(gbn ) when it exists and is unique. We investigate conditions for existence and uniqueness in Section 2. 2. Main results. imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

MULTIVARIATE CONVEX DENSITIES

5

2.1. Definitions and basic properties. To construct the classes of convextransformed densities of interest here, we first need to define two classes of monotone transformations. An increasing transformation h is a nondecreasing function R → R+ such that h(−∞) = 0 and h(+∞) = +∞. We define the limit points y0 < y∞ of the increasing transformation h by: y0 = inf{y : h(y) > 0} y∞ = sup{y : h(y) < +∞}. We make the following assumptions about the asymptotic behavior of the increasing transformation. (I.1) The function h(y) is o(|y|−α ) for some α > d as y → −∞. (I.2) If y∞ < +∞ then h(y)  (y∞ − y)−β for some β > d as y ↑ y∞ . (I.3) The function h is continuously differentiable on the interval (y0 , y∞ ). Note that the assumption I.1 is satisfied if y0 > −∞. Definition 2.1. For an increasing transformation h an increasing class d of convex-transformed densities or simply an increasing model P(h) on R+ is the family of all bounded densities which have the form h ◦ g ≡ h(g(·)), d where g is a closed proper convex function with dom g = R+ . Remark 2.2. Consider a density h ◦ g from an increasing model P(h). Since h ◦ g is bounded we have g < y∞ . The function g˜ = max(g, y0 ) is convex and h ◦ g˜ = h ◦ g. Thus we can assume that g ≥ y0 . A decreasing transformation h is a nonincreasing function R → R+ such that h(−∞) = +∞ and h(+∞) = 0. We define the limit points y0 > y∞ of the decreasing transformation h by: y0 = sup{y : h(y) > 0} y∞ = inf{y : h(y) < +∞}. We make the following assumptions about the asymptotic behavior of the decreasing transformation. (D.1) The function h(y) is o(y −α ) for some α > d as y → +∞. (D.2) If y∞ > −∞ then h(y)  (y − y∞ )−β for some β > d as y ↓ y∞ . (D.3) If y∞ = −∞ then h(y)γ h(−Cy) = o(1) for some γ, C > 0 as y → −∞. imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

6

ARSENI SEREGIN AND JON A. WELLNER

(D.4) The function h is continuously differentiable on the interval (y∞ , y0 ). Note that the assumption D.1 is satisfied if y0 < +∞. Now we define the decreasing class of densities P(h). Definition 2.3. For a decreasing transformation h a decreasing class of convex-transformed densities or simply a decreasing model P(h) on Rd is the family of all bounded densities which have the form h ◦ g, where g is a closed proper convex function with dim(dom g) = d. Remark 2.4. Consider a density h ◦ g from a decreasing model P(h). Since h ◦ g is bounded we have g > y∞ . For the sublevel set C = levy0 g, the function g˜ = g + δ(· | C) is convex and h ◦ g˜ = h ◦ g. Thus we can assume that levy0 h = dom g. For a monotone transformation h we denote by G(h) the class of all convex functions g such that h ◦ g belongs to a decreasing class P(h). The following lemma allows us to compare models defined by increasing or decreasing transformations h Lemma 2.5. Consider two decreasing (or increasing) models P(h1 ) and P(h2 ). If h1 = h2 ◦ f for some convex function f then P(h1 ) ⊆ P(h2 ). Proof. The argument below is for a decreasing model. For an increasing model the proof is similar. If f (x) > f (y) for some x < y then f is decreasing on (−∞, x), f (−∞) = +∞, and therefore h2 is constant on (f (x), +∞) and we can redefine f (y) = f (x) for all y < x. Thus we can always assume that f is nondecreasing. For any convex function g, the function f ◦ g is also convex. Therefore, if p = h1 ◦ g ∈ P(h1 ), then p = h2 ◦ f ◦ g ∈ P(h2 ). In this section we discuss several examples of monotone models. First two families based on increasing transformations h: Example 2.6 (Log-convex densities ). This increasing model is defined by h(y) = ey and C = Rd+ . Limit points are y0 = −∞ and y∞ = ∞. Assumption I.1 holds for any α > d. These classes of densities were considered by An [1998], who established several useful preservation properties. In particular, log-convexity is preserved under mixtures (An [1998, Proposition 3]) and under marginalization (An [1998, Remark 8, p. 361]).

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

MULTIVARIATE CONVEX DENSITIES

7

Example 2.7 (r−Convex densities). This family of increasing models is s with s > 0 and C = Rd . defined by the transforms h(y) = max(y, 0)s = y+ + Limit points are: y0 = 0 and y∞ = +∞. Assumption I.1 holds for any 1/r α > d. As noted in Section 1 the model P(y+ , Rd+ ) corresponds to the class of r−convex densities, with r = ∞ corresponding to the log-convex densities of the previous example. For r < ∞ these classes seem to have not been previously discussed or considered, except in special cases: the case r = 1 and d = 1 corresponds to the class of decreasing convex densities on R+ considered by Groeneboom, Jongbloed and Wellner [2001]. It follows from Lemma 2.5 that (2.1)

s2 s1 P(ey , Rd+ ) ⊂ P(y+ , Rd+ ) ⊂ P(y+ , Rd+ ), for

0 < s2 < s1 < ∞.

Now for some models based on decreasing transformations h: Example 2.8 (Log-concave densities). This decreasing model is defined by the transform h(y) = e−y . Limit points are: y0 = +∞ and y∞ = −∞. Assumption D.1 holds for any α > d. Assumption D.3 holds for any γ > C > 0. Many parametric models are subsets of this model. Below we specify convex functions g which correspond to the densities of several distributions: 1. Uniform: Density of a uniform distribution on a convex set C is logconcave: g(x) = − log(µ[C]) + δ(x | C). 2. Normal: Density of a multivariate normal distribution (µ, Σ) with Σ nonsingular is log-concave: 1 1 d g(x) = (x − µ)T Σ−1 (x − µ) + log |Σ| + log 2π. 2 2 2 3. Gamma: Density of Gamma distribution (r, λ) is log-concave for r > 1: g(x) = −(r − 1) log x + λx − r log λ + log Γ(r). 4. Beta: Density of Beta distribution (α, β) is log-concave for α, β > 1: g(x) = −(α − 1) log x − (β − 1) log(1 − x) + log B(α, β). Gumbel, Fr´echet and logistic distributions also have log-concave densities.

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

8

ARSENI SEREGIN AND JON A. WELLNER

Example 2.9 (r−Concave densities; Power-convex densities). This fam−s ily of decreasing models is defined by the transforms h(y) = y+ for s > d. Limit points are: y0 = +∞ and y∞ = 0. Assumption D.1 holds for any α ∈ (d, s]. Assumption D.2 holds for β = s. As noted in Section 1 the 1/r −s model P(y+ ) = P(y+ ) (with r = −1/s < 0) corresponds to the class of r−concave densities. From Lemma 2.5 we have the following inclusion: −s1 −s2 ), for ) ⊂ P(y+ P(e−y ) ⊂ P(y+

(2.2)

s1 < s2 .

The models defined by power transformations include some parametric models with heavier than exponential tails. The following examples are discussed in Borell [1975]. We use Johnson and Kotz Johnson and Kotz [1972] as a reference. 1. Pareto (Johnson and Kotz [1972] 42.3): Density of a multivariate Pareto distribution (θ, a) is power-convex for s ∈ (d, a + d]: 

g(x) =

Q

Γ(a) θi Γ(a + d)

1/s 

1−d+

X

xi /θi

(a+d)/s

.

2. Student (Johnson and Kotz [1972] 37.3): Density of a multivariate t-distribution (d, n, µ, Σ) is power-convex for s ∈ (d, n + d]: g(x) =

Γ(n/2)(nπ)d/2 |Σ|1/2 Γ((d + n)/2)

!1/s 

1+

1 (x − µ)T Σ−1 (x − µ) n

(d+n)/2s

3. Snedecor (Johnson and Kotz [1972] 40.8): Density of a multivariate F -distribution (n0 , n1 , . . . , nd ) with ni ≥ 2 is power-convex for s ∈ (d, (n0 /2) + d]: g(x) = C(ni , d, s) n0 +

d X i=1

where n =

!n/2s

ni xi

d Y x2−ni i

!1/2s

,

i=1

Pd

i=0 ni .

Since the distributions above belong to the power-convex models only for bounded values of the parameter s the inclusion (2.2) implies that they do not belong to the log-concave model (corresponding to s = +∞) Borell [1975] developed a framework which unifies log-concave and powerconvex densities and gives interesting characterization for these classes. Here we briefly state the main result.

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

.

9

MULTIVARIATE CONVEX DENSITIES

Definition 2.10. Let C ⊆ Rd be an open convex set and let s ∈ R. Then we define Ms (C) as the family of all positive Radon measures µ on C such that: (2.3)

µ∗ (θA + (1 − θ)B) ≥ [θµ∗ (A)s + (1 − θ)µ∗ (B)s ]1/s

holds for all ∅ 6= A, B ⊆ C and all θ ∈ (0, 1). We define M◦s (C) as a subfamily of Ms (C) which consists of probability measures such that the affine hull of its support has dimension d. Here µ∗ is the inner measure corresponding to µ and the cases s = 0, ∞ are defined by continuity. Then one of the main results of Borell [1975], Pr´ekopa [1973], and Rinott [1976] is as follows: Theorem 2.11 (Borell, Prekopa, Rinott). For s < 0 the family M◦s (Rd ) −d+1/s coincides with the power-convex family P(y+ ). For s = 0 the family M◦0 (Rd ) coincides with the log-concave family P(e−y ). Finally, Corollary 3.1 Borell [1975] says that the condition (2.3) can be relaxed: Theorem 2.12 (Borell). Let Ω ⊆ Rd be an open convex set and let µ be a positive Radon measure in Ω. Then µ ∈ Ms (Ω) if and only if 

µ

1 1 A1 + A2 ≥ 2 2 



1 1 µ(A1 )s + µ(A2 )s 2 2

1/s

holds for all compact (or open, or semiopen) blocks A1 , A2 ⊆ Ω (i.e. rectangles with sides parallel to the coordinate axes). Theorem 2.11 gives a special case of what has come to be known as the Borell-Brascamp-Lieb inequality; see e.g. Dharmadhikari and Joag-Dev [1988], and Brascamp and Lieb [1976]. The current terminology is apparently due to Cordero-Erausquin, McCann and Schmuckenschl¨ager [2001]. 2.2. Existence of the Maximum Likelihood Estimators. Now suppose that X1 , . . . , Xn are i.i.d. with density p0 (x) = h(g0 (x)) for a fixed monotone P transformation h and a convex function g0 . As before, Pn = n−1 ni=1 δXi is the empirical measure of the Xi ’s and P0 is the probability measure corresponding to p0 . Then Ln g = Pn log h ◦ g is the log-likelihood function (divided by n), and pbn ≡ argmax{Ln g : h ◦ g ∈ P(h)} imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

10

ARSENI SEREGIN AND JON A. WELLNER

is the maximum likelihood estimator of p over the class P(h), assuming it exists and is unique. We also write gbn for the MLE of g. We first state our main results concerning existence and uniqueness of the MLE’s for the classes P(h): Theorem 2.13. Suppose that h is an increasing transformation satisfying assumptions I.1 - I.3. Then the MLE pbn exists almost surely for the model P(h). Theorem 2.14. Suppose that h is a decreasing transformation satisfying assumptions D.1 - D.4. Then the MLE pbn exists almost surely for the model P(h) if (

n ≥ nd ≡

d(1 + γ), if y∞ = −∞ βd2 d + α(β−d) , if y∞ > −∞.

Uniqueness of the MLE is known for the log-concave model P(e−y ), see e.g. Duembgen and Rufibach [2009] for d = 1 and Cule et al. [2007] for d ≥ 1. For a brief further comment see section 2.5. 2.3. Consistency of the Maximum Likelihood Estimators. Once existence of the MLE’s is assured, our attention shifts to other properties of the estimators: our main concern in this sub-section is consistency. While for a decreasing model it is possible to prove consistency without any restrictions, for an increasing model we need the following assumptions about the true density h ◦ g0 : (I.4) The function g0 is bounded by some constant C < y∞ . (I.5) If d > 1 then we have, with |x| ≡

Qd

j=1 xj

for x ∈ Rd+ ,

1 Cg ≡ log dP0 (x) < ∞. d |x| ∧ 1 R+ Z





Remark 2.15. Note that for d = 1 the assumption I.5 follows from the assumption I.4 and integrability of log(1/x) at zero. This assumption is also true if P has finite marginal densities. (I.6) We have: Z Rd+

(h log h) ◦ g0 (x)dx < ∞.

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

11

MULTIVARIATE CONVEX DENSITIES

Let H(p, q) denote the Hellinger distance between two probability measures with densities p and q with respect to Lebesgue measure on Rd : (2.4)

H 2 (p, q) ≡

1 2

Z

√ √ ( p(x) − q(x))2 dx = 1 −

Rd

Z

q

p(x)q(x)dx.

Rd

Our main results about increasing models are as follows. Theorem 2.16. For an increasing model P(h) where h satisfies assumptions I.1 - I.3 and for the true density h ◦ g0 which satisfies assumptions I.4 - I.6, the sequence of MLEs {pbn = h ◦ gˆn } is Hellinger consistent: H(pbn , p0 ) = H(h ◦ gˆn , h ◦ g0 ) →a.s. 0. Theorem 2.17. For an increasing model P(h) where h satisfies assumptions I.1 - I.3 and for the true density h ◦ g0 which satisfies assumptions I.4 - I.6, the sequence of MLEs gˆn is pointwise consistent. That is gˆn (x) →a.s. g0 (x) for x ∈ ri(Rd+ ) and convergence is uniform on compacta. The results about decreasing models can be formulated in a similar way. Theorem 2.18. For a decreasing model P(h) where h satisfies assumptions D.1 - D.4, the sequence of MLEs {pbn = h◦ gˆn } is Hellinger consistent: H(pbn , p0 ) = H(h ◦ gˆn , h ◦ g0 ) →a.s. 0. Theorem 2.19. For a decreasing model P(h) with h satisfying assumptions D.1 - D.4, the sequence of MLEs gˆn is pointwise consistent in the following sense. Define g0∗ = g0 + δ(· ri(dom g0 )). Then g0∗ = g0 a.e., gˆn →a.s. g0∗ , and the convergence is uniform on compacta. Moreover, if dom g0 = Rd then: kh ◦ gˆn − h ◦ g0 k∞ →a.s. 0 . 2.4. Local Asymptotic Minimax LowerBounds. In this section establish local asymptotic minimax lowers for any estimator of several functionals of interest on the family P(h) of convex transformed densities. We start with several general results following Jongbloed [2000], and then apply them to estimation at a point fixed point and to mode estimation. First, we define minimax risk as in Donoho and Liu [1991]:

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

12

ARSENI SEREGIN AND JON A. WELLNER

Definition 2.20. Let P be a class of densities on Rd with respect to Lebesgue measure and T be a functional T : P → R. For an increasing convex loss function l on R+ we define the minimax risk as: (2.5)

Rl (n; T, P) = inf sup Ep×n l(|tn (X1 , . . . , Xn ) − T p|), tn p∈P

where tn ranges over all possible estimators of T p based on X1 , . . . , Xn . The main result (Theorem 1) in Jongbloed [2000] can be formulated as: Theorem 2.21 (Jongbloed). such that:

Let {pn } be a sequence of densities in P

√ lim sup nH(pn , p) ≤ τ n→∞

for some density p in P. Then: (2.6)

lim inf n→∞

Rl (n; T, {p, pn }) ≥ 1. l((1/4)e−2τ 2 |T (pn ) − T (p)|)

It will be convenient to reformulate this result in the following form: Corollary 2.22. Suppose that for any ε > 0 small enough, there exists pε ∈ P such that for some r > 0: lim ε−1 |T pε − T p| = 1

ε→0

and lim sup ε−r H(pε , p) ≤ c. ε→0

Then, there exists a sequence {pn } such that: (2.7)

lim inf n1/2r R1 (n; T, {p, pn }) ≥ n→∞

1 c−1/r , 4(2re)1/2r

where R1 is the risk which corresponds to l(x) = |x|. Corollary 2.22 shows that for a fixed change in the value of the functional T , a family pε which is closer to the true density p with respect to Hellinger distance provides sharper lower bound. This suggests that for the functional T which depends only on the local structure of the density we would like our family {pε } to deviate from p also locally. Below, we define formally such local deviations. imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

13

MULTIVARIATE CONVEX DENSITIES

Definition 2.23. We call a family of measurable functions {pε } a deformation of a measurable function p if pε is defined for any ε > 0 small enough, lim ess sup |p − pε | = 0,

ε→0

and there exists a bounded family of real numbers rε and a point x0 such that: µ[supp |pε (x) − p(x)|] > 0, supp |pε (x) − p(x)| ⊆ B(x0 , rε ). If in addition we have: lim rε = 0

ε→0

we say that {pε } is a local deformation at x0 . Since for a deformation pε we have µ[supp |pε (x) − p(x)|] > 0, for every ε > 0, there exists δ > 0 such that µ{x : |pε (x) − p(x)| > δ} > 0 and thus the Lr -distance from pε to p is positive for all ε > 0. Note that this is always true if p and pε are continuous at x0 and pε (x0 ) 6= p(x0 ). Now we can state our lower bound for estimation of the convex-transformed density value at a fixed point x0 . This result relies on the properties of strongly convex functions as described in Appendix 4.4 and can be applied to both increasing and decreasing classes of convex-transformed densities. Theorem 2.24. Let h be a monotone transformation, let p = h ◦ g ∈ P(h) be a convex-transformed density, and suppose that x0 is a point in ri(dom g) such that h is continuously differentiable at g(x0 ), h ◦ g(x0 ) > 0, h0 ◦ g(x0 ) 6= 0, and curvx0 g > 0. Then, for the functional T (h ◦ g) ≡ g(x0 ) there exists a sequence {pn } ⊂ P(h) such that: (2.8)

lim inf n n→∞

2 d+4

h ◦ g(x0 )2 curvx0 g R1 (n; T, {h ◦ g, pn }) ≥ C(d) h0 ◦ g(x0 )4 "

#

1 d+4

,

where the constant C(d) depends only on the dimension d. Remark 2.25. If in addition g is twice continuously differentiable at x0 and ∇2 g(x0 ) is positive definite, then by Lemma 4.22 we have curvx0 g = det(∇2 g(x0 )). imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

14

ARSENI SEREGIN AND JON A. WELLNER

In Jongbloed [2000] lower bounds were constructed for functionals with values in R. However, it is easy to see that the proof does not change for functionals with values in an arbitrary metric space (V, s) if instead of |T p − T pn | we consider s(T p, T pn ). We define: (2.9)

Rs (n; T, P) = inf sup Ep×n s(tn (X1 , . . . , Xn ), T p), tn p∈P

and the analogue of Corollary 2.22 has the following form: Corollary 2.26. Suppose that for any ε > 0 small enough, there exists pε ∈ P such that for some r > 0: lim ε−1 s(T pε , T p) = 1

ε→0

lim sup ε−r H(pε , p) ≤ c. ε→0

Then, there exists a sequence {pn } such that: (2.10)

lim inf n1/2r Rs (n; T, {p, pn }) ≥ n→∞

1 c−1/r . 4(2re)1/2r

In this section we consider estimation of the functional T (h ◦ g) = argmin(g) ∈ Rd for the density p = h ◦ g ∈ P(h) assuming that the minimum is unique. This is equivalent to estimation of the mode of p = h ◦ g. Construction of a lower bound for the functional T is similar to the procedure we presented for estimation of p = h ◦ g at a fixed point x0 . Again, we use two opposite deformations: one is local and changes the functional value, another is a convex combination with a fixed deformation and negligible in Hellinger distance computation. However, in this case the minimax rate depends also on the growth rate of g. Theorem 2.27. Let h be a decreasing transformation, h ◦ g ∈ P(h) be a convex-transformed density and a point x0 ∈ ri(dom g) be a unique global minimum of g such that h is continuously differentiable at g(x0 ), h0 ◦ g(x0 ) 6= 0 and curvx0 g > 0. In addition let us assume that g is locally H¨ older continuous at x0 : |g(x) − g(x0 )| ≤ Lkx − x0 kγ

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

15

MULTIVARIATE CONVEX DENSITIES

with respect to some norm k · k. Then, for the functional T (h ◦ g) ≡ argmin g there exists a sequence {pn } ∈ P(h) such that: (2.11) lim inf n n→∞

2 γ(d+4)

− γ1

Rs (n; T, {p, pn }) ≥ C(d)L

"

h ◦ g(x0 )2 curvx0 g h0 ◦ g(x0 )4

#

1 γ(d+4)

,

where the constant C(d) depends only on the dimension d and metric s(x, y) is defined as kx − yk. Remark 2.28. If in addition g is twice continuously differentiable at x0 and ∇2 g(x0 ) is positive definite, then by Lemma 4.22 we have curvx0 h = det(∇2 g(x0 )) and g is locally H¨ older continuous at x0 with exponent γ = 2 2 and any constant L > k∇ g(x0 )k. Remark 2.29.

Since curvx0 g > 0 there exists constant C such that: Ckx − x0 k2 ≤ |g(x) − g(x0 )|,

and thus we have γ ∈ (0, 2]. 2.5. Conjectures concerning uniqueness of MLEs. There exist counterexamples to uniqueness for nonconvex transformations h which satisfy assumptions D.1 - D.4. They suggest that uniqueness of the MLE does not depend on the tail behavior of the transformation h but rather on the local properties of h in neighborhoods of the optimal values gˆn (Xi ). We conjecture that uniqueness holds for all monotone models if h is convex and h/|h0 | is nondecreasing convex. Further work on the uniqueness issues is needed. 2.6. Conjectures about rates of convergence for the MLEs. We conjecture that the (optimal) rate of convergence n2/(d+4) appearing in Theorem 2.24 for estimation of f (x0 ) will be achieved by the MLE only for d = 2, 3. For d = 4, we conjecture that the MLE will come within a factor (log n)−γ (for some γ > 0) of achieving the rate n1/4 , but for d > 4 we conjecture that the rate of convergence will be the suboptimal rate n1/d . This conjectured rate sub-optimality raises several interesting further issues: • Can we find alternative estimators (perhaps via penalization or sieve methods) which achieve the optimal rates of convergence? • For interesting sub-classes do maximum likelihood estimators remain rate optimal?

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

16

ARSENI SEREGIN AND JON A. WELLNER

Baraud and Birg´e [2009] have recently studied estimation in generalizations of our classes P(h) in which both h and g are unknown with h : Rm → R and g : Rd → Rm . While they achieve optimal rates of estimation in their more general problem with a class of estimators based on multiple testing and model selection, their procedures will very likely have several drawbacks relative to the methods we have studied here, including (a) not belonging to the the classes of densities defined by the models, and (b) difficulties in computation or implementation. 3. Proofs. 3.1. Preliminaries: properties of increasing transformations. Lemma 3.1. Let h be a increasing transformation and g be a closed d proper convex function with dom g = R+ such that Z d

h ◦ gdx = C < ∞.

R+

Then the following are true: 1. For a sublevel set levy g with y > y0 we have: µ[(levy g)c ] ≤ C/h(y). 2. For any point x0 ∈ Rd+ and any subgradient a ∈ ∂g(x0 ) all coordinates of a are nonpositive. If in addition g(x0 ) > y0 then all coordinates of a are negative. 3. For any point x0 ∈ Rd+ such that g(x0 ) > y0 we have: h ◦ g(x0 ) ≤ where |x| ≡

Qd

k=1 xk

Cd! , dd |x0 |

for x ∈ Rd+ . d

4. The function h reverses partial order on R+ : if x1 < x2 then g(x1 ) ≥ g(x2 ) and the last inequality is strict if g(x1 ) > y0 . d 5. The supremum of g on R+ is attained at 0. Proof.

1. Since h is nondecreasing we have h(y) > 0 and: Z

C=

d R+

h ◦ gdx ≥

Z (levy

g)c

h ◦ gdx ≥ h(y)µ[(levy g)c ].

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

MULTIVARIATE CONVEX DENSITIES

17

2. Consider the linear function l(x) = aT (x−x0 )+g(x0 ). We have g ≥ l. If the vector a has a nonnegative coordinate ai then consider a closed ball ¯ 0 ) ⊂ Rd+ . If m is a minimum of the function l on B then the B = B(x minimum of the function h ◦ l on B + λei is equal to h(m + λai ), where ei is the element of the basis which corresponds to the ith coordinate. For λ > 0 we have B + λei ⊂ Rd+ . If ai > 0 then: Z

h ◦ gdx ≥

p

R+

Z

h ◦ ldx ≥

p

Z

h ◦ ldx ≥ µ[B]h(m + λai ) → +∞

B

R+

as λ → ∞, which contradicts the assumption. If ai = 0 and g(x0 ) = l(x0 ) > y0 , then we can choose the radius of the ball small enough so that m > y0 . Then: Z p

R+

h ◦ gdx ≥

Z p

h ◦ ldx ≥

R+

Z

h ◦ ldx ≥ µ[K]h(m) = +∞

K

where K ≡ ∪λ>0 (B + λei ), and this again contradicts the assumption. 3. Consider the subgradient a ∈ ∂g(x0 ). For the linear function l(x) = aT (x−x0 )+g(x0 ) we have g ≥ l and l(x0 ) = g(x0 ) therefore (levg(x0 ) l)c ⊆ (levg(x0 ) g)c . From the previous statement we have that levg(x0 ) l is a simplex and using the Cauchy-Schwartz inequality we have: µ[(levg(x0 ) l)c ] =

(aT x0 )d dd |x0 | ≥ d!|a| d!

which together with 1. proves the statement. d 4. Since x1 ∈ R+ and x1 < x2 we have x2 ∈ Rd+ = ri(dom g). For any subgradient a ∈ ∂g(x2 ) we have g(x1 ) − g(x2 ) ≥ aT (x1 − x2 ) ≥ 0 from the previous statement. Now, if g(x1 ) > y0 then we can assume that g(x2 ) > y0 since otherwise the statement is trivial. In this case all coordinates of a are negative and: g(x1 ) − g(x2 ) ≥ aT (x1 − x2 ) > 0. 5. From the previous statement we have that h◦g ≤ h◦g(0) on Rd+ which together with continuity of h ◦ g implies the statement.

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

18

ARSENI SEREGIN AND JON A. WELLNER

Lemma 3.2. Let h be an increasing transformation, g be a closed proper d d convex function on R+ and Q be a σ-finite Borel measure on R+ . Then: Z

h ◦ gdQ =

Z a −∞

leva g

h0 (y)Q[(levy g)c ∩ leva g]dy.

Proof. Using the Fubini-Tonelli theorem we have: Z

h ◦ gdQ =

leva g

Z h(a)

Z leva g

Z h(a)

Z

= leva g

h0 (y)1{y ≤ g(x)}dydQ(x)

= lev g Z aa

−∞

h0 (y)

= −∞ Z a

= −∞

1{h−1 (z) ≤ g(x)}dzdQ(x)

0

Z a

Z

1{z ≤ h ◦ g(x)}dzdQ(x)

0

Z

1{y ≤ g(x)}dQ(x)dy

leva g

h0 (y)Q[(levy g)c ∩ leva g]dy.

Lemma 3.3. Let h be an increasing transformation and let g be a polyd hedral convex function with dom g = R+ such that: Z d

h ◦ gdx < ∞.

R+

Then g(0) < y∞ . Proof. For y∞ = +∞ the statement is trivial so we assume that y∞ is d finite. If g(0) > y∞ then since g is continuous there exists a ball B ⊂ R+ small enough such that g > y∞ on B and therefore Z d

h ◦ gdx = ∞.

R+

Let us assume that g(0) = y∞ . By Lemma 4.13 there exists a ∈ ∂g(0) and therefore g(x) ≥ l(x) ≡ aT x + y∞ . Let am be the minimum among the d coordinates of the vector a and −1. Then on R+ we have l(x) ≥ l1 (x) ≡ T am 1 x + y∞ where am < 0 and thus l1 (x) ≤ y∞ . By Lemma 3.2 we have: Z d

R+

h ◦ gdx ≥

Z d

R+

h ◦ l1 dx =

Z y∞ −∞

d

h0 (y)µ[(levy l1 )c ∩ R+ ]dy.

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

MULTIVARIATE CONVEX DENSITIES

19

d

The set Ay = (levy g)c ∩ R+ is a simplex and: (y∞ − y)d d!(−am )d

µ[Ay ] =

for y ≤ y∞ . By assumption I.2 we have h0 (y)  (y∞ − y)−β−1 as y ↑ y∞ where β > d and therefore: Z d

h ◦ g1 dx =

R+

Z d

h ◦ gdx = +∞.

R+

This contradiction proves that g(0) < y∞ . Lemma 3.4. Let h be an increasing transformation and let l(x) = aT x+b be a linear function such that all coordinates of a are negative and b < y∞ . Then: Z h ◦ ldx < ∞. d R+ d

Proof. We have l ≤ b on R+ and by Lemma 3.2: Z d R+

h ◦ ldx =

Z b

d

−∞

h0 (y)µ[(levy l)c ∩ R+ ]dy.

d

The set Ay = (levy l)c ∩ R+ is a simplex and: µ[Ay ] =

(b − y)d d!| − a|

for y ≤ b. By assumption I.1 we have h0 (y) = o(y −α−1 ) as y → −∞ for α > d and therefore the integral is finite. Lemma 3.5. Let h be an increasing transformation and suppose that d K ⊂ R+ is a compact set. Then there exists a closed proper convex function g ∈ G(h) such that g > y0 on K. Proof. If y0 = −∞ then consider the function T (c) defined as: Z

T (c) =

d

h ◦ (−1T x + c)dx.

R+

By Lemma 3.4, T (c) is finite for c < y∞ , and by Lemma 3.3, we conclude that T (y∞ ) = +∞. By monotone convergence T is left-continuous imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

20

ARSENI SEREGIN AND JON A. WELLNER

for c ∈ (−∞, y∞ ] and by dominated convergence is right-continuous for c ∈ (−∞, y∞ ). Since T (−∞) = 0 there exists c1 < y∞ such that T (c1 ) = 1 and thus the linear function l(x) = −1T x + c1 belongs to G(h). If y0 < −∞ then choose M such that 1T x < M on K. Consider the function T (c) defined as: Z

T (c) =

d R+

h ◦ (c(−1T x + M ) + y0 )dx.

By Lemma 3.4, T (c) is finite for c < (y∞ −y0 )/M and by Lemma 3.3, T ((y∞ − y0 )/M ) = +∞. By monotone and dominated convergence T is continuous for c ∈ [0, (y∞ − y0 )/M ]. Since T (0) = 0 there exists c1 ∈ (0, (y∞ − y0 )/M ) such that linear function l(x) = c1 (−1T x + M ) + y0 belongs to G(h). By construction l > y0 on K. 3.2. Preliminaries: properties of decreasing transformations. Lemma 3.6. Let h be a decreasing transformation and g be a closed proper convex function such that Z

h ◦ gdx = C < ∞.

Rd

Then the following are true: 1. For y < +∞ the sublevel sets levy g are bounded and we have: µ[levy g] ≤ C/h(y); 2. The infimum of g is attained at some point x ∈ Rd . Proof.

1. We have: Z

C=

d

h ◦ gdx ≥

R+

Z levy g

h ◦ gdx ≥ h(y)µ[levy g]

µ[levy g] ≤ C/h(y). The sublevel set levy g has the same dimension as dom g (Theorem 7.6 Rockafellar [1970]) which is d. By Lemma 4.1 this set is bounded when y < y0 . Therefore, it is enough to prove that levy0 g is bounded for y0 < +∞. Since h ◦ g is a density we have inf g < y0 . If g is constant on dom g, then for all y ∈ [inf g, +∞) we have levy g = levinf g h and is therefore bounded. Otherwise, we can choose inf h ≤ y1 < y2 < y0 . Then µ[levy2 g] < ∞, and by Lemma 4.3 we have µ[levy0 g] < ∞. The argument above shows that levy0 g is also bounded. imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

MULTIVARIATE CONVEX DENSITIES

21

2. Follows from the fact that g is continuous and levy g is bounded and nonempty for y > inf g.

Lemma 3.7. Let h be a decreasing transformation, let g be a closed proper convex function on Rd , and let Q be a σ-finite Borel measure on Rd . Then: Z

h ◦ gdQ = −

(leva

Z +∞

g)c

a

h0 (y)Q[levy g ∩ (leva g)c ]dy.

Proof. Using the Fubini-Tonelli theorem we have: Z

Z h(a)

Z

h ◦ gdQ =

(leva g)c

(leva g)c

1{z ≤ h ◦ g(x)}dzdQ(x)

0

Z h(a)

Z

1{h−1 (z) ≥ g(x)}dzdQ(x)

= (leva g)c

=− =− =−

0

Z +∞

Z (leva g)c Z +∞ 0

h0 (y)1{y ≥ g(x)}dydQ(x)

a

Z

1{y ≥ g(x)}dQ(x)dy

h (y)

a Z +∞ a

(leva g)c

h0 (y)Q[levy g ∩ (leva g)c ]dy.

Lemma 3.8. Let h be an decreasing transformation and let g be a closed proper convex function such that: Z

h ◦ gdx < ∞.

Rd

Then inf g > y∞ . Proof. Since g is proper the statement is trivial for y∞ = −∞, so we assume that y∞ > −∞. If for x0 we have g(x0 ) = y∞ , then there exists a ball B ≡ B(x; r) such that g < y∞ + ε on B. Consider the convex function f defined as: f (x) = y∞ + (ε/r)kx − x0 k + δ(x | B). Then by convexity f ≥ g and: Z Rd

h ◦ gdx ≥

Z

h ◦ f dx.

Rd

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

22

ARSENI SEREGIN AND JON A. WELLNER

We have µ[levy f ] = S(y − y∞ )d for y ∈ [y∞ , y∞ + ε] where S is the Lebesgue measure of a unit ball B(0; 1) and by Lemma 3.7 we can compute: Z

Z y∞ +ε

h ◦ f dx = −S

Rd

h0 (y)(y − y∞ )d dy.

y∞

The assumption D.2 implies: Z

h ◦ gdx ≥

Rd

Z

h ◦ f dx = ∞,

Rd

which proves the statement. Lemma 3.9. Let h be a decreasing transformation. Then for any convex function g such that h ◦ g belongs to the decreasing model P(h) we have: Z

[h log h] ◦ gdx < ∞.

Rd

Proof. By assumption D.1 the function −[h log h](y) is decreasing to zero as y → +∞ and we have: 0

0 < −[h log h](y) < Cy −d−α

for C large enough and α0 ∈ (0, α) as y → +∞. By Lemma 3.6 the level sets levy g are bounded and since h ◦ g ∈ P(h) we have inf g > y∞ . Therefore, the integral is finite if and only if the integral: Z

[h log h] ◦ gdx

(leva g)c

is finite for some a > y∞ . Choosing a large enough and using Lemma 3.7 0 for the decreasing transformation h1 (y) = y −d−α we obtain: 0≥

Z (leva g)c

[h log h] ◦ gdx ≥ −C

Z (leva g)c 0

= −C(d + α )

h1 ◦ gdx ≥ C

a

Z +∞ a

By Lemma 4.3 we have µ(levy g) = finite.

Z +∞

O(y d )

h01 (y)µ[levy g]dy

0

y −d−α −1 µ[levy g]dy.

and therefore the last integral is

Lemma 3.10. Let h be a decreasing transformation and suppose that K ⊂ Rd is a compact set. Then there exists a closed proper convex function g ∈ G(h) such that g < y0 on K. Proof. Let B be a ball such that K ⊂ B. Let c be such that h(c) = 1/µ[B]. Then the function g ≡ c + δ(· | B) belongs to G(h). imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

MULTIVARIATE CONVEX DENSITIES

23

3.3. Proofs for Existence Results. Before giving proofs of Theorems 2.13 and 2.14, we establish two auxiliary lemmas. A set of points x = {xi }ni=1 in Rd is in general position if for any subset x0 ⊆ x of the size d + 1 the Lebesgue measure of conv(x0 ) is not zero. Lemma 3.11. If X1 , . . . , Xn are i.i.d. p0 = h ◦ g0 ∈ P(h) for a monotone transformation h, then the observations X are in general position with probability 1. Proof. Points are not in general position if at least one subset Y of X of size d + 1 belongs to a proper linear subspace of Rd . This is true if and only if X as a vector in Rnd belongs to a certain non-degenerate algebraic variety. Since with probability 1 we have X ⊂ dom g0 and by definition dim(dom g0 ) = d, the statement follows from Okamoto [1973]. Below we assume that our observations are in general position for any n. For an increasing model we also assume that all Xii belong to Rd+ . This h d assumption holds with probability 1 since µ R+ \ Rd+ = 0. If a MLE for the model P(h) exists, then it maximizes the functional: Ln g ≡

Z

(log h) ◦ gdPn d

over g ∈ G(h), where the last integral is over R+ for increasing h and over Rd for decreasing models. The theorem below determines the form of the MLE for an increasing model. We write evx f = (f (x1 ), . . . , f (xn )), x = (x1 , . . . , xn ) with xi ∈ Rd . Lemma 3.12. Consider an increasing transformation h. For any convex R d function g with dom g = R+ such that: Rd h ◦ gdx ≤ 1 and Ln g > −∞, +

there exists g˜ ∈ G(h) such that g˜ ≥ g and Ln g˜ ≥ Ln g. The function g˜ can be chosen as a minimal element in ev−1 ˜ where p˜ = evX g˜. X p Proof. Let p = evX g. Since Ln g > −∞ we have g(Xi ) > y0 for all 1 ≤ i ≤ n and therefore g > y0 for x ∈ conv(X). Consider any minimal element g1 among convex functions in ev−1 X p (which exists by Lemma 4.15). Then: Z Z h ◦ g1 dx ≤ d h ◦ gdx ≤ 1. d R+

R+

Since g1 is polyhedral we have g1 = max li for some linear functions li (x) = aTi x+b and for each function li there exists some facet of g1 such that g1 = li on it. imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

24

ARSENI SEREGIN AND JON A. WELLNER

By Lemma 4.15 the interior of the facet of g1 which corresponds to li contains some Xji ∈ X. We have ∂g1 (Xji ) = {ai } and g1 (Xji ) = g(Xji ) > y0 . Thus by Lemma 3.1, all coordinates of ai are negative and the supremum M of g1 is attained at 0. Therefore bi = li (0) ≤ M . By Lemma 3.3 we have M < y∞ . Thus by Lemma 3.4 the functions h ◦ (li + c) are integrable for all c < y∞ − M . Since g1 has only finite number of facets we have that h ◦ (g1 + c) is also integrable for all c < y∞ − M . Finally, for c = y∞ − M function h ◦ (g1 + c) is not integrable by Lemma 3.3. The function T (c) defined as: T (c) ≡

Z d

h ◦ (g1 + c)dx

R+

is increasing, finite for c ∈ [0, y∞ − M ) and continuous for c ∈ [0, y∞ − M ] by monotone and dominated convergence. Since T (0) ≤ 1 and T (h∞ − M ) = +∞, there exists c1 ∈ (0, y∞ − M ) such that T (c1 ) = 1. Then the function g˜ ≡ g1 + c1 satisfies the conditions of our lemma. Theorem 3.13. If an MLE gb0 exists for the increasing model P(h), then there exists an MLE gb1 which is a minimal element in ev−1 X p where q = evX gˆ0 . In other words gb1 is a polyhedral convex function such that d dom g1 = R+ , and the interior of each facet contains at least one element of X. If h is strictly increasing on [y0 , y∞ ], then gb0 (x) = gb1 (x) for all x such that gb0 (x) > y0 and thus defines the same density from P(h). Proof. Let gb0 be any MLE. Then by Lemma 3.5 applied to K = conv(X) it follows that Ln gb0 > −∞. By Lemma 3.12 there exists a function gb1 ∈ P(h) b1 and gb1 ≥ gb0 . such that gb1 is a minimal element in ev−1 X q1 where q1 = evX g Since gb0 is a MLE we have evX gb0 = evX gb1 which together with Lemma 4.15 proves the first part of the statement. By Lemma 3.3 we have gb0 < y∞ and gb1 < y∞ . Since h ◦ gb0 and h ◦ gb1 are continuous functions, for the strictly increasing h the equality: Z R+

(h ◦ gb1 − h ◦ gb0 )dx = 0

implies that gb1 (x) = gb0 (x) for x such that gb0 (x) > y0 . Here are the corresponding results for decreasing transformations h. Lemma 3.14. Consider a decreasing transformation h. For any convex function g such that: Z h ◦ gdx ≤ 1 Rd

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

MULTIVARIATE CONVEX DENSITIES

25

and Ln g > −∞ there exists g˜ ∈ G(h) such that g˜ ≤ g and Ln g˜ ≥ Ln g. The function g˜ can be chosen as the maximal element in ev−1 ˜ where q˜ = evX g˜. X q Proof. Let q = evX g. Since Ln g > 0 we have g(Xi ) < y0 for all 1 ≤ i ≤ n and therefore g < y0 for x ∈ conv(X). Consider the maximal element g1 among convex functions in ev−1 X q (which exists and is unique by Lemma 4.14). Then: Z

h ◦ g1 dx ≤

Rd

Z d

h ◦ gdx = 1.

R+

By Lemma 3.6 there exists x0 and m > −∞ such that g ≥ g(x0 ) = m. By Lemma 3.8 we have m > y∞ . By Lemma 4.14 we have dom g1 = conv(X) and therefore: Z Rd

h ◦ (g1 + c)dx ≤ h(m)µ[conv(X)] < ∞.

The function T (c) defined as: T (c) ≡

Z d

h ◦ (g1 + c)dx

R+

is increasing, finite for c ∈ (y∞ − m, 0] and continuous for c ∈ [y∞ − m, 0] by monotone and dominated convergence. Since T (0) ≤ 1 and T (y∞ − m) = +∞, there exists c1 ∈ (y∞ − m, 0) such that T (c1 ) = 1. Then the function g˜ ≡ g1 + c1 satisfies the conditions of our lemma. Theorem 3.15. If the MLE gb0 exists for the decreasing model P(h), then there exists another MLE gb1 which is the maximal element in ev−1 X q where q = evX ge0 . In other words gb1 is a polyhedral convex function with the set of knots Kn ⊆ X and domain dom gb1 = conv(X). If h is strictly decreasing on [y∞ , y0 ], then gb0 (x) = gb1 (x). Proof. Let gb0 be any MLE. Then by Lemma 3.10 applied to K = conv(X) we have that Ln gb0 > 0. By Lemma 3.14 there exists a function b1 gb1 ∈ G h such that gb1 is the maximal element in ev−1 X q1 where q1 = evX g and gb1 ≤ gb0 . Since gb0 is a MLE we have evX gb0 = evX gb1 , which together with Lemma 4.14 proves the first part of the statement. By Lemma 3.8 we have gb0 ≥ gb1 > y∞ . Since h◦ gb0 and h◦ gb1 are continuous functions, for the strictly increasing h, the equality: Z R+

(h ◦ gb1 − h ◦ gb0 )dx = 0

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

26

ARSENI SEREGIN AND JON A. WELLNER

implies that gb1 (x) = gb0 (x) for x ∈ conv(X). Therefore gb0 (x) ≥ y∞ for x∈ / conv(X). Since gb0 is convex we have gb0 = gb1 . The bounds provided by the following key lemma are the remaining preparatory work for proving existence of the MLE in the case of increasing transformations. For an increasing model P(h) let us denote by N (h, X, ε) for ε > −∞ the family of all convex functions g ∈ G(h) such that g is a minimal element in ev−1 X q where q = evX g and Ln g ≥ ε. By Lemma 3.5, the family N (h, X, ε) is not empty for ε > −∞ small enough. By construction for g ∈ N (h, X, ε) we have g(Xi ) > y0 for Xi ∈ X. Lemma 3.16. There exist constants c(x, X, ε) and C(x, X, ε) < y∞ which d depend only on x ∈ R+ , the observations X, and ε, such that for any g ∈ N (h, X, ε) we have: c(x, X, ε) ≤ g(x) ≤ C(x, X, ε). Proof. By Lemma 3.1 we have: h ◦ g(Xi ) ≤

d! dd |Xi |

,

which gives the upper bounds C(Xi , X, ε). By assumption we have: (max h ◦ g(Xi ))n−1 min h ◦ g(Xi ) ≥

Y

h ◦ g(Xi ) ≥ enε ,

and therefore: min h ◦ g(Xi ) ≥

enε , h(max C(Xi , X, ε))n−1

which gives the uniform lower bound c(Xi , X, ε) for all Xi ∈ X. Since by Lemma 3.1 g(0) ≥ g(Xi ) we also obtain c(0, X, ε). Now, we prove that there exist C(0, X, ε). Let l be a linear function which defines any facet of g for which 0 is an element. By Lemma 4.15, there exists Xa ∈ X which belongs to this facet. Then g(0) = l(0) and g(Xa ) = l(Xa ). d Let us denote by S the simplex {l = l(Xa )} ∩ R+ , by S ∗ the simplex {l ≥ d

l(Xa )}∩R+ and by l0 the linear function which is equal to c ≡ min c(Xi , X, ε) on S and to g(0) at 0. By the Cauchy-Schwartz inequality (as in the proof of Lemma 3.1) we have: dd |Xa | µ[S ∗ ] ≥ d! imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

27

MULTIVARIATE CONVEX DENSITIES

We have l ≥ l0 :

Z

1=

h ◦ gdx ≥

d

Z S∗

R+

h ◦ l0 dx.

By Lemma 3.2 Z

0

S∗



h ◦ l dx = µ[S ]

Z g(0) c

dd |Xa | ≥ d!

g(0) − y h (y) g(0) − c

Z y∞ c



0

d

dy

g(0) − y h (y)1{y ≤ g(0)} g(0) − c 

0

d

dy.

Consider the function T (s) defined as: T (s) =

dd |Xa | d!

Z y∞

h0 (y)1{y ≤ s}



c

s−y s−c

d

dy.

If y∞ = +∞, then for a fixed y ∈ (c, +∞) we have: h0 (y)1{y ≤ s}



s−y s−c

d

↑ h0 (y) as s → y∞

and by monotone convergence we have: T (s) ↑

Z y∞

h0 (y)dy = +∞ as s → y∞ .

c

If y∞ < +∞ then for a fixed y ∈ (c, y∞ ] we have: s−y h (y)1{y ≤ s} s−c 

0

d

y∞ − y ↑ h (y) y∞ − c 

0

d

as s → y∞

and by monotone convergence we have: T (s) ↑

Z y∞ c

y∞ − y h (y) y∞ − c 0



d

dy = +∞ as s → y∞ ,

by assumption I.2. Thus there exists s0 ∈ (c, y∞ ) such that T (s0 ) > 1. This implies g(0) < s0 . Since s0 depends only on Xa and min c(Xi , X, ε) this gives an upper bound C(0, X, ε). d By Lemma 3.1 for any x0 ∈ R+ we can set C(x0 , X, ε) = C(0, X, ε). Let l(x) = aT x + l(0) be a linear function which defines the facet of g to which x belongs. By Lemma 4.15 there exists Xa ∈ X which belongs to this facet,

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

28

ARSENI SEREGIN AND JON A. WELLNER

and thus l(Xa ) = g(Xa ). By Lemma 3.1 we have ak < 0 for all k, and by definition l(0) ≤ g(0). We have c(Xa , X, ε) ≤ g(Xa ) = l(Xa ) = aT Xa + l(0) ≤ aT Xa + g(0), and therefore c(Xa , X, ε) − C(0, X, ε) (Xa )k l(0) ≥ c(Xa , X, ε). ak ≥

Now, g(x0 ) = l(x0 ) ≥

c(Xa , X, ε) − C(0, X, ε) (x0 )k + c(Xa , X, ε). (Xa )k

Since we have only finite number of possible choices for Xa , we obtained c(x0 , X, ε) which concludes the proof. Now we are ready for the proof of Theorem 2.13. Proof. (Theorem 2.13) By Lemma 3.5 there exists ε small enough such that the family N (h, X, ε) is not empty. Clearly, we can restrict MLE candidates gb to functions in the family N (h, X, ε). The set N = evX N (h, X, ε) ¯ of is bounded by Lemma 3.16. Let us denote by q ∗ a point in the closure N N which maximizes the continuous function: Ln (p) =

n 1X log h(qi ). n i=1

¯ , there exists a sequence of functions gk ∈ N (h, X, ε) such that Since q ∈ N evX gk converges to q ∗ . By Theorem 10.9 Rockafellar [1970] and Lemma 3.16 d there exists a finite convex function g ∗ on R+ such that some subsequence gl pointwise converges to g ∗ . Therefore we have evX g ∗ = q ∗ . Since X ⊂ Rd+ we can assume that g ∗ is closed. By Fatou’s lemma we have: Z d

h ◦ g ∗ dx ≤ 1.

R+

By Lemma 3.12 there exists g ∈ G(h) such that g ≥ g ∗ and Ln g ≥ Ln g ∗ = Ln (q ∗ ). By assumption this implies Ln g = Ln g ∗ . Hence g is the MLE. Finally, we have to add the almost surely clause since we assumed that the points Xi belong to Rd+ . imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

29

MULTIVARIATE CONVEX DENSITIES

Before proving existence of the MLE for a decreasing transformation family, we need two lemmas. Lemma 3.17. Consider a decreasing model P(h). Let {gk } be a sequence of convex functions from G(h), and let {nk } be a nondecreasing sequence of positive integers nk ≥ nd such that for some ε > −∞ and ρ > 0 the following is true: 1. Lnk gk ≥ ε; 2. if µ[levak gk ] = ρ for some ak , then Pnk [levak gk ] < d/nd . Then there exists m > y∞ such that gk ≥ m for all k. Proof. Suppose, on the contrary, that mk → y∞ where mk = min gk . The first condition implies that Xd ≡ {X1 , . . . , Xnd } ∈ dom hk , and therefore by Corollary 4.4 the function µ[levy gk ] as a function of y admits all values in the interval [µ[levmk gk ], µ[conv(Xd )]]. If the second condition is true for some ρ then it is also true for all ρ0 ∈ (0, ρ), and therefore we can assume that ρ < µ[conv(Xd )]. By Lemma 3.6 we have µ[levmk gk ] → 0, and thus there exists such ak that µ[levak gk ] = ρ for all k large enough. We define Ak = levak gk . By Lemma 3.6 we have: h(ak ) ≤ 1/ρ and therefore the sequence {ak } is bounded below by some a > y∞ . Consider tk > mk such that tk → y∞ . We will specify the exact form of tk later in the proof. Since ak are bounded away from y∞ , it follows that for k large enough we will have tk < ak . Using Lemma 4.3 we obtain: ak − mk ρ = µ[Ak ] ≤ µ[levtk gk ] tk − mk 

d

1 ak − mk ≤ h(tk ) tk − mk 

d

which implies: ak ≥ mk + (tk − mk )[ρh(tk )]1/d . We have: gk ≥ mk 1{Ak } + ak (1 − 1{Ak }), and hence: Lnk gk ≤ Pnk (Ak ) log h(mk ) + (1 − Pnk (Ak )) log h(ak ) ≤ Pnk (Ak ) log h(mk ) + (1 − Pnk (Ak )) log h(mk + (tk − mk )[ρh(tk )]1/d ).

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

30

ARSENI SEREGIN AND JON A. WELLNER

Case y∞ = −∞. Choose tk = (1−δ)mk where δ ∈ (0, 1). Then starting from some k we have mk < tk , h(mk ) > 1, h(−Cmk ) < 1 and δ[ρh(tk )]1/d > C +1. This implies: mk + (tk − mk )[ρh(tk )]1/d = mk (1 − δ[ρh(tk )]1/d ) ≥ −Cmk , and hence: Lnk gk ≤ Pnk (Ak ) log h(mk ) + (1 − Pnk (Ak )) log h(−Cmk ) d nd − d d ≤ log h(mk ) + log h(−Cmk ) = log [h(mk )h(−Cmk )γ ] → −∞. nd nd nd Case y∞ > −∞. Without loss of generality we can assume that y∞ = 0. Choose tk = (1 + δ)mk where δ > 0. Then: − β−d d

mk + (tk − mk )[ρh(tk )]1/d ≥ mk δ[ρh((1 + δ)mk )]1/d  mk

→ +∞

which implies 1/d

h(mk + (tk − mk )[ρh(tk )]



α(β−d) d

) = o mk



.

This in turn yields βd −n +

exp(Lnk gk ) = o mk

d

α(β−d)(nd −d) dnd

!

= o(1).

Therefore in both cases we obtained Lnk gk → −∞. This contradiction concludes the proof. For a decreasing model P(h), let us denote by N (h, X, ε) for ε > −∞ the family of all convex functions g ∈ G(h) such that g is a maximal element in ev−1 X q where q = evX g, and Ln g ≥ ε. By Lemma 3.10 family N (h, X, ε) is not empty for ε > −∞ small enough. By construction for g ∈ N (h, X, ε) we have g(Xi ) < y0 for Xi ∈ X. Lemma 3.18. For given observations X = (X1 , . . . , Xn ) such that n ≥ nd , there exist constants m > y∞ and M which depend only on observations X and ε such that for any g ∈ N (h, X, ε) we have m ≤ g(x) ≤ M on conv(X). Proof. Since by assumption the points X are in general position, there exists ρ > 0 such that for any d-dimensional simplex S with vertices from imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

MULTIVARIATE CONVEX DENSITIES

31

X we have µ[S] ≥ ρ. Then any convex set C ⊆ conv(X) such that µ[C] = ρ cannot contain more than d points from X. Therefore, we have Pn [C] ≤ d/n ≤ d/nd . An arbitrary sequence of functions {gk } from N (h, X, ε) satisfies the conditions of Lemma 3.17 with nk ≡ n, the same ε and ρ constructed above. Therefore the sequence {gk } is bounded below by some constant greater than y∞ . Thus the family of functions N (h, X, ε) is uniformly bounded below by some m > y∞ . Consider any g ∈ N (h, X, ε). Let Mg be the supremum of g on dom h. By Theorem 32.2 Rockafellar [1970], the supremum is obtained at some XM ∈ X and therefore Mg < y0 . Let mg be the minimum of g on X. We have: h(mg )n−1 h(Mg ) ≥ enε enε enε h(Mg ) ≥ ≥ . h(mg )n−1 h(m)n−1 Thus we obtained an upper bound M which depends only on m and X. Now we are ready for the proof of Theorem 2.14. Proof. (Theorem 2.14) By Lemma 3.10 there exists ε small enough such that the family N (h, X, ε) is not empty. Clearly, we can restrict MLE candidates to the functions in the family N (h, X, ε). The set N = evX N (h, X, ε) ¯ is bounded by Lemma 3.18. Let us denote by q ∗ the point in the closure N of N which maximizes the continuous function: Ln (q) =

n 1X log h(qi ). n i=1

¯ , there exists a sequence of functions gk ∈ N (h, X, ε) such Since q ∈ N that evX gk converges to q ∗ . By Lemma 3.18 the functions fk = supl≥k gk are finite convex functions on conv(X), the sequence {fk (x)} is monotone decreasing for each x ∈ conv(X) and bounded below. Therefore fk ↓ g ∗ for some convex function g ∗ , and by construction evX g ∗ = q ∗ . We have: Z d

h ◦ fk dx ≤

R+

Z d

h ◦ gk dx = 1

R+

and thus by Fatou’s lemma: Z d

h ◦ g ∗ dx ≤ 1.

R+

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

32

ARSENI SEREGIN AND JON A. WELLNER

By Lemma 3.14 there exists g ∈ G(h) such that g ≤ g ∗ and Ln g ≥ Ln g ∗ = Ln (q ∗ ). By assumption this implies Ln g = Ln g ∗ . Thus the function g is the MLE. Finally, we have to add the almost surely clause since we assumed that the points Xi are in general position. 3.4. Proofs for Consistency Results. We begin with proofs for some technical results which we will use in the consistency arguments for both increasing and decreasing models. The main argument for proving Hellinger consistency proceeds along the lines of the proof given in the case of d = 1 by Pal, Woodroofe and Meyer [2007]. Lemma 3.19. Consider a monotone model P(h). Suppose the true density h ◦ g0 and the sequence of MLEs {ˆ gn } have the following properties: Z

and

Z

(h log h) ◦ g0 (x)dx < ∞,

log[ε + h ◦ gˆn (x)]d(Pn (x) − P0 (x)) →a.s. 0,

for ε > 0 small enough. Then the sequence of the MLEs is Hellinger consistent: H(h ◦ gˆn , h ◦ g0 ) →a.s. 0. Proof. For ε ∈ (0, 1) we have: 0≥

Z {h◦g0 (x)≤1−ε}

0≤

Z {h◦g0 (x)≥1}



Z

log(ε + h ◦ g0 )dP0 ≥ log(ε)P0 {h ◦ g0 (x) ≤ 1 − ε} > −∞

log(ε + h ◦ g0 )dP0 ≤

Z {h◦g0 (x)≥1}

log(2h ◦ g0 )dP0

(h log h) ◦ g0 (x)dx + log 2 < ∞.

Thus the function log(ε + h ◦ g0 ) is integrable with respect to probability measure P0 .

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

33

MULTIVARIATE CONVEX DENSITIES

We can rearrange: 0 ≤ Ln gˆn − Ln g0 = ≤ ≤

(3.12)

Z Z

+

(3.14)

+

log[h ◦ gˆn ]dPn −

log[ε + h ◦ gˆn ]dPn −

Z

Z

log[h ◦ g0 ]dPn

log[h ◦ g0 ]dPn

log[ε + h ◦ gˆn ]d(Pn − P0 ) Z

(3.13)

Z

Z

ε + h ◦ gˆn dP0 log ε + h ◦ g0 



log[ε + h ◦ g0 ]dP0 −

Z

log[h ◦ g0 ]dPn .

The term (3.12) converges almost surely to zero by assumption. For the term (3.13) we can apply the analogue of Lemma 1 from Pal et al. [2007]: II ≡

Z

ε + h ◦ gˆn log dP0 ≤ 2 b + h ◦ g0 



Z r

ε dP0 − 2H 2 (h ◦ gˆn , h ◦ g0 ). ε + h ◦ g0

For the term (3.14), the SLLN implies that: Z

III =

log[ε + h ◦ g0 ]dP0 −

→a.s.

Z

Z

log[h ◦ g0 ]dPn

log[ε + h ◦ g0 ]dP0 −

Z

log[h ◦ g0 ]dP0 =



Z

log

ε + h ◦ g0 dP0 . h ◦ g0 

Thus we have: 0 ≤ lim inf(I + II + III) ≤a.s. − lim sup 2H 2 (h ◦ gˆn , h ◦ g0 ) + 2

Z r

ε dP0 + ε + h ◦ g0



Z

log

ε + h ◦ g0 dP0 . h ◦ g0 

This yields 2

lim sup H (h ◦ gˆn , h ◦ g0 ) ≤a.s.

Z s

1 1 dP0 + 1 + h ◦ g0 /ε 2

Z

ε + h ◦ g0 log dP0 → 0 h ◦ g0 



as ε ↓ 0 by monotone convergence. Next lemma allows us to obtain pointwise consistency once we proved Hellinger consistency. Lemma 3.20. Consider that for a monotone model P(h) a sequence of the MLEs gˆn is Hellinger consistent. Then the sequence gˆn is pointwise consistent. In other words gˆn (x) →a.s. g0 (x) for x ∈ ri(dom g0 ) and convergence is uniform on compacta. imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

34

ARSENI SEREGIN AND JON A. WELLNER

Proof. Let us denote by L0a and Lka the following sublevel sets: L0a = leva g0 Lna = leva gˆn . Consider Ω0 such that Pr[Ω0 ] = 1 and the MLE for ω ∈ Ω0 . For all ω ∈ Ω0 we Z √ Z √ [ h ◦ g0 − h ◦ gˆn ]2 dx ≥

L0a \Ln a+ε

H 2 (h ◦ gˆnω , h ◦ g0 ) → 0 where gˆnω is have: √ √ [ h ◦ g0 − h ◦ gˆn ]2 dx

√ √ ≥ ( h(a) − h(a + ε))2 µ(L0a \ Lna+ε ) → 0, and by Lemma 4.2 we have: lim inf ri(L0a ∩ Lna+ε ) = ri(L0a ). Therefore lim sup gˆn (x) < a + ε for x ∈ ri(L0a ). Since a and ε are arbitrary we have lim sup gˆn ≤ g0 on ri(dom g0 ). On the other hand, we have: Z √ Z √ √ √ 2 [ h ◦ g0 − h ◦ gˆn ] dx ≥ [ h ◦ g0 − h ◦ gˆn ]2 dx 0 Ln a−ε \La



≥ ( h(a − ε) −



h(a))2 µ(Lna−ε \ L0a ) → 0,

and by Lemma 4.2 we have: lim sup cl(Lna−ε ∪ L0a ) = cl(L0a ). Therefore lim inf gˆn (x) > a − ε for x such that g0 (x) ≥ a. Since a and ε are arbitrary we have lim inf gˆn ≥ g0 on dom g0 . Thus gˆn → g0 almost surely on ri(dom g0 ). By Theorem 10.8 Rockafellar [1970] convergence is uniform on compacta K ⊂ ri(Rd+ ). We need a general property of the bracketing entropy numbers. Lemma 3.21. Let A be a class of sets in Rd such that class A ∩ [−a, a]d has finite bracketing entropy with respect to Lebesgue measure λ for any a large enough: log N[] (ε, A ∩ [−a, a]d , L1 (λ)) < +∞ for every ε > 0. Then for any Lebesgue absolutely continuous probability measure P with bounded density we have that A is Glivenko-Cantelli class: kPn − P kA →a.s. 0. imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

MULTIVARIATE CONVEX DENSITIES

35

Proof. Let C be an upper bound for the density of P and a be large so that for the set D ≡ [−a, a]d we have P ([−a, a]d ) > 1−ε/2C. By assumption the class A ∩ D has a finite set of ε/2-brackets {[Li , Ui ]}. Then for any set A ∈ A there exists index i such that: Li ⊆ A ∩ D ⊆ Ui Therefore: Li ⊆ A ⊆ Ui ∪ Dc and: k1{Ui ∪ Dc } − 1{Li }kL1 (P ) ≤ k1{Ui } − 1{Li }kL1 (P ) + k1{Dc }kL1 (P ) ≤ C(k1{Ui } − 1{Li }kL1 (λ) + k1{Dc }kL1 (λ) ) ≤ ε. Thus the set {[Li , Ui ∪ Dc ]} is the set of ε-brackets for our class A in L1 (P ). This implies that A is a Glivenko-Cantelli class and the statement follows from Theorem 2.4.1 van der Vaart and Wellner [1996]. To prove consistency for increasing models we begin with a general property of lower layer sets (see Dudley [1999], Chapter 8.3). Lemma 3.22. Let LL be the class of closed lower layer sets in Rd+ and P be a Lebesgue absolutely continuous probability measure with bounded density. Then: kPn − P kLL →a.s. 0. Proof. By Theorem 8.3.2 Dudley [1999] we have log N[] (ε, LL ∩ [0, 1]d , L1 (λ)) < +∞. Since the class LL is invariant under rescaling, the result follows from Lemma 3.21 Note that Lemma 3.1 implies that if h ◦ g belongs to an increasing model P(h) then (levy g)c is a lower layer set and has Lebesgue measure less or equal than 1/h(y). Let us denote by Aδ the set {|x| ≤ δ, x ∈ Rd+ }. Then by Lemma 3.1 part 3 we have: (3.15)

(levy g)c ⊂ Ac/h(y) ,

for c ≡ d!/dd . imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

36

ARSENI SEREGIN AND JON A. WELLNER

Proof. (Theorem 2.16) By Lemma 3.19 it is enough to show that: Z

log[ε + h ◦ gˆn (x)]d(Pn (x) − P0 (x)) →a.s. 0.

Indeed, applying Lemma 3.2 for the increasing transformation log[ε+h(y)]− log ε we obtain: Z

 Z +∞  h0 (z)

(Pn − P0 )(levz gˆn )c dz ε + h(z)   Z Z M  h0 (z) h0 (z) dz + |Pn − P0 |(levz gˆn )c dz ≤ kPn − P0 kLL ε + h(z) ε + h(z) −∞ M   Z +∞   ε + h(M ) h0 (z) ≤ kPn − P0 kLL log + (Pn + P0 )(levz gˆn )c dz. ε ε + h(z) M log[ε + h ◦ gˆn (x)]d(Pn (x) − P0 (x)) =

−∞ +∞ 

The first converges to zero almost surely by Lemma 3.22. For the second term we will use the inclusion 3.15:  Z +∞  h0 (z)

ε + h(z)

M

c

(Pn + P0 )(levz gˆn ) dz ≤

 Z +∞  h0 (z)

ε + h(z)

M

(Pn + P0 )Ac/h(z) dz.

Now, we can apply Lemma 3.2 again for gA (x) = h−1 (c/|x|). We have (levz gA )c = Ac/h(z) and therefore:  Z +∞  h0 (z)

ε + h(z)

M

Z

(Pn + P0 )Ac/h(z) dz = ≤

log(ε + c/|x|)d(Pn + P0 ) Ac/h(M )

Z

log(2c/|x|)d(Pn + P0 ), Ac/h(M )

for M large enough. Assumption I.5 and the SLLN imply that: Z Ac/h(M )

log(2c/|x|)d(Pn + P0 ) →a.s. 2

Z

log(2c/|x|)dP0 . Ac/h(M )

Since M is arbitrary and Ac/h(M ) ↓ {0} as M → +∞ the result follows. By Lemma 3.1 we have ri(Rd+ ) ⊆ dom g0 . Thus Theorem 2.16 and Lemma 3.20 imply Theorem 2.17. Theorem 3.23. For an increasing model P(h) and a true density h ◦ g which satisfies the assumptions I.4, I.5 and I.6 the sequence of MLEs gˆn is pointwise consistent. That is gˆn (x) →a.s. g0 (x) for x ∈ ri(Rd+ ) and convergence is uniform on compacta. imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

MULTIVARIATE CONVEX DENSITIES

37

Finally, we prove consistency for decreasing models. We need a general property of convex sets. Lemma 3.24. Let A be the class of closed convex sets A in Rd and P be a Lebesgue absolutely continuous probability measure with bounded density. Then: kPn − P kA →a.s. 0. Proof. Let D be a convex compact set. By Theorem 8.4.2 Dudley [1999] the class A ∩ D has a finite set of ε-brackets. Since the class A is invariant under rescaling, the result follows from Lemma 3.21 Lemma 3.25. For a decreasing model P(h) the sequence of MLEs gˆn is almost surely uniformly bounded below. Proof. We will apply Lemma 3.17 to the sequences gˆn and {n}. By the SLLN and Lemma 3.9 we have: Ln gˆn ≥ Ln g0 →a.s.

Z

[h log h] ◦ g0 dx > −∞.

Therefore the sequence {Ln gˆn } is bounded away from −∞, and the first condition of Lemma 3.17 is true. Choose some a ∈ (0, d/nd ). Then for any set S such that µ[S] = ρ ≡ a/h(min g0 ) where min g0 is attained by Lemma 3.6 we have: Z

P [S] = S

h ◦ g0 dx ≤ µ[S]h(min g0 ) = a < d/nd .

Now, let An = levan gˆn be such sets that µ[An ] = ρ. Then, by Lemma 3.24 we have: |Pn [An ] − P [An ]| ≤ kPn − P kA →a.s. 0, which implies that Pn [An ] < d/nd almost surely for n large enough. Therefore, the second condition of Lemma 3.17 is true and it is applicable to the sequence gˆn almost surely. Proof. (Theorem 2.18) By Lemma 3.9 and Lemma 3.19 it is enough to show that: Z log[ε + h ◦ gˆn (x)]d(Pn (x) − P0 (x)) →a.s. 0. By Lemma 3.25 we have inf gˆn ≥ A for some A > h∞ . Therefore, by Lemma 3.7 applied to the decreasing transformation log[ε + h(y)] − log ε

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

38

ARSENI SEREGIN AND JON A. WELLNER

it follows: Z

log[ε + h ◦ gˆn (x)]d(Pn (x) − P0 (x)) =

 Z +∞  −h0 (z) A

ε + h(z)

≤ kPn − P0 kA

(Pn − P0 )(levz gˆn )dz

 Z +∞  −h0 (z)

dz ε + h(z)   ε + h(A) →a.s. 0, = kPn − P0 kA log ε A

where the last limit follows from Lemma 3.24. Proof. (Theorem 2.19) By Lemma 3.20 gˆn → g0 almost surely on ri(dom g0 ). Functions g0 and g0∗ differ only on the boundary ∂ dom g0 which has Lebesgue measure zero by Lemma 4.1. Since observations Xi ∈ ri(dom g0 ) almost surely we have gˆn = +∞ on ∂ dom g0 and thus gˆn → g0∗ . Now, we assume that dom g0 = Rd . By Lemma 3.6 function g0 has bounded sublevel sets and therefore there exists x0 where g0 attains its minimum m. Since h ◦ g0 is density we have h(m) > 0 and by Lemma 3.8 we have h(m) < ∞. Fix ε > 0 such that h(m) > 3ε and consider a such that h(a) < ε. The set A = leva g0 is bounded and by continuity g0 = a on ∂A. Choose δ > 0 such that h(a − δ) < 2ε < h(m + δ) and: |h(x) − h(x − δ)| ≤ ε.

sup x∈[m,a+δ]

The closure A¯ is compact and thus for n large enough we have with probability one: gn − g0 | < δ, sup |ˆ ¯ A

which implies: sup |h ◦ gˆn − h ◦ g0 | < ε, ¯ A

since the range of values of g0 on A¯ is [m, a]. The set ∂A is compact and therefore gˆn attains its minimum mn on this set at some point xn . By construction: mn = gˆn (xn ) > g0 (xn ) − δ = a − δ > m + δ = g0 (x0 ) + δ > gˆn (x0 ) = m We have x0 ∈ A ∩ leva−δ gˆn and gˆn ≥ mn > a − δ on ∂A. Thus, by convexity we have leva−δ gˆn ⊂ A and for x ∈ / A¯ we have: |h ◦ gˆn (x) − h ◦ g0 (x)| ≤ h ◦ gˆn (x) + h ◦ g0 (x) < h(a − δ) + h(a) < 3ε. imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

39

MULTIVARIATE CONVEX DENSITIES

This shows that for any ε > 0 small enough we will have: kh ◦ gˆn − h ◦ g0 k∞ < 3ε with probability one as n → ∞. This concludes the proof. 3.5. Proofs for Lower Bound Results. We will use the following lemma for computing the Hellinger distance between a function and its local deformation: Lemma 3.26. Let {gε } be a local deformation of the function g : Rd → R at the point x0 , such that g is continuous at x0 , and let the function h : R → R be continuously differentiable at the point g(x0 ). Then for any r > 0: Z

(3.16)

lim

ε→0 Rd

R

lim

(3.17)

Rd

|gε (x) − g(x)|r dx = 0, |h ◦ gε (x) − h ◦ g(x)|r dx R

ε→0

|gε (x) − g(x)|r dx

Rd

= |h0 ◦ g(x0 )|r .

Proof. Since {gε } is a local deformation, for ε > 0 small enough we have: Z Rd

Z

r

|h ◦ gε (x) − h ◦ g(x)| dx =

|h ◦ gε (x) − h ◦ g(x)|r dx,

B(x0 ;rε )

Z

Z

r

Rd

|gε (x) − g(x)| dx =

|gε (x) − g(x)|r dx.

B(x0 ;rε )

Then: Z

|gε − g|r dx ≤ ess sup |gε − g|r µ[B(x0 ; rε )]

B(x0 ;rε )

implies (3.16). Let us define a sequence {aε }: aε ≡ ess sup |gε − g| +

sup

|g(x) − g(x0 )|.

x∈B(x0 ;rε )

For x ∈ B(x0 ; rε ) and y ∈ [gε (x), g(x)] we have a.e.: |y − g(x0 )| ≤ |gε (x) − g(x)| + |g(x) − g(x0 )| ≤ aε . Using the mean value theorem we obtain: Z Rd

r

|h ◦ gε (x) − h ◦ g(x)| dx = R

inf

y∈B(g(x0 );aε )

0

r

|h (y)| ≤

Rd

Z Rd

|h0 (yx )|r |gε (x) − g(x)|r dx

|h ◦ gε (x) − h ◦ g(x)|r dx R

Rd

|gε (x) − g(x)|r dx



sup

|h0 (y)|r

y∈B(g(x0 );aε )

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

40

ARSENI SEREGIN AND JON A. WELLNER

Since h0 is continuous at g(x0 ), to prove (3.17) it is enough to show that aε → 0. By assumption we have: lim ess sup |gε − g| = 0.

ε→0

Since g is continuous at x0 and rε → 0 we have: lim

sup

ε→0 x∈B(x ;rε ) 0

|g(x) − g(x0 )| = 0.

Thus aε → 0, which proves (3.17). In order to apply Corollary 2.22 we need to construct deformations so that they still belong to the class G. The following lemma provides a technique for constructing such deformations. Lemma 3.27. Let {gε } be a local deformation of the function g : Rd → R at the point x0 , such that g is continuous at x0 , and let the function h : R → R be continuously differentiable at the point g(x0 ) so that h0 ◦ g(x0 ) 6= 0. Then for any fixed δ > 0 small enough, the deformation gθ,δ = θgδ + (1 − θ)g and any r > 0 we have: lim sup θ−r

(3.18)

Z Rd

θ→0

lim inf θ−r

(3.19)

θ→0

Z Rd

|h ◦ gθ,δ (x) − h ◦ g(x)|r dx < ∞,

|h ◦ gθ,δ (x) − h ◦ g(x)|r dx > 0.

Note that gθ,δ is not a local deformation. Proof. The statement follows from the argument for Lemma 3.26. For a fixed θ the family {gθ,δ } is a local deformation. Thus for aθ,ε defined by: aθ,ε ≡ ess sup |gθ,ε − g| +

sup

|g(x) − g(x0 )|,

x∈B(x0 ;rε )

it follows that R

Rd

|h ◦ gθ,ε (x) − h ◦ g(x)|r dx R

R

Rd

Rd

|gθ,ε (x) − g(x)|r dx

|h ◦ gθ,ε (x) − h ◦ g(x)|r dx R

Rd

|gθ,ε (x) − g(x)|r dx



sup

|h0 (y)|r ,

y∈B(g(x0 );aθ,ε )



inf

y∈B(g(x0 );aθ,ε )

|h0 (y)|r .

For |θ| < 1 we have: |gθ,ε − g| = |θ||gε − g| imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

MULTIVARIATE CONVEX DENSITIES

41

and therefore aθ,δ ≤ aδ . Since aε → 0 and h is continuously differentiable for all δ > 0 small enough we have: |h0 (y)|r ≤

sup y∈B(g(x0 );aθ,δ )

inf

y∈B(g(x0 );aθ,δ )

sup

|h0 (y)|r < ∞,

y∈B(g(x0 );aδ )

|h0 (y)|r ≥

inf

y∈B(g(x0 );aδ )

|h0 (y)|r > 0.

Thus for all θ we obtain: θ−r

Z Rd



|h ◦ gθ,δ (x) − h ◦ g(x)|r dx sup

0

|h (y)|

r

Rd

y∈B(g(x0 );aδ )

θ−r

Z Rd



Z

|gδ (x) − g(x)|r dx < ∞,

|h ◦ gθ,δ (x) − h ◦ g(x)|r dx inf

y∈B(g(x0 );aδ )

|h0 (y)|r

Z Rd

|gδ (x) − g(x)|r dx > 0

which proves the lemma. Proof. (Theorem 2.24). Our statement is not trivial only if the curvature curvx0 g > 0 or equivalently there exists positive definite d×d matrix G such that the function g is locally G-strongly convex. Then by Lemma 4.17 this means that there exists a convex function q such that in some neighborhood O(x0 ) of x0 we have: (3.20)

1 g(x) = (x − x0 )T G(x − x0 ) + q(x). 2

The plan of the proof is the following: we introduce families of functions {Dε (g; x0 , v)} and {Dε∗ (g; x0 )} and prove that these families are local deformations. Using these deformations as building blocks we construct two types of deformations: {h ◦ gε+ } and {h ◦ gε− } of the density h ◦ g which belong to P(h). These deformations represent positive and negative changes in the value of the function g at the point x0 . After that we approximate the Hellinger distances using Lemma 3.26. Finally, applying Corollary 2.22 we obtain lower bounds which depend on G. We finish the proof by taking the supremum of the obtained lower bounds over all G ∈ SC(g; x0 ). Under the mild assumption of strong convexity of the function g both deformations give the same rate and structure of the constant C(d). However, it is possible to obtain a larger constant C(d) for the negative deformation if we assume that g is twice differentiable. Note that by the definition of P(h) the function g is a closed proper convex function. imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

42

ARSENI SEREGIN AND JON A. WELLNER

original

deformation

ε v0

x0

x0

Fig 1. Example of the deformation Dε (g; x0 , v0 ).

Let us define a function Dε (g; x0 , v0 ) for a given ε > 0, x0 ∈ dom g and v0 ∈ ∂g(x0 ) as: Dε (g; x0 , v0 )(x) = max(g(x), l0 (x) + ε), where l0 (x) = hv0 , x − x0 i + g(x0 ) is a support plane to g at x0 . Since l0 + ε is a support plane to g + ε we have: g ≤ Dε (g; x0 , v0 ) ≤ g + ε, and thus dom Dε (g; x0 , v0 ) = dom g. As a maximum of two closed convex functions Dε (g; x0 , v0 ) is a closed convex function. For a given x1 we have Dε (g; x0 , v0 )(x1 ) = g(x1 ) if and only if: (3.21)

g(x1 ) − ε ≥ hv0 , x1 − x0 i + h(x0 ).

We also define a function Dε∗ (g; x0 ) for a given ε > 0 and x0 ∈ dom g as a maximal convex minorant (Appendix 4.1) of the function g˜ε defined as: (

g˜ε (x) =

g(x), x 6= x0 g(x0 ) − ε, x = x0 .

Both functions Dε (g; x0 , v0 ) and Dε∗ (g; x0 ) are convex by construction, and, as the next lemma shows, have similar properties. However, the argument for Dε∗ (g; x0 ) is more complicated. Lemma 3.28. Let g be a closed proper convex function, g ∗ be its convex conjugate and x0 ∈ ri(dom g). Then: imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

43

MULTIVARIATE CONVEX DENSITIES

original

deformation

ε

x0

x0

Fig 2. Example of the deformation Dε∗ (g; x0 ).

1. Dε∗ (g; x0 ) is a closed proper convex function such that: g − ε ≤ Dε∗ (g; x0 ) ≤ g and dom Dε∗ (g; x0 ) = dom g. 2. For a given x1 ∈ ri(dom g) we have Dε∗ (g; x0 )(x1 ) = g(x1 ) if and only if there exists v ∈ ∂g(x1 ) such that: (3.22)

g(x1 ) + ε ≤ hv, x1 − x0 i + g(x0 ).

3. If y0 ∈ ∂g(x0 ) then x0 ∈ ∂g ∗ (y0 ) and: Dε (g; x0 , y0 ) = (Dε∗ (g ∗ ; y0 ))∗ . Proof. Obviously, g˜ε ≥ g − ε. Since g − ε is a closed proper convex function it is equal to the supremum of all linear functions l such that l ≤ h − ε. Thus g − ε ≤ Dε∗ (g; x0 ), which implies that Dε∗ (g; x0 ) is a proper convex function and dom Dε∗ (g; x0 ) ⊆ dom(g − ε) = dom g. By Lemma 4.10 we have Dε∗ (g; x0 ) ≤ g, therefore dom g ⊆ dom Dε (g; x0 ) which proves 1. If v ∈ ∂g(x1 ) then lv (x) = hv, x − x1 i + g(x1 ) is a support plane to g(x) and lv ≤ g. If inequality (3.22) is true then lv (x) is majorized by g˜ε and we have: Dε (g; x0 )(x1 ) ≤ g(x1 ) = lv (x1 ) ≤ Dε (g; x0 )(x1 ). On the other hand, by 1 we have x1 ∈ ri(dom Dε (g; x0 )), hence there exists v ∈ ∂Dε (g; x0 )(x1 ) and: g(x) ≥ g˜ε (x) ≥ Dε (g; x0 )(x) ≥ hv, x0 − x1 i + Dε (g; x0 )(x1 ) = hv, x0 − x1 i + g(x1 ). imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

44

ARSENI SEREGIN AND JON A. WELLNER

Therefore v ∈ ∂g(x1 ). In particular: g˜ε (x0 ) = g(x0 ) − ε ≥ Dε (g; x0 )(x0 ) ≥ hv, x0 − x1 i + Dε (g; x0 )(x1 ) = hv, x0 − x1 i + g(x1 ), which proves 2. We can represent Dε∗ (g ∗ ; x0 ) as the maximal convex minorant of g defined by: g = min(g, g(x0 ) − ε + δ(· | x0 )). For x ∈ dom g by Lemma 4.10, g ∗ (y0 ) + g(x0 ) = hy0 , x0 i. Thus (g(x0 ) − ε + δ(· | x0 ))∗ (y) = hx0 , yi − g(x0 ) + ε = hx0 , y − vi + ε for some v ∈ ∂g(x0 ). By Lemma 4.7 we have: Dε∗ (g ∗ ; x0 )∗ = max(g ∗ , l0∗ ),

l0∗ (y) ≡ hx0 , y − vi + ε

which concludes the proof the lemma. Since the domain of the quadratic part of the equation (3.20) is Rd , by Lemma 4.11 we have that for any x ∈ dom g and v ∈ ∂g(x) there exists w ∈ ∂q(x) such that: (3.23)

v = G(x − x0 ) + w.

Therefore for the point x1 in the neighborhood O(x0 ) where the decomposition (3.20) is true, condition (3.21) is equivalent to: 1 (x1 − x0 )T G(x1 − x0 ) + q(x1 ) − ε ≥ hw0 , x1 − x0 i + q(x0 ), 2 where w0 corresponds to y0 in the decomposition (3.23). Since hw0 , x1 − x0 i + q(x0 ) is a support plane to q(x), the inequality (3.21) is satisfied if 1 (x1 − x0 )T G(x1 − x0 ) ≥ ε, 2 √ which is the complement of an open ellipsoid BG (x0 , 2ε) defined by G with the center at x0 . For ε small enough this ellipsoid will belong to the neighborhood O(x0 ). Since |Dε (g; x0 , y0 ) − g| ≤ ε this proves that the family Dε (g; x0 , y0 ) is a local deformation.

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

MULTIVARIATE CONVEX DENSITIES

45

In the same way, the condition (3.22) is equivalent to: 1 (x1 − x0 )T G(x1 − x0 ) + q(x1 ) + ε ≤ hG(x1 − x0 ) + w1 , x1 − x0 i + q(x0 ) 2 or: 1 (x1 − x0 )T G(x1 − x0 ) + q(x0 ) − ε ≥ hw1 , x0 − x1 i + q(x1 ), 2 which is satisfied if we have: 1 (x1 − x0 )T G(x1 − x0 ) ≥ ε. 2 Since |Dε∗ (g; x0 ) − g| ≤ ε this proves that the family Dε∗ (g; x0 ) is also a local deformation. Thus we proved: Lemma 3.29. Let g be a closed proper convex function locally G-strongly convex at some x0 ∈ ri dom g and y0 ∈ ∂g(x0 ). Then the families Dε (g; x0 , y0 ) and Dε∗ (g; x0 ) are local deformations for all ε > 0 small enough. Moreover, the condition: 1 (x − x0 )T G(x − x0 ) ≥ ε 2 implies Dε (g; x0 , y0 )(x) = Dε∗ (g; x0 )(x) = g(x). Or equivalently supp[Dε (g; x0 , y0 ) − g] and supp[Dε∗ (g; x0 ) − g] are subsets √ of BG (x0 , 2ε). For r > 0 small enough h0 ◦ g(x) is positive and the decomposition 3.20 is true on B(x0 ; r). Let us fix some y0 ∈ ∂g(x0 ), some x1 ∈ B(x0 ; r) such that x1 6= x0 , and some y1 ∈ ∂g(x1 ). We fix √δ such that equation (3.18) of Lemma 3.27 is true for the transformation h and r = 2 and also x0 ∈ / √ BG (x1 ; 2δ). Then by Lemma 3.29 for all ε > 0 small enough, the support sets supp[Dε (g; x0 , y0 )−g] and supp[Dε∗ (g; x0 )−g] do not intersect; i.e. these two deformations do not interfere. Now, we can prove Theorem 2.24. The argument below is identical for gε+ and gε− , so we will give the proof only for gε+ . We define deformations gε+ and gε− by means of the following Lemma:

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

46

ARSENI SEREGIN AND JON A. WELLNER

Lemma 3.30. For all ε > 0 small enough there exist θε+ , θε− ∈ (0, 1) such that the functions gε+ and gε− defined by: gε+ = (1 − θε+ )Dε (g; x0 , v0 ) + θε+ Dδ∗ (g; x1 ) gε− = (1 − θε− )Dε∗ (g; x0 ) + θε− Dδ (g; x1 ; v1 ) belong to P(h). Proof. By dominated convergence, the function F (θ) defined by: Z

F (θ) =

h ◦ ((1 − θ)Dε (g; x0 , v0 )dx + θDδ∗ (g; x1 )) dx

is continuous. We have: Z

F (0) = Z

F (1) =

h ◦ Dε (g; x0 , v0 )dx > h ◦ Dδ∗ (g; x1 )dx
0 and if k < d we have µ[V ] = 0. 3. Part 1 implies that it is enough to consider closed convex sets. Part 2 implies that it is enough to prove that an unbounded closed convex set of dimension d has Lebesgue measure +∞. Let A be such a set; i.e. an unbounded closed convex set. Then A contains d-dimensional simplex D (Theorem 2.4 Rockafellar [1970]) which has non-zero Lebesgue measure. Since A is unbounded then its recession cone is non-empty (Theorem 8.4 Rockafellar [1970] ) and therefore we can choose a direction v such that D + λv ⊂ A for all λ ≥ 0 which implies µ[A] = +∞.

The following lemma shows that convergence of convex sets in measure implies pointwise convergence. Lemma 4.2. Let A be a convex set in Rd such that dim(A) = d and ri(A) 6= ∅. Then: 1. Suppose a sequence of convex sets Bn is such that A ⊆ Bn and lim µ[Bn \ A] = 0 then lim sup cl(Bn ) = cl(A); 2. Suppose a sequence of convex sets Bn is such that Cn ⊆ A and lim µ[A\ Cn ] = 0 then lim inf ri(Cn ) = ri(A). Proof. By Lemma 4.1 we can assume that A, Bn and Cn are closed convex sets. 1. If on the contrary, there exists subsequence {k} such that for some x ∈ Ac we have x ∈ ∩k≥1 Bk then for xA = conv({x} ∪ A) we have: xA ⊆ Bk µ[Bk \ A] ≥ µ[xA \ A]. imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

52

ARSENI SEREGIN AND JON A. WELLNER

Since A is closed there exists a ball B(x) such that B(x) ∩ A = ∅. Since ri(A) 6= ∅ there exists a ball B(x0 ) such that B(x0 ) ⊆ A for some x0 ∈ ri(A). Then for xB = conv({x} ∪ B(x0 )) we have: xB ⊆ xA µ[xA \ A] ≥ µ[xB ∩ B(x)] > 0. This contradiction implies lim sup Bi = A. 2. If on the contrary, there exists a point x ∈ ri(A) and subsequence {k} such that x ∈ / Ck for all k then for each Ck there exists a half-space Lk such that x ∈ Lk and Ck ⊆ Lck . Let B(x) be a ball such that B(x) ⊆ A. We have: µ[A \ Ci ] ≥ µ[A ∩ Lk ] ≥ µ[B(x) ∩ Lk ] = µ[B(x)]/2 > 0. This contradiction implies ri(A) ⊆ lim inf Ci .

Our next lemma shows that the Lebesgue measure of sublevel sets of a convex function grows at most polynomially. Lemma 4.3. Let g be a convex function and values y1 < y2 < y3 are such that levy1 g 6= ∅. Then we have: y3 − y1 µ[levy3 g] ≤ y2 − y1 

(4.24)

d

µ[levy2 g].

Proof. By assumption we have: µ[levy3 g] ≥ µ[levy2 g] ≥ µ[levy1 g] > 0. Let us consider the set L defined as: L = {x1 + k(x − x1 ) | x ∈ levy2 g}, where x1 is any fixed point such that g(x1 ) = y1 and k=

y3 − y1 > 1. y2 − y1

Then: µ[L] = k d µ[levy2 g]. imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

MULTIVARIATE CONVEX DENSITIES

53

and therefore it is enough to prove that levy3 g ⊆ L. If x3 ∈ levy3 g then for x2 = x1 + (x3 − x1 )/k we have: x3 = x1 + k(x2 − x1 ), g(x2 ) ≤ (1 − 1/k)g(x1 ) + (1/k)g(x3 ) = y2 and thus x2 ∈ levy2 g. Corollary 4.4. If g is a convex function then function µ[levy g] is continuous on (inf g, sup g). 4.1. Maximal convex minorant. In this section we describe the convex function fc which is in some sense the closest to a given function f . Definition 4.5. The maximal convex minorant fc of a proper function f is a supremum of all linear functions l such that l ≤ f . It is possible that f does not majorate any linear function and then f = −∞. However if it is not the case the following properties of the maximal convex minorant hold: Lemma 4.6. norant. Then: 1. 2. 3. 4.

Let f be a function and fc 6= −∞ its maximal convex mi-

fc is a closed proper convex function; if f is proper convex function then fc is its closure; fc ≤ f ; (fc )∗ (y) = supx∈Rd (hy, xi − f (x)).

Proof. This follows from Corollary 12.1.1 Rockafellar [1970]. The maximal convex minorant allows us to see an important duality between operations of pointwise minimum and pointwise maximum. Lemma 4.7. Let fi be a proper convex functions and let g = inf i fi be the pointwise infinum of fi . Then (gc )∗ = supi fi∗ . Proof. This follows from Theorem 16.5 Rockafellar [1970].

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

54

ARSENI SEREGIN AND JON A. WELLNER

4.2. Subdifferential. Definition 4.8. The subdifferential ∂h(x) of a convex function h at the point x is the set of all vectors v which satisfy the inequality h(z) ≥ hv, z − xi + h(x)

for all x.

Obviously ∂h(x) is a closed convex set. It might be empty, but if it is not, the function h is called subdifferentiable at x. Lemma 4.9. Let h be a proper convex function then for x ∈ ri dom h subdifferential ∂h(x) is not empty. Proof. This follows from Theorem 23.4 Rockafellar [1970]. Lemma 4.10. Let h be a closed proper convex function. Then the following conditions on x and x∗ are equivalent: 1. 2. 3. 4. 5.

x∗ ∈ ∂h(x); l(z) = hx∗ , zi − h∗ (x∗ ) is a support plane for epi(h) at x; h(x) + h∗ (x∗ ) = hx∗ , xi; x ∈ ∂h∗ (x∗ ); l(z) = hx, zi − h(x) is a support plane for epi(h∗ ) at x∗ ;

Proof. This follows from Theorem 23.5 Rockafellar [1970]. Lemma 4.11. Let h1 and h2 be proper convex functions such that ri dom h1 ∩ ri dom h2 6= ∅. Then ∂(h1 + h2 ) = ∂h1 + ∂h2 for all x. Proof. This follows from Theorem 23.8 Rockafellar [1970]. 4.3. Polyhedral functions. Definition 4.12. A polyhedral convex set is a set which can be expressed as an intersection of finitely many half-spaces. A polyhedral convex function is a convex function whose epigraph is polyhedral. From Theorem 19.1 Rockafellar [1970] we have that the epigraph of the polyhedral function h : Rd → R has finite number of extremal points and faces. We call projections of extremal points the knots of h and projections of the nonvertical d-dimensional faces the facets of h. Thus the set of knots and the set of facets of polyhedral function are always finite. Moreover, by

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

MULTIVARIATE CONVEX DENSITIES

55

Theorem 18.3 Rockafellar [1970] the knots are the extremal points of the facets. Finally, let {Ci } be the set of facets of a polyhedral function h then: dom h =

[

Ci

i

ri(Ci ) ∩ ri(Cj ) = ∅, and on dom h we have h = max(li ) where li are linear functions. For each Ci there exists li such that: Ci = {x | h(x) = li (x)}. Lemma 4.13. ∂h(x) 6= ∅.

Let f be a polyhedral convex function and x ∈ dom h then

Proof. This follows from Theorem 23.10 Rockafellar [1970]. Lemma 4.14. For the set of points x = {xi }ni=1 such that xi ∈ Rd and any point p ∈ Rn consider a family of all convex functions h with evx h = p. The unique maximal element Uxp in this family is a polyhedral convex function with domain dom Uxp = conv(x) and the set of knots K ⊆ x. Proof. Points (xi , pi ) and direction (0, 1) belong to the epigraph of any convex function h in our family and so does convex hull U of these points and direction. By construction U is an epigraph of some closed proper convex function Uxp such that dom Uxp = conv(x), by Theorem 19.1 Rockafellar [1970] this function is polyhedral, by Corollary 18.3.1 Rockafellar [1970] the set of its knots K belongs to x and since epi(Uxp ) = U ⊆ epi(h) we have h ≤ Uxp . On the other hand, since (xi , pi ) ∈ U we have pi = h(xi ) ≤ Uxp (xi ) ≤ pi and therefore Uxp (xi ) = pi which proves the lemma. Lemma 4.15. For the set of points x = {xi }ni=1 , convex set C such that xi ∈ ri(C) and any point p ∈ Rn consider a family of all convex functions h with evx h = p and C ⊆ dom h. Any minimal element Lpx in this family is a polyhedral convex function with dom Lpx = Rd . For each facet C of Lpx , ri(C) contains at least one element of x. Proof. For any function h in our family let us consider the set of linear functions li such that li (xi ) = h(xi ) = pi and li ≤ h and which correspond imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

56

ARSENI SEREGIN AND JON A. WELLNER

to arbitrarily chosen nonvertical support planes for epi(h) at xi . Then L = max(li ) is polyhedral and since lj (xi ) ≤ h(xi ) = pi we have L(xi ) = pi . We also have dom L = Rd . If the interior of any facet Ci of L does not contain elements of x we can exclude corresponding linear function li from maximum. For the new polyhedral function L0 = maxj6=i lj we still have evx L0 = p. Now, we repeat this procedure until interior of each facet contains at least one element of x and denote the function we obtained by Lpx . If a closed proper convex function h is such that evx h = p and h ≤ Lpx , then consider for any facet Ci and corresponding linear function li we have h ≤ li on Ci and the supremum of h on a the convex set Ci is obtained in interior point xj ∈ x. By Theorem 32.1 Rockafellar [1970] h ≡ Lpx on Ci . Thus h ≡ Lpx and Lpx is the minimal element of our family. Lemma 4.16. For linear function l(x) = aT x + b the polyhedral set A = {l ≥ c} ∩ Rd is bounded if and only if all coordinates of a are negative. In this case, if b ≥ c the set A is simplex with vertices pi = ((c − b)/ai )ei and 0, where ei are basis vectors. Otherwise, A is empty. Proof. If coordinate ai is nonnegative then the direction {λei }, λ > 0 belongs to the recession cone of A and thus it is unbounded. If all coordinates ai are negative and b ≤ c the set A is either empty or consists of zero vector 0. Finally, if ai are negative and b > c then for x ∈ A we can define P P θi = ai xi /(c − b) > 0. Then 1 ≥ i θi and x = i θi pi , which proves that A is simplex. 4.4. Strong convexity. Following Rockafellar and Wets Rockafellar and Wets [1998] page 565 we say that a proper convex function h : Rd → R is strongly convex if there exists a constant σ such that: (4.25)

1 h(θx + (1 − θ)y) ≤ θh(x) + (1 − θ)h(y) − σθ(1 − θ)kx − yk2 2

for all x, y and θ ∈ (0, 1). There is a simple characterization of strong convexity: Lemma 4.17. A proper convex function f : Rd → R is strongly convex if and only if the function f (x) − 12 σkxk2 is convex. Since we need a more precise control over the curvature of a convex function we define a generalization of strong convexity based on the characterization above:

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

MULTIVARIATE CONVEX DENSITIES

57

Definition 4.18. We say that a proper convex function h : Rd → R is G-strongly convex if there exist a point x0 , a positive semidefinite d × d matrix G and a convex function q such that: (4.26)

1 h(x) = (x − x0 )T G(x − x0 ) + q(x) 2

for all x.

Obviously, strong convexity is equivalent to σI-strong convexity where I denotes the d × d identity matrix.Note that the definition does not depend on the choice of x0 . Definition 4.19. We say that a proper convex function h : Rd → R is locally G-strongly convex at a point x0 if there exist an open neighborhood of x0 , a positive semidefinite d × d matrix G and a convex function q such that (4.26) holds for any x in this neighborhood. We can relate G-strong convexity to the Hessian of a smooth convex function: Lemma 4.20. If a proper convex function h : Rd → R is continuously twice differentiable at x0 then h is locally (1 − ε)∇2 h-strongly convex for any ε ∈ (0, 1). The last result suggests the following definition: Definition 4.21. For a proper convex function h : Rd → R we define a curvature curvx0 h at a point x0 as: (4.27)

curvx0 h =

sup

det(G)

G∈SC(h;x0 )

where SC(h; x0 ) is the set of all positive semidefinite matrices G such that h is locally G-strong convex at x0 . Lemma 4.20 implies that: Lemma 4.22. If a proper convex function h : Rd → R is continuously twice differentiable at x0 and Hessian ∇2 h(x0 ) is positive definite then (4.28)

curvx0 h = det(∇2 h(x0 )).

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

58

ARSENI SEREGIN AND JON A. WELLNER

APPENDIX A: NOTATION

R R R+ R+ |x| C D G(h) Ln g evx f supp(f ) δ(· | C) levy g µ[S] {f S a} B(x0 ; r) BH (x0 ; r) curvx h

= = = = = = = = = = = = = = = = = =

(−∞, +∞) [−∞, +∞] [0, +∞) [0, +∞] Qd x ∈ Rd+ k=1 xk , d {f : R → R | f closed proper convex function} {p : Rd → R | p density} {h : | g ∈ C, h ◦ g ∈ D} Pn h ◦ g (f (x1 ), . . . f (xn )), xi ∈ Rd {x | f (x) 6= 0} ∞ · 1C c + 0 · 1C {x | g(x) ≤ y} Lebesgue measure of S {x ∈ X | f (x) S a} {x : kx − x0 k < r} {x : (x − x0 )T H(x − x0 ) < r2 } curvature of a convex function h at a point x ACKNOWLEDGMENTS

This research is a part of the Ph.D. dissertation of the first author at the University of Washington.

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

MULTIVARIATE CONVEX DENSITIES

59

REFERENCES An, M. Y. (1998). Logconcavity versus logconvexity: a complete characterization. J. Econom. Theory 80 350–369. Avriel, M. (1972). r-convex functions. Math. Programming 2 309–323. Balabdaoui, F., Rufibach, K. and Wellner, J. A. (2009). Limit distribution theory for maximum likelihood estimation of a log-concave density. Ann. Statist. 37 1299–1331. ´, L. (2009). Model selection for functions on high-dimensional Baraud, Y. and Birge spaces. Tech. rep., Unversit´e Paris VI. In Mathematisches Forschungs Institute Oberwolfach Report No. 39/2009, Challenges in Statistical Theory: Complex Data Structures and Algorithmic Optimization. Bobkov, S. G. (1999). Isoperimetric and analytic inequalities for log-concave probability measures. Ann. Probab. 27 1903–1921. Bobkov, S. G. (2007a). Large deviations and isoperimetry over convex probability measures with heavy tails. Electron. J. Probab. 12 1072–1100 (electronic). Bobkov, S. G. (2007b). On isoperimetric constants for log-concave probability distributions. In Geometric aspects of functional analysis, vol. 1910 of Lecture Notes in Math. Springer, Berlin, 81–88. Bobkov, S. G. and Ledoux, M. (2009). Weighted Poincar´e-type inequalities for Cauchy and other convex measures. Ann. Probab. 37 403–427. Borell, C. (1975). Convex set functions in d-space. Period. Math. Hungar. 6 111–136. Brascamp, H. J. and Lieb, E. H. (1976). On extensions of the Brunn-Minkowski and Pr´ekopa-Leindler theorems, including inequalities for log concave functions, and with an application to the diffusion equation. J. Functional Analysis 22 366–389. ¨ ger, M. (2001). A Cordero-Erausquin, D., McCann, R. J. and Schmuckenschla Riemannian interpolation inequality a ` la Borell, Brascamp and Lieb. Invent. Math. 146 219–257. Cule, M. and Samworth, R. (2009). Theoretical properties of the log-concave maximum likelihood estimator of a multidimensional density. Tech. rep., Cambridge University. Available at arXiv:0908.4400v1. Cule, M., Samworth, R. and Stewart, M. (2007). Maximum likelihood estimation of a multidimensional log-concave density. Available at arXiv:0804.3989vl. Dharmadhikari, S. and Joag-Dev, K. (1988). Unimodality, convexity, and applications. Probability and Mathematical Statistics, Academic Press Inc., Boston, MA. Donoho, D. L. and Liu, R. C. (1991). Geometrizing rates of convergence. II, III. Ann. Statist. 19 633–667, 668–701. Dudley, R. M. (1999). Uniform central limit theorems, vol. 63 of Cambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge. Duembgen, L. and Rufibach, K. (2009). Maximum likelihood estimation of a logconcave density and its distribution function: basic properties and uniform consistency. Bernoulli 15 40–68. ¨ mbgen, L., Hu ¨ sler, A. and Rufibach, K. (2007). Active set and EM algorithms for Du log-concave densities based on complete and censored data. Tech. rep., University of Bern. Available at arXiv:0707.4643. `res, P. (2005). Spectral gap for log-concave probability measures on the real line. Fouge In S´eminaire de Probabilit´es XXXVIII, vol. 1857 of Lecture Notes in Math. Springer, Berlin, 95–123. Groeneboom, P., Jongbloed, G. and Wellner, J. A. (2001). Estimation of a convex function: characterizations and asymptotic theory. Ann. Statist. 29 1653–1698. Ibragimov, I. A. (1956). On the composition of unimodal distributions. Teor. Veroyatimsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009

60

ARSENI SEREGIN AND JON A. WELLNER

nost. i Primenen. 1 283–288. Johnson, N. L. and Kotz, S. (1972). Distributions in Statistics: Continuous Multivariate Distributions. Wiley, New York. Jongbloed, G. (2000). Minimax lower bounds and moduli of continuity. Statist. Probab. Lett. 50 279–284. Koenker, R. and Mizera, I. (2008). Quasi-concave density estimation. unpublished . Milman, E. and Sodin, S. (2008). An isoperimetric inequality for uniformly log-concave measures and uniformly convex bodies. J. Funct. Anal. 254 1235–1268. Okamoto, M. (1973). Distinctness of the eigenvalues of a quadratic form in a multivariate sample. Ann. Statist. 1 763–765. Pal, J. K., Woodroofe, M. B. and Meyer, M. C. (2007). Estimating a polya frequency function. In Complex Datasets and Inverse Problems: Tomography, Networks, and Beyond, vol. 54 of IMS Lecture Notes-Monograph Series. IMS, 239–249. ´kopa, A. (1973). On logarithmic concave measures and functions. Acta Sci. Math. Pre (Szeged) 34 335–343. Rinott, Y. (1976). On convexity of measures. Ann. Probability 4 1020–1026. Rockafellar, R. T. (1970). Convex analysis. Princeton Mathematical Series, No. 28, Princeton University Press, Princeton, N.J. Rockafellar, R. T. and Wets, R. J.-B. (1998). Variational analysis, vol. 317 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin. Rufibach, K. (2006). Log-concave density estimation and bump hunting for I.I.D. observations. Ph.D. thesis, Universities of Bern and G¨ ottingen. Rufibach, K. (2007). Computing maximum likelihood estimators of a log-concave density function. J. Statist. Comp. Sim. 77 561–574. ¨ sler, A. and Duembgen, L. (2009). Multivariate log-concave Schuhmacher, D., Hu distributions as a nearly parametric model. Tech. rep., University of Bern. Available at arXiv:0907.0250v1. Uhrin, B. (1984). Some remarks about the convolution of unimodal functions. Ann. Probab. 12 640–645. van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes, with Applications to Statistics. Springer Series in Statistics, Springer-Verlag, New York. Walther, G. (2010). Inference and modeling with log-concave distributions. Statistical Science 25 to appear. Department of Statistics, Box 354322 University of Washington Seattle, WA 98195-4322 E-mail: [email protected]

Department of Statistics, Box 354322 University of Washington Seattle, WA 98195-4322 E-mail: [email protected]

imsart-aos ver. 2009/08/13 file: ConvexTransfv3f.tex date: November 26, 2009