arXiv:1506.00034v1 [cs.IT] 29 May 2015
Bracketing Numbers of Convex Functions on Polytopes Charles R. Doss ∗ School of Statistics, University of Minnesota
Abstract We study bracketing numbers for spaces of bounded convex functions in the Lp norms. We impose no Lipschitz constraint. Previous results gave bounds when the domain of the functions is a hyperrectangle. We extend these results to the case wherein the domain is a polytope. Bracketing numbers are crucial quantities for understanding asymptotic behavior for many statistical nonparametric estimators. Our results are of interest in particular in many multidimensional estimation problems based on convexity shape constraints. Keywords: bracketing entropy, Kolmogorov metric entropy, convex functions, convex polytope, covering numbers, nonparametric estimation, convergence rates Mathematics Subject Classification (2010): Primary: 52A41, 41A46; Secondary: 52A27, 52B11, 52C17 62G20
1
Introduction and Motivation
To quantify the size of an infinite dimensional set, the pioneering work of Kolmogorov and Tihomirov (1961) studied the metric covering number of the set and its logarithm, the metric entropy. Metric entropy quantifies the amount of information it takes to recover any element of a set with a given accuracy ǫ. This quantity is important in many areas of statistics and information theory; in particular, the asymptotic behavior of empirical processes and thus of many statistical estimators is fundamentally tied to the entropy of the class under consideration (Dudley, 1978). In this paper, we are interested not in the metric entropy but the related bracketing entropy for a class of functions. Let F be a set of functions and let d be a metric on F. We call a pair of functions [l, u] a bracket if l ≤ u pointwise. For ǫ > 0, the ǫ-bracketing number of F, denoted N[ ] (ǫ, F, d), is the smallest N such that there exist brackets [li , ui ], i = 1, . . . , N , such that for all f ∈ F, there exists i with li (x) ≤ f (x) ≤ ui (x) for all x. Like metric entropies, bracketing entropies are fundamentally tied to rates of convergence of certain estimators (see e.g., Birg´e and Massart (1993), van der Vaart and Wellner (1996), van de Geer (2000)). In this paper, we study the bracketing entropy of classes of convex functions. Our interest is motivated by the study of nonparametric estimation of functions satisfying ∗
Charles R. Doss, 224 Church St. SE, Minneapolis, MN 55455. Email:
[email protected] 1
convexity restrictions, such as the least-squares estimator of a convex or concave regression function on Rd (e.g., Seijo and Sen (2011), Guntuboyina and Sen (2015)), possibly in the high dimensional setting (Xu et al., 2014), or estimators of a logconcave or s-concave density (e.g., Seregin and Wellner (2010), Koenker and Mizera (2010), Kim and Samworth (2014), Doss and Wellner (2015a,b), among others). Bracketing entropy bounds are directly relevant for studying asymptotic behavior of estimators in these contexts. Let D ⊂ Rd be a convex set, let v1 , . . . , vd ∈ Rd , be linearly independent vectors, let B, Γ1 , . . . , Γd be positive reals, and let v = (v1 , . . . , vd ) and Γ = (Γ1 , . . . , Γd ). Then we let C (D, B, Γ, v) be the class of convex functions ϕ defined on D, such that |ϕ(x)| ≤ B for all x ∈ D, and such that |ϕ(x + λvi ) − ϕ(x)| ≤ Γi |λ| as long as x and x + λvi are both elements of D. Let C (D, B) be the convex functions on D with uniform bound B and no Lipschitz constraints. For f : D → R, let Lp (f ) = R 1/p p for 1 ≤ p < ∞, and let L∞ (f ) = supx∈D |f (x)|. Since convex D f (x) dx functions are Lebesgue almost everywhere two-times differentiable their entropies correspond to the entropy for Q twice differentiable function classes, namely ǫ−d/2 . When D is a hyperrectangle d [−1, 1], and B = 1, Γi = 1, Bronshtein (1976) and Dudley (1984), chapter 8, indeed show that log N (ǫ, C (D, B, Γ) , L∞ ) . ǫ−d/2 . Here, N (ǫ, F, ρ) is the ǫ-covering number of F in the metric ρ, i.e. the smallest number of balls of ρ-radius ǫ that cover F. Bracketing entropies govern the suprema of corresponding empirical processes and thus govern the rates of convergence of certain statistical estimators. In many problems, including some of the statistical ones mentioned above, the classes that arise do not naturally have Lipschitz constraints, and so the class C (D, B, Γ) is not of immediate use. Without Lipschitz constraints, the L∞ bracketing numbers are not bounded, but one can use the Lp metrics, 1 ≤ p < ∞, instead: Dryanov (2009) and Guntuboyina and Sen (2013) found bounds when d = 1 and d > 1, respectively, for metric entropies of C (D, 1): they found that log N (ǫ, C (D, 1) , Lp ) . ǫ−d/2 , again with D a hyperrectangle. The d = 1 case (from Dryanov (2009)) was the fundamental building block in computing the rate of convergence of the univariate log-concave and s-concave MLEs in Doss and Wellner (2015a). In the corresponding statistical problems when d > 1, the domain of the functions under consideration is not a hyperrectangle but rather is a polytope, and thus the results of Guntuboyina and Sen (2013) are not always immediately applicable, and there is need for results on more general convex domains D. It is not immediate that previous results will apply, since D may have a complicated boundary. In this paper we are able to indeed find bracketing entropies for all polytopes D, attaining the bound log N[ ] (ǫ, C (D, B) , Lp ) . ǫ−d/2 with 1 ≤ p < ∞, D a polytope, and 0 < B < ∞. Note we work with bracketing entropy rather than metric entropy. Bracketing entropies are larger than metric entropies (van der Vaart and Wellner, 1996), so bracketing entropy bounds imply metric entropy bounds of the same order. Along the way, we also generalize the 2
results of Bronshtein (1976) to bound the L∞ bracketing numbers of C (D, B, Γ) when D is arbitrary. One of the benefits of our method is its constructive nature. We initially study only simple polytopes and in that case attempt to keep track of how constants depend on D. This paper is organized as follows. In Section 2 we prove bounds for bracketing entropy of classes of convex functions with Lipschitz bounds, using the L∞ metric. We use these to prove our main result for the bracketing entropy of classes of convex functions without Lipschitz bounds in the Lp metrics, 1 ≤ p < ∞, which we do in Section 3. We defer some of the details of the proofs to Section 4.
2
Bracketing with Lipschitz Constraints
If we have sets Di ⊂ Rd , i = 1, . . . , M , for M ∈ N, and D ⊆ ∪M i=1 Di then for ǫi > 0, !1/p M M X Y N[ ] , C (D, 1) , Lp ≤ ǫpi N[ ] (ǫi , C (D, 1) |Di , Lp ) . (1) i=1
i=1
where, for a class of functions F and a set G, we let F|G denote the class {f |G : f ∈ F} where f |G is the restriction of f to the set G. We will apply (1) to a cover of D by sets G with the property that C (D, 1) |G ⊆ C (G, 1, Γ)
for some Γ < ∞, so that we can apply bracketing results for classes of convex functions with Lipschitz bounds. Thus, in this section, we develop the needed bracketing results for such Lipschitz classes, for arbitrary domains G. Recall C (D, 1, Γ, v) is the class of convex functions ϕ defined on D, uniformly bounded by B and with Lipschitz parameter Γi in the direction vi . When vi are the standard basis of Rd , we just write C (D, 1, Γ). When we have Lipschitz constraints on convex functions, we will see that the situation for forming brackets for C (D, 1, Γ) with D ⊆ [0, 1]d is essentially the same as for forming brackets for C [0, 1]d , 1, Γ . For two sets C, D ⊂ Rd , define the Hausdorff distance between them by ! lH (C, D) := max
sup inf kx − yk, sup inf kx − yk
x∈D y∈C
y∈C x∈D
For B > 0 and a convex function f defined on a convex set D, define the epigraph VB (f ) by VB (f ) := {(x1 , . . . , xd , xd+1 ) : (x1 , . . . , xd ) ∈ D, f (x1 , . . . , xd ) ≤ xd+1 ≤ B} . Bronshtein (1976) found entropy estimates in the Hausdorff distance for classes of ddimensional convex sets (see also Dudley (1999), chapter 8). These entropy bounds for classes of convex sets are the main tool for Bronshtein (1976)’s entropy bounds for classes of convex functions, and they will also be the main tool in our bracketing bounds for convex functions (with Lipschitz constraints). 3
Theorem 2.1 (Bronshtein (1976)). For any R > 0 and any integer d ≥ 1, there exist positive real numbers cd and ǫ0,d such that for all 0 < ǫ ≤ ǫ0,d R there is an ǫ-cover of Kd+1 (R) in the Hausdorff distance of cardinality not larger than exp cd (R/ǫ)d/2 . The following lemma connects the Hausdorff distance on sets of epigraphs of Lipschitz functions to the supremum distance for those functions.
Lemma 2.1. Let G ⊆ [0, 1]d be any convex set and B, Γ1 , . . . , Γd > 0. For f, g ∈ C (G, B, (Γ1 , . . . , Γd )), v u d u X kf − gk∞ ≤ lH (VB (f ), VB (g))t1 + Γ2i . i=1
Proof. For ease of notation, let ρ = lH (VB (f ), VB (g)). Fix x ∈ G and suppose f (x) < g(x), without loss of generality. Now, (x, f (x)) ∈ VB (f ) so there exists (x′ , y ′ ) ∈ VB (g) such that k(x′ , y ′ ) − (x, f (x))k ≤ ρ. Since f (x) < g(x), (x, f (x)) is outside the epigraph of VB (g) so by convexity of VB (g), y ′ = g(x′ ). Thus q 0 ≤ g(x)−f (x) = g(x)−g(x′ )+g(x′ )−f (x) ≤ kx−x′ k Γ21 + · · · + Γ2d +|g(x′ )−f (x)|, since g(x) − g(x′ )| = |g(x1 , . . . , xd ) − g(x1 , . . . , xd−1 , x′d ) + · · · + g(x1 , x′2 , . . . , x′d ) − g(x′1 , . . . , x′d )| which is bounded above by q ′ ′ ′ |xd − xd |Γd + · · · + |x1 − x1 |Γ1 ≤ kx − x k Γ21 + · · · + Γ2d
by the Cauchy-Schwarz inequality. Thus, again by Cauchy-Schwarz, s X Γ2i , 0 ≤ g(x) − f (x) ≤ ρ 1 + i
as desired. Theorem 3.2 from (Guntuboyina and Sen, 2013) gives the following result when Qd D = i=1 [ai , bi ]; we now extend it to the case of a general D. When we consider convex functions without Lipschitz constraints, we will partition D into sets that are similar to parallelotopes. Note that if P ⊂ R ⊂ Rd where R is a hyperrectangle and P is a parallelotope defined by vectors v1 , . . . , vd , then if A is a linear map with v1 , . . . , vd as its eigenvectors (thus rescaling P ), then AR will not necessarily still be a hyperrectangle, i.e. its axes may no longer be orthogonal. Thus, we cannot argue by simple scaling arguments that bracketing numbers for P scale with the lengths along the vectors vi . Q Theorem 2.2. Let ai < bi and let D ⊂ di=1 [ai , bi ] be a convex set. Let Γ = (Γ1 , . . . , Γd ) and 0 < B, Γ1 , . . . , Γd < ∞. Then there exist positive constants c ≡ cd 4
and ǫ0 ≡ ǫ0,d such that log N[ ] ǫ Vol(D)1/p , C (D, B, Γ) , Lp ≤ log N[ ] (ǫ, C (D, B, Γ) , L∞ ) !d/2 P B + di=1 Γi (bi − ai ) ≤c ǫ P for 0 < ǫ ≤ ǫ0 B + di=1 Γi (bi − ai ) and p ≥ 1.
Proof. The first inequality of the theorem is elementary. We will show the second. First we note the following scaling relationship. For f ∈ C (D, B, Γ) we can define f˜ : ˜ → R, where D ˜ ⊆ [0, 1]d , by f˜(t1 , . . . , td ) = f (a1 + t1 (b1 − a1 ), . . . , ad + td (bd − ad )). D ˜ B, (Γ1 (b1 − a1 ), . . . , Γd (bd − ad )) . This shows that Then f˜ ∈ C D, ˜ B, (Γ1 (b1 − a1 ), . . . , Γd (bd − ad )) , L∞ N[ ] ǫ, C D,
(2)
= N[ ] (ǫ, C (D, B, (Γ1 , . . . , Γd )) , L∞ ) .
˜ ⊂ [0, 1]d . It is Thus, we now let ai = 0 and a convex domain D bi = 1 and consider √ ˜ B that VB (f ) ∈ Kd+1 ( d + B 2 ), where then clear if f ∈ C D,
Kd+1 (R) = {D : D is a closed, convex set, D ⊆ B(0, R)} q for R > 0. Thus, given an ǫ/ 4 1 + Γ21 + · · · + Γ2d -cover in Hausdorff dis˜ elements V1 , . . . , V ˜ , we can pick VB (f1 ), . . . , VB (fN ) for tance of Kd+1 (R) of N Nq ˜ , such that lH (VB (fi ), V˜i ) ≤ ǫ/(4 1 + Γ2 + · · · + Γ2 ), if such an fi ∈ N ≤ N 1 d ˜ B, (Γ1 , . . . , Γd ) exists. Then from Lemma 2.1, [fi − ǫ, fi + ǫ] form an L∞ C D, ˜ B, (Γ1 , . . . , Γd ) . Thus, by Theorem 2.1, for some positive bracketing set for C D, c, ǫ0 , d/2 q (d + B 2 )(1 + Γ21 + · · · + Γ2d ) ˜ B, (Γ1 , . . . , Γd ) , L∞ ≤ c log N[ ] ǫ, C D, ǫ for 0 < ǫ ≤ ǫ0
q (d + B 2 )(1 + Γ21 + · · · + Γ2d ). Using (2), we see that q
log N[ ] (ǫ, C (D, B, (Γ1 , . . . , Γd )) , L∞ ) ≤ c
5
(d + B 2 )(1 +
P
ǫ
2 i Γi (bi
− ai )2 )
d/2
(3)
for 0 < ǫ ≤ ǫ0 of (3) equals
r
P (d + B 2 ) 1 + di=1 Γ2i (bi − ai )2 . It is immediate that the left side log N[ ]
B Γ1 Γd ǫ , C D, , ,..., , L∞ A A A A
for any A > 0 so that for all A > 0 (3) is bounded above by q
(dA2 + B 2 ) 1 +
c
P
2 i Γi (bi
ǫ
− ai )2 /A2
d/2
q P for 0 < ǫ ≤ ǫ0 (dA2 + B 2 )(1 + i Γ2i (bi − ai )2 /A2 ). We pick 2
A =
s
B2
Pd
2 i=1 Γi (bi
d
− ai )2
,
which yields
log N[ ] (ǫ, C (D, B, (Γ1 , . . . , Γd )) , L∞ ) ≤ c
q P if 0 < ǫ ≤ ǫ0 B + d i Γ2i (bi − ai )2 . Since sX i
Γ2i (bi
− ai
)2
≤
X i
B+
q P d/2 d i Γ2i (bi − ai )2 ǫ
s X Γi (bi − ai ) ≤ d Γ2i (bi − ai )2 , i
which are basic facts about lp norms in Rd , we are done showing the second inequality of the theorem.
3
Bracketing without Lipschitz Constraints
In the previous section we bounded bracketing entropy for classes of functions with Lipschitz constraints. In this section we remove those Lipschitz constraints.
3.1
Notation and Assumptions
With Lipschitz constraints we could consider arbitrary domains D, but without the Lipschitz constraints we need more restrictions. We will now require that D is a polytope, and, to begin with, we also assume that D is simple. We will consider only the case d ≥ 2 since the result is given when d = 1 in Dryanov (2009). Assumption 1. Let d ≥ 2 and let D ⊂ Rd be a simple convex polytope, meaning that all (d − k)-dimensional faces of D have exactly k incident facets. 6
It is well-known that the simplicial polytopes are dense in the class of all polytopes in the Hausdorff distance. The simple polytopes are dual to the simplicial ones, and are also dense in the class of all polytopes in the Hausdorff distance (page 82 of Gr¨ unbaum (1967)). Any convex polytope can be triangulated into O(n⌊d/2⌋ ) simplices (which are simple polytopes) if the polytope has n vertices, see e.g. Dey and Pach (1998), and so one can translate our results to a general polytope D. However, then any geometric intuition provided by the constants in the bounds is lost. d are halfspaces with We let D = ∩N j=1 Ej where Ej := x ∈ R : hvj , xi ≥ pj (inner) normal vectors v , and where p ∈ R, for j = 1, . . . , N . Let Hj := j j d nx ∈ R : hx, vj i = pj be the corresponding o hyperplanes. For k ∈ N, let Jk := (j1 , . . . , jk ) ∈ {1, . . . , N }k : j1 < · · · < jk
and Ik := {0, . . . , A}k . For j ∈ Jk , let
Gj = ∩kα=1 Hjα .
Any Gj , j ∈ Jk , is (d−k)-dimensional and so, by Fritz John’s theorem (John (1948), see also Ball (1992) or Ball (1997)), contains a (d − k)-dimensional ellipsoid Aj − xj of maximal (d − k)-dimensional volume, such that Aj − xj ⊂ Gj − xj ⊂ d(Aj − xj )
(4)
for some point xj ∈ Gj . Let ek+1 , . . . , ed be the orthonormal basis given by the axes of the ellipsoid Aj − xj and let γj,α /2 be the radius of Aj in the direction eα , meaning that xj ± γj,α eα /2 lies in the boundary of Aj . We will rely heavily on Fritz John’s theorem to understand the size of Gj . Let d+ (x, ∂Gj , e) := inf K>0 {K : x + Ke ∩ ∂Gj 6= ∅} and let 2 (p+2)
u := 2−2(p+1)
∧
d+ (xj , ∂Gj , e) Lk,2 k∈{1,...,d−1} j∈Jk ,e∈span{ek+1 ,...,ed }
where Lk,2
min
min
D E k f˜γ , vjβ X E D := 1 ∨ sup β>k γ=1 f˜γ , vj γ
(5)
(6)
and f˜γ are defined in Proposition 4.2. Then let
0 = δ0 < δ1 < · · · < δA < u = δA+1 < δA+2 = ∞
(7)
be a sequence to be defined later. Let Lin P be the translated affine span of P , i.e. the space of all linear combinations of elements of (P − x), for any x ∈ P . Note that lin P is commonly used to refer to the linear span of P rather than of P − x, and thus to distinguish from this case, we use the notation “Lin” rather than “lin.” For a point x, a set H, and a unit vector v, let d(x, H, v) := inf {|k| : x + kv ∈ H} 7
be the distance from x to H in direction v, and for a set E, d(E, H, v) := inf x∈E d(x, H, v). For i = (i1 , . . . , ik ) ∈ Ik and j = (j1 , . . . , jk ) ∈ Jk let Gi,j := {x ∈ D : δiα ≤ d(x, Hjα ) < δiα +1 for α = 1, . . . , N } ,
(8)
where for α > k we let iα = A + 1. These sets are not parallelotopes, since for α > k, δiα +1 = ∞. However, for any x ∈ Gj , (Gi,j − x) ∩ span vj1 , . . . , vjβ , for β ≤ k, is contained in a β-dimensional parallelotope.
3.2
Main Results
We want to bound the slope of functions f ∈ C (D, 1) |Gi,j , so that we can apply bracketing bounds on convex function classes with Lipschitz bounds. Note that each Gi,j is distance δiα in the direction of vjα from Hjα , which means that if f ∈ C (D, 1) |Gi,j then f has Lipschitz constant bounded by 2/δiα along the direction vjα towards Hjα . However, the vectors vjα are not orthonormal, so the distance from Gi,j along vjα to a hyperplane other than Hjα may be smaller than δiα . For each Gi,j we will find an orthonormal basis such that Gi,j is contained in a rectangle R whose axes are given by the basis and whose lengths along those axes (i.e., widths) is bounded by a constant times the width of one of the normal vectors vjα . Furthermore, the distance from R along each basis vector to ∂D will be bounded by the distance from Gi,j along vjα to Hjα . This will give us control of both the Lipschitz parameters and the widths corresponding to the basis, and thus control of the size of bracketing for classes of convex functions. Proposition 3.1. Let Assumption 1 hold for a convex polytope D. For each k ∈ {0, . . . , d}, i ∈ Ik , j ∈ Jk , and each Gi,j , there is an orthornormal basis ei,j ≡ e := (e1 , . . . , ed ) of Rd such that for any f ∈ C (D, B) |Gi,j , f has Lipschitz constant 2B/δiα in the direction eα , where δiα = δA+1 if k + 1 ≤ α ≤ d. Furthermore, for α = 1, . . . , k, ei,j,α ≡ eα satisfies eα ∈ span {vj1 , . . . , vjα } , eα ⊥ span vj1 , . . . , vjα−1 , and heα , vα i > 0, and for α ∈ {k + 1, . . . , d}, eα ⊥ span {vj1 , . . . , vjk }.
Proof. Without loss of generality, for ease of notation we assume in this proof that jβ = β for β = 1, . . . , k, and then that δi1 ≤ δi2 ≤ · · · ≤ δik ≤ δik+1 = · · · = δiN ,
where we let iα = A + 1 for k < α ≤ N . That is, we assume that H1 , . . . , Hk are the nearest hyperplanes to Gi,j , in order of increasing distance. To define the orthonormal basis vectors, we will use a Gram-Schmidt orthonormalization, proceeding according to increasing distances from Gi,j to the hyperplanes Hj . Define e1 := v1 and for 1 < j ≤ k, define ej inductively by ej ∈ span {v1 , . . . , vj } , ej ⊥ span {v1 , . . . , vj−1 } , hej , vj i > 0, and kej k = 1, 8
and let {ej }dj=k+1 be any orthonormal basis of span {v1 , . . . , vk }⊥ . For α ∈ {1, . . . , k}, for any x ∈ Gi,j , since d(x, Hα , v) is smallest when v is vα , d(x, Hα , eα ) ≥ d(x, Hα , vα ) ≥ δiα ,
d(x, Hj , eα ) ≥ d(x, Hj , vj ) ≥ δij ≥ δiα , for all N ≥ j > α, and d(x, Hj , eα ) = ∞ > δiα for j < α,
since eα ⊥ span {v1 , . . . , vα−1 }. Similarly, for α ∈ {k + 1, . . . , d}, d(x, Hj , eα ) ≥ d(x, Hj , vj ) ≥ δA+1 , for all N ≥ j ≥ k + 1, and d(x, Hj , eα ) = ∞ > δA+1 for j ≤ k,
since eα ⊥ span {v1 , . . . , vk }. Thus, we have d(Gi,j , Hj , eα ) ≥ δiα for α ∈ {1, . . . , d} and for j ∈ {1, . . . , N }. That is, we have shown d(Gi,j , ∂D, eα ) ≥ δiα for all α ∈ {1, . . . , d} .
(9)
Thus, if f ∈ C (D, B) |Gi,j , then for any x ∈ Gi,j , let z1 = x−γ1 eα and z2 = x+γ2 eα , γ1 , γ2 > 0, both be elements of ∂Gi,j , so that by convexity we have f (z2 + δiα eα ) − f (z2 ) f (z1 ) − f (z1 − δiα eα ) f (x + keα ) − f (x) 2B −2B ≤ ≤ ≤ ≤ , δiα δiα k δiα δiα using (9). Thus, f satisfies a Lipschitz constraint in the direction of eα . Here is our main theorem. It gives a bracketing entropy of ǫ−d/2 when D is a fixed simple polytope. Its proof relies on embedding Gi,j in a rectangle Ri,j with axes given by Proposition 3.1. We need to control the distance of Gi,j to ∂D, and we need to control the size of Ri,j in terms of the widths along its axes. Then we can use the results of Section 2 on Ri,j and thus on Gi,j . Our studying the size of Ri,j is somewhat lengthy so we defer that until Section 4. The constant S has an explicit form given in the proof of the theorem. Qd Theorem 3.1. Let Assumption 1 hold for a convex polytope D ⊆ i=1 [ai , bi ], for an integer d ≥ 2. Fix p ≥ 1. Then for some ǫ0 > 0 and for 0 < ǫ ≤ Q 1/p d ǫ0 B , i=1 bi − ai
B log N[ ] (ǫ, C (D, B) , Lp ) ≤ S
Q
d i=1 (bi
ǫ
1/p d/2 − ai ) ,
where S is a constant depending on d and D (and on u, which is fixed by (5)).
9
Proof. First, we will reduce to the case where D ⊂ [0, 1]d and B = 1 by a scaling Qd ˜ is the image argument. Let A be an affine map from i=1 [ai , bi ] to [0, 1], where D ˜ 1 . Let of D, and assume we have a bracketing cover [˜l1 , u˜1 ], . . . , [˜lN , u ˜N ] of C D, li := B ˜li ◦ A and similarly for ui , so that [l1 , u1 ], . . . , [lN , uN ] form brackets for C (D, B). Their Lpp size is Z Z d Y p p (ui (x) − li (x)) dx = B (bi − ai )dx. (˜ ui (x) − ˜li (x))p ˜ D
D
Thus,
N[ ] ǫB
d Y
bi − a i
!1/p
˜ 1 , Lp , , C (D, B) , Lp ≤ N[ ] ǫ, C D,
Q 1/p d so apply the theorem with η = ǫ/B bi − ai for ǫ. Note that the constant S ˜ depends on D, the version of D normalized to lie in [0, 1]d . We now assume D ⊂ [0, 1]d and B = 1. For a sequence ai,k > 0 (constant P 1/p P p d over j ∈ Jk ), to be defined later, let a = a Vol (G ) . By d i,j k=0 j∈Jk ,i∈Ik i,k Assumption 1 and since D ⊂ ∪dk=0 ∪j∈Jk ,i∈Ik Gi,j , N[ ] (a, C (D, 1) , Lp ) ≤
d Y
Y
k=0 j∈Jk ,i∈Ik
N[ ] ai,k Vold−k (Gi,j )1/p , C (D, 1) |Gi,j , Lp ,
as in (1). Now by Lemma 4.3, we can ignore all terms with j ∈ Jk \ JkD , where JkD := j ∈ Jk : ∩kα=1 Hjα is a k-face of G . Thus log N[ ] (a, C (D, 1) , Lp ) ≤
d X X X
k=0 j∈J D i∈Ik k
log N[ ] ai,k Vold−k (Gi,j )1/p , C (D, 1) |Gi,j , Lp .
First we compute the sum over Ik for a fixed j ∈ Jk . Thus by Proposition 3.1, C (D, 1) |Gi,j ⊂ C (Gi,j , 1, Γ, e)
(10)
where Γi = (2/δi1 , . . . , 2/δik , 2/u, . . . , 2/u). Let Ri,j be as in (25). That is, let ρj,α = w(Gj , eα ), Lk,1 be given by (22), and let Ri,j :=
k X
α=1
[α!(δiα +1 − δiα )eα , α!(δiα +1 − δiα )eα ]+
d X
[−2Lk,1 ρj,α eα , 2Lk,1 ρj,α eα ] ,
α=k+1
so that Gi,j ⊆ x + Ri,j for any x ∈ Gi,j by (26). Then by (10) (and the first inequality of Theorem 2.2) we bound X X 1/p log N[ ] ai,k Vol(Gi,j ) , C (D, 1) |Gi,j , Lp ≤ log N[ ] (ai,k , C (Gi,j , 1, Γi ) , L∞ ) . i∈Ik
i∈Ik
(11)
10
We use the trivial bracket [−1, 1] for any Gi,j where iα = 0 for any α ∈ {1, . . . , k}, and otherwise we use Theorem 2.2, which shows us that (11) is bounded by A X
i1 =1
···
A X
ik =1
c
1+
Pk
α=1
2d!(δiα +1 −δiα ) δiα
+
ai,k
Pd
α=k+1
8Lk,1 ρj,α u
d/2
.
(12)
For i ∈ Ik , we will let a(i1 ,...,ik ) = 1 if iα = 0 for any α ∈ {1, . . . , k}, and otherwise we let k Y
k Y
(p + 1)iβ −2 ǫ exp −p aiβ := a(i1 ,...,ik ) := log ǫ , and (p + 2)iβ −1 β=1 β=1 ) ( p + 1 i−1 log ǫ for i = 1, . . . , A, δi := exp p p+2 1/k
and δ0 = 0. Since Lk,1 ≥ 1, Lk,2 ≥ 1, and u ≤ ρj,α /Lk,2 by (5) for all k, i, j and P Q 8L ρ 8L ρ α = k + 1, . . . , d, we have dα=k+1 k,1u j,α ≤ dα=k+1 k,1u j,α (using the fact that P Q for a, b ≥ 2, ab ≥ a + b). Similarly, kα=1 2(δiα +1 − δiα )/δiα ≤ kα=1 2δiα +1 /δiα since 2δiα +1 /δiα > 2. Thus (12) is bounded above by d/2
c(d!)
d Y 8Lk,1 ρj,α 1+ u α=k+1
!d/2
A X
i1 =1
···
A X
−d/2 ai,k
k Y 2δi
δiα
α=1
ik =1
α +1
d/2
,
(13)
which is d/2
c(d!)
d Y 8Lk,1 ρjα 1+ u α=k+1
For i = 1, . . . , A, let ζi := equals A X
i1 =1
···
A X
ik =1
p
ǫ1/k δi+1 /(δi ai ),
2kd/2 ǫ−d/2
k Y
Bu :=
A X
i1 =1
···
so that
ζidβ = 2kd/2 ǫ−d/2
k A Y X 2δiβ +1 d/2
PA
i1 =1 · · ·
A X
ζid1
= ǫ−d/2 2kd/2 Buk A X i=1
2
ζid ≤ 2ud/(2(p+1) ) ,
11
δiβ aiβ
ik =1 β=1
i1 =1
β=1
where
by Lemma (3.1).
!d/2
PA
A X
i2 =1
ik =1
Qk
ζid2 · · ·
.
β=1
A X
(14)
2δiβ +1 δiβ aiβ
d/2
ζidk
ik =1
(15)
Q 8L ρjα d/2 Next, we will relate the term c(d!)d/2 1 + dα=k+1 k,1 to Vold−k (Gj ). u Recall Aj is the ellipsoid defined in (4) which has diameter (and width) in the eα direction given by γj,α . By (4), ρj,α ≤ dγj,α . Q d The volume of Aj is Vold−k (Aj ) = γ /2 π (d−k)/2 /Γ((d−k)/2+1). Thus, α=k+1 j,α
letting Cd :=
(2d)d−k Γ((d−k)/2+1) , π (d−k)/2 d Y
α=k+1
we have
ρj,α ≤ Cd Vold−k (Aj ) ≤ Cd Vold−k (Gj ).
Then we have shown that (14) is bounded above by !d/2 d−k d Y 8 1+ Buk · ǫ−d/2 . Lk,1 Cd Vold−k (Gj ) u
d/2 kd/2
cd (d!)
2
(16)
α=k+1
Then, gathering the constants together into c˜d , we have shown X log N[ ] ai,k Vol(Gi,j )1/p , C (D, 1) |Gi,j , Lp i∈Ik
≤ǫ
−d/2
c˜d
!d/2 Q Vold−k (Gj ) dα=k+1 Lk,1 2 ukd/(2(p+1) ) . ud−k
Then the cardinality of the collection of brackets covering the entire domain D is given by summing over j ∈ Jk and k ∈ {0, . . . , d}. We have computed the cardinality of the brackets. Now we bound their size. We have ap ≤
k d X p Y X X δiα +1 − δiα E D ai,k Vold−k (Gj ) (2Lk,1 )d−k f˜α , vjα α=1 i∈Ik j∈Jk k=0
(17)
by Proposition 4.2, with f˜α defined there. Fixing k, we have X
Vold−k (Gj )
j∈Jk
X
api,k
i∈Ik
k A Y A k X X X Y δiα +1 − δiα k E ≤ D apiα δiα +1 ··· Vold−k (Gj )Lj,3 ˜ f α , vj α α=1 ik =0 α=1 i1 =0 j∈Jk
≤ where Lj,3
X
Vold−k (Gj )Lkj,3
i1 =0
j∈Jk
E := maxα∈{1,...,k} 1/ f˜α , vjα . We have A X
D
apα δα+1
=ǫ
p/k
1+
A X
α=1
α=0
12
A X
ζα2
!
=: ǫp/k Au ,
api1 δi1 +1 · · ·
A X
apik δik +1 .
ik =0
(18)
2
where Au ≤ 1 + 2u1/(p+1) by Lemma 3.1. Thus ! A A X X X X api1 δi1 +1 · · · apik δik +1 ≤ ǫp Aku Vold−k (Gj )Lkj,3 Vold−k (Gj )Lkj,3 , i1 =0
j∈Jk
ik =0
j∈Jk
so by (17)
a≤ǫ where
SkD
=
P
d X
(2Lk,1 )d−k SkD Aku
k=0
!1/p
k j∈Jk Vold−k (Gj )Lj,3 .
Lemma 3.1. For any γ ≥ 1, with A and u given by (7), A X
α=1
2
ζαγ ≤ 2uγ/(2(p+1) ) .
Proof. Taking ǫ ≤ ǫ0 ≤ 1, ζα ≤ 1. Then for α = 2, . . . , A, ( ) ζα p + 1 α−1 −p log ǫ = exp ζα+1 2(p + 1)2 (p + 2) p + 2 ( ) −p log ǫ p + 1 A−1 ≥ exp 2(p + 1)2 (p + 2) p + 2 − log u =: R. ≥ exp 2(p + 1)2 (p + 2) γ Then, ζαγ (Rγ − 1) ≤ ζαγ Rγ − (Rζα−1 )γ so ζαγ ≤ (Rγ /(Rγ − 1)) (ζαγ − ζα−1 ) and thus A X
α=1
ζαγ ≤ ζ1γ +
A Rγ Rγ Rγ X γ γ γ γ γ (ζ − ζ ) = ζ + (ζ − ζ ) ≤ ζγ α α−1 1 1 A Rγ − 1 α=2 Rγ − 1 Rγ − 1 A
2 γ and ζA = uγ/(2(p+1) ) . Since u ≤ exp −2(p + 1)2 (p + 2) log 2 by its definition (5), R ≥ 2 so Rγ /(Rγ − 1) ≤ 2 for any γ ≥ 1. Since simplices are simple polytopes, by triangulating any convex polytope D into simplices, we can extend our theorem to any polytope D. The constant in the bound then depends on the triangulation of D. Q Corollary 3.1. Fix d ≥ 1 and p ≥ 1. Let D ⊆ di=1 [ai , bi ] be any convex polytope. 1/p Q d b − a , Then for some ǫ0 > 0 and for 0 < ǫ ≤ ǫ0 B i i i=1
B log N[ ] (ǫ, C (D, B) , Lp ) . 13
Q
d i=1 (bi
ǫ
1/p d/2 − ai ) .
Proof. By the same scaling argument as in the proof of Theorem 3.1 we may assume [ai , bi ] = [0, 1] and B = 1. The d = 1 case is given by Dryanov (2009). Any convex polytope D can be triangulated into d-dimensional simplices (see e.g. Dey and Pach (1998), Rothschild and Straus (1985)). We are done by applying Theorem 3.1 to each of those simplices, by (1).
4 4.1
Proofs: Relating Gi,j to a Hyperrectangle Inscribing Gi,j in a Hyperrectangle
Theorem 2.2 shows that the bracketing entropy of C (D, B, Γ) depends on the diQd ameters of the hyperrectangle i=1 [ai , bi ] circumscribing D. This is part of why bounding entropies on hyperrectangular domains is more straightforward than on non-hyperrectangular domains. In this section we prove Propositions 4.1 and 4.2, which show how to embed the domains Gi,j , which partition D, into hyperrectangles. We used this in the proof of Theorem 3.1 so we could apply Theorem 2.2. The support function for a convex set D is, for x ∈ Rd , h(D, x) := max hd, xi . d∈D
Then the width function is, for kuk = 1, w(D, u) := h(D, u) + h(D, −u), which gives the distance between supporting hyperplanes of D with inner normal vectors u and −u, respectively, and let w(D) = sup w(D, u). kuk=1
Theorem 2.2 says that the bracketing entropy of convex functions on domain D with Lipschitz constraints along directions e1 , . . . , ek depends on w(D, ei ) (since that gives the maximum “rise” in “rise over run”). In our proof of Theorem 3.1 we partitioned D into sets related to parallelotopes. Thus we will study the widths of parallelotopes. We know the width of Gi,j in the directions vjα , which are δiα +1 −δiα , by definition. Lemma 4.1. Let V be a vector space of dimension j ∈ N containing linearly independent vectors v1 , . . . , vj . Let di > 0 for i = 1, . . . , j, and let P be the parallelotope defined by having w(P, vi ) = di . Then P satisfies w(P ) ≤ j! max di . 1≤i≤j
Proof. The proof is by induction. The case j = 1 is trivial. Now assume the statement holds for j − 1 and we want to show it for j. For any x, y ∈ ∂P we can find a path x = x0 , x1 , . . . , xn = y from x to y such that xi and xi+1 are elements 14
of (the boundary of) the same facet of P . P has 2j facets; if n > j, then we can find a path through the complementary 2j − n facets, so that we may assume n ≤ j. By the induction hypothesis, kxi+1 − xi k ≤ (j − 1)! max1≤i≤j di , since any (j − 1)dimensional facet is a parallelotope lying in a hyperplane with normal vector vi and widths d1 , . . . , di−1 , di+1 , . . . dj . Thus kx − yk ≤
n X i=1
kxi − xi−1 k ≤ n(j − 1)! max di ≤ j! max di , 1≤i≤j
1≤i≤j
as desired. This gives a bound on the width of Gi,j in the direction of each basis vector eα , α = 1, . . . , k, from Proposition 3.1. Proposition 4.1. Let Assumption 1 hold for a convex polytope D. Fix k ∈ {0, . . . , d}, i ∈ Ik , j ∈ Jk , and let Gi,j be as in (8). Let ei,j ≡ e := (e1 , . . . , ed ), with eα ∈ Rd , be the orthornormal basis from Proposition 3.1. Then w(Gi,j , eα ) ≤ α!(δiα +1 − δiα ). for α = 1, . . . , k. Proof. Let α ∈ {1, . . . , k} and let w(Gi,j , eα ) be given by the distance between the parallel supporting hyperplanes H1 and H2 . The distance between H1 and H2 is equal to the distance between H1 ∩A and H2 ∩A where A is any linear subspace containing the normal vector of H1 and H2 . Thus, let A = span {v j1 , . . . d, vjα } ∋ e α . Gi,j k ˜ ˜ is contained in a parallelotope, Gi,j ⊆ ∩β=1 Hjβ where Hjβ = x ∈ R : δiβ ≤ x, vjβ ≤ δiβ +1 . ˜ j ∩span {vj , . . . , vjα }. Then P is a parallelotope contained in the αLet P = ∩αβ=1 H 1 β dimensional vector space V = span {vj1 , . . . , vjα } with widths w(P, vjβ ) = δiβ +1 −δiβ , for β = 1, . . . , α. Thus we can apply Lemma 4.1 and conclude that w(Gi,j , eα ) ≤ w(P, eα ) ≤ α!(δiα +1 − δiα ) for α = 1, . . . , k. ˜ j , eα , ˜ j , eα ≥ w ∩k H For the first inequality, we use the fact that w ∩αβ=1 H β β β=1 α ˜ and that w ∩β=1 Hjβ , eα = w(P, eα ) since the distance between any two supporting ˜ j is equal to the distance between H1 ∩ A and hyperplanes H1 and H2 of ∩kβ=1 H β H2 ∩ A where A is any linear subspace containing the normal vector of H1 and H2 . We will rely on the following representation for a k-dimensional parallelotope. For sets A and B, let A + B = {a + b : a ∈ A, b ∈ B} . ˜β be a paralLemma 4.2. Let V be a k-dimensional vector space, and P := ∩kβ=1 E ˜β := {x ∈ V : 0 ≤ hx, vβ i ≤ dβ } for k linearly independent normal lelotope where E ˜ + := {x ∈ V : hx, vβ i = dβ }. Let f˜β be the unit vector lying unit vectors vβ . Let H β 15
D E ˜ + with f˜β , vβ > 0, for β = 1, . . . , k. Then 0 is a vertex of P and we in ∩kγ=1,γ6=β H β can write k X [0, fβ ] P = β=1
D
E where fβ := dβ f˜β / f˜β , vβ , [0, fβ ] = {λfβ : λ ∈ [0, 1]}.
˜ − := {x ∈ V : hx, vβ i = 0}. Since the vectors vβ are unique, ∩k H ˜− Proof. Let H β=1 β = β ˜ − gives a 1-dimensional 0 and the intersection of any k − 1 of the hyperplanes H β n o space, span f˜β . A k-dimensional parallelotope can be written as the set-sum
of the k intervals emanating from the vertex, each given by the intersection of k − 1 of the hyperplanes Hβ− . See page 56 of Gr¨ unbaum (1967). Note that fβ + ˜ satisfy hfβ , vβ i = dβ so that fβ ∈ Hβ ; thus the k intervals are given by [0, fβ ], β = 1, . . . , k. The next proposition combines the previous ones to bound the widths of Gi,j (i.e., to embed Gi,j in a hyperrectangle). Proposition 4.2. For each k ∈ {1, . . . , d − 1}, i ∈ Ik , j ∈ Jk , and each Gi,j , and the basis e from Proposition 3.1, for α = k + 1, . . . , d, we have w(Gi,j , eα ) ≤ 2Lk,1 w(Gj , eα )
(19)
and
k Y δiα +1 − δiα D E (20) f˜α , vjα α=1 E D where Lk,1 is given by (22) and f˜α is the unit vector with f˜α , vjα > 0 lying in
span {vj1 , . . . , vjk }∩ ∩kγ=1,γ6=α Hγ+ , α = 1, . . . , k, where Hγ+ := y ∈ Rd : y, vjγ = δiγ +1 − δiγ .
Vold (Gi,j ) ≤ (2Lk,1 )d−k Vold−k (Gj ) ·
Proof. Take k ∈ {1, . . . , d − 1}. Let x be an arbitraryPfixed point, which we take to be x ≡ xj (from (4)) for definiteness. Let z = x + kγ=1 fjγ where fjγ = djγ f˜jγ where D E 0 ≤ djγ ≤ (δiγ +1 − δiγ )/ f˜jγ , vjγ (21)
and f˜jγ is given by Lemma 4.2 for the k linearly independent normal vectors v j1 , . . . , vjk . Take an arbitrary e ∈ span {ek+1 , . . . , ed }. Let λ > 0 be such that z + λe, vjβ is maximal over β ∈ {k + 1, . . . , N } where λ > 0 is such that z + λe is ˜ inin the boundary of Gi,j . (That is, vjβ corresponds to the first hyperplane z + λe
˜ tersects for λ > 0.) Note that this means vjβ , e < 0. Then z + λe, vjβ = pjβ + u so, with the δ sequence in (7) and Gi,j defined for any u > 0, we have
D E f˜γ ,vjβ γ=1 hf˜γ ,vj i
Pk
x, vjβ − pjβ + u pjβ + u − z, vjβ
≤ λ= e, vjβ −e, vjβ 16
γ
.
Let Lk,1 :=
sup e∈span{ek+1 ,...,ed }
1/ −e, vjβ ,
(22)
which is finite since Gj is bounded. Then
x, vjβ − pjβ ≤ d(x, Hjβ ) ≤ d(x, ∂Gj , e)
since Hjβ is the closest hyperplane to x in the direction e. Now, by (5) and (6), we have shown λ ≤ 2Lk,1 d+ (x, ∂Gj , e), (23) meaning that (Gi,j − z) ∩ span {ek+1 , . . . , ed } ⊂ 2Lk,1 (Gj − x) so we can conclude that w(Gi,j −z, eα ) ≤ 2Lk,1 w(Gj , eα ) and w(Gi,j , eα ) ≤ 2Lk,1 w(Gj , eα ) since hz, eα i = 0 for all djγ given by the range (21), α = k + 1, . . . , d, for k = 1, . . . , d − 1. It then also follows that ! k X [0, fα ] , (24) Vold (Gi,j ) ≤ (2Lk,1 )d−k Vold−k (Gj ) · Volk α=1
D
where fα = (δiα +1 − δiα )f˜α / f˜α , vjα sition. This yields (20).
E
and f˜α given in the statement of the propo-
Lemma 4.3. Let Assumption 1 hold and let Gi,j be as in (8). If Gj = ∅, then Gi,j = ∅. Proof. This follows from Proposition 4.2 and its proof. The above provides a hyperrectangle containing Gi,j . Let A+B = {a + b : a ∈ A, b ∈ B} for sets A, B. Let ρj,α := w(Gj , eα ) and then let Ri,j :=
k X
α=1
[α!(δiα +1 − δiα )eα , α!(δiα +1 − δiα )eα ]+
d X
[−2Lk,1 ρj,α eα , 2Lk,1 ρj,α eα ] .
α=k+1
(25)
Then, for any x ∈ Gi,j we have shown Gi,j ⊂ x + Ri,j .
(26)
References Ball, K. (1992). Ellipsoids of maximal volume in convex bodies. Geometriae Dedicata, 41 241–250.
17
Ball, K. (1997). An elementary introduction to modern convex geometry. In Flavors of Geometry (S. Levy, ed.), vol. 31. Flavors of geometry, 1–58. Birg´ e, L. and Massart, P. (1993). Rates of convergence for minimum contrast estimators. Probability theory and related fields, 97 113–150. Bronshtein, E. M. (1976). epsilon-entropy of convex sets and functions. Siberian Mathematical Journal, 17 393–398. Dey, T. K. and Pach, J. (1998). Extremal problems for geometric hypergraphs. Discrete & Computational Geometry, 19 473–484. Doss, C. R. and Wellner, J. A. (2015a). Global rates of convergence of the mles of log-concave and s-concave densities. arXiv:1306.1438. 1306.1438. Doss, C. R. and Wellner, J. A. (2015b). Inference for the mode of a log-concave density. Tech. rep., University of Washington. In preparation. Dryanov, D. (2009). Kolmogorov entropy for classes of convex functions. Constr. Approx., 30 137–153. Dudley, R. M. (1978). Central limit theorems for empirical measures. The Annals of Probability, 6 899–929. ´ Dudley, R. M. (1984). A course on empirical processes. In Ecole d’´et´e de probabilit´es de Saint-Flour, XII—1982, vol. 1097 of Lecture Notes in Math. Springer, Berlin, 1–142. Dudley, R. M. (1999). Uniform Central Limit Theorems, vol. 63 of Cambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge. ¨ nbaum, B. (1967). Convex Polytopes. With the cooperation of Victor Klee, Gru M. A. Perles and G. C. Shephard. Pure and Applied Mathematics, Vol. 16, Interscience Publishers John Wiley & Sons, Inc., New York. Guntuboyina, A. and Sen, B. (2013). Covering numbers for convex functions. IEEE Transactions on Information Theory, 59 1957–1965. Guntuboyina, A. and Sen, B. (2015). Global risk bounds and adaptation in univariate convex regression. Probability theory and related fields, to appear. John, F. (1948). Extremum problems with inequalities as subsidiary conditions. In Studies and Essays Presented to R. Courant on his 60th Birthday, January 8, 1948. Interscience Publishers, Inc., New York, N. Y., 187–204. Kim, A. K. H. and Samworth, R. J. (2014). Global rates of convergence in log-concave density estimation. arXiv:1404.2298v1. 1404.2298v1. Koenker, R. and Mizera, I. (2010). Quasi-concave density estimation. Ann. Statist., 38 2998–3027. 18
Kolmogorov, A. N. and Tihomirov, V. M. (1961). ε-entropy and ε-capacity of sets in function spaces. AMS Translations: Series 2, 17 277–364. Rothschild, B. L. and Straus, E. G. (1985). On triangulations of the convex hull of n points. Combinatorica, 5 167–179. Seijo, E. and Sen, B. (2011). Nonparametric least squares estimation of a multivariate convex regression function. The Annals of Statistics, 39 1633–1657. Seregin, A. and Wellner, J. A. (2010). Nonparametric estimation of multivariate convex-transformed densities. Ann. Statist., 38 3751–3781. With supplementary material available online. van de Geer, S. A. (2000). Empirical Processes in M-Estimation. Cambridge Univ Pr. van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer Series in Statistics, Springer-Verlag, New York. Xu, M., Chen, M. and Lafferty, J. (2014). Faithful variable screening for highdimensional convex regression. arXiv.org. 1411.1805v1.
19