Entropy Numbers of Linear Function Classes Â½). - Semantic Scholar

Comment

Report 5 Downloads 8 Views

Entropy Numbers of Linear Function Classes

Robert C. Williamson Department of Engineering Australian National University Canberra, ACT 0200, Australia

Alex J. Smola Department of Engineering Australian National University Canberra, ACT 0200, Australia

Abstract

the class of functions

F :=

This paper collects together a miscellany of results originally motivated by the analysis of the generalization performance of the “maximum-margin” algorithm due to Vapnik and others. The key feature of the paper is its operator-theoretic viewpoint. New bounds on covering numbers for classes related to Maximum Margin classes are derived directly without making use of a combinatorial dimension such as the VC-dimension. Specific contents of the paper include:

1

Bernhard Sch¨olkopf Microsoft Research Limited St. George House 1 Guildhall Street Cambridge CB2 3NH, UK

n

x 7! w x : kwk`M 1; kxk`M 1 2

2

o

:

(The standard notation used here is defined precisely below in Definition 3 .) The focus of the present paper is to consider what happens when different norms are used in the definition of F. Apart from the purely mathematical interest in developing the connection between problems of determining covering numbers of function classes to those of determining entropy numbers of operators, the results in the paper indicate the effect to be expected by using different norms to define linear function classes in practical learning algorithms. There is a considerable body of work in the mistake-bounded framework for analysing learning algorithms for linear function classes exploring the effect of different norms. The present paper can be considered as a similar exercise in the statistical learning theory framework. The following section collects all the definitions we need. All proofs in the paper are relegated to the appendix.

a new and self-contained proof of Maurey’s theorem and some generalizations with small explicit values of constants; bounds on the covering numbers of maximum margin classes suitable for the analysis of their generalization performance; the extension of such classes to those induced by balls in quasi-Banach spaces (such as `p norms with 0 < p < 1). extension of results on the covering numbers of convex hulls of basis functions to p-convex hulls (0 < p 1); an appendix containing the tightest known bounds on the entropy numbers of the identity operator between `np1 and `np2 (0 < p1 < p2 1).

2

Definitions

We will make use of several notions from the theory of Banach spaces and a generalization of these called a quasiBanach spaces. A nice general reference for Banach spaces is [33]; for quasi-Banach spaces see [8]. Definition 1 (Banach space) A Banach space (X; kkX ) is a complete normed linear space X with a norm on X , i.e. a map k kX : X ! [0; 1) that satisfies

Introduction

1. 2. 3.

Linear classifiers have had a resurgence of interest in recent years because of the development of Support Vector machines [22, 24] which are based on Maximum Margin hyperplanes [25]. The generalization performance of support vector machines is becoming increasingly understood with an analysis of the covering numbers of the classes of functions they induce. Some of this analysis has made use of entropy number techniques. In this paper we focus on the simple maximum margin case more closely and do not consider kernel mappings at all. The effect of the kernel used in support vector machines has been analysed using similar techniques in [32, 11] The classical maximum margin algorithm effectively works with

kxkX = 0 if and only if x = 0; kxkX = jjkxkX for scalars 2 R and all x 2 X ; kx + ykX kxkX + kykX for all x; y 2 X .

Definition 2 (Quasi-norm) A quasi-norm is a map like a norm which instead of satisfying the triangle inequality (3 above) satisfies 30 . There exists a constant C such that for all k + kX C (k kX + k kX )

x y

x

y

x; y 2 X ,

All of the spaces considered in this paper are real. A norm (quasi-norm) induces a metric (quasi-metric) via d( ; ) = k kX . We will use d to denote both the norm and the

x y

309

xy

Definition 7 (Entropy numbers of operators) Suppose T 2 L(X; Y ). The entropy numbers of the operator T are defined by

induced metric, and use (X; d) to denote the induced metric space. A quasi-Banach space is a complete quasi-normed linear space.

x

Definition 3 (`p Norms) Suppose 2 R n with 2 R N (for infinite dimensional spaces). Then

x

x

n

2N

n (T ) = n (T; X ) := n (T (UX )): The dyadic entropy numbers en (T ) are defined by en (T ) = en (T; d) := 2n 1 (T ) for i 2 N :

or

x

P 1=p for 0 < p < 1, k k`p := k kp = ( ni=1 jxi jp ) (provided the sum converges). for p = 1, k k`1 := k k1 = supi jxi j.

x

x

x

x X

x x x x

X

kf kXp m := k(f (x1 ); : : : ; f (xm))k`mp :

Lemma 8 (Edmunds and Triebel [8, p.7]) Let A; B; C be quasi-Banach spaces and let S; T 2 L(A; B ) and R 2 L(B; C ). Then 1. 2.

(1)

For 1 p 1, k k`m p is a norm, and for 0 < p < 1 it is a quasi-norm. Note that a different definition of the `m p norm is used in some papers in learning theory, e.g. [28, 34]. A useful inequality in this context is the following (cf. e.g. [16, p.21]): Theorem 4 (H¨older’s inequality) Suppose p; q 1 1 p + q = 1 and that u 2 `p and v 2 `q . Then

ju vj kukp kvkq :

Suppose (X; d) is a metric space and let A X is an -cover for S if for all a 2 A such that d( ; a) < .

x

S

Lemma 9 (Carl and Stephani, [7, p.14,21]) Denote by A; B Banach spaces, let T 2 L(A; B ) and suppose that rank(T ) = m. Then for all k 2 N , ek (T ) 4kT k2 (k 1)=m : (7)

In fact this bound is tight to within a constant factor of 4. One operator that we will make repeated use of is the identity operator If A and B are (quasi)-Banach spaces,

We say

x 2 S , there is an

id: A ! B;

Definition 5 (Covering and Entropy number) The -covering number of S , denoted by N(; S; d), is the size of the smallest -cover of S . The nth entropy number of a set S X is defined by

n (S ) = n (S; d) := inf f > 0: N(; S; d) ng (2) Given a class of functions F R X , the uniform covering number or -growth function is m Nm(; F) := msup m N(; F; k kX1 ): X 2X

id: x 7! x:

(8)

Entropy Numbers of Linear Function Classes

Since we are mainly concerned about the capacity of linear functions of the form

(3)

x

w x

x

w

f ( ) := with k kA x and k kB w (9) we will consider bounds on entropy numbers of linear operators. (Here k kA and k kB are norms or quasi-norms.) This will allow us to deal with function classes derived from (9). In particular we will analyze the class of functions

Definition 6 (Operator Norm) Denote by X = (X; d) a (quasi)-normed space, Ud is the (closed) unit ball: Ud := f 2 X : k kX 1g. Suppose X and Y are (quasi)Banach spaces and T is a linear operator mapping from X to Y . Then the operator norm of T is defined by

M := Fp;q

x

kT k := supfkT xkY : x 2 UX g: and T is called bounded if kT k < 1.

defined by

This seemingly trivial operator is of interest when A 6= B because of the definition of the operator norm. As we shall see below the entropy numbers of id: `np1 ! `np2 play a central role in many of the results we develop. In the following we will use to denote a positive constant and log is logarithm base 2.

3

Covering numbers are of considerable interest to learning theory because generalization bounds can be stated in terms of them [1, 30].

x

kT k e1 (T ) e2 (T ) 0. 8k; l 2 N , ek+l 1 (RS ) ek (R)el (S ).

Finite rank operators have exponentially decaying entropy numbers:

1 satisfy

X.

(6)

The main reference for entropy numbers of operators is [7]. Many of the properties shown there for Banach spaces X; Y actually carry over to quasi-Banach spaces — see e.g. [8]. The factorization theorem for entropy numbers is of considerable use:

If dim = m, we often write `m p to explicitly indicate the is defined as `m dimension. The space `M p p := f : k k`m p < M m 1g. Given 1 ; : : : ; m 2 `p , write = ( 1; : : : ; m ). Suppose F is a class of functions defined on R M . The `m p norm with respect to m of f 2 F is defined as

x

(5)

n

x 7! w x : kwk`Mp 1; kxk`Mq 1

o

(10)

where M 2 N , p; q 0; and p1 + 1q 1. More specifiM on an m–sample cally we will look at the evaluation of Fp;q f 1 ; : : : ; m g `Mq and the entropy numbers of the evaluation map in terms of the `m 1 metric. Formally, we will study the entropy numbers of the operator SXm defined as m SXm : `M p ! `1 7! ( 1 ; : : : ; m) = m : SXm :

x

(4)

We denote by L(X; Y ) the set of all bounded linear operators from X to Y .

x

w

310

w x

w x

w X

M is given in Lemma 11 The connection between SXm and Fp;q below. Since one cannot expect that all problems can be cast into (10) without proper rescaling we will be interested in constraints on and i such that j i j 1 (and rescale later).

w

x

It is of interest to determine an explicit value for the constant

. Carl [3] proved that for S : `m 1 ! E,

w x

(2m+kk 1) (S ) 4p (E )k 1+1=p kS k

(15)

This leads to the following straight-forward corollary: Lemma 10 (Product bounds from H¨older’s Inequality) Suppose p; q 0 with p1 + 1q 1. Furthermore suppose M 2 N; 2 `Mp ; 2 `Mq with k k`Mp w and k k`Mq x . Then

w

w

x

Corollary 14 (Small Constants for Maurey’s theorem) If X is a Hilbert space, (14) holds with p = 2, p (E ) = 1, and

4:4377.

x

(11)

The dual version of Theorem 13, i.e. bounds on ek (X ! We only state the Hilbert space case here.

In order to avoid tedious notation we will assume that w =

x = 1. This is no major restriction since the general results follow simply by rescaling by w x . Our interest in the operator SXm and its entropy num-

Theorem 15 (Dual Version of the Maurey-Carl Theorem) Suppose H is a Hilbert space, m 2 N and T a linear operator T : H ! `m 1 . Then

sup kSXm k w x : x1 ;::: ;xm 2 x U`Mq

`m 1 ) has identical formal structure as Theorem 13.

1=2 m 1 ; ek (T : H ! `m 1 ) kT k k log k + 1 where = 102:88.

bers is explained by the following lemma which connects

M. ek (SXm ) to the uniform covering numbers of Fp;q

Lemma 11 (Entropy and Covering Numbers) Let k 2 N . M m m If for all m 2 (U`M q ) , ek (SXm : `p ! `1 ) , then

X

M)k log NX (; Fp;q m

1:

(12)

3.1 The Maurey-Carl Theorem In this section we present a special case of the famous MaureyCarl theorem. The proof (in the appendix) presented provides a (small) explicit constant. The result is not only of fundamental importance in statistical learning theory — it is of central importance in pure mathematics. Carl and Pajor [6] prove the Maurey theorem via the “Little Grothendieck theorem” which is related to Grothendieck’s “fundamental theorem of the metric theory of tensor products”. Furthermore the Little Grothendieck theorem can be proved in terms of Maurey’s theorem (and thus they are formally equivalent). See [7, pages 254–267] for details. The version proved by Carl [3] (following Maurey’s proof) uses a characterization of Banach spaces in terms of their Rademacher type. The latter is defined as follows.

i=1

n X i=1

kxi kp

!1=p

:

(13)

Theorem 13 (Maurey-Carl) Let X be a Banach space of Rademacher type p, 1 < p 2. Let m 2 N and let S 2 L(`m 1 ; X ). Then there exists a constant such that for all k 2 N, k m 1

p

:

Lemma 16 (Improved Dual Maurey-Carl Theorem) Let H be a Hilbert space, and suppose m; k 2 N with m 4 and T is a linear operator T : H ! `m 1 . Then m ek (T : H ! `1 ) (17) 8 if k log m > < 1 1=2 m 1 kT k > k log k + 1 if log m k m 1 : k=m 8 m 2 2 if m < k where = 102:88.

4

Dimensionality and Sample Size

x

Here ri (t) = sgn sin(2i t) is the ith Rademacher function on [0; 1℄. The Rademacher type p constant p (X ) is the smallest constant satisfying (13).

1 m +1 ek (S ) p (X )kS k k 1 log k

As explained in Appendix B, we suspect that a smaller value of is possible (we conjecture 1.86). Theorem 15 can be improved by taking advantage of operators with low rank via Lemma 9.

Lemma 16 already indicated that the size m of the sample generating the evaluation operator SXm plays a crucial role in the scaling behaviour of entropy numbers. The dimensionality M of X (i.e. the dimension of the samples i ) also comes into play. This guides our analysis in the present section. In section 4.1 we will deal with the case where M = m, section 4.2 deals with the situation where M is polynomial in m. Depending on the setting of the learning problem we need bounds for the entropy numbers of the identity map m between `m p and `q . We will use such bounds repeatedly below. We have collected together a number of bounds on this in appendix C.

Definition 12 (Rademacher type of Banach spaces) A Banach space X is of Rademacher type p, 1 p 2 if there is constant > 0 such that for every finite sequence fx1 ; : : : ; xn g X we have

Z 1 n

X

ri (t)xi

dt

0

(16)

4.1 Dimensionality of X and Sample Size m are Equal

We begin with the simplest case — X is a finite dimensional Hilbert space of dimensionality M = m, hence we will be dealing with F2m;2 . Lemma 16 applies. It is instructive to restate this result in terms of covering numbers.

(14)

311

Theorem 17 (Covering Numbers for F2m;2 ) There exists constants ; 0 > 0 such that for all n 2 N , and all > 0, ( 0

log2m if > p m m m log N (; F2;2) 0 max(1; m log(m2 )) if p m : (18)

5

Adaboost [10] is an algorithm related to the variants on the maximum margin algorithm considered in this paper. It outputs an hypothesis which is the convex hull of the set of weak learners and its generalization performance can be expressed in terms of the covering number of the convex hull of weak learners at a scale related to the margin achieved [21]. In this section we consider what effect there is on the covering numbers of the class used (and hence the generalization bounds) when the p-convex hull is used (with p 2 (0; 1)). Variants of Adaboost can be developed which use the p-convex hull and experimental results indicate that p affects the generalization in a manner consistent with what is suggested by the theory below [2]. The argument below is inspired by the results in [4, 5]. We are interested in N(; op (F ); `m 1 ) when F is a subset of a Hilbert space H and op (F ) denotes the p-convex hull m 21). Recalling the definition of kf kX , of F (see Definition 1 P m m jf (x )jq )1=q . For all q > 0, we have let kf kX := ( i q i=1 kf kX1m kmf kXq m and thus N(; F; `m1) N(; F; `m2 ). Since k kX 2 induces a Hilbert space we will now bound N(; op (F ); `m numbers with respect 1 ) in terms of covering m to the Hilbert space norm kkX . Since the results will hold 2 regardless of m in fact we will bound the uniform covering number Nm (; op (F )).

It is interesting to note the analogy with the Sauer-VapnikChervonenkis lemma [31, 20, 1] which shows that the growth function has two regimes. We will now develop generalizam with (p; q ) 6= (2; 2). An tions of the above result for Fp;q existing result in this direction is m Lemma 18 (Carl [3, p. 94]) Let S 2 L(`m p ; `1 ), m 2 N and let 1 < p 2. Then there exists a constant = (p) such that for all k 2 N ,

1

m 2 ek (S ) kS k k 1 log 1 + (19) : k (For k > m one can get a better bound along the lines of

Lemma 16.) This leads to the following theorem: m ) Let p > 0, q 2, and 1 + Theorem 19 (Slack in Fp;q p 1 > 1. Then there exist constants ; 0 such that with := q 1 1 1 we have ek (Fp;q ) k 1 log m + 1 and p+q 2 k m ) 0 log m 1= : log Nm(; Fp;q

X

4.2 Dimensionality of X is Polynomial in the Sample Size m M ) when M > m. We will Now we will consider Nm (; Fp;q derive results that are useful when M is polynomial in m. m With SXm : `M p ! `1 defined as before, we proceed to bound ek (SXm ). M ) Let 0 < p 2, q 1, 1 + 1 Lemma 20 (Slack in Fp;q p q 1, and M; m 2 N . Then there exists a constant 0 0 such that with M ; `m ) e2k+1 (Fp;q 1 M (20) ek (id: `p ! `M2 )ek (S~Xm : `M2 ! `m1 )

1

0 k 1 log Mk + 1 p

Consider now the situation that Lemma 20, e2k+1 (F1M;2; `m 1)

1 2

12

0 k 1 log Mk + 1

Definition 21 (p-Convex Hull) Suppose p > 0, and F is a set. Then the p-convex hull of F (strictly speaking the pabsolutely convex hull) is defined by [13, chapter 6] ( ) n n [ X X p

op (F ) = i fi : fi 2 F; i 2 R ; ji j 1 i=1 n2N i=1 As an example, consider the p-convex hull of the set of Heaviside functions on [0; 1℄. It is well known that the convex hull (p = 1) is the set of functions of bounded variation. In Appendix D we explore the analogous situation for 0 < p < 1. Lemma 22 Let 0 < p 1, let A be a set and let U (A) be an -cover of A. Then op (U (A)) is an -cover of op (A).

1 k 1 log mk + 1 2(21)

p = 1 and q = 2.

k 1 log

m

k

+1

Covering Numbers of op (F )

Lemma 23 Let 0 < p 1, Æ > 0, 1 ; 2 > 0 and 1 + 2 Æ. Then N(Æ; op (A)) N(1; op (U2 (A)).

From

=

Lemma 24 Let A = fa1 ; : : : ; an g H where H is a Hilbert space. Define the linear operator S : `np ! H such that S ei = ai , i = 1; : : : ; n, where ei is the canonical basis of `np (e.g. e1 = (1; 0; : : : ; 0), e2 = (0; 1; : : : ; 0), etc.). Then

op (A) = S (U`np ).

12

m M + 1 log1=2 +1 : = 0 k 1 log1=2 k k

Theorem 25 (Covering Numbers of p-Convex Hull) Let H be a Hilbert space, let F H be compact, and let B := suph2F khk. Then log N(Æ; op (F ))

Thus regression with F1M;2 has a sample complexity of m() 1=2 1 +1 log M ignoring log(m) factors. Interestingly Zhang [34] has developed a result going in the other direction: he makes use of mistake bounds to determine similar covering numbers, whereas our covering number bounds (computed directly) recover the general form of the mistake bounds when turned into batch learning results. (Note, too, that Zhang uses a normalised definition of k k`m p and so care needs to be taken in comparing his results to ours.)

min k : B k 1 log 1 + N(k2;F ) 2 2(0;Æ)

p1

1 2

=Æ

2 :

d Suppose N(; F ) 1 for some d 2 N . We can determine the rate of growth of log N(Æ; op (F )) as follows. Neglecting the k inside the log in (45) we can explicitly solve the

312

equation. Numerical evidence suggests that the dependence of the value of k in (45) is not very strong, so we choose 2 = Æ=2. Then a simple approximate calculation yields:

[4] B. Carl. Metric entropy of convex hulls in hilbert spaces. Bulletin of the London Mathematical Society, 29:452–458, 1997.

Corollary 26 Suppose F H is such that for some d 2 N , N(; F ) 1 d . Then for 0 < p 1,

[5] B. Carl, I. Kyrezi, and A. Pajor. Metric entropy of convex hulls in Banach spaces. Proceedings of the London Mathematical Society, 1999. to appear.

22pp

1 log 1=Æ: Æ For p = 1 this is O ((1=Æ )2 log(1=Æ )) whereas we know 2d from [4] that the rate should be O ((1=Æ ) d+2 ). Of course for large d, the difference is negligible (asymptotically in 1=Æ ). log N(Æ; op (F )) (p)d

6

[6] B. Carl and A. Pajor. Gelfand numbers of operators with values in a Hilbert space. Inventiones Mathematicae, 94:479–504, 1988. [7] B. Carl and I. Stephani. Entropy, compactness, and the approximation of operators. Cambridge University Press, Cambridge, UK, 1990.

Conclusions

[8] D. E. Edmunds and H. Triebel. Function Spaces, Entropy Numbers, Differential Operators. Cambridge University Press, Cambridge, 1996.

We have computed covering numbers for a range of variants on the maximum margin algorithm. In doing so we made explicit use of an operator theoretic viewpoint already used fruitfully in analysing the effect of the kernel in SV machines. We also analysed the covering numbers of p-convex hulls of simple classes of functions. We have seen how the now classical results for maximum margin hyperplanes can be generalized to function classes induced by different norms. The scaling behaviour of the resulting covering number bounds gives some insight into how related algorithms will perform in terms of their generalization performance. In other work [32, 11, 26] we have explored the effect of the kernel used in support vector machines for instance. In that case the eigenvalues of the kernel play a key role. In all this work the viewpoint that he function class is the image under the multiple evaluation map (considered as a linear operator) of a ball induced by a norm has been used. Gurvits [12] asked (effectively) what can learning theory do for the geometric theory of Banach spaces. It seems the assistance flows more readily in the other direction. Perhaps the one contribution learning theory has made is pointing out an interesting research direction [27] by giving an answer to Pietsch’s implicit question where he said of entropy numbers [19, p.311] that “at present we do not know any application in the real world.” Now at least there is one!

[9] D.E. Edmunds and H. Triebel. Entropy numbers and approximation numbers in function spaces. Proceedings of the London Mathematical Society, 58:137–152, 1989. [10] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Proc. 13th International Conference on Machine Learning, pages 148–146. Morgan Kaufmann, 1996. [11] Y. G. Guo, P. L. Bartlett, J. Shawe-Taylor, and R. C. Williamson. Covering numbers for support vector machines. In Proceedings of COLT99, 1999. [12] L. Gurvits. A note on a scale-sensitive dimension of linear bounded functionals in Banach spaces. In M. Li and A. Maruoka, editors, Algorithmic Learning Theory ALT-97, LNAI-1316, pages 352–363, Berlin, 1997. Springer. [13] H. Jarchow. Locally Convex Spaces. B.G. Teubner, 1981. [14] H. K¨onig. Eigenvalue Distribution of Compact Operators. Birkh¨auser, Basel, 1986.

Acknowledgements We would like to thank detailed comments and much help from Bernd Carl, Anja Westerhoff and Ingo Steinwart. This work was supported by the Australian Research Council. AS was supported by the DFG, grant SM 62/1-1. Parts of this work were done while BS was visiting the Australian National University.

[15] M. Laczkovich and D. Preiss. -Variation and transformation into n functions. Indiana University Mathematics Journal, 34(2):405–424, 1985.

References

[17] A.M. Olevski˘i. Homeomorphisms of the circle, modifications of functions, and Fourier series. American Mathematical Society Translations (2), 147:51–64, 1990.

[16] I.J. Maddox. Elements of Functional Analysis. Cambridge University Press, Cambridge, 1970.

[1] M. Anthony and P. Bartlett. A Theory of Learning in Artificial Neural Networks. Cambridge University Press, 1999.

[18] A. Pietsch. Operator ideals. North-Holland, Amsterdam, 1980.

[2] J. Barnes. Capacity control in boosting using a pconvex hull. Master’s thesis, Department of Engineering, Australian National University, 1999.

[19] A. Pietsch. Eigenvalues and s-Numbers. Cambridge University Press, Cambridge, 1987.

[3] B. Carl. Inequalities of Bernstein-Jackson-type and the degree of compactness of operators in Banach spaces. Ann. de l’Institut Fourier, 35(3):79–118, 1985.

[20] N. Sauer. On the density of families of sets. Journal of Combinatorial Theory, 13:145–147, 1972.

313

A Proofs

[21] R. Schapire, Y. Freund, P. L. Bartlett, and W. Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 1998.

Proof (Lemma 10) If p 1 choose p0 = 1 and q 0 = 1, otherwise simply set p0 = p and q 0 = (1 p 1 ) 1 . Since 1 1 0 0 p + q 1 we can always find two positive numbers p ; q 1 such that p0 q0 and p10 + q10 = 1. Note that by convexity M k kr k kr0 for 0 r r0 and therefore `M r 0 `r0 . Thus 0 we may apply H¨older’s inequality with p and q to obtain

[22] B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola. Advances in Kernel Methods — Support Vector Learning. MIT Press, Cambridge, MA, 1999. [23] C. Sch¨utt. Entropy numbers of diagonal operators between symmatric Banach spaces. Journal of Approximation Theory, 40:121–128, 1984.

sup

1im

jxi wj sup kxi kp0 kwkq0 x w 1im

(22)

which proves (11).

[24] J. Shawe-Taylor and N. Cristianini. Margin distribution and soft margin. In A.J. Smola, P.L. Bartlett, B. Sch¨olkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 349 – 358, Cambridge, MA, 2000. MIT Press.

Proof (Lemma 11) Fix

Xm. We have

ek (SXm ) = 2k 1 (SXm ) = (SXm (U`M p ))

(23)

where the last equality follows from the definition of SXm . M m Since m 2 (U`M p ) = Fp;q . mThus q ) , we have SXm (U`M M ) = k 1 (SXm ) and thus N(; FM ; `X ) 2k 1 (Fp;q 2 p;q 1 k 1 )m was arbitrary, we conclude that 2 . Since m 2 (U`M q m M ) k 1. log2 NX (; Fp;q

X

[25] A.J. Smola, P.L. Bartlett, B. Sch¨olkopf, and D. Schuurmans. Advances in Large Margin Classifiers. MIT Press, Cambridge, MA, 2000.

X

[26] A.J. Smola, A. Elisseeff, B. Sch¨olkopf, and R.C. Williamson. Entropy numbers for convex combinations and MLPs. In A.J. Smola, P.L. Bartlett, B. Sch¨olkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 369 – 387, Cambridge, MA, 2000. MIT Press.

Proof (Corollary 14) The fact that p = 2 and p (E ) = 1 follows from the construction of Hilbert spaces [33]. What remains to be shown is the bound on . Specializing (15) for Hilbert spaces, rewriting l(m; k ) := log 2m+kk 1 + 1 (and conversely k (l; m)) we obtain el (s) 4k(l; m) 1=2 kS k: (24)

[27] I. Steinwart. Some estimates for the entropy numbers of convex hulls with finitely many extreme points. Technical report, University of Jena, 1999.

The next step is to find a (simple) function k~(l; m) such that 4k(l; m) 1=2 kS k 4 k~(l; m) 1=2 kS k or equivalently 2 k~(l; m)=k(l; m) for all 1 k m. We choose k~(l; m) = l= log( ml + 1). Next we have to bound 2 . Set

[28] M. Talagrand. The Glivenko–Cantelli problem, ten years later. Journal of Theoretical Probability, 9(2):371–384, 1996. [29] H. Triebel. Interpolation Theory, Function Spaces, Differential Operators. North-Holland, Amsterdam, 1978.

k~(l(m; k); m) l(m; k) = m + 1)k : (25) k log( l(m;k ) One can readily check that for any m 2 N , (k; m) attains its maximum value at k = m. Furthermore m 7! (m; m) (k; m) :=

[30] V. Vapnik. Statistical Learning Theory. Wiley, N.Y., 1998. [31] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probab. and its Applications, 16(2):264–280, 1971.

is monotonically non-decreasing. Thus taking the limit (note m and that log := log ) we take that 3mm 1 = 23 3m 2

lim (m; m) < 6:16615

m!1

[32] R. C. Williamson, A. J. Smola, and B. Sch¨olkopf. Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators. Technical Report 19, NeuroCOLT, http://www.neurocolt.com, 1998. Accepted for publication in IEEE Transactions on Information Theory.

Resubstitution yields < 4 pletes the proof.1

p

(26)

6:16615 < 9:9327 which com-

Proof (Lemma 16) The first bound (for k log m) follows immediately from the definition of entropy numbers by ek (T ) kT k. The second line of (17) is a direct consequence of Theorem 15. All that remains is the third line: m for k > m we factorise T as T = T Æ id: `m 2 ! `2 and subsequently m ek (T ) em (T )ek m+1 (id : `m (27) 2 ! `2 ):

[33] P. Wojtaszczyk. Banach Spaces for Analysts. Cambridge University Press, 1991. [34] T. Zhang. Analysis of regularised linear functions for classification problems. IBM Research Report RC21572, 1999.

1 A (marginally) tighter version of the theorem could be stated by using directly to bound ek p 1=2 1 n 4 (m; m) k log( k + 1) kS k.

314

By Lemma 9,

Hence using Lemmas 11, 31 and 18 we obtain

(k m)=m 8 2 k=m : m em k+1 (id : `m 2 ! `2 ) 4 2 Theorem 15 tells us em (T ) kT km 1=2 . Substituting, we

ek (SXm )

are done.

Proof (Theorem 17) We distinguish between the two last bounds of Lemma 16. For 2 k m we can bound m ) k 1 log m + 1 1=2 ! ` k (SXm : `m 2 1 k (28) k 1 log m 1=2

and hence k 2 2 log m. Application of Lemma 11 proves the first inequality of (18)2 . For the second inequality simply note that the third line of Lemma 16 states (for k m) 1 k m k (SXm : `m 2 ! `1 ) 2 m m 2

Rewriting the conditions on k in terms of and collecting all remaining terms into the constants and 0 completes the proof.

@@ @@ @ id @@@

`m r

// `m ?? 1 ~~ ~ ~ ~~ T m ~~ X

`M p

w

x

kTXm k =

sup

w2U`mr max

sup

r

kw xi k 1:

(32) (33) (34) (35) (36)

(37)

Exploiting the factorization we can bound

! `m1 )

which proves (20). Next we have to bound the individual terms separately. For the first factor Lemma 31 can be applied. The dual version of the Maurey-Carl theorem applies to the second term. What remains to be shown is that kS~Xm k 1. This, however, follows immediately from Lemma 10 in analogy to the previous section. Proof (Lemma 22) Let f 2 op (A) be arbitrary, f = PN PN p i=1 i ai , i=1 ji j 1, where N may be infinite. For i = 1; : : : ; N , let a^i 2 U (A) be such that ka^i P ai k . (Such a ^i exist by definition of U (A).) Let f^ := N i=1 i a^i . We have

N N

X X

i a^i

kf f^k =

i ai

i=1 i=1 N X ji j sup kaj a^j k j i=1 N X ji j = (38) i=1

w

kTXm wk`m1

i=1;::: ;m w2U`m

SXm // `m ?? 1 @@ ~ ~ @@ ~ ~ @ id @@ ~~~~ S~Xm `M

m e2k+1 (SXm : `M p ! `1 ) ek (id: `Mp ! `M2 )ek (S~Xm : `M2

w x

1

2

(29)

The idea is that the identity operator uses up the “slack” between p and r implicit in the constraint p1 + 1q 1. As will be seen, a smaller value of is achieved for larger values of r . But r can be no larger than (1 1q ) 1 in order for us to be able to use Lemma 10. If equality is achieved in the m p; q constraint, then id maps `m p to `p and of course nothing (additional) is gained. We will show that kTXm k 1 and then use Lemma 8 to bound ek (SXm ). The operator TXm is identical to SXm except its domain is `m r . Thus TXm : 7! ( 1 11; : : : ; m ). By Lemma 10, since k i k`m q 1, and r + q 1, we have

x

1

Proof (Lemma 20) Similar to Theorem 17 consider the factorization

m)= Proof (Theorem 19) As before we observe that k (Fp;q m k (SXm ) and we will factorise the operator SXm : `p ! `m 1 as in the diagram (with 1r = 1 1q ): SXm

p r 2 2m

log +1 k k 12 kTXm k k2 log 2km + 1 p1 + q1 1 2m 2 +1

log k k 12 2m 2 kTXm k k log k + 1 m +1 :

0 kTXm k k 1 log k

Here we used the product inequality for entropy numbers in (33), (35) holds since p 1 r 1 p 1 + q 1 1, and more over there exists a constant ~ such that k2 log 2km + 1

~ k 1 log mk + 1 for k 1. Solving for k then immediately gives the bound on the covering numbers.

, k 12 m log(m2 ) +

`m p

(30)

By Lemma 8

m ek 1 (SXm ) = ek 1 (TXm Æ id: `m p ! `r ) ek=2 (id: `mp ! `mr ) ek=2 (TXm ):(31)

2 A slightly more involved and longer argument gives a better bound: there are constants 1 ; 2 ; 3 > 0 such that p

log( 2 2 m) m m log N(; F2;2 ; `1 ) 1 for 3 = m. 2

315

Proof (Lemma 23) Suppose V1 is an 1 -cover of op (U2 (A)). From Lemma 22, for any f 2 op (A), there exists a f^ 2

op (U2 (A)) such that kf f^k 2 . But for any f^ 2

op (U2 (A)), there exists f^^ 2 V1 such that kf^ f^^k 1 .

^

By the triangle inequality kf f^k kf f^k + kf^ 1 + 2 = Æ. Thus V1 is an Æ-cover of op (A).

x

f^^k

x

log( ml + 1) 1=2 : (46) l m We compute the entropy of an operator T : `m 2 ! `1 by

x

factorizing it as

@@ @@ T @@

and by Lemma 23, if

Definition 28

@@ @@ @ id @@@

`m 1

// H ?? ~ S

of kT

kS~k = i=1max kS~ei k`m = i=1max ka k m B ;::: ;n ;::: ;m i ` 2

2

Choosing obtain

2 U`m

Furthermore we have that Thus we obtain

log( mk + 1) k

p1

m ek (id : `m p ! `1 ) B

(42)

(43)

0 < p
> 1 }} } }} }} id

where p 2 < 1 is a free parameter that we will optimize over at the end. The two factors will be dealt with by using the following two propositions, respectively.

(39)

then log N(Æ; op (F )) k . Thus

S

T

`m 2 @

Proof (Theorem 25) Let n := N(2 ; F ) and let k;n;2 := ek ( op (U2 (F )). Thus we have log N(k;n;2 ; U2 (F )) k

log N(Æ; op (F )) min fk : (39) holdsg: 2 2(0;Æ)

m el (T : `m 2 ! `1 ) 102:88kT k

x

k;n;2 + 2 Æ;

In this appendix we provide a proof of Maurey’s theorem for operators S : H ! `m 1 which gives an explicit value for the constant. This is considerably more work, and we get a correspondingly poorer estimate, than in the dual case of theorem 13 for operators S : `m 1 ! H . The argument of this section is due to Professor Bernd Carl. Theorem 27

Proof (Lemma 24) Let = (x1 ; : : : ; xn ) 2 U`np . Write = P P Pn ( ni=1 xi ei ) = ni=1 xi S ei = Pin=1 xi ei . Then S = SP n p i=1 xi ai . Since 2 U`np , i=1 jxi j 1. ThusPS (U`np )

op (A). Likewise, for any f 2 op (A), f = ni=1 xi ai , with = (x1 ; : : : ; xn ) 2 U`np . Thus op (A) S (U`np ).

x

B Maurey’s Theorem

1 2

=

(45)

Combining (40) and (45) concludes the proof.

=

316

Z

Rm Z

kT xkp d (x)

kT xkpp d (x) m

R Z

m m X X xi Rm k=1 i=1 m m Z X X xi Rm k=1

i=1

p1

p T ei ; ek d (

! p1

p T ei ; ek d (

! p1

h

h

i i

x) x)

:

p

One can check that lim !1 ( ) = 2 and ( ) has a ) unique maximum for 2 [0; 1). Computing ( , setting it equal to zero and solving numerically, one finds the maximum occurs for = 3:66661119696101 : : : at which point ( ) = 1:66956682 : : : =: C . Thus 1=2 m m ) 8CB kT k log( k + 1) e2k 1 (T : `m ! ` 2 1

The Khinchin inequality [33] states that Z

p !1=p m X xi i d (x1 ; : : : ; xm ) Rm i=1

m X

p

i=1

ji j2

!1=2

where

p = 2

((1 + p)=2) 1=p p p: (1=2)

k

(52) and all that remains is to bound B , which we now do. From (66) B 2(3:70789)1=p . But our choice of p 2 and hence B 2(3:70789)1=2 = 3:851176 : : : . Substituting this value of B into (52) along with the numerical value of C we get log( mk + 1) 1=2 m m : e2k 1 (T : `2 ! `1 ) 51:44kT k

Hence

`(T )

m X k=1

0

ppp

0

= =

i=1 m X

m p X = p

m X

jhT ei ; ek ij2

jhT ei ; ek ij2

!p=2 11=p A

!p=2 11=p

k

A

m m m Noting that e2k (T : `m 2 ! `1 ) e2k 1 (T : `2 ! `1 ), we obtain a statement valid for even as well as odd numbers. To infer a bound on el , l 2 N , we set l = 2k , hence k = l=2. Then 2 log( 2m + 1) 1=2 m m l : el (T : `2 ! `1 ) 51:44kT k

k=1 i=1 !1=p m pp X p 0 kT ek k2 k=1 ppm1=p sup kT 0 e k k 2 1km ppm1=p kT 0 : `m ! `m k

m ek (T : `m 2 ! `p ) k

1 2

m el (T : `m 2 ! `1 ) 102:88kT k

pp

m 4 2 pm1=p kT : `m 2 ! `1 k:

Next, we combine the obtained bound with Sch¨utt’s result, using m m m m m e2k 1 (T : `m 2 ! `1 ) ek (T : `2 ! `p ) ek (id: `p ! `1 ); m to obtain e2k 1 (T : `m 2 ! `1 ) p pp m k B log(m=k + 1) k 4 2 pm1=p kT : `m ! ` : 2 1 k 1

1 2

Now choose p = 2 log( m k + 1). Observe p 2 for k m. Thus 1=2 4p2p2(log( m + 1))1=2 m e2k 1 (T : `m 2 ! `1 ) k k 1=p m 1 + 1) log( k 2 log( m +1) kT km k B

= 8 B kT k

log( mk

k

k + 1) 1=2

log( mk + 1) 2 log( mk +1) k 1 1 = x 2 log(x+1) (log(x + 1)) 2 log(x+1) with x = m=k . Note k m ) x 1. Let = log(x + 1), so x = 2 1. Then 0 and 1

= (t) = (2

1

log( ml + 1) 1=2 : l

(53)

`np

1

!`

n p2 )

m We now determine bounds on em k;p1 ;p2 := ek (idp1 ;p2 ). These have been given in [29, 4.10.3], [18, page 172], [14, 3.c.8], [23], and [9, p.141], (see also [8, page 101]). All but the last two references only considered p1 1. The most recent contribution by Edmunds and Triebel [9, page 141], [8, page 101] subsumes all of the others and is summarized in the Lemma below. For p1 1 by an argument of Sch¨utt [23] it is asymptotically optimal in n and k .

1) 2 2 : 1

2

Conjecture on Best Value of Maurey Constant A value of 102:88 is not particularly satisfying. We believe in fact it is quite loose. Our reasoning is as follows. Consider m all operators T : `m 2 ! `1 such that kT k = 1. By definition e1 (T ) = 1. Observe that the unit ball U`m2 is the ellipsoid of maximum volume contained inside U`m 1 . Since )) = lim N ( ; T ( U ))vol(U`m1 ) vol(T (U`m `m 2 2 !0 where N(; S ) is the covering number of S we have that m m vol(T (U`m )) = nlim 2 !1 nvol(n (T (U`2 ))U`1 ): But vol(T (U`m )) is maximized over all T such that kT k = 1 2 by choosing T = id. Thus we have for kT k = 1 e1 (T ) e1 (id) and nlim !1 en (T ) nlim !1 en (id): We conjecture that for all n 2 N and all T with kT k = 1 m m m en (T : `m 2 ! `1 ) en (id : `2 ! `1 ) If this were true, by (55) the Maurey constant would be 1.86.

C Bounds on ek (id :

where

= m 2 log( mk +1)

m

log( ll +2) = log( ml +1)+1 2 log( ml +1) l : Thus for log(m) l m, l

Now for log(m) l

ppm1=p kT : `m1 ! `m2 k 2 1

By Proposition 29, we get

l

m m we have log( ll +1) 2

1

317

Lemma 31 Let `np2 )

0 < p1

p2 1.

Then ek (id: `np1

!

for k

p p m (k) := log m (k):

8
0. For a given number of dimensions k, we determine the smallest so that Upk can be covered by =2e k d1[ [ B (; k) + 2jei Spk := (56) i=1 j =b 1=2 where ei is the ith canonical basis vector. For x 2 Upk we P have ki=1 xpi 1 ) kxp1 1 ) x1 k 1=p . Setting x1 = (the radius of B (; k)) gives p 1=k ) k p . We will now set k = p . Along each of the k axes of Upk we have used (d1=2e b 1=2 1+1) cubes. We have added

3:45944677l

log +1 l 1=2 m +1 1:85996 l 1 log l

The following interpolation lemma follows immediately from [18, p.173]. Lemma 33 Let 1 p1 p2 1. Then for all k 2 N 1 1=p2 (63) ek (idn1;p2 ) 2ek (idm 1;1 ) m m ek (idp1 ;1 ) 2ek (id1;1 )1=p1 (64) m m 1 =p 1 1=p2 ek (idp1 ;p2 ) 4ek (id1;1 ) : (65)

and subtracted one here so we do not multiply count the box that will live at the very center of Upk . Therefore along all k axes, we have

k(2d1=2e 1) + 1 = k(2d1=2e 1) + 1 k(2(1=2 + 1) 1) + 1 = k(1= + 1) + 1: k Thus jSp j k (1= + 1) + 1. Observe that by construction of k , any point in (Upm n Upk ) which is not covered by S must lie within a (m k )-dimension -ball on one of the (m k) principal axes of Upm not contained in Upk . Thus by separately covering all possible choices of k axes in the manner above, we cover Upm . Since there are m k ways to choose the k axes, we have that m 1 =p p m (k) := (k(k + 1) + 1) (57) k 1 -`m 1 -balls cover Upm , where = k1=p . In other words 1 pm (k) (idm (58) p;1 ) k1=p :

Combining this Lemma with (54) gives for p 1,

m 1=p ek (idm p;1 ) 2ek (id1;1 ) 2

3:70789 log( mk + 1) 1=p k

(66) C.2 When p1 < 1 Lemma 31 and the proof of Lemma 32 suggests a similar form of the result in Lemma 32 should be obtainable when p1 < 1. However it turns out that for p1 < 1 the value of constant obtained is quite unsatisfactory. This is because its value is dominated by the behaviour of (p1 ; k; n) for very small k and n ((k; n) = (3; 3)). For learning applications these are uninteresting values of k and n. Furthermore for some applications we are actually interested in how enk;p1 ;1 behaves as a function of p for fixed n. For example if we

318

oN (H) 0<1

wanted to use p1 as a “capacity control knob.” In that case it is necessary to determine the dependence of on p1 explicitly. In doing so it turns out that in the case of p1 < 1 one has to pay too high a price for the elegance of an expression of the form (54) and we end up being better served by an implicit formula, which nevertheless can be easily computed. This implicit formula does exhibit the expected behaviour in p for fixed n. Setting j := k 1=p we have from (60) that

1 e np (j p )+1 : 2j

Since

We take the following definitions from [15]. Suppose I R and f : I ! R . For 0, n X V (F; I ) := sup jf (bi ) f (ai )j i=1 where f[ai ; bi ℄gni=1 is an arbitrary finite system of non-overlapping intervals with ai ; bi 2 I for i = 1; : : : ; n. Suppose f is continuous on [a; b℄. Let G be the union of all open subintervals on which f is either strictly monotonic or constant. Then

V (fI ; K fi )

k=2 i=1

Theorem 36 (Laczkovich and Preiss) Let s; > 0, s = 1=. Then f 2 CBV if and only if there exists a homeomorphism of [a; b℄ into itself such that f Æ 2 Lip(s).

N X

k=2

1.

ff : [a; b℄ ! R : 9K > 0; 8x; y 2 [a; b℄; jf (y) f (x)j K jx yjs g: For k 2 N and k < s k + 1, Lip(s) = ff : f is k-times differentiable and f (k) 2 Lip(s k)g:

2 oN (H), we can write f as

i=1

1)(2 )

Lip(s) =

i H (x i ) i=1 where we will assume that 1 2 N . Observe that K f fi : i = 1; : : : ; N g. Thus V (f; K f ) N N N

=

jfi (xj ) fi (xj 1 )j (m 1)(2 ):

The smallness of CBV is well illustrated by the following theorem from [15]. For 0 s 1 define

2 N , for 0 < 1, oN (H)

X X X i H (k 1 i H (k i ) i=1 k=2 i=1 N k kX1 X N X X = k 1: i i

j =2

for m > 1 and

is the set of points of varying monotonicity. If f is not continuous everywhere, we let D denote the set of points of dis = K [ D. continuity and set K If 0, f is said to be of bounded -variation and we ) < 1. If V (f; K ) < M we write f 2 CBV if V (f; K say f 2 CBV (M ). Let H = fH (x ): 2 [a; b℄g where H (x) is the Heaviside (step function).

m X

fi ) 1 and so (m But by hypothesis V (fi ; K 1. Thus m 2 + 1 3

K = Kf = [a; b℄ n G

f (x) =

< 1. Then

3 FatCBV ( ) :

Proof Suppose ff1 ; : : : ; f2m g CBV (1) -shatter the points (x1 ; : : : ; xm ) [a; b℄ with respect to (r1 ; : : : ; rm ). We will show that m 3 . For i = 1; : : : ; 2m , we have Kfi fx1 ; : : : ; xm g. There must always be a sign assignment bi 2 f 1; 1gm which is realized w.r.t. (r1 ; : : : ; rm ) by fi such that jfi (xj ) fi (xj 1 )j 2 for j = 2; : : : ; m. Thus

p-Convex Hull of Heavisides

Proof For any f

oN (H) CBV (1):

Proposition 35 Suppose 0 < 1 and 0 <

(67)

one would expect, the log covering numbers decrease with decreasing p.

N

N , we have for

As a way of illustrating the “size” of CBV , in a fashion that gives some additional intuition to our entropy number results determined directly, we will now compute the fat-shattering dimension of CBV (1).

k we can easily determine ek numerically: Let j0 := fj : np (j p ) + 1 = kg. Such a j0 is unique. Then ek 1=2j0 . Using this method one can plot k = k(p; ) = log N(; Fp;1) as a function of p 1 for various . As

Lemma 34 For any CBV (1).

1 [ N =1

For a given

D

CBV (1) for all N 2

(For related results see [17] and references therein.)

i )

j j

319