Localized Complexities for Transductive Learning - VideoLectures

Comment

Report 2 Downloads 62 Views

Localized Complexities for Transductive Learning Ilya Tolstikhin1 , Gilles Blanchard2 , Marius Kloft3

1 Russian

Academy of Sciences of Potsdam 3 Humboldt University of Berlin 2 University

COLT 2014

Contents:

1. New concentration inequalities for sampling without replacement 2. Application to transductive learning

Concentration inequalities

I

Function of many random variables Q = g(X1 , . . . , Xn )

I

We want to control random fluctuations of Q around E[Q]

We aim to show high-probability upper bounds on: Q − E[Q]

and/or

E[Q] − Q.

Independent random variables The case when X1 , . . . , Xn are independent has been very well studied and many useful results are available [Boucheron et al., 2013].

Concentration inequalities: independent random variables Consider independent random variables X1 , . . . , Xn bounded in [0, 1]: n

Sn =

1X Xi . n i=1

Then for any δ ∈ (0, 1] with probability greater than 1 − δ: Hoeffding’s inequality r |Sn − E[Sn ]| ≤

log(2/δ) ; 2n

Bernstein’s inequality : r |Sn − E[Sn ]| ≤ where σ 2 =

1 n

2σ 2 log(2/δ) 2 log(2/δ) + , n 3n

Pn

i=1 Var[Xi ].

Message: small variance leads to better convergence rates.

Concentration inequalities: independent random variables Now consider i.i.d. sequence of r.v.’s X1 , . . . , Xn , taking values in X . Let F be a countable class of bounded functions f : X → [−1, 1] such that E[f (X1 )] = 0. Consider the supremum of empirical process: n

Qn = sup f ∈F

1X f (Xi ). n i=1

Then for any δ ∈ (0, 1] with probability greater than 1 − δ: McDiarmid’s inequality: r |Qn − E[Qn ]| ≤

2 log(2/δ) ; n

Talagrand’s inequality (version due to [O. Bousquet, 2002]): r 2v log(1/δ) 2 log(1/δ) Qn − E[Qn ] ≤ + , n 3n where v = supf ∈F Var[f (X1 )] + 2E[Qn ].

Sampling without replacement Now let Z1 , . . . , Zn be sampled uniformly without replacement from given finite set C = {c1 , . . . , cN } for N ≥ n. Note: Z1 , . . . , Zn are not independent Motivation: I

Cross-validation procedures;

I

Transductive learning;

I

Randomized sequential algorithms (SGD,. . . );

I

Matrix completion;

I

Low-rank matrix factorization (Collaborative filtering,. . . );

I

...

Sampling without replacement: previous results n

1X Sn = Zi . n i=1 [Hoeffding, 1963]: Hoeffding’s and Bernstein’s inequalities also hold for this setting. [Serfling, 1974]: Moreover, for all δ ∈ (0, 1] with prob. greater than 1 − δ: s N − n + 1 log(2/δ) |Sn − E[Sn ]| ≤ . N 2n [Bardenet and Maillard, 2013]: Bernstein’s inequality can be tightened in the same manner. Message: things are more concentrated when random variables are sampled without replacement!

Sampling without replacement: previous results n

Qn = sup f ∈F

1X f (Zi ). n i=1

[El-Yaniv and Pechyony, 2009; Cortes et al., 2009]: for any δ ∈ (0, 1] with prob. greater than 1 − δ: s 1 N −n 2 log(2/δ) |Qn − E[Qn ]| ≤ , N − 1/2 ∆(n, N ) n where ∆(n, N ) = 1 −

1 2 max{n,N −n}

≈ 1.

This inequality is a (tighter) version of McDiarmid’s inequality. Problem: there is no version of Talagrand’s concentration inequality for sampling without replacement!

Our results Let X1 , . . . , Xn and Z1 , . . . , Zn be sampled with and without replacement respectively from C = {c1 , . . . , cN }. Consider: n

Qiid n

1X = sup f (Xi ), f ∈F n i=1

n

1X Qn = sup f (Zi ), f ∈F n i=1

2 σF = sup Var[f (X1 )]. f ∈F

Theorem For any δ ∈ (0, 1] with probability greater than 1 − δ: s 2 log(2/δ) 2σF N |Qn − E[Qn ]| ≤ 2 . n n

Theorem For any δ ∈ (0, 1] with probability greater than 1 − δ: r 2 + 2E[Qiid ]) log(1/δ) 2(σF log(1/δ) n Qn − E[Qiid + . n ]≤ n 3n

Our results Let X1 , . . . , Xn and Z1 , . . . , Zn be sampled with and without replacement respectively from C = {c1 , . . . , cN }. Consider: n

Qiid n

1X = sup f (Xi ), f ∈F n i=1

n

1X Qn = sup f (Zi ), f ∈F n i=1

2 σF = sup Var[f (X1 )]. f ∈F

Theorem For any δ ∈ (0, 1] with probability greater than 1 − δ: s 2 log(2/δ) 2σF N |Qn − E[Qn ]| ≤ 2 . n n

Theorem For any δ ∈ (0, 1] with probability greater than 1 − δ: r 2 + 2E[Qiid ]) log(1/δ) 2(σF log(1/δ) n Qn − E[Qiid + . n ]≤ n 3n

Our results: discussion s

2 log(2/δ) N −n ; n N − 1/2 s 2 log(2/δ) 2σF N |Qn − E[Qn ]| ≤ 2 ; n n r 2 + 2E[Qiid ]) log(1/δ) 2(σF log(1/δ) n iid + . Qn − E[Qn ] ≤ n 3n |Qn − E[Qn ]| ≤

(Old)

(New 1) (New 2)

I

(Old) does not account for the variance (Hoeffding-type)

I

If n = o(N ) (Old) and (New 2) can outperform (New 1)

I

2 If n = Ω(N ) (New 1) outperforms (Old) for σF ≤ 1/16

I

2 Comparison between (New 2) and (Old) depends on σF and E[Qiid n ]

I

0 ≤ E[Qiid ] − E[Qn ] ≤ 2n3 /N

Summary: (New 2) stays informative in all regimes of N and n; (New 1) can give better results (at least for n = Ω(N )).

Our results: discussion s

2 log(2/δ) N −n ; n N − 1/2 s 2 log(2/δ) 2σF N |Qn − E[Qn ]| ≤ 2 ; n n r 2 + 2E[Qiid ]) log(1/δ) 2(σF log(1/δ) n iid + . Qn − E[Qn ] ≤ n 3n |Qn − E[Qn ]| ≤

(Old)

(New 1) (New 2)

I

(Old) does not account for the variance (Hoeffding-type)

I

If n = o(N ) (Old) and (New 2) can outperform (New 1)

I

2 If n = Ω(N ) (New 1) outperforms (Old) for σF ≤ 1/16

I

2 Comparison between (New 2) and (Old) depends on σF and E[Qiid n ]

I

0 ≤ E[Qiid ] − E[Qn ] ≤ 2n3 /N

Summary: (New 2) stays informative in all regimes of N and n; (New 1) can give better results (at least for n = Ω(N )).

Contents:

1. New Concentration inequalities for sampling without replacement 2. Application to transductive learning

Transductive learning: setting and notations

Deterministic agnostic setting Finite instance space X n = {X1 , . . . , XN } ⊂ X and output space Y Class H of predictors h : X → Y Labelling function ϕ : X → Y (not necessarilly in H) 1. Sample n ≤ N inputs X n ⊆ X N uniformly without replacement 2. Obtain outputs Y n for X n by applying function ϕ : X → Y 3. Reveal training set Sn = (X n , Y n ) and u = N − n test inputs X u

Transductive learning: setting and notations Goal of the learner: based on Sn and X u find a predictor in hypothesis class H with minimal test error: Lu (h) =

1 X ` h(X), ϕ(X) u {z } | X∈X u

`h (X)

for arbitrary bounded loss function ` : Y × Y → [0, 1]. I

ˆ n (h) are losses on X N and X n respectively; LN (h) and L

I

ˆ n , h∗ and h∗ minimize L ˆ n (h), Lu (h) and LN (h) respectively; h u N

I

Excess loss ˆ n ) = Lu (h ˆ n ) − Lu (h∗ ). E(h u

ˆ n ). Our goal: obtain tight high-probability upper bounds on E(h

Transductive learning: previous results [Vapnik, 1982; Blum and Langford, 2003] present an implicit bounds for binary loss function; q ˆ n ) log(n+u) ˆ n (h I [Cortes and Mohri, 2006] obtain bounds of order L n for regression with quadratic loss;

I

I

[Blum and Langford, 2003; Derbeko et al., 2004] PAC-Bayesian bounds for transductive learning which crucially depend on prior;

I

[El-Yaniv and Pechyony, 2006; Cortes et al., 2009] Bounds of order n−1/2 for binary and quadratic loss functions based on algorithmic stability;

I

[El-Yaniv and Pechyony, 2009] Bounds of order n−1/2 for bounded loss functions based on global Rademacher complexities.

I

[Blum and Langford, 2003] provide bounds of the order n−1 in the realizable setting (when ϕ ∈ H) and binary loss function; Message: all bounds have the “slow” rate O(n−1/2 ) under general assumptions

Localized complexities and fast rates in inductive setting Inductive setting assumes that Sn ∼ i.i.d. from unknown P on X × Y. Classic VC-approach deals with uniform deviations: ˆ n (h) sup LN (h) − L h∈H

and provides bounds of the slow rate of O(n−1/2 ).

Localized approach [Massart, 2000; Bartlett et al., 2005; Koltchinskii, 2006] this is overpessimistic and we should study local fluctuations: ˆ n (h), sup LN (h) − L

h∈H0

where H0 ⊆ H contains functions with small variances. This often leads to the fast rates of o(n−1/2 ) (e.g. Tsybakov’s low noise conditions, etc.). Important: Localized approach is based on the Talagrand’s inequality.

Localized complexities and fast rates in inductive setting Inductive setting assumes that Sn ∼ i.i.d. from unknown P on X × Y. Classic VC-approach deals with uniform deviations: ˆ n (h) sup LN (h) − L h∈H

and provides bounds of the slow rate of O(n−1/2 ).

Localized approach [Massart, 2000; Bartlett et al., 2005; Koltchinskii, 2006] this is overpessimistic and we should study local fluctuations: ˆ n (h), sup LN (h) − L

h∈H0

where H0 ⊆ H contains functions with small variances. This often leads to the fast rates of o(n−1/2 ) (e.g. Tsybakov’s low noise conditions, etc.). Important: Localized approach is based on the Talagrand’s inequality.

Our results ˆ iid (h) = Let L n

1 n

Pn

i=1 `h (Zi ),

where Z1 , . . . , Zn ∼ i. i. d. from X N .

Consider local neighbourhood of h∗N in H: n h o 2 i H(r) = h ∈ H : E `h (X) − `h∗N (X) ≤r .

Theorem Assume that there is a constant B > 0 such that for every h ∈ H: h 2 i E `h (X) − `h∗N (X) ≤ B · LN (h) − LN (h∗N ) . Assume that there is a sub-root function ψn (r), such that: " # iid ∗ iid ∗ ˆ n (h) − LN (hN ) − L ˆ n (hN ) ≤ ψn (r). B · E sup LN (h) − L h∈H(r)

Let rn∗ be a fixed point of ψn (r). Then with prob. greater than 1 − δ: ∗ ˆ n ) − LN (h∗ ) ≤ 901 rn + (16 + 25B) log(1/δ) = ∆n (δ). LN (h N B 3n

Our results Theorem Under assumptions of the previous theorem with prob. greater than 1 − δ: ˆ n ) − Lu (h∗ ) ≤ N ∆n (δ) + ∆u (δ) , ∆n (δ) ∼ r∗ + n−1 . Lu (h n u u n h E

2 i `h (X) − `h∗N (X) ≤ B · LN (h) − LN (h∗N ) .

This condition is satisfied for many problems including: I

quadratic loss and uniformly bounded convex class H;

I

binary loss and a class H with finite VC-dimension if ϕ ∈ H.

For many interesting situations rn∗ is of the order o(n−1/2 ): VC(H) log n . n

I

[Massart, 2000] binary loss and VC-classes: rn∗ ∼

I

[Mendelson, 2003] balls in RKHS and Lipschitz losses.

Thank you for attention! Many open questions: I

Can we “close the gap” in concentration inequalities?

I

Can we obtain the tighter version of Talagrand’s inequality? (In the way Serfling’s bound tightens Hoeffding’s inequality)

I

Local transductive Rademacher complexities.

I

Other applications: non-asymptotic analysis of cross-validation, . . .

I

Can we obtain transductive bounds useful in practice?

I

...

Recommend Documents

Localized Complexities for Transductive Learning

PAC-Bayesian Theory for Transductive Learning

Stable Transductive Learning - Semantic Scholar

Large Margin Transductive Transfer Learning - Semantic Scholar

Learning to Behave by Reading - VideoLectures