Localized Complexities for Transductive Learning Ilya Tolstikhin1 , Gilles Blanchard2 , Marius Kloft3
1 Russian
Academy of Sciences of Potsdam 3 Humboldt University of Berlin 2 University
COLT 2014
Contents:
1. New concentration inequalities for sampling without replacement 2. Application to transductive learning
Concentration inequalities
I
Function of many random variables Q = g(X1 , . . . , Xn )
I
We want to control random fluctuations of Q around E[Q]
We aim to show high-probability upper bounds on: Q − E[Q]
and/or
E[Q] − Q.
Independent random variables The case when X1 , . . . , Xn are independent has been very well studied and many useful results are available [Boucheron et al., 2013].
Concentration inequalities: independent random variables Consider independent random variables X1 , . . . , Xn bounded in [0, 1]: n
Sn =
1X Xi . n i=1
Then for any δ ∈ (0, 1] with probability greater than 1 − δ: Hoeffding’s inequality r |Sn − E[Sn ]| ≤
log(2/δ) ; 2n
Bernstein’s inequality : r |Sn − E[Sn ]| ≤ where σ 2 =
1 n
2σ 2 log(2/δ) 2 log(2/δ) + , n 3n
Pn
i=1 Var[Xi ].
Message: small variance leads to better convergence rates.
Concentration inequalities: independent random variables Now consider i.i.d. sequence of r.v.’s X1 , . . . , Xn , taking values in X . Let F be a countable class of bounded functions f : X → [−1, 1] such that E[f (X1 )] = 0. Consider the supremum of empirical process: n
Qn = sup f ∈F
1X f (Xi ). n i=1
Then for any δ ∈ (0, 1] with probability greater than 1 − δ: McDiarmid’s inequality: r |Qn − E[Qn ]| ≤
2 log(2/δ) ; n
Talagrand’s inequality (version due to [O. Bousquet, 2002]): r 2v log(1/δ) 2 log(1/δ) Qn − E[Qn ] ≤ + , n 3n where v = supf ∈F Var[f (X1 )] + 2E[Qn ].
Sampling without replacement Now let Z1 , . . . , Zn be sampled uniformly without replacement from given finite set C = {c1 , . . . , cN } for N ≥ n. Note: Z1 , . . . , Zn are not independent Motivation: I
Cross-validation procedures;
I
Transductive learning;
I
Randomized sequential algorithms (SGD,. . . );
I
Matrix completion;
I
Low-rank matrix factorization (Collaborative filtering,. . . );
I
...
Sampling without replacement: previous results n
1X Sn = Zi . n i=1 [Hoeffding, 1963]: Hoeffding’s and Bernstein’s inequalities also hold for this setting. [Serfling, 1974]: Moreover, for all δ ∈ (0, 1] with prob. greater than 1 − δ: s N − n + 1 log(2/δ) |Sn − E[Sn ]| ≤ . N 2n [Bardenet and Maillard, 2013]: Bernstein’s inequality can be tightened in the same manner. Message: things are more concentrated when random variables are sampled without replacement!
Sampling without replacement: previous results n
Qn = sup f ∈F
1X f (Zi ). n i=1
[El-Yaniv and Pechyony, 2009; Cortes et al., 2009]: for any δ ∈ (0, 1] with prob. greater than 1 − δ: s 1 N −n 2 log(2/δ) |Qn − E[Qn ]| ≤ , N − 1/2 ∆(n, N ) n where ∆(n, N ) = 1 −
1 2 max{n,N −n}
≈ 1.
This inequality is a (tighter) version of McDiarmid’s inequality. Problem: there is no version of Talagrand’s concentration inequality for sampling without replacement!
Our results Let X1 , . . . , Xn and Z1 , . . . , Zn be sampled with and without replacement respectively from C = {c1 , . . . , cN }. Consider: n
Qiid n
1X = sup f (Xi ), f ∈F n i=1
n
1X Qn = sup f (Zi ), f ∈F n i=1
2 σF = sup Var[f (X1 )]. f ∈F
Theorem For any δ ∈ (0, 1] with probability greater than 1 − δ: s 2 log(2/δ) 2σF N |Qn − E[Qn ]| ≤ 2 . n n
Theorem For any δ ∈ (0, 1] with probability greater than 1 − δ: r 2 + 2E[Qiid ]) log(1/δ) 2(σF log(1/δ) n Qn − E[Qiid + . n ]≤ n 3n
Our results Let X1 , . . . , Xn and Z1 , . . . , Zn be sampled with and without replacement respectively from C = {c1 , . . . , cN }. Consider: n
Qiid n
1X = sup f (Xi ), f ∈F n i=1
n
1X Qn = sup f (Zi ), f ∈F n i=1
2 σF = sup Var[f (X1 )]. f ∈F
Theorem For any δ ∈ (0, 1] with probability greater than 1 − δ: s 2 log(2/δ) 2σF N |Qn − E[Qn ]| ≤ 2 . n n
Theorem For any δ ∈ (0, 1] with probability greater than 1 − δ: r 2 + 2E[Qiid ]) log(1/δ) 2(σF log(1/δ) n Qn − E[Qiid + . n ]≤ n 3n
Our results: discussion s
2 log(2/δ) N −n ; n N − 1/2 s 2 log(2/δ) 2σF N |Qn − E[Qn ]| ≤ 2 ; n n r 2 + 2E[Qiid ]) log(1/δ) 2(σF log(1/δ) n iid + . Qn − E[Qn ] ≤ n 3n |Qn − E[Qn ]| ≤
(Old)
(New 1) (New 2)
I
(Old) does not account for the variance (Hoeffding-type)
I
If n = o(N ) (Old) and (New 2) can outperform (New 1)
I
2 If n = Ω(N ) (New 1) outperforms (Old) for σF ≤ 1/16
I
2 Comparison between (New 2) and (Old) depends on σF and E[Qiid n ]
I
0 ≤ E[Qiid ] − E[Qn ] ≤ 2n3 /N
Summary: (New 2) stays informative in all regimes of N and n; (New 1) can give better results (at least for n = Ω(N )).
Our results: discussion s
2 log(2/δ) N −n ; n N − 1/2 s 2 log(2/δ) 2σF N |Qn − E[Qn ]| ≤ 2 ; n n r 2 + 2E[Qiid ]) log(1/δ) 2(σF log(1/δ) n iid + . Qn − E[Qn ] ≤ n 3n |Qn − E[Qn ]| ≤
(Old)
(New 1) (New 2)
I
(Old) does not account for the variance (Hoeffding-type)
I
If n = o(N ) (Old) and (New 2) can outperform (New 1)
I
2 If n = Ω(N ) (New 1) outperforms (Old) for σF ≤ 1/16
I
2 Comparison between (New 2) and (Old) depends on σF and E[Qiid n ]
I
0 ≤ E[Qiid ] − E[Qn ] ≤ 2n3 /N
Summary: (New 2) stays informative in all regimes of N and n; (New 1) can give better results (at least for n = Ω(N )).
Contents:
1. New Concentration inequalities for sampling without replacement 2. Application to transductive learning
Transductive learning: setting and notations
Deterministic agnostic setting Finite instance space X n = {X1 , . . . , XN } ⊂ X and output space Y Class H of predictors h : X → Y Labelling function ϕ : X → Y (not necessarilly in H) 1. Sample n ≤ N inputs X n ⊆ X N uniformly without replacement 2. Obtain outputs Y n for X n by applying function ϕ : X → Y 3. Reveal training set Sn = (X n , Y n ) and u = N − n test inputs X u
Transductive learning: setting and notations Goal of the learner: based on Sn and X u find a predictor in hypothesis class H with minimal test error: Lu (h) =
1 X ` h(X), ϕ(X) u {z } | X∈X u
`h (X)
for arbitrary bounded loss function ` : Y × Y → [0, 1]. I
ˆ n (h) are losses on X N and X n respectively; LN (h) and L
I
ˆ n , h∗ and h∗ minimize L ˆ n (h), Lu (h) and LN (h) respectively; h u N
I
Excess loss ˆ n ) = Lu (h ˆ n ) − Lu (h∗ ). E(h u
ˆ n ). Our goal: obtain tight high-probability upper bounds on E(h
Transductive learning: previous results [Vapnik, 1982; Blum and Langford, 2003] present an implicit bounds for binary loss function; q ˆ n ) log(n+u) ˆ n (h I [Cortes and Mohri, 2006] obtain bounds of order L n for regression with quadratic loss;
I
I
[Blum and Langford, 2003; Derbeko et al., 2004] PAC-Bayesian bounds for transductive learning which crucially depend on prior;
I
[El-Yaniv and Pechyony, 2006; Cortes et al., 2009] Bounds of order n−1/2 for binary and quadratic loss functions based on algorithmic stability;
I
[El-Yaniv and Pechyony, 2009] Bounds of order n−1/2 for bounded loss functions based on global Rademacher complexities.
I
[Blum and Langford, 2003] provide bounds of the order n−1 in the realizable setting (when ϕ ∈ H) and binary loss function; Message: all bounds have the “slow” rate O(n−1/2 ) under general assumptions
Localized complexities and fast rates in inductive setting Inductive setting assumes that Sn ∼ i.i.d. from unknown P on X × Y. Classic VC-approach deals with uniform deviations: ˆ n (h) sup LN (h) − L h∈H
and provides bounds of the slow rate of O(n−1/2 ).
Localized approach [Massart, 2000; Bartlett et al., 2005; Koltchinskii, 2006] this is overpessimistic and we should study local fluctuations: ˆ n (h), sup LN (h) − L
h∈H0
where H0 ⊆ H contains functions with small variances. This often leads to the fast rates of o(n−1/2 ) (e.g. Tsybakov’s low noise conditions, etc.). Important: Localized approach is based on the Talagrand’s inequality.
Localized complexities and fast rates in inductive setting Inductive setting assumes that Sn ∼ i.i.d. from unknown P on X × Y. Classic VC-approach deals with uniform deviations: ˆ n (h) sup LN (h) − L h∈H
and provides bounds of the slow rate of O(n−1/2 ).
Localized approach [Massart, 2000; Bartlett et al., 2005; Koltchinskii, 2006] this is overpessimistic and we should study local fluctuations: ˆ n (h), sup LN (h) − L
h∈H0
where H0 ⊆ H contains functions with small variances. This often leads to the fast rates of o(n−1/2 ) (e.g. Tsybakov’s low noise conditions, etc.). Important: Localized approach is based on the Talagrand’s inequality.
Our results ˆ iid (h) = Let L n
1 n
Pn
i=1 `h (Zi ),
where Z1 , . . . , Zn ∼ i. i. d. from X N .
Consider local neighbourhood of h∗N in H: n h o 2 i H(r) = h ∈ H : E `h (X) − `h∗N (X) ≤r .
Theorem Assume that there is a constant B > 0 such that for every h ∈ H: h 2 i E `h (X) − `h∗N (X) ≤ B · LN (h) − LN (h∗N ) . Assume that there is a sub-root function ψn (r), such that: " # iid ∗ iid ∗ ˆ n (h) − LN (hN ) − L ˆ n (hN ) ≤ ψn (r). B · E sup LN (h) − L h∈H(r)
Let rn∗ be a fixed point of ψn (r). Then with prob. greater than 1 − δ: ∗ ˆ n ) − LN (h∗ ) ≤ 901 rn + (16 + 25B) log(1/δ) = ∆n (δ). LN (h N B 3n
Our results Theorem Under assumptions of the previous theorem with prob. greater than 1 − δ: ˆ n ) − Lu (h∗ ) ≤ N ∆n (δ) + ∆u (δ) , ∆n (δ) ∼ r∗ + n−1 . Lu (h n u u n h E
2 i `h (X) − `h∗N (X) ≤ B · LN (h) − LN (h∗N ) .
This condition is satisfied for many problems including: I
quadratic loss and uniformly bounded convex class H;
I
binary loss and a class H with finite VC-dimension if ϕ ∈ H.
For many interesting situations rn∗ is of the order o(n−1/2 ): VC(H) log n . n
I
[Massart, 2000] binary loss and VC-classes: rn∗ ∼
I
[Mendelson, 2003] balls in RKHS and Lipschitz losses.
Thank you for attention! Many open questions: I
Can we “close the gap” in concentration inequalities?
I
Can we obtain the tighter version of Talagrand’s inequality? (In the way Serfling’s bound tightens Hoeffding’s inequality)
I
Local transductive Rademacher complexities.
I
Other applications: non-asymptotic analysis of cross-validation, . . .
I
Can we obtain transductive bounds useful in practice?
I
...