The Rademacher Complexity of Co-Regularized Kernel Classes

Report 3 Downloads 25 Views
The Rademacher Complexity of Co-Regularized Kernel Classes

David S. Rosenberg

Peter L. Bartlett

Department of Statistics

Department of Statistics and Computer Science Divison

University of California, Berkeley

University of California, Berkeley

Berkeley, CA 94720

Berkeley, CA 94720

[email protected]

[email protected]

Abstract

forms well on the labeled data, such that all the chosen predictors give similar predictions on the unlabeled

In

the

multi-view

approach

to

semi-

supervised learning, we choose one predictor from

each

of

multiple

hypothesis

and we co-regularize

classes,

our choices by pe-

nalizing disagreement among the predictors on the unlabeled data.

We examine the

co-regularization

used

method

in

the

co-

regularized least squares (CoRLS) algorithm [12],

in which the views are reproducing

kernel

Hilbert

spaces

(RKHS's),

and

the

disagreement penalty is the average squared dierence in predictions. The nal predictor is the pointwise average of the predictors

data. This approach is motivated by the assumption that each view contains a predictor that's approximately correct. Roughly speaking, predictors that are approximately correct are also approximately equal. Thus we can reduce the complexity of our learning problem by eliminating from the search space all predictors that don't have matching predictors in each of the other views. Because of this reduction in complexity, it's reasonable to expect better test performance for the same amount of labeled training data. This paper provides a more precise understanding of the ways in which the agreement constraint and the choice of views aect complexity and generalization.

from each view. We call the set of predictors

Early theoretical results on the multi-view approach

that

the

to semi-supervised learning, in particular [5] and the

Our main

original co-training paper [3], assume that the predic-

result is a tight bound on the Rademacher

tors of each view are conditionally independent given

complexity of the co-regularized hypothesis

the labels they try to predict. This assumption is dif-

class in terms of the kernel matrices of each

cult to justify in practice, yet there are scant other

RKHS. We nd that the co-regularization

theoretical results to guide the choice of views.

reduces

by

[1], the authors present a theoretical framework for

an amount that depends on the distance

semi-supervised learning that nicely contains multi-

between

view learning as a special case.

can

result

from

this

procedure

co-regularized hypothesis class.

the the

Rademacher two

complexity

views,

as

a data dependent metric. standard

techniques

by

We then use gap

complexity results are in terms of the complexity of the space of compatible predictors, which in the case

the

Experimentally,

of multi-view learning corresponds to those predictors

we nd that the amount of reduction in

that have matching predictors in the other views. To

complexity introduced by co-regularization

apply these results to a particular multi-view learning

correlates with the amount of improvement

algorithm, one must compute the complexity of the

that co-regularization gives in the CoRLS

class of compatible predictors.

algorithm.

dressed to some extent in [6], in which they compute

algorithm.

bound

the

Although their re-

sults do not assume independent views, their sample

between training error and test error for CoRLS

to

measured

In

This problem is ad-

an upper bound on the Rademacher complexity of the space of compatible predictors. However, their bound

1

Introduction

In the multi-view approach to semi-supervised learning, we have several classes of predictors, or views. The goal is to nd a predictor in each view that per-

is given as the solution to an optimization problem. In this paper, we consider the co-regularized least squares

(CoRLS)

algorithm,

a

two-view,

semi-

supervised version of regularized least squares (RLS).

The algorithm was rst discussed in [12], and a similar

In Section 2, we give a formal presentation of the

algorithm was given earlier in [10]. Although CoRLS

CoRLS algorithm.

has been shown to work well in practice for both clas-

tion 3, discussed in Section 4, and proved in Section 5.

sication [12] and regression [4], many would-be users

In Section 6, we present an empirical investigation of

of the algorithm are deterred by the requirement of

whether the eect of co-regularization on test perfor-

choosing two views, as this often seems an arbitrary

mance is correlated with its eect on Rademacher com-

process.

plexity.

We attempt to improve this situation by

Our results are presented in Sec-

showing how the choice of views aects generalization performance, even in settings where we can't make any probabilistic assumptions about the views. In

CoRLS,

the

two

views

are

predictors

f∗ ∈ F

and

g∗ ∈ G

F

and

G.

Co-Regularized Least Squares

kernel

We consider the case of two views, though both the

We nd

algorithm [4] and the analysis can be extended to

reproducing

Hilbert spaces (RKHS's), call them

2

that minimize an objec-

multiple views.

Our views are RKHS's

F

and

G

X

to

of functions mapping from an arbitrary space

tive function of the form

the reals. The CoRLS algorithm takes labeled points Labeled Loss(f, g)

X



(x1 , y1 ), . . . , (x` , y` ) ∈ X × Y , and unlabeled points x`+1 , . . . , x`+u ∈ X , and solves the following minimiza-

+ Complexity(f, g) 2

[f (x) − g(x)] .

tion problem:

x∈{unlabeled points}

ˆ g) + γF ||f ||2F + γG ||g||2G (f ∗ , g ∗ ) = arg min L(f,

The last term is the co-regularization, which encourages the selection of a pair of predictors

(f ∗ , g ∗ )

f ∈F ,g∈G

that

agree on the unlabeled data. We follow [4, 6] and consider the average predictor







ϕ := (f + g )/2,

`+u X



which

i=`+1

comes from the class for some loss functional

Jave := {x 7→ [f (x) + g(x)] /2 : (f, g) ∈ F × G} . For typical choices of

F

and

G , this class is too large to

admit uniform generalization bounds. In Section 3.1,

and

co-regularization

term in the objective function. clear that the size of



As

λ

increases, it is

decreases, and we'd expect

this to improve generalization. We make this precise

The nal output is

γF , γG ,

ϕ∗ := (f ∗ + g ∗ )/2.

`

If we use this loss and set

λ = 0, the objective function

decouples into two single-view, fully-supervised kernel RLS regressions. We also propose the loss functional

in Theorem 2, where we use standard arguments to

`

X ˆ g) = 1 L(f, ` i=1

bound the gap between training error and test error in terms of the Rademacher complexity of

λ.

X  2  2  ˆ g) = 1 f (xi ) − yi + g(xi ) − yi L(f, 2` i=1

Jλ ,

smaller class

where

that depends only on the

In [12, 4], the loss functional considered was

the labeled loss, the complexity and coregularization

ϕ∗ to come from a much λ ≥ 0 is the coecient of the

ˆ L

labeled data, and regularization parameters

however, we see that under a boundedness condition on terms force

[f (xi ) − g(xi )]2

Jλ .



f (xi ) + g(xi ) − yi 2

2

The main contribution of this paper is Theorem 3,

as one that seems natural when the nal prediction

which gives an explicit expression for the Rademacher

function is

up to a small constant factor. For

1 2 (f + g), as in [4, 6]. Our analysis applies to both of these loss functionals, as well as many more

ordinary kernel RLS (i.e. single view and fully super-

that depend on the labeled data only and that satisfy a

vised), it is known that the squared Rademacher com-

boundedness condition specied in Section 3.1. Thus

plexity is proportional to the trace of the kernel matrix

we take the S in CoRLS to refer to the squares in

(see e.g. [11, Thm 7.39, p. 231]). We nd that for the

the complexity and co-regularization terms, which our

λ = 0),

analysis requires, rather than to the squares in the loss

complexity of

Jλ ,

two-view case without coregularization (i.e.



has squared Rademacher complexity equal to the

functional, which we don't require.

average of the traces of the two labeled-data kernel matrices. When

λ > 0,

the coregularization term re-

2.1

Notation and Preliminaries

duces this quantity by an amount that depends on how dierent the two views are, and in particular on

We'll denote the reproducing kernels corresponding

the average distance between the two views' represen-

to

tations of the labeled data, where the distance metric

F and G by kF : X × X → R and kG : X × X → R, respectively. It's convenient to in-

is determined by the unlabeled data.

troduce notation for the span of the data in each

LF := span{kF (xi , ·)}`+u i=1 ⊂ F and LG := `+u span{kG (xi , ·)}i=1 ⊂ G . By the Representer Theo∗ ∗ rem, it's clear that (f , g ) ∈ LF × LG . That is, we space:

over and

F × G . Plugging in the g ≡ 0 gives the following

f ∗ (·) =

f,g

αi kF (xi , ·) ∈ LF

Since all terms of

i=1

g ∗ (·) =

u+` X

that any

(

for

α = (α1 , . . . , αu+` ) ∈ Ru+`

and

β = (β1 , . . . , βu+` ) ∈ Ru+` .

αi kF (xi , ·),

LF

by

fα = LG .

(KF )ij = kF (xi , xj )

Dene the kernel matrices

and

KF =

Au×u 0 C`×u

|f (xi ) − g(xi )| ≤ 1 ,

and the nal predictor for the CoRLS algorithm is chosen from the class

and partition them into blocks

Cu×` B`×`



 KG =

Du×u 0 F`×u

J := {x 7→ [f (x) + g(x)] /2 : (f, g) ∈ H} . 

Fu×` . E`×`

We can now write the agreement term as

`+u X

2

i=`+1

corresponding to labeled and unlabeled points:



)

`+u X



and similarly for elements of

(KG )ij = kG (xi , xj ),

Q(f, g) are nonnegative, we conclude minimizing Q(f, g) is contained in

H := (f, g) : γF ||f ||2F + γG ||g||2G

We'll denote an arbitrary element of

i=1

(f ∗ , g ∗ )

βi kG (xi , ·) ∈ LG

i=1

Pu+`

f ≡0

ˆ 0) ≤ 1 min Q(f, g) ≤ Q(0, 0) = L(0,

can write the CoRLS solution as

u+` X

trivial predictors upper bound:

Note that the function classes

H and J

do not depend

on the labeled data, and thus are deterministic after conditioning on the unlabeled data.

2

[fα (xi ) − gβ (xi )]2 = ||(A C) α − (D F ) β|| ,

3.2

Setup for the Theorems

i=`+1 and it follows from the reproducing property that

||fα ||2F = α0 KF α

and

||gβ ||2G = β 0 KG β .

For each of

the loss functionals presented in the beginning of this

α β , and thus a solution (f ∗ , g ∗ ) can be found by dif-

section, the whole objective function is quadratic in and

We will use empirical Rademacher complexity as our measure of the size of a function class. The empirical Rademacher complexity of a function class

X → Y}

for a sample

" ˆ ` (F) = E R

ferentiating and solving a system of linear equations.

σ

See [12, 4] for more details.

3

where

Results

3.1

x1 , . . . , x` ∈ X

the

# ` 2 X σi ϕ(xi ) , sup ϕ∈F ` i=1

expectation

{σ1 , . . . , σ` }, and the σi

Bounding the CoRLS Function Class

We assume the loss functional

ˆ : F × G → [0, ∞) L

satises

f ≡ 0

is

with

respect

to

σ

=

are i.i.d. Rademacher random

variables . In our semi-supervised context, we assume that the la-

(x1 , y1 ), . . . , (x` , y` ) are drawn i.i.d. from D on X × Y , but make no assumptions unlabeled points x`+1 , . . . , x`+u ∈ X . In-

beled points

a distribution

ˆ 0) ≤ 1. L(0, ˆ g) ≤ 1 L(f,

2

F = {ϕ :

is dened as

about the

g ≡ 0.

This is

deed, our claims are conditional on the unlabeled data,

true, for example, for the two loss functionals given

and thus remain true no matter what distribution the

That is,

for

in Section 2, provided that

ˆ 0) ≤ 1, we now derive L(0, 1 tion class J ⊂ Jave , from

and

Y = [−1, 1].

Assuming

the co-regularized funcwhich the CoRLS predic-

tors are drawn.

ˆ g) + γF ||f ||2 + γG ||g||2 Q(f, g) :=L(f, F G +λ

`+u X

λ

in the



L : Y 2 → [0, 1],

and for any

ϕ ∈ J , we are interested in bounds on the exloss ED L(ϕ(X), Y ). Typically, L would be the

choice of

loss used to dene the labeled empirical risk functional

ˆ, L

but it need not be.

Our generalization bound in Theorem 2 is based on

2

[f (xi ) − g(xi )]

i=`+1 We now suppress

For a given loss function pected

Recall that our original problem was to minimize

1

unlabeled points are drawn from.

from Section 1.

the following theorem (see e.g. [11, Thm 4.9, p. 96]):

2 We say σ is a Rademacher 1) = P (σ = −1) = 12 .

random variable if

P (σ =

Theorem 1. Fix δ ∈ (0, 1), and let Q be a class of functions mapping from X ×Y to [0, 1]. With probability at least 1 − δ over the sample (X1 , Y1 ), . . . , (X` , Y` ) drawn i.i.d. from D, every q ∈ Q satises

` 1X

ED q(X, Y ) ≤

`

r ˆ ` (Q) + 3 q(Xi , Yi ) + R

i=1

where   −1 U 2 = γF tr(B) + γG−1 tr(E) − λtr J 0 (I + λM )−1 J ,

with I the identity matrix, and

ln(2/δ) . 2`

−1 J = γF C − γG−1 F

4 The expression

ED q(X, Y )

` 1 i=1 q(Xi , Yi ) and ` are random, but with probability at least 1 − δ ,

D.

ˆ ` (Q) R

Discussion

is deterministic, but un-

known to us because we do not know the data generating distribution

−1 M = γF A + γG−1 D.

The terms

the inequality holds for the observed values of these random quantities, and for every

4.1

Unlabeled Data Improves the Bound

P

The regularization parameter

λ

controls the amount

that the unlabeled data constrains the hypothesis space. It's obvious from the denition of the hypoth-

q ∈ Q.

J that if λ1 ≥ λ2 ≥ 0, then Jλ1 ⊆ Jλ2 , and ˆ ` (Jλ ) ≤ R ˆ ` (Jλ ). That is, increasing λ reduces R 1 2 ˆ ` (Jλ ). The amount of Rademacher complexity R

esis class 3.3

thus

Theorems

the

In Theorem 2, we give generalization bounds for the class

J

plexity

this reduction is characterized by the expression

in terms of the empirical Rademacher com-

ˆ ` (J ). R

  −1 ∆(λ) := λtr J 0 (I + λM ) J

In Theorem 3, we give upper and

ˆ ` (J ) R

lower bounds on

that can be written explic-

itly in terms of blocks of the kernel matrices

KF

and

KG .

Suppose that L : Y 2 → [0, 1] satises the following uniform Lipschitz condition: for all y ∈ Y and all yˆ1 , yˆ2 ∈ Y with yˆ1 6= yˆ2 , Theorem 2.

|L(ˆ y1 , y) − L(ˆ y2 , y)| ≤ B. |ˆ y1 − yˆ2 |

Then conditioned on the unlabeled data, for any δ ∈ (0, 1), with probability at least 1 − δ over the sample of labeled points (X1 , Y1 ), . . . , (X` , Y` ) drawn i.i.d. from D, we have for any predictor ϕ ∈ J that `

ED L(ϕ(X), Y ) ≤

1X ˆ ` (J ) L(ϕ(Xi ), Yi ) + 2B R ` i=1  p 1  + √ 2 + 3 ln(2/δ)/2 `

from Theorem 3. When

λ = 0,

the algorithm ignores

∆(0) = ∆(λ) is nondecreasing in λ and has a nite limit as λ → ∞. We collect these properties the unlabeled data, and the reduction is indeed

0.

As we would expect,

in a proposition:

∆(0) = 0, ∆(λ) is nondecreasing on λ ≥ 0, and limλ→∞ ∆(λ) = tr(J 0 M −1 J), provided the Proposition 1.

inverse exists. Proof.

The limit claim is clear if we write the re-

duction as and

D

∆(λ) =

tr(J

0

(λ−1 I + M )−1 J).

Since

A

are Gram matrices, their positive combination

M is positive semidenite (psd). Thus we can write M = Q0 DQ, with diagonal D ≥ 0 and orthogonal Q. Then

∆(λ)

= =

tr



J 0 Q0 λ−1 I + D

` X u X

−1

QJ

(QJ)2ij λ−1 + Djj



−1

i=1 j=1 Note that for

Y = [−1, 1],

the conditions of this theo-

rem are satised by the loss function

L(ˆ y , y) = (τ (ˆ y ) − y)2 /4, where

τ (y) = min(1, max(−1, y)).

The following theorem is the main result of the paper. Recall that trices, and

C

B

A

and

and

F

Theorem 3.

and

E

D

are the unlabeled kernel subma-

are the labeled kernel submatrices,

involve the cross-terms.

For the CoRLS function class J , 1 U ˆ ` (J ) ≤ U , √ ≤R 4 ` 2 `

From this expression, it's clear that creasing in

λ = 0, 4.2

λ

on

(0, ∞).

∆(λ) [0, ∞).

Since

it's nondecreasing on

∆(λ)

is nonde-

is continuous at



Interpretation of Improvement

Here we take some steps towards interpreting the reduction in complexity

γF = γG = 1.

∆(λ).

For simplicity, take

Then the reduction is given by

∆(λ) = λ(C − F )0 (I + λM )−1 (C − F ) Note that the

j th

resentation of the

column of matrix

j th

C

gives a rep-

labeled point by its

F -inner

product with each of the unlabeled points.

That

j = 1, . . . , ` and for i = 1, . . . , u, we have Cij = hkF (x`+i , ·), kF (xj , ·)i. Similarly, F represents each labeled point by its G -inner product with each −1 of the unlabeled points. Since (I + λM ) is psd, it is, for

by a property of Rademacher complexity (see e.g. [11, Thm 4.15(v), p. with constant

` X

For all

hy (0) = 0.

y , hy (·)

is Lipschitz

Thus by the Ledoux-

we have

ˆ ` (hy ◦ J ) ≤ 2B R ˆ ` (J ) R

in complexity as

= λ

101]). and

Talagrand contraction inequality [9, Thm 4.12, p.112]

denes a semi-norm. Thus we can write the reduction

∆(λ)

B,

2

||C·i − F·i ||(I+λM )−1

i=1 ` X

=

The bound in Lemma 1 is sometimes of the right order

||C·i −

2 F·i ||(I/λ+M )−1 (for

λ > 0)

i=1 We see that

∆(λ) grows with the distance between the

two dierent representations of the labeled points. For very small

λ,

this distance is essentially measured us-

ing the Euclidean norm. As proaches that determined by

λ grows, the distance apM −1 , where M is the sum

of the two unlabeled data kernel matrices.

Loosely

of magnitude and sometimes quite loose. Consider the space X × Y = R × {−1, 0, 1} and the loss L(ˆ y , y) = (τ (ˆ y ) − y)2 /4, where τ (y) = min(1, max(−1, y)). Assume P (y = 0) = 1. Then for any class J of predictors that map into {−1, 1}, with probability 1 we have P ` ˆ L(ϕ(x), y) = 1, and thus R(Q) = 2` E σ i=1 σi . By the Kahane-Khintchine inequality (c.f. Section 5.2.3),

ˆ R(Q) = Θ(`−1/2 ). If we choose J small, ˆ ` (J ) = 2 E σ P` σi = J = {x 7→ 1}, then R i=1 `

we conclude

summarized, the reduction is proportional to the dif-

say

ference between the representations of the labeled data

Θ(`−1/2 ),

in the two dierent views, where the measure of difference is determined by the unlabeled data.

and the bound is tight. If we choose

large as possible, and we assume that uous distribution, then

ˆ ` (J ) = 2 R

x

J

as

has a contin-

almost surely, and

the bound is loose.

5

Proofs 5.2

5.1

Proof of Theorem 3

Proof of Theorem 2.

We prove this theorem in several steps, starting from

Dene the loss class

the denition

Q = {(x, y) 7→ L(ϕ(x), y) : ϕ ∈ J }.

" ˆ ` (J ) = E σ R

# ` 1 X σi (f (xi ) + g(xi )) , sup ` (f,g)∈H i=1

Q maps into [0, 1]. Applying Theorem 1, we have for any ϕ ∈ J , with proba` bility at least 1−δ over the labeled sample (Xi , Yi )i=1 ,

where as usual the expectation is w.r.t.

that

convert from a supremum over the function space

By assumption, any function in

`

ED L(ϕ(X), Y ) ≤

1X L(ϕ(Xi ), Yi ) ` i=1 r ˆ ` (Q) + 3 ln(2/δ) . +R 2`

The following lemma completes the proof: Lemma 1.

Proof.

ˆ ` (Q) ≤ 2B R ˆ ` (J ) + R

2 √ `

.

gy = L(0, y) and hy (ˆ y) = L(ϕ(x), y) = gy + hy (ϕ(x)),

Dene the functions

L(ˆ y , y) − L(0, y).

Then

and

We rst

H to

a supremum over a nite-dimensional Euclidean space that we can solve directly. Next, we use the KahaneKhintchine inequality to bound the expectation over

|gy | ≤ 1

for all

y,

we have

ˆ ` (Q) ≤ R ˆ ` (hy ◦ J ) + √2 , R `

σ

above and below by expectations that we can compute explicitly. write

Finally, with some matrix algebra we can

ˆ ` (J ) R

in terms of blocks of the original kernel

matrices.

5.2.1

Since

Converting to Euclidean Space

(f, g) ∈ H

implies

(−f, −g) ∈ H,

we can drop

the absolute value. Next, note that the expression inside the supremum depends only on the values of

Q = gy + hy ◦ J . Since

σ.

and

g

f

at the sample points. By the reproducing kernel

f (xj ) = (ProjLF f )(xj ) xj . ˆ Thus the supremum in the expression for R` (J ) is unchanged if we restrict the supremum to (f, g) ∈

property, it's easy to show that and

g(xj ) = (ProjLG g)(xj )

for any sample point

(LF × LG ) ∩ H.

Applying these observations we get

ˆ ` (J ) = 1 E σ sup R `

( ` X

sections.

Diagonalize the psd kernel matrices to get

orthonormal bases for the column spaces of

KG :

σi (f (xi ) + g(xi )) :

V 0 KF V = ΣF

i=1

(f, g) ∈ (LF × LG ) ∩ H . (LF × LG ) ∩ H

ΣG

and

are diagonal matrices containing

a

α|| = V a

( 0

0

(fα , gβ ) : γF α KF α+γG β KG β

KF

are bases for the column spaces of

respectively. Now introduce

as

and

W 0 KG W = ΣG V

and

and

KG ,

the nonzero eigenvalues, and the columns of

W Finally, we can write the set

ΣF

where

)

KF

b

and

such that

β|| = W b

Applying this change of variables to the quadratic form, we get

`+u X



) |fα (xi ) − gβ (xi )|2 ≤ 1

(α||0

i=`+1

    α 0 0 = (fα , gβ ) : (α β )N ≤1 , β

β||0 )N



α|| β||



  a (a b )T b 0

=

0

where

T = Σ + λRR0 with

where



0





γF KF 0 N := + λKC , 0 γG KG   A  C0   KC :=   −D  (A C − D − F ) −F 0

Σ :=

γF ΣF 0

The matrix

T

agonal matrix

0 γG ΣG





V 0

R :=

0

0 W0

1

A 0 B C C @ −D A −F 0



is spd, since it's the sum of the spd di-

Σ

pactness, dene

λRR0 . For com0 ). We can W = (C B F E) ( V0 W and the psd matrix

0

0

now write

Now we can write

"

(

ˆ ` (J ) = 1 E σ sup σ 0 (C 0 B)α + σ 0 (F 0 E)β R ` α,β )#   α 0 0 s.t. (α β )N ≤1 . β

 n ˆ ` (J ) = 1 E σ sup |σ 0 W ( ab )| R ` a,b s.t. Since

T

o (a b )T (a b ) ≤ 1 . 0

0

0

0 0

is spd, we can evaluate the supremum as de-

scribed above to get 5.2.2

Evaluating the Supremum

M , supα:α0 M α≤1 v 0 α = M −1/2 v . However, our matrix N may not have full rank. Note 0 that each entry of the column vector (C B)α is an inner product between α and a row (or column, by symmetry) of KF . Thus if α|| = ProjColSpace(K ) α, F 0 0 then (C B)α = (C B)α|| . Similar reasoning shows 0 0 that (F E)β = (F E)β|| , for β|| = ProjColSpace(K ) β , G 0 0 0 0 0 and that the quadratic form (α β )N (α β ) is unchanged when we replace (α, β) by (α|| , β|| ). Thus the

ˆ ` (J ) R

For a symmetric positive denite (spd) matrix it's easy to show that

supremum can be rewritten as

sup

α|| ∈ColSpace(KF ) β|| ∈ColSpace(KG )

n σ 0 (C 0 B)α|| + σ 0 (F 0 E)β||

5.2.3

Bounding

1 σ −1/2 0 E T Wσ `

=

ˆ ` (J ) R

above and below

3

We make use of the following lemma : Lemma 2 (Kahane-Khintchine inequality).

For any vectors a1 , . . . , an in a Hilbert space and independent Rademacher random variables σ1 , . . . , σn , we have n

2

X 1 √ E σi ai ≤ 2 i=1

n X

E

Taking the columns of



!2

σi ai

i=1

T −1/2 W 0



(α||0 β||0 )N (α||0 β||0 )0 ≤ 1

clears the way for substantial simplications in later

i=1



ai 's, we can ˆ ` (J ) to get R

1 U ˆ ` (J ) ≤ U √ ≤R 4 ` 2 `

o

Changing to eigenbases cleans up this expression and



2

σi ai

to be the

apply this lemma to our expression for s.t.

n X

≤ E

3

See [8] for a proof of the lower bound.

bound is Jensen's inequality.

The upper

where

Since

took the trace of the scalar quantity inside the expectation, and rotated the factors inside the trace. To get the last equality we interchanged the trace and the ex-

Eσσ 0

I + λR0 Σ−1 R

are spd, our inverses ex-

expansion into our expression for

is the identity matrix.

Substituting this

U 2,

we get

  U 2 = tr WΣ−1 W 0 − λtr WΣ−1 R −1 0 −1  RΣ W × I + λR0 Σ−1 R

To get the second line we expanded the squared norm,

pectation and noted that

and

ist and the expansion is justied.

2 := E σ T −1/2 W 0 σ   = E σ tr WT −1 W 0 σσ 0   = tr WT −1 W 0

U2

Σ

We'll

have

our

nal

form

once

WΣ−1 W 0 , R0 Σ−1 R, and R0 Σ−1 W

we

can

express

in terms of the orig-

inal kernel matrix blocks. We have

WΣ−1 W 0 5.2.4

Writing our Expression in terms of the

   −1 −1  γF ΣF 0 ΣF 0 = (V` W` ) 0 ΣG 0 γG−1 Σ−1 G   0 ΣF 0 V` × 0 ΣG W`0  −1  0 γ Σ 0 V` = (V` W` ) F F W`0 0 γG−1 ΣG

Original Kernel Matrices

It will be helpful to divide

V

and

W

into labeled and

V and W (` + u) × rF and (` + u) × rG , where rF and rG are the ranks of KF and KG , respectively. So we have     A C Vu KF = = ΣF (Vu0 V`0 ) (1) C0 B V`     D F Wu KG = = ΣG (Wu0 W`0 ) (2) F0 E W`

unlabeled parts. We note the dimensions of are

V W

0

 

A C0

D F0

−1 = γF B + γG−1 E The last equality follows by equating submatrices in Equations 1 and 2. Using very similar steps, but with

Rearranging the diagonalization, we also have

0

−1 = γF V` ΣF V`0 + γG−1 W` ΣG W`0

dierent substitutions read from Equations 1 and 2,

 C = ΣF (Vu0 V`0 ) B  F = ΣG (Wu0 W`0 ) E

(3)

we also get

−1 R0 Σ−1 R = γF A + γG−1 D = M −1 R0 Σ−1 W 0 = γF C − γG−1 F = J

(4)

By equating blocks in these four matrix equations, we

Putting things together, we get

U 2 in terms of the original kernel submatrices A, B, C, D, E, and F . For example, by equating the top left subma0 trices in Equation 1, we get A = Vu ΣF Vu . Using these attain all the substitutions we need to write

   −1 −1 U 2 = tr γF B + γG−1 E − λtr J 0 (I + λM ) J 

substitutions, we can write:

  C   0 0   V 0 B ΣF 0 V` 0   W = = 0 W 0 F  0 ΣG W`0 E   A  0    0  V 0  C0  ΣF 0 Vu   R= = 0 W 0  −D  0 −ΣG Wu0 −F 0 

We now work on the

U2 =

tr(WT

6



−1

W).

Woodbury formula

T −1

Experiments

The objective of our experiments was to investigate whether the reduction in hypothesis space complexity due to co-regularization correlates with an improvement in test performance. We closely followed the experimental setup used in [4] on the UCI repository data sets [2]. We selected those data sets with continuous target values, between 5 and 500 examples, and at least

factor in our expression

5 features. For each of these 29 data sets, we generated

Using the Sherman-Morrison-

two views by randomly splitting the features into two

4 , we expand

T −1 = (Σ+λRR0 )−1

as

sets of equal size. To get our performance numbers, we averaged over 10 randomly chosen feature splits.

T −1 = Σ−1 − λΣ−1 R I + λR0 Σ−1 R 4

0 −1

(A + U U )

−1

=A

−1

−A

T

−1

U (I + U A

provided the inverses exist [7, p. 50].

−1

R0 Σ−1 −1

U)

T

To

evaluate the performance of each split, we performed 10-fold `inverse' cross validation, in which one fold is −1

U A

,

used as labeled data, and the other nine folds are used as unlabeled data.

of whether our expression for Rademacher complexity can help guide the choice of views. Acknowledgements

We gratefully acknowledge the support of the NSF under awards DMS-0434383 and DMS-01030526. References

[1] Maria-Florina Balcan and Avrim Blum. A PACstyle model for learning from labeled and unlabeled data. In

COLT, pages 111126, June 2005.

[2] Blake and C. J. Merz. UCI repository of machine learning databases, 1998.

Figure 1: The percent improvement in RMS error of the CoRLS (λ (λ

= 0)

= 1/10)

algorithm over the 2-view RLS

algorithm vs.

the decrease in Rademacher

[3] Avrim Blum and Tom M. Mitchell.

Combining

labeled and unlabeled data with co-training.

COLT, pages 92100, 1998.

In

[4] Ulf Brefeld, Thomas Gärtner, Tobias Scheer,

complexity.

and Stefan Wrobel. Ecient co-regularised least squares regression. For each data set, we used the CoRLS algorithm with

In

ICML,

pages 137144,

2006.

loss functional [5] S. Dasgupta, M. L. Littman, and D. Mcallester.

`

X  2  2  ˆ g) = 1 L(f, f (xi ) − yi + g(xi ) − yi , 2` i=1 as in [12, 4]. In [4], CoRLS is compared to RLS. Here, we compare CoRLS with co-regularization parameter

λ = 1/10

to the performance with

λ = 0.

In Figure 1,

for each data set we plot the percent improvement in RMS error when going from

λ = 0 to λ = 1/10 against

the size of the decrease in the Rademacher complexity. The correlation between these two quantities is

.67.

r =

The error bars extend two standard errors from

the mean.

PAC generalization bounds for co-training.

NIPS, 2001.

In

[6] Jason D. R. Farquhar, David R. Hardoon, Hongying Meng, John S. Taylor, and Sándor Szedmák. Two view learning: SVM-2K, theory and practice. In

NIPS, 2005.

Matrix Computations (Johns Hopkins Studies in Mathematical Sciences). The Johns Hopkins University

[7] Gene H. Golub and Charles F. Van Loan.

Press, October 1996. [8] Rafal Latala and Krzysztof Oleszkiewicz. On the best constant in the Khintchine-Kahane inequal-

7

ity.

Conclusions

We have given tight bounds for the Rademacher complexity of the co-regularized hypothesis space arising from two RKHS views, as well as a generalization bound for the CoRLS algorithm.

While our theo-

Studia Mathematica, 109(1):101104, 1994.

Probability in Banach Spaces: Isoperimetry and Processes.

[9] Michel Ledoux and Michel Talagrand. Springer, 1991. [10] Tom Mitchell.

Machine learning and extracting

rems bound the gap between training and test per-

information from the web. NAACL Invited Talk,

formance, it says nothing about the absolute perfor-

2001.

mance: If neither view has good predictors, then we'll have poor performance, regardless of the generalization bound.

Nevertheless, experimentally we found

a correlation between improved generalization bounds and improved test performance. This may suggest that for typical parameter settings, or at least for those

[11] John Shawe-Taylor and Nello Cristianini.

Methods for Pattern Analysis. versity Press, June 2004.

[12] Vikas Sindhwani, Belkin.

Partha Niyogi,

and Mikhail

A co-regularization approach to semi-

used in [4], reduction in Rademacher complexity is a

supervised learning with multiple views. In

good predictor of improved performance.

pages 824831, 2005.

We leave

this question for further study, as well as the question

Kernel

Cambridge Uni-

ICML,