The Rademacher Complexity of Co-Regularized Kernel Classes
David S. Rosenberg
Peter L. Bartlett
Department of Statistics
Department of Statistics and Computer Science Divison
University of California, Berkeley
University of California, Berkeley
Berkeley, CA 94720
Berkeley, CA 94720
[email protected] [email protected] Abstract
forms well on the labeled data, such that all the chosen predictors give similar predictions on the unlabeled
In
the
multi-view
approach
to
semi-
supervised learning, we choose one predictor from
each
of
multiple
hypothesis
and we co-regularize
classes,
our choices by pe-
nalizing disagreement among the predictors on the unlabeled data.
We examine the
co-regularization
used
method
in
the
co-
regularized least squares (CoRLS) algorithm [12],
in which the views are reproducing
kernel
Hilbert
spaces
(RKHS's),
and
the
disagreement penalty is the average squared dierence in predictions. The nal predictor is the pointwise average of the predictors
data. This approach is motivated by the assumption that each view contains a predictor that's approximately correct. Roughly speaking, predictors that are approximately correct are also approximately equal. Thus we can reduce the complexity of our learning problem by eliminating from the search space all predictors that don't have matching predictors in each of the other views. Because of this reduction in complexity, it's reasonable to expect better test performance for the same amount of labeled training data. This paper provides a more precise understanding of the ways in which the agreement constraint and the choice of views aect complexity and generalization.
from each view. We call the set of predictors
Early theoretical results on the multi-view approach
that
the
to semi-supervised learning, in particular [5] and the
Our main
original co-training paper [3], assume that the predic-
result is a tight bound on the Rademacher
tors of each view are conditionally independent given
complexity of the co-regularized hypothesis
the labels they try to predict. This assumption is dif-
class in terms of the kernel matrices of each
cult to justify in practice, yet there are scant other
RKHS. We nd that the co-regularization
theoretical results to guide the choice of views.
reduces
by
[1], the authors present a theoretical framework for
an amount that depends on the distance
semi-supervised learning that nicely contains multi-
between
view learning as a special case.
can
result
from
this
procedure
co-regularized hypothesis class.
the the
Rademacher two
complexity
views,
as
a data dependent metric. standard
techniques
by
We then use gap
complexity results are in terms of the complexity of the space of compatible predictors, which in the case
the
Experimentally,
of multi-view learning corresponds to those predictors
we nd that the amount of reduction in
that have matching predictors in the other views. To
complexity introduced by co-regularization
apply these results to a particular multi-view learning
correlates with the amount of improvement
algorithm, one must compute the complexity of the
that co-regularization gives in the CoRLS
class of compatible predictors.
algorithm.
dressed to some extent in [6], in which they compute
algorithm.
bound
the
Although their re-
sults do not assume independent views, their sample
between training error and test error for CoRLS
to
measured
In
This problem is ad-
an upper bound on the Rademacher complexity of the space of compatible predictors. However, their bound
1
Introduction
In the multi-view approach to semi-supervised learning, we have several classes of predictors, or views. The goal is to nd a predictor in each view that per-
is given as the solution to an optimization problem. In this paper, we consider the co-regularized least squares
(CoRLS)
algorithm,
a
two-view,
semi-
supervised version of regularized least squares (RLS).
The algorithm was rst discussed in [12], and a similar
In Section 2, we give a formal presentation of the
algorithm was given earlier in [10]. Although CoRLS
CoRLS algorithm.
has been shown to work well in practice for both clas-
tion 3, discussed in Section 4, and proved in Section 5.
sication [12] and regression [4], many would-be users
In Section 6, we present an empirical investigation of
of the algorithm are deterred by the requirement of
whether the eect of co-regularization on test perfor-
choosing two views, as this often seems an arbitrary
mance is correlated with its eect on Rademacher com-
process.
plexity.
We attempt to improve this situation by
Our results are presented in Sec-
showing how the choice of views aects generalization performance, even in settings where we can't make any probabilistic assumptions about the views. In
CoRLS,
the
two
views
are
predictors
f∗ ∈ F
and
g∗ ∈ G
F
and
G.
Co-Regularized Least Squares
kernel
We consider the case of two views, though both the
We nd
algorithm [4] and the analysis can be extended to
reproducing
Hilbert spaces (RKHS's), call them
2
that minimize an objec-
multiple views.
Our views are RKHS's
F
and
G
X
to
of functions mapping from an arbitrary space
tive function of the form
the reals. The CoRLS algorithm takes labeled points Labeled Loss(f, g)
X
+λ
(x1 , y1 ), . . . , (x` , y` ) ∈ X × Y , and unlabeled points x`+1 , . . . , x`+u ∈ X , and solves the following minimiza-
+ Complexity(f, g) 2
[f (x) − g(x)] .
tion problem:
x∈{unlabeled points}
ˆ g) + γF ||f ||2F + γG ||g||2G (f ∗ , g ∗ ) = arg min L(f,
The last term is the co-regularization, which encourages the selection of a pair of predictors
(f ∗ , g ∗ )
f ∈F ,g∈G
that
agree on the unlabeled data. We follow [4, 6] and consider the average predictor
∗
∗
∗
ϕ := (f + g )/2,
`+u X
+λ
which
i=`+1
comes from the class for some loss functional
Jave := {x 7→ [f (x) + g(x)] /2 : (f, g) ∈ F × G} . For typical choices of
F
and
G , this class is too large to
admit uniform generalization bounds. In Section 3.1,
and
co-regularization
term in the objective function. clear that the size of
Jλ
As
λ
increases, it is
decreases, and we'd expect
this to improve generalization. We make this precise
The nal output is
γF , γG ,
ϕ∗ := (f ∗ + g ∗ )/2.
`
If we use this loss and set
λ = 0, the objective function
decouples into two single-view, fully-supervised kernel RLS regressions. We also propose the loss functional
in Theorem 2, where we use standard arguments to
`
X ˆ g) = 1 L(f, ` i=1
bound the gap between training error and test error in terms of the Rademacher complexity of
λ.
X 2 2 ˆ g) = 1 f (xi ) − yi + g(xi ) − yi L(f, 2` i=1
Jλ ,
smaller class
where
that depends only on the
In [12, 4], the loss functional considered was
the labeled loss, the complexity and coregularization
ϕ∗ to come from a much λ ≥ 0 is the coecient of the
ˆ L
labeled data, and regularization parameters
however, we see that under a boundedness condition on terms force
[f (xi ) − g(xi )]2
Jλ .
f (xi ) + g(xi ) − yi 2
2
The main contribution of this paper is Theorem 3,
as one that seems natural when the nal prediction
which gives an explicit expression for the Rademacher
function is
up to a small constant factor. For
1 2 (f + g), as in [4, 6]. Our analysis applies to both of these loss functionals, as well as many more
ordinary kernel RLS (i.e. single view and fully super-
that depend on the labeled data only and that satisfy a
vised), it is known that the squared Rademacher com-
boundedness condition specied in Section 3.1. Thus
plexity is proportional to the trace of the kernel matrix
we take the S in CoRLS to refer to the squares in
(see e.g. [11, Thm 7.39, p. 231]). We nd that for the
the complexity and co-regularization terms, which our
λ = 0),
analysis requires, rather than to the squares in the loss
complexity of
Jλ ,
two-view case without coregularization (i.e.
Jλ
has squared Rademacher complexity equal to the
functional, which we don't require.
average of the traces of the two labeled-data kernel matrices. When
λ > 0,
the coregularization term re-
2.1
Notation and Preliminaries
duces this quantity by an amount that depends on how dierent the two views are, and in particular on
We'll denote the reproducing kernels corresponding
the average distance between the two views' represen-
to
tations of the labeled data, where the distance metric
F and G by kF : X × X → R and kG : X × X → R, respectively. It's convenient to in-
is determined by the unlabeled data.
troduce notation for the span of the data in each
LF := span{kF (xi , ·)}`+u i=1 ⊂ F and LG := `+u span{kG (xi , ·)}i=1 ⊂ G . By the Representer Theo∗ ∗ rem, it's clear that (f , g ) ∈ LF × LG . That is, we space:
over and
F × G . Plugging in the g ≡ 0 gives the following
f ∗ (·) =
f,g
αi kF (xi , ·) ∈ LF
Since all terms of
i=1
g ∗ (·) =
u+` X
that any
(
for
α = (α1 , . . . , αu+` ) ∈ Ru+`
and
β = (β1 , . . . , βu+` ) ∈ Ru+` .
αi kF (xi , ·),
LF
by
fα = LG .
(KF )ij = kF (xi , xj )
Dene the kernel matrices
and
KF =
Au×u 0 C`×u
|f (xi ) − g(xi )| ≤ 1 ,
and the nal predictor for the CoRLS algorithm is chosen from the class
and partition them into blocks
Cu×` B`×`
KG =
Du×u 0 F`×u
J := {x 7→ [f (x) + g(x)] /2 : (f, g) ∈ H} .
Fu×` . E`×`
We can now write the agreement term as
`+u X
2
i=`+1
corresponding to labeled and unlabeled points:
)
`+u X
+λ
and similarly for elements of
(KG )ij = kG (xi , xj ),
Q(f, g) are nonnegative, we conclude minimizing Q(f, g) is contained in
H := (f, g) : γF ||f ||2F + γG ||g||2G
We'll denote an arbitrary element of
i=1
(f ∗ , g ∗ )
βi kG (xi , ·) ∈ LG
i=1
Pu+`
f ≡0
ˆ 0) ≤ 1 min Q(f, g) ≤ Q(0, 0) = L(0,
can write the CoRLS solution as
u+` X
trivial predictors upper bound:
Note that the function classes
H and J
do not depend
on the labeled data, and thus are deterministic after conditioning on the unlabeled data.
2
[fα (xi ) − gβ (xi )]2 = ||(A C) α − (D F ) β|| ,
3.2
Setup for the Theorems
i=`+1 and it follows from the reproducing property that
||fα ||2F = α0 KF α
and
||gβ ||2G = β 0 KG β .
For each of
the loss functionals presented in the beginning of this
α β , and thus a solution (f ∗ , g ∗ ) can be found by dif-
section, the whole objective function is quadratic in and
We will use empirical Rademacher complexity as our measure of the size of a function class. The empirical Rademacher complexity of a function class
X → Y}
for a sample
" ˆ ` (F) = E R
ferentiating and solving a system of linear equations.
σ
See [12, 4] for more details.
3
where
Results
3.1
x1 , . . . , x` ∈ X
the
# ` 2 X σi ϕ(xi ) , sup ϕ∈F ` i=1
expectation
{σ1 , . . . , σ` }, and the σi
Bounding the CoRLS Function Class
We assume the loss functional
ˆ : F × G → [0, ∞) L
satises
f ≡ 0
is
with
respect
to
σ
=
are i.i.d. Rademacher random
variables . In our semi-supervised context, we assume that the la-
(x1 , y1 ), . . . , (x` , y` ) are drawn i.i.d. from D on X × Y , but make no assumptions unlabeled points x`+1 , . . . , x`+u ∈ X . In-
beled points
a distribution
ˆ 0) ≤ 1. L(0, ˆ g) ≤ 1 L(f,
2
F = {ϕ :
is dened as
about the
g ≡ 0.
This is
deed, our claims are conditional on the unlabeled data,
true, for example, for the two loss functionals given
and thus remain true no matter what distribution the
That is,
for
in Section 2, provided that
ˆ 0) ≤ 1, we now derive L(0, 1 tion class J ⊂ Jave , from
and
Y = [−1, 1].
Assuming
the co-regularized funcwhich the CoRLS predic-
tors are drawn.
ˆ g) + γF ||f ||2 + γG ||g||2 Q(f, g) :=L(f, F G +λ
`+u X
λ
in the
Jλ
L : Y 2 → [0, 1],
and for any
ϕ ∈ J , we are interested in bounds on the exloss ED L(ϕ(X), Y ). Typically, L would be the
choice of
loss used to dene the labeled empirical risk functional
ˆ, L
but it need not be.
Our generalization bound in Theorem 2 is based on
2
[f (xi ) − g(xi )]
i=`+1 We now suppress
For a given loss function pected
Recall that our original problem was to minimize
1
unlabeled points are drawn from.
from Section 1.
the following theorem (see e.g. [11, Thm 4.9, p. 96]):
2 We say σ is a Rademacher 1) = P (σ = −1) = 12 .
random variable if
P (σ =
Theorem 1. Fix δ ∈ (0, 1), and let Q be a class of functions mapping from X ×Y to [0, 1]. With probability at least 1 − δ over the sample (X1 , Y1 ), . . . , (X` , Y` ) drawn i.i.d. from D, every q ∈ Q satises
` 1X
ED q(X, Y ) ≤
`
r ˆ ` (Q) + 3 q(Xi , Yi ) + R
i=1
where −1 U 2 = γF tr(B) + γG−1 tr(E) − λtr J 0 (I + λM )−1 J ,
with I the identity matrix, and
ln(2/δ) . 2`
−1 J = γF C − γG−1 F
4 The expression
ED q(X, Y )
` 1 i=1 q(Xi , Yi ) and ` are random, but with probability at least 1 − δ ,
D.
ˆ ` (Q) R
Discussion
is deterministic, but un-
known to us because we do not know the data generating distribution
−1 M = γF A + γG−1 D.
The terms
the inequality holds for the observed values of these random quantities, and for every
4.1
Unlabeled Data Improves the Bound
P
The regularization parameter
λ
controls the amount
that the unlabeled data constrains the hypothesis space. It's obvious from the denition of the hypoth-
q ∈ Q.
J that if λ1 ≥ λ2 ≥ 0, then Jλ1 ⊆ Jλ2 , and ˆ ` (Jλ ) ≤ R ˆ ` (Jλ ). That is, increasing λ reduces R 1 2 ˆ ` (Jλ ). The amount of Rademacher complexity R
esis class 3.3
thus
Theorems
the
In Theorem 2, we give generalization bounds for the class
J
plexity
this reduction is characterized by the expression
in terms of the empirical Rademacher com-
ˆ ` (J ). R
−1 ∆(λ) := λtr J 0 (I + λM ) J
In Theorem 3, we give upper and
ˆ ` (J ) R
lower bounds on
that can be written explic-
itly in terms of blocks of the kernel matrices
KF
and
KG .
Suppose that L : Y 2 → [0, 1] satises the following uniform Lipschitz condition: for all y ∈ Y and all yˆ1 , yˆ2 ∈ Y with yˆ1 6= yˆ2 , Theorem 2.
|L(ˆ y1 , y) − L(ˆ y2 , y)| ≤ B. |ˆ y1 − yˆ2 |
Then conditioned on the unlabeled data, for any δ ∈ (0, 1), with probability at least 1 − δ over the sample of labeled points (X1 , Y1 ), . . . , (X` , Y` ) drawn i.i.d. from D, we have for any predictor ϕ ∈ J that `
ED L(ϕ(X), Y ) ≤
1X ˆ ` (J ) L(ϕ(Xi ), Yi ) + 2B R ` i=1 p 1 + √ 2 + 3 ln(2/δ)/2 `
from Theorem 3. When
λ = 0,
the algorithm ignores
∆(0) = ∆(λ) is nondecreasing in λ and has a nite limit as λ → ∞. We collect these properties the unlabeled data, and the reduction is indeed
0.
As we would expect,
in a proposition:
∆(0) = 0, ∆(λ) is nondecreasing on λ ≥ 0, and limλ→∞ ∆(λ) = tr(J 0 M −1 J), provided the Proposition 1.
inverse exists. Proof.
The limit claim is clear if we write the re-
duction as and
D
∆(λ) =
tr(J
0
(λ−1 I + M )−1 J).
Since
A
are Gram matrices, their positive combination
M is positive semidenite (psd). Thus we can write M = Q0 DQ, with diagonal D ≥ 0 and orthogonal Q. Then
∆(λ)
= =
tr
J 0 Q0 λ−1 I + D
` X u X
−1
QJ
(QJ)2ij λ−1 + Djj
−1
i=1 j=1 Note that for
Y = [−1, 1],
the conditions of this theo-
rem are satised by the loss function
L(ˆ y , y) = (τ (ˆ y ) − y)2 /4, where
τ (y) = min(1, max(−1, y)).
The following theorem is the main result of the paper. Recall that trices, and
C
B
A
and
and
F
Theorem 3.
and
E
D
are the unlabeled kernel subma-
are the labeled kernel submatrices,
involve the cross-terms.
For the CoRLS function class J , 1 U ˆ ` (J ) ≤ U , √ ≤R 4 ` 2 `
From this expression, it's clear that creasing in
λ = 0, 4.2
λ
on
(0, ∞).
∆(λ) [0, ∞).
Since
it's nondecreasing on
∆(λ)
is nonde-
is continuous at
Interpretation of Improvement
Here we take some steps towards interpreting the reduction in complexity
γF = γG = 1.
∆(λ).
For simplicity, take
Then the reduction is given by
∆(λ) = λ(C − F )0 (I + λM )−1 (C − F ) Note that the
j th
resentation of the
column of matrix
j th
C
gives a rep-
labeled point by its
F -inner
product with each of the unlabeled points.
That
j = 1, . . . , ` and for i = 1, . . . , u, we have Cij = hkF (x`+i , ·), kF (xj , ·)i. Similarly, F represents each labeled point by its G -inner product with each −1 of the unlabeled points. Since (I + λM ) is psd, it is, for
by a property of Rademacher complexity (see e.g. [11, Thm 4.15(v), p. with constant
` X
For all
hy (0) = 0.
y , hy (·)
is Lipschitz
Thus by the Ledoux-
we have
ˆ ` (hy ◦ J ) ≤ 2B R ˆ ` (J ) R
in complexity as
= λ
101]). and
Talagrand contraction inequality [9, Thm 4.12, p.112]
denes a semi-norm. Thus we can write the reduction
∆(λ)
B,
2
||C·i − F·i ||(I+λM )−1
i=1 ` X
=
The bound in Lemma 1 is sometimes of the right order
||C·i −
2 F·i ||(I/λ+M )−1 (for
λ > 0)
i=1 We see that
∆(λ) grows with the distance between the
two dierent representations of the labeled points. For very small
λ,
this distance is essentially measured us-
ing the Euclidean norm. As proaches that determined by
λ grows, the distance apM −1 , where M is the sum
of the two unlabeled data kernel matrices.
Loosely
of magnitude and sometimes quite loose. Consider the space X × Y = R × {−1, 0, 1} and the loss L(ˆ y , y) = (τ (ˆ y ) − y)2 /4, where τ (y) = min(1, max(−1, y)). Assume P (y = 0) = 1. Then for any class J of predictors that map into {−1, 1}, with probability 1 we have P ` ˆ L(ϕ(x), y) = 1, and thus R(Q) = 2` E σ i=1 σi . By the Kahane-Khintchine inequality (c.f. Section 5.2.3),
ˆ R(Q) = Θ(`−1/2 ). If we choose J small, ˆ ` (J ) = 2 E σ P` σi = J = {x 7→ 1}, then R i=1 `
we conclude
summarized, the reduction is proportional to the dif-
say
ference between the representations of the labeled data
Θ(`−1/2 ),
in the two dierent views, where the measure of difference is determined by the unlabeled data.
and the bound is tight. If we choose
large as possible, and we assume that uous distribution, then
ˆ ` (J ) = 2 R
x
J
as
has a contin-
almost surely, and
the bound is loose.
5
Proofs 5.2
5.1
Proof of Theorem 3
Proof of Theorem 2.
We prove this theorem in several steps, starting from
Dene the loss class
the denition
Q = {(x, y) 7→ L(ϕ(x), y) : ϕ ∈ J }.
" ˆ ` (J ) = E σ R
# ` 1 X σi (f (xi ) + g(xi )) , sup ` (f,g)∈H i=1
Q maps into [0, 1]. Applying Theorem 1, we have for any ϕ ∈ J , with proba` bility at least 1−δ over the labeled sample (Xi , Yi )i=1 ,
where as usual the expectation is w.r.t.
that
convert from a supremum over the function space
By assumption, any function in
`
ED L(ϕ(X), Y ) ≤
1X L(ϕ(Xi ), Yi ) ` i=1 r ˆ ` (Q) + 3 ln(2/δ) . +R 2`
The following lemma completes the proof: Lemma 1.
Proof.
ˆ ` (Q) ≤ 2B R ˆ ` (J ) + R
2 √ `
.
gy = L(0, y) and hy (ˆ y) = L(ϕ(x), y) = gy + hy (ϕ(x)),
Dene the functions
L(ˆ y , y) − L(0, y).
Then
and
We rst
H to
a supremum over a nite-dimensional Euclidean space that we can solve directly. Next, we use the KahaneKhintchine inequality to bound the expectation over
|gy | ≤ 1
for all
y,
we have
ˆ ` (Q) ≤ R ˆ ` (hy ◦ J ) + √2 , R `
σ
above and below by expectations that we can compute explicitly. write
Finally, with some matrix algebra we can
ˆ ` (J ) R
in terms of blocks of the original kernel
matrices.
5.2.1
Since
Converting to Euclidean Space
(f, g) ∈ H
implies
(−f, −g) ∈ H,
we can drop
the absolute value. Next, note that the expression inside the supremum depends only on the values of
Q = gy + hy ◦ J . Since
σ.
and
g
f
at the sample points. By the reproducing kernel
f (xj ) = (ProjLF f )(xj ) xj . ˆ Thus the supremum in the expression for R` (J ) is unchanged if we restrict the supremum to (f, g) ∈
property, it's easy to show that and
g(xj ) = (ProjLG g)(xj )
for any sample point
(LF × LG ) ∩ H.
Applying these observations we get
ˆ ` (J ) = 1 E σ sup R `
( ` X
sections.
Diagonalize the psd kernel matrices to get
orthonormal bases for the column spaces of
KG :
σi (f (xi ) + g(xi )) :
V 0 KF V = ΣF
i=1
(f, g) ∈ (LF × LG ) ∩ H . (LF × LG ) ∩ H
ΣG
and
are diagonal matrices containing
a
α|| = V a
( 0
0
(fα , gβ ) : γF α KF α+γG β KG β
KF
are bases for the column spaces of
respectively. Now introduce
as
and
W 0 KG W = ΣG V
and
and
KG ,
the nonzero eigenvalues, and the columns of
W Finally, we can write the set
ΣF
where
)
KF
b
and
such that
β|| = W b
Applying this change of variables to the quadratic form, we get
`+u X
+λ
) |fα (xi ) − gβ (xi )|2 ≤ 1
(α||0
i=`+1
α 0 0 = (fα , gβ ) : (α β )N ≤1 , β
β||0 )N
α|| β||
a (a b )T b 0
=
0
where
T = Σ + λRR0 with
where
0
γF KF 0 N := + λKC , 0 γG KG A C0 KC := −D (A C − D − F ) −F 0
Σ :=
γF ΣF 0
The matrix
T
agonal matrix
0 γG ΣG
V 0
R :=
0
0 W0
1
A 0 B C C @ −D A −F 0
is spd, since it's the sum of the spd di-
Σ
pactness, dene
λRR0 . For com0 ). We can W = (C B F E) ( V0 W and the psd matrix
0
0
now write
Now we can write
"
(
ˆ ` (J ) = 1 E σ sup σ 0 (C 0 B)α + σ 0 (F 0 E)β R ` α,β )# α 0 0 s.t. (α β )N ≤1 . β
n ˆ ` (J ) = 1 E σ sup |σ 0 W ( ab )| R ` a,b s.t. Since
T
o (a b )T (a b ) ≤ 1 . 0
0
0
0 0
is spd, we can evaluate the supremum as de-
scribed above to get 5.2.2
Evaluating the Supremum
M , supα:α0 M α≤1 v 0 α = M −1/2 v . However, our matrix N may not have full rank. Note 0 that each entry of the column vector (C B)α is an inner product between α and a row (or column, by symmetry) of KF . Thus if α|| = ProjColSpace(K ) α, F 0 0 then (C B)α = (C B)α|| . Similar reasoning shows 0 0 that (F E)β = (F E)β|| , for β|| = ProjColSpace(K ) β , G 0 0 0 0 0 and that the quadratic form (α β )N (α β ) is unchanged when we replace (α, β) by (α|| , β|| ). Thus the
ˆ ` (J ) R
For a symmetric positive denite (spd) matrix it's easy to show that
supremum can be rewritten as
sup
α|| ∈ColSpace(KF ) β|| ∈ColSpace(KG )
n σ 0 (C 0 B)α|| + σ 0 (F 0 E)β||
5.2.3
Bounding
1 σ −1/2 0 E T Wσ `
=
ˆ ` (J ) R
above and below
3
We make use of the following lemma : Lemma 2 (Kahane-Khintchine inequality).
For any vectors a1 , . . . , an in a Hilbert space and independent Rademacher random variables σ1 , . . . , σn , we have n
2
X 1 √ E σi ai ≤ 2 i=1
n X
E
Taking the columns of
!2
σi ai
i=1
T −1/2 W 0
(α||0 β||0 )N (α||0 β||0 )0 ≤ 1
clears the way for substantial simplications in later
i=1
ai 's, we can ˆ ` (J ) to get R
1 U ˆ ` (J ) ≤ U √ ≤R 4 ` 2 `
o
Changing to eigenbases cleans up this expression and
2
σi ai
to be the
apply this lemma to our expression for s.t.
n X
≤ E
3
See [8] for a proof of the lower bound.
bound is Jensen's inequality.
The upper
where
Since
took the trace of the scalar quantity inside the expectation, and rotated the factors inside the trace. To get the last equality we interchanged the trace and the ex-
Eσσ 0
I + λR0 Σ−1 R
are spd, our inverses ex-
expansion into our expression for
is the identity matrix.
Substituting this
U 2,
we get
U 2 = tr WΣ−1 W 0 − λtr WΣ−1 R −1 0 −1 RΣ W × I + λR0 Σ−1 R
To get the second line we expanded the squared norm,
pectation and noted that
and
ist and the expansion is justied.
2 := E σ T −1/2 W 0 σ = E σ tr WT −1 W 0 σσ 0 = tr WT −1 W 0
U2
Σ
We'll
have
our
nal
form
once
WΣ−1 W 0 , R0 Σ−1 R, and R0 Σ−1 W
we
can
express
in terms of the orig-
inal kernel matrix blocks. We have
WΣ−1 W 0 5.2.4
Writing our Expression in terms of the
−1 −1 γF ΣF 0 ΣF 0 = (V` W` ) 0 ΣG 0 γG−1 Σ−1 G 0 ΣF 0 V` × 0 ΣG W`0 −1 0 γ Σ 0 V` = (V` W` ) F F W`0 0 γG−1 ΣG
Original Kernel Matrices
It will be helpful to divide
V
and
W
into labeled and
V and W (` + u) × rF and (` + u) × rG , where rF and rG are the ranks of KF and KG , respectively. So we have A C Vu KF = = ΣF (Vu0 V`0 ) (1) C0 B V` D F Wu KG = = ΣG (Wu0 W`0 ) (2) F0 E W`
unlabeled parts. We note the dimensions of are
V W
0
A C0
D F0
−1 = γF B + γG−1 E The last equality follows by equating submatrices in Equations 1 and 2. Using very similar steps, but with
Rearranging the diagonalization, we also have
0
−1 = γF V` ΣF V`0 + γG−1 W` ΣG W`0
dierent substitutions read from Equations 1 and 2,
C = ΣF (Vu0 V`0 ) B F = ΣG (Wu0 W`0 ) E
(3)
we also get
−1 R0 Σ−1 R = γF A + γG−1 D = M −1 R0 Σ−1 W 0 = γF C − γG−1 F = J
(4)
By equating blocks in these four matrix equations, we
Putting things together, we get
U 2 in terms of the original kernel submatrices A, B, C, D, E, and F . For example, by equating the top left subma0 trices in Equation 1, we get A = Vu ΣF Vu . Using these attain all the substitutions we need to write
−1 −1 U 2 = tr γF B + γG−1 E − λtr J 0 (I + λM ) J
substitutions, we can write:
C 0 0 V 0 B ΣF 0 V` 0 W = = 0 W 0 F 0 ΣG W`0 E A 0 0 V 0 C0 ΣF 0 Vu R= = 0 W 0 −D 0 −ΣG Wu0 −F 0
We now work on the
U2 =
tr(WT
6
−1
W).
Woodbury formula
T −1
Experiments
The objective of our experiments was to investigate whether the reduction in hypothesis space complexity due to co-regularization correlates with an improvement in test performance. We closely followed the experimental setup used in [4] on the UCI repository data sets [2]. We selected those data sets with continuous target values, between 5 and 500 examples, and at least
factor in our expression
5 features. For each of these 29 data sets, we generated
Using the Sherman-Morrison-
two views by randomly splitting the features into two
4 , we expand
T −1 = (Σ+λRR0 )−1
as
sets of equal size. To get our performance numbers, we averaged over 10 randomly chosen feature splits.
T −1 = Σ−1 − λΣ−1 R I + λR0 Σ−1 R 4
0 −1
(A + U U )
−1
=A
−1
−A
T
−1
U (I + U A
provided the inverses exist [7, p. 50].
−1
R0 Σ−1 −1
U)
T
To
evaluate the performance of each split, we performed 10-fold `inverse' cross validation, in which one fold is −1
U A
,
used as labeled data, and the other nine folds are used as unlabeled data.
of whether our expression for Rademacher complexity can help guide the choice of views. Acknowledgements
We gratefully acknowledge the support of the NSF under awards DMS-0434383 and DMS-01030526. References
[1] Maria-Florina Balcan and Avrim Blum. A PACstyle model for learning from labeled and unlabeled data. In
COLT, pages 111126, June 2005.
[2] Blake and C. J. Merz. UCI repository of machine learning databases, 1998.
Figure 1: The percent improvement in RMS error of the CoRLS (λ (λ
= 0)
= 1/10)
algorithm over the 2-view RLS
algorithm vs.
the decrease in Rademacher
[3] Avrim Blum and Tom M. Mitchell.
Combining
labeled and unlabeled data with co-training.
COLT, pages 92100, 1998.
In
[4] Ulf Brefeld, Thomas Gärtner, Tobias Scheer,
complexity.
and Stefan Wrobel. Ecient co-regularised least squares regression. For each data set, we used the CoRLS algorithm with
In
ICML,
pages 137144,
2006.
loss functional [5] S. Dasgupta, M. L. Littman, and D. Mcallester.
`
X 2 2 ˆ g) = 1 L(f, f (xi ) − yi + g(xi ) − yi , 2` i=1 as in [12, 4]. In [4], CoRLS is compared to RLS. Here, we compare CoRLS with co-regularization parameter
λ = 1/10
to the performance with
λ = 0.
In Figure 1,
for each data set we plot the percent improvement in RMS error when going from
λ = 0 to λ = 1/10 against
the size of the decrease in the Rademacher complexity. The correlation between these two quantities is
.67.
r =
The error bars extend two standard errors from
the mean.
PAC generalization bounds for co-training.
NIPS, 2001.
In
[6] Jason D. R. Farquhar, David R. Hardoon, Hongying Meng, John S. Taylor, and Sándor Szedmák. Two view learning: SVM-2K, theory and practice. In
NIPS, 2005.
Matrix Computations (Johns Hopkins Studies in Mathematical Sciences). The Johns Hopkins University
[7] Gene H. Golub and Charles F. Van Loan.
Press, October 1996. [8] Rafal Latala and Krzysztof Oleszkiewicz. On the best constant in the Khintchine-Kahane inequal-
7
ity.
Conclusions
We have given tight bounds for the Rademacher complexity of the co-regularized hypothesis space arising from two RKHS views, as well as a generalization bound for the CoRLS algorithm.
While our theo-
Studia Mathematica, 109(1):101104, 1994.
Probability in Banach Spaces: Isoperimetry and Processes.
[9] Michel Ledoux and Michel Talagrand. Springer, 1991. [10] Tom Mitchell.
Machine learning and extracting
rems bound the gap between training and test per-
information from the web. NAACL Invited Talk,
formance, it says nothing about the absolute perfor-
2001.
mance: If neither view has good predictors, then we'll have poor performance, regardless of the generalization bound.
Nevertheless, experimentally we found
a correlation between improved generalization bounds and improved test performance. This may suggest that for typical parameter settings, or at least for those
[11] John Shawe-Taylor and Nello Cristianini.
Methods for Pattern Analysis. versity Press, June 2004.
[12] Vikas Sindhwani, Belkin.
Partha Niyogi,
and Mikhail
A co-regularization approach to semi-
used in [4], reduction in Rademacher complexity is a
supervised learning with multiple views. In
good predictor of improved performance.
pages 824831, 2005.
We leave
this question for further study, as well as the question
Kernel
Cambridge Uni-
ICML,