Domain-Adversarial Neural Networks Hana Ajakan1 , Pascal Germain1 , Hugo Larochelle2 , Fran¸cois Laviolette1 , Mario Marchand1 1 D´ epartement d’informatique et de g´ enie logiciel, Universit´ e Laval, Qu´ ebec, Canada 2 D´ epartement d’informatique, Universit´ e de Sherbrooke, Qu´ ebec, Canada Groupe de recherche en apprentissage automatique de l’Universit´ e Laval (GRAAL)
December 13, 2014
Pascal Germain (GRAAL)
Domain-Adversarial Neural Networks
December 13, 2014
1 / 14
Outline
1
Domain Adaptation Setting
2
Theoretical Foundations
3
Neural Network for Domain Adaptation
4
Empirical Results
Pascal Germain (GRAAL)
Domain-Adversarial Neural Networks
December 13, 2014
2 / 14
Our Domain Adaptation Setting Binary classification tasks Input space:
Two different data distributions
Rd
Source domain: DS Target domain: DT
Labels: {0, 1}
A domain adaptation learning algorithm is provided with a labeled source sample m S = {(xsi , y si )}m i=1 ∼ (DS ) ,
an unlabeled target sample m T = {xti }m i=1 ∼ (DT ) .
4
4
3
3
2
2
1
1
0
0
1
1
2
2 3
3 4
4
3
2
1
0
1
2
3
4
4
4
3
2
1
0
1
2
3
4
The goal is to build a classifier η : Rd → {0, 1} with a low target risk def
RDT (η) = Pascal Germain (GRAAL)
Pr
(xt ,y t )∼DT
[η(xt ) 6= y t ] .
Domain-Adversarial Neural Networks
December 13, 2014
3 / 14
Divergence between source and target domains Definition (Ben David et al., 2006) Given two domain distributions DS and DT , and a hypothesis class H, the H-divergence between DS and DT is t s def dH (DS , DT ) = 2 sup s Pr η(x ) = 1 − t Pr η(x ) = 1 . x ∼DT η∈H x ∼DS s t = 2 sup s Pr η(x ) = 1 + t Pr η(x ) = 0 − 1 . η∈H
x ∼DS
x ∼DT
4
The H-divergence measures the ability of an hypothesis class H to discriminate between source DS and target DT distributions.
3 2 1 0 1 2 3 4
Pascal Germain (GRAAL)
Domain-Adversarial Neural Networks
4
3
2
1
0
1
2
3
4
December 13, 2014
4 / 14
Bound on the target risk Theorem (Ben David et al., 2006) Let H be a hypothesis class of VC-dimension d. With probability 1 − δ over the choice of samples S ∼ (DS )m and T ∼ (DT )m , for every η ∈ H: RDT (η) ≤ RS (η)+
4 m
q d log
2e m d
+ log δ4 + dˆH (S, T )+
4 m2
q
d log
2m d
+ log δ4 +β
with β ≥ ∗inf [RDS (η ∗ )+RDT (η ∗ )] . η ∈H
Empirical risk on the source sample: m
RS (η)
def
=
1X I [η(xsi ) 6= y si ] . m i=1
Empirical H-divergence: # m m 1 X 1 X s t 2 max I [η(xi ) = 1] + I [η(xi ) = 0] − 1 . η∈H m m "
dˆH (S, T )
def
=
i=1
Pascal Germain (GRAAL)
Domain-Adversarial Neural Networks
i=1
December 13, 2014
5 / 14
Bound on the target risk Theorem (Ben David et al., 2006) Let H be a hypothesis class of VC-dimension d. With probability 1 − δ over the choice of samples S ∼ (DS )m and T ∼ (DT )m , for every η ∈ H: RDT (η) ≤ RS (η)+
4 m
q d log
+ log δ4 + dˆH (S, T )+
2e m d
4 m2
q
d log
2m d
+ log δ4 +β
with β ≥ ∗inf [RDS (η ∗ )+RDT (η ∗ )] . η ∈H
Target risk RDT (η) is low if, given S and T ,
RS (η) is small, i.e., η ∈ H is good on
and dˆH (S, T ) is small, i.e., all η 0 ∈ H are bad on
4
4
3
3
3
2
2
2
4
1
1
1
0
0
0
1
1
1
2
2
2
3
3
4
4
3
2
1
0
1
2
Pascal Germain (GRAAL)
3
4
4
3 4
3
2
1
0
1
2
3
4
Domain-Adversarial Neural Networks
4
4
3
2
1
0
1
2
3
December 13, 2014
4
6 / 14
Standard Neural Network Let consider a neural network architecture with one hidden layer h(x) = sigm(b + Wx) , " 1 m
min
W,V,b,c
m X
and
f(h(x)) = softmax(c + Vh(x)) .
# −log 1−y si −f(h(xsi )) .
f(h(x))
i=1
|
{z source loss
V
}
m Given a source sample S = {(xsi , y si )}m i=1 ∼ (DS ) ,
...
h(x)1
h(x)j
...
h(x)n
W
s
1. Pick a x ∈ S
x1
...
xj
...
xd
2. Update V towards f(h(xs )) = y s 3. Update W towards f(h(xs )) = y s
The hidden layer learns a representation h(·) from which linear hypothesis f(·) can classify source examples. Pascal Germain (GRAAL)
Domain-Adversarial Neural Networks
December 13, 2014
7 / 14
Domain-Adversarial Neural Network (DANN) Empirical H-divergence # m m X X 1 1 s t I [η(xi ) = 1] + I [η(xi ) = 0] − 1 . dˆH (S, T ) = 2 max η∈H m m "
def
i=1
i=1
We estimate the H-divergence by a logistic regressor that model the probability that a given input (either xs or xt ) is from the source domain: def
o(h(x)) = sigm(d + w> h(x)) . Given a representation output by the hidden layer h(·) :
dˆH h(S), h(T )
"
# m m 1X 1X s t ≈ 2 max log o(h(xi )) + log 1−o(h(xi )) − 1 . w,d m m i=1
Pascal Germain (GRAAL)
Domain-Adversarial Neural Networks
i=1
December 13, 2014
8 / 14
Domain-Adversarial Neural Network (DANN) " min 1 W,V,b,c m
m X −log 1−y si −f(h(xsi )) +λ max w,d
i=1
|
{z source loss
}
1 m
!# m m X X s t 1 log o(h(xi )) + m log 1−o(h(xi )) , i=1
i=1
|
{z adaptation regularizer
}
where λ > 0 weights the domain adaptation regularization term. m Given a source sample S = {(xsi , y si )}m i=1 ∼ (DS ) , m t m and a target sample T = {(xi )}i=1 ∼ (DT ) , s
f(h(x))
o(h(x))
w
V
t
1. Pick a x ∈ S and x ∈ T ...
h(x)1
2. Update V towards f(h(xs )) = y s
h(x)j
...
h(x)n
3. Update W towards f(h(xs )) = y s s
W
t
4. Update w towards o(h(x )) = 1 and o(h(x )) = 0 s
x1
...
xj
...
xd
t
5. Update W towards o(h(x )) = 0 and o(h(x )) = 1
DANN finds a representation h(·) that are good on S; but unable to discriminate between S and T . Pascal Germain (GRAAL)
Domain-Adversarial Neural Networks
December 13, 2014
9 / 14
Toy Dataset Standard Neural Network (NN) Trained to classify source
f(h(x))
Trained to classify domains
2 2
V 1
...
h(x)1
h(x)j
...
h(x)n
W ...
x1
xj
...
1
0
0
1
1
xd
2 2 4
3
2
1
0
1
2
3
4
4
3
2
1
0
1
2
3
4
Domain-Adversarial Neural Networks (DANN) f(h(x))
o(h(x))
Classification output: f(h(x))
Domain output: o(h(x))
2 2
w
V
1
...
h(x)1
h(x)j
...
h(x)n
W x1
...
xj
...
1
0
0
1
1
xd
2 2 4
Pascal Germain (GRAAL)
3
2
1
0
1
2
3
4
Domain-Adversarial Neural Networks
4
3
2
1
0
1
2
December 13, 2014
3
4
10 / 14
Amazon Reviews Input: product review (bag of words) — Output: positive or negative rating.
Dataset books → dvd books → electronics books → kitchen dvd → books dvd → electronics dvd → kitchen electronics → books electronics → dvd electronics → kitchen kitchen → books kitchen → dvd kitchen → electronics
DANN
NN
0.201 0.246 0.230 0.247 0.247 0.227 0.280 0.273 0.148 0.283 0.261 0.161
0.199 0.251 0.235 0.261 0.256 0.227 0.281 0.277 0.149 0.288 0.261 0.161
Note: We use a small labeled subset of 100 target examples to select the hyperparameters. Pascal Germain (GRAAL)
Domain-Adversarial Neural Networks
December 13, 2014
11 / 14
Marginalized Stacked Denoising Autoencoders (mSDA) Question Does DANN can be combined with other representation learning techniques for domain adaptation? The autoencoders mSDA (Chen et al. 2012) provides a new common representation for source and target (unsupervised) With mSDA+SVM, Chen et al. (2012) obtained state-of-the-art results on Amazon Reviews: – Train a linear SVM on mSDA source representations. We try mSDA+DANN: – Train DANN on source representations and target representations.
Pascal Germain (GRAAL)
Domain-Adversarial Neural Networks
December 13, 2014
12 / 14
Amazon Reviews Input: product review (bag of words) — Output: positive or negative rating.
Dataset books → dvd books → electronics books → kitchen dvd → books dvd → electronics dvd → kitchen electronics → books electronics → dvd electronics → kitchen kitchen → books kitchen → dvd kitchen → electronics
mSDA+DANN
mSDA+SVM
0.176 0.197 0.169 0.176 0.181 0.151 0.237 0.216 0.118 0.222 0.208 0.141
0.175 0.244 0.172 0.176 0.220 0.178 0.229 0.261 0.137 0.234 0.209 0.138
Note: We use a small labeled subset of 100 target examples to select the hyperparameters. The noise parameter of mSDA representations is fixed to 50%. Pascal Germain (GRAAL)
Domain-Adversarial Neural Networks
December 13, 2014
13 / 14
Future Work
Several paths to explore: Deeper neural networks architectures. Multiclass / Multilabels problems. Multisource domain adaptation.
Thank you!
Pascal Germain (GRAAL)
Domain-Adversarial Neural Networks
December 13, 2014
14 / 14