slides - DI ENS

Report 3 Downloads 331 Views
Domain-Adversarial Neural Networks Hana Ajakan1 , Pascal Germain1 , Hugo Larochelle2 , Fran¸cois Laviolette1 , Mario Marchand1 1 D´ epartement d’informatique et de g´ enie logiciel, Universit´ e Laval, Qu´ ebec, Canada 2 D´ epartement d’informatique, Universit´ e de Sherbrooke, Qu´ ebec, Canada Groupe de recherche en apprentissage automatique de l’Universit´ e Laval (GRAAL)

December 13, 2014

Pascal Germain (GRAAL)

Domain-Adversarial Neural Networks

December 13, 2014

1 / 14

Outline

1

Domain Adaptation Setting

2

Theoretical Foundations

3

Neural Network for Domain Adaptation

4

Empirical Results

Pascal Germain (GRAAL)

Domain-Adversarial Neural Networks

December 13, 2014

2 / 14

Our Domain Adaptation Setting Binary classification tasks Input space:

Two different data distributions

Rd

Source domain: DS Target domain: DT

Labels: {0, 1}

A domain adaptation learning algorithm is provided with a labeled source sample m S = {(xsi , y si )}m i=1 ∼ (DS ) ,

an unlabeled target sample m T = {xti }m i=1 ∼ (DT ) .

4

4

3

3

2

2

1

1

0

0

1

1

2

2 3

3 4

4

3

2

1

0

1

2

3

4

4

4

3

2

1

0

1

2

3

4

The goal is to build a classifier η : Rd → {0, 1} with a low target risk def

RDT (η) = Pascal Germain (GRAAL)

Pr

(xt ,y t )∼DT

[η(xt ) 6= y t ] .

Domain-Adversarial Neural Networks

December 13, 2014

3 / 14

Divergence between source and target domains Definition (Ben David et al., 2006) Given two domain distributions DS and DT , and a hypothesis class H, the H-divergence between DS and DT is  t   s  def dH (DS , DT ) = 2 sup s Pr η(x ) = 1 − t Pr η(x ) = 1 . x ∼DT η∈H x ∼DS  s   t  = 2 sup s Pr η(x ) = 1 + t Pr η(x ) = 0 − 1 . η∈H

x ∼DS

x ∼DT

4

The H-divergence measures the ability of an hypothesis class H to discriminate between source DS and target DT distributions.

3 2 1 0 1 2 3 4

Pascal Germain (GRAAL)

Domain-Adversarial Neural Networks

4

3

2

1

0

1

2

3

4

December 13, 2014

4 / 14

Bound on the target risk Theorem (Ben David et al., 2006) Let H be a hypothesis class of VC-dimension d. With probability 1 − δ over the choice of samples S ∼ (DS )m and T ∼ (DT )m , for every η ∈ H: RDT (η) ≤ RS (η)+

4 m

q d log

2e m d

+ log δ4 + dˆH (S, T )+

4 m2

q

d log

2m d

+ log δ4 +β

with β ≥ ∗inf [RDS (η ∗ )+RDT (η ∗ )] . η ∈H

Empirical risk on the source sample: m

RS (η)

def

=

1X I [η(xsi ) 6= y si ] . m i=1

Empirical H-divergence: # m m 1 X 1 X s t 2 max I [η(xi ) = 1] + I [η(xi ) = 0] − 1 . η∈H m m "

dˆH (S, T )

def

=

i=1

Pascal Germain (GRAAL)

Domain-Adversarial Neural Networks

i=1

December 13, 2014

5 / 14

Bound on the target risk Theorem (Ben David et al., 2006) Let H be a hypothesis class of VC-dimension d. With probability 1 − δ over the choice of samples S ∼ (DS )m and T ∼ (DT )m , for every η ∈ H: RDT (η) ≤ RS (η)+

4 m

q d log

+ log δ4 + dˆH (S, T )+

2e m d

4 m2

q

d log

2m d

+ log δ4 +β

with β ≥ ∗inf [RDS (η ∗ )+RDT (η ∗ )] . η ∈H

Target risk RDT (η) is low if, given S and T ,

RS (η) is small, i.e., η ∈ H is good on

and dˆH (S, T ) is small, i.e., all η 0 ∈ H are bad on

4

4

3

3

3

2

2

2

4

1

1

1

0

0

0

1

1

1

2

2

2

3

3

4

4

3

2

1

0

1

2

Pascal Germain (GRAAL)

3

4

4

3 4

3

2

1

0

1

2

3

4

Domain-Adversarial Neural Networks

4

4

3

2

1

0

1

2

3

December 13, 2014

4

6 / 14

Standard Neural Network Let consider a neural network architecture with one hidden layer h(x) = sigm(b + Wx) , " 1 m

min

W,V,b,c

m X

and

f(h(x)) = softmax(c + Vh(x)) .

#   −log 1−y si −f(h(xsi )) .

f(h(x))

i=1

|

{z source loss

V

}

m Given a source sample S = {(xsi , y si )}m i=1 ∼ (DS ) ,

...

h(x)1

h(x)j

...

h(x)n

W

s

1. Pick a x ∈ S

x1

...

xj

...

xd

2. Update V towards f(h(xs )) = y s 3. Update W towards f(h(xs )) = y s

The hidden layer learns a representation h(·) from which linear hypothesis f(·) can classify source examples. Pascal Germain (GRAAL)

Domain-Adversarial Neural Networks

December 13, 2014

7 / 14

Domain-Adversarial Neural Network (DANN) Empirical H-divergence # m m X X 1 1 s t I [η(xi ) = 1] + I [η(xi ) = 0] − 1 . dˆH (S, T ) = 2 max η∈H m m "

def

i=1

i=1

We estimate the H-divergence by a logistic regressor that model the probability that a given input (either xs or xt ) is from the source domain: def

o(h(x)) = sigm(d + w> h(x)) . Given a representation output by the hidden layer h(·) : 

dˆH h(S), h(T )



"

# m m  1X  1X s t ≈ 2 max log o(h(xi )) + log 1−o(h(xi )) − 1 . w,d m m i=1

Pascal Germain (GRAAL)

Domain-Adversarial Neural Networks

i=1

December 13, 2014

8 / 14

Domain-Adversarial Neural Network (DANN) " min 1 W,V,b,c m

m X  −log 1−y si −f(h(xsi )) +λ max w,d

i=1

|

{z source loss

}

1 m

!# m m X X s  t  1 log o(h(xi )) + m log 1−o(h(xi )) , i=1

i=1

|

{z adaptation regularizer

}

where λ > 0 weights the domain adaptation regularization term. m Given a source sample S = {(xsi , y si )}m i=1 ∼ (DS ) , m t m and a target sample T = {(xi )}i=1 ∼ (DT ) , s

f(h(x))

o(h(x))

w

V

t

1. Pick a x ∈ S and x ∈ T ...

h(x)1

2. Update V towards f(h(xs )) = y s

h(x)j

...

h(x)n

3. Update W towards f(h(xs )) = y s s

W

t

4. Update w towards o(h(x )) = 1 and o(h(x )) = 0 s

x1

...

xj

...

xd

t

5. Update W towards o(h(x )) = 0 and o(h(x )) = 1

DANN finds a representation h(·) that are good on S; but unable to discriminate between S and T . Pascal Germain (GRAAL)

Domain-Adversarial Neural Networks

December 13, 2014

9 / 14

Toy Dataset Standard Neural Network (NN) Trained to classify source

f(h(x))

Trained to classify domains

2 2

V 1

...

h(x)1

h(x)j

...

h(x)n

W ...

x1

xj

...

1

0

0

1

1

xd

2 2 4

3

2

1

0

1

2

3

4

4

3

2

1

0

1

2

3

4

Domain-Adversarial Neural Networks (DANN) f(h(x))

o(h(x))

Classification output: f(h(x))

Domain output: o(h(x))

2 2

w

V

1

...

h(x)1

h(x)j

...

h(x)n

W x1

...

xj

...

1

0

0

1

1

xd

2 2 4

Pascal Germain (GRAAL)

3

2

1

0

1

2

3

4

Domain-Adversarial Neural Networks

4

3

2

1

0

1

2

December 13, 2014

3

4

10 / 14

Amazon Reviews Input: product review (bag of words) — Output: positive or negative rating.

Dataset books → dvd books → electronics books → kitchen dvd → books dvd → electronics dvd → kitchen electronics → books electronics → dvd electronics → kitchen kitchen → books kitchen → dvd kitchen → electronics

DANN

NN

0.201 0.246 0.230 0.247 0.247 0.227 0.280 0.273 0.148 0.283 0.261 0.161

0.199 0.251 0.235 0.261 0.256 0.227 0.281 0.277 0.149 0.288 0.261 0.161

Note: We use a small labeled subset of 100 target examples to select the hyperparameters. Pascal Germain (GRAAL)

Domain-Adversarial Neural Networks

December 13, 2014

11 / 14

Marginalized Stacked Denoising Autoencoders (mSDA) Question Does DANN can be combined with other representation learning techniques for domain adaptation? The autoencoders mSDA (Chen et al. 2012) provides a new common representation for source and target (unsupervised) With mSDA+SVM, Chen et al. (2012) obtained state-of-the-art results on Amazon Reviews: – Train a linear SVM on mSDA source representations. We try mSDA+DANN: – Train DANN on source representations and target representations.

Pascal Germain (GRAAL)

Domain-Adversarial Neural Networks

December 13, 2014

12 / 14

Amazon Reviews Input: product review (bag of words) — Output: positive or negative rating.

Dataset books → dvd books → electronics books → kitchen dvd → books dvd → electronics dvd → kitchen electronics → books electronics → dvd electronics → kitchen kitchen → books kitchen → dvd kitchen → electronics

mSDA+DANN

mSDA+SVM

0.176 0.197 0.169 0.176 0.181 0.151 0.237 0.216 0.118 0.222 0.208 0.141

0.175 0.244 0.172 0.176 0.220 0.178 0.229 0.261 0.137 0.234 0.209 0.138

Note: We use a small labeled subset of 100 target examples to select the hyperparameters. The noise parameter of mSDA representations is fixed to 50%. Pascal Germain (GRAAL)

Domain-Adversarial Neural Networks

December 13, 2014

13 / 14

Future Work

Several paths to explore: Deeper neural networks architectures. Multiclass / Multilabels problems. Multisource domain adaptation.

Thank you!

Pascal Germain (GRAAL)

Domain-Adversarial Neural Networks

December 13, 2014

14 / 14