Unsupervised Riemannian Metric Learning for Histograms Using ...

Report 1 Downloads 163 Views
Unsupervised Riemannian Metric Learning 
 for Histograms Using Aitchison Transformations" Tam Le, Marco Cuturi" Graduate School of Informatics" Kyoto University, Japan"

ICML, Lille 2015" 1  

Unsupervised Riemannian Metric Learning 
 for Histograms Using Aitchison Transformations"

xi xi xj {xi , yi }1im [Xing et al.’02]" [Davis et al.’07]" [Weinberger et al.’06’09]"

{xi }1im [Lebanon ’06]" [Wang et al.’07]" 2  

Unsupervised Riemannian Metric Learning 
 for Histograms Using Aitchison Transformations"

[Aitchison’86]"

f (x) =

p

x

p

d(x, z) = arcos( x

Tp

z) [Amari & Nagaoka’00]" 3  

Why Focus on Histograms?"

4  

Why Focus on Histograms?"

Pn = x 2 Rn+1 x

0 and xT 1 = 1 5  

Supervised Metric Learning 
 for Histograms" Kedem et al.’12: Chi-square distance, linear map." 2

"

(Lx, Lz) L1 = 1, L 0

Cuturi & Avis’14: EMD distance, ground metric"

dM (x, z) =

" " "

Mij

min

X1=x,XT 1=z X 0

hX, Mi

0, Mii = 0, Mij  Mik + Mkj , 8i, j, k 6  

Unsupervised Metric Learning 
 for Histograms" Lebanon’06: Pull-back metric, a family of specific transformations" 0s 1 Ts

d (x, z) = arcos @

x• hx, i

z• A hz, i

" 2 intPn " Reformulation with Aitchison’s perturbation operator" " "

x x = C(x • ); C(x) = T x 1 ⇣p Tp d (x, z) = arcos x z



7  

Aitchison Transformation 
 with Fisher Information Metric"

f (x) =

p

x

d(u, v) = arcos(uT v)

d (x, z) = arcos(

p

(x)

Tp

(z)) 8  

Aitchison Transformation 
 with Fisher Information Metric"

f (x) =

p

x

d(u, v) = arcos(uT v)

d (x, z) = arcos(

p

(x)

Tp

(z)) 9  

Aitchison Geometry" Perturbation operator"

x

z = C(x • z) 2 intPn

" Powering operator " " "

t

x = C(xt ) 2 intPn

where"x, z 2 intPn , t 2 R

x "and"C(x) = xT 1 [Aitchison, 1986]"

10  

Perturbation Operator" x (0, 0, 1)

(0, 0, 1)

(1, 0, 0)

(0, 1, 0)

= [0.3, 0.3, 0.4]

(1, 0, 0)

(0, 1, 0)

= [0.28, 0.34, 0.38] 11  

Powering Operator" t

t = 0.6

x

t=2

12  

Aitchison Transformations" Generalized powering operator" " "

↵ ⌦ x = C(x↵ ) 2 intPn

General Aitchison transformations"

"

: intPn ! intPn " where"x,

x 7! (↵ ⌦ x)

2 intPn , ↵ 2

n+1 R+

13  

Generalized Powering Operator" ↵⌦x

(0, 0, 1)

(1, 0, 0)

(0, 0, 1)

(0, 1, 0)

↵ = [1, 1, 0.5]

(1, 0, 0)

(0, 1, 0)

↵ = [1.3, 1, 0.5] 14  

General Aitchison Transformations" (↵ ⌦ x) (0, 0, 1)

(1, 0, 0)

(0, 1, 0)

↵ = [0.5, 1, 2] = [0.2, 0.35, 0.45] 15  

Aitchison Transformations 
 with Fisher Information Metric"

d (x, z) = arcos( (x) = (↵ ⌦ x)

p

(x)

Tp

(z))

2 intPn , ↵ 2 Rn+1 +

16  

Maximize Inverse Volume Framework:
 Pull-back Metric" [Lebanon, 2006]"

h=f f (x) =

p

x

J T

d(u, v) = arcos(u v)

17  

Volume Element"

Volume element of Riemannian metric J at point x:! p " dvolJ(x) = det G(x) " where"Gij = J(ri , rj ) " and {rj }1jn : a basis of T " x Pn " "

"

"

18  

Compute Gram Matrix
 via Push-forward Map" Th(x) S+ n T x Pn

h Pn

S+ n

h⇤ : Tx Pn ! Th(x) S+ n r 7! rh(x)|r

J(ri , rj ) = hh⇤ ri , h⇤ rj i 19  

Inverse Volume Element" ⇣

T

⌘ n+1 2

(x↵ • ) 1 ⌘ ⇣ ↵ 2⌘ dvolJ 1 (x) / ⇣ T x 1 g x 2 ↵ where"g(c) =

Y

ck

k

20  

Maximize inverse volume" [Lebanon, 2006]"

Volume element summarizes “size” of Riemannian metric.! "

xi

"

Inverse volume element measures “smallness” of the metric." "

xj {xi }1im 21  

Unsupervised Riemannian Metric Learning" max ↵,

s.t.

m X 1 dvolJ 1 (xi ) F= log R 1 (x)dx m i=1 dvolJ Pn

2 intPn ,

↵2

2

µ klog ↵k2

n+1 R+

The optimization problem is non-convex." Maximum pseudo log-likelihood function under the model " 1

dvolJ (x) p(x) = R 1 (z)dz dvolJ Pn

22  

Gradient Ascent" " At iteration t, we can update for"↵, ✓ ◆ ↵ " t0 @F ↵t+1 = ⇧ ↵t + p " t @↵ ✓ ✓ ◆◆ " t0 @F " t+1 = C t • exp p t@ " where ⇧(·) is the projection on Rn+1 offset by a + threshold ""= 10 20 " 23  

Gradient" m X @F 1 @ log dvolJ = @ m i=1 @

@ log dvolJ where" @ " @F Similar for " @↵

1

1

(xi )

(x)

=



Ep(x)

@ log dvolJ @

1

(x)



(n + 1)x↵ 2 (x↵

T

• ) 1

24  

Gradient" m X @F 1 @ log dvolJ = @ m i=1 @

@ log dvolJ where" @ " @F Similar for " @↵

1

1

(xi )

(x)

=



Ep(x)

@ log dvolJ @

1

(x)



(n + 1)x↵ 2 (x↵

T

• ) 1

25  

Approximate Gradient 
 by Contrastive Divergence" [Hinton, 2002]"

Approximate Ep(x) (·) by drawing samples from "

dvolJ 1 (x) p(x) = R 1 (z)dz dvolJ Pn

Use MCMC sampling since only a ratio between probabilities is required. " " Metropolis – Hasting sampling method with logistic normal distribution." [Aitchison & Shen, 1980]" [Blei & Lafferty, 2006]" 26  

Experimental Setting:
 k-Medoids Clustering" Datasets!

Baseline Methods!

MIT SCENE"

Euclidean distance (L2)"

UIUC SCENE"

Total variation distance (L1)"

OXFORD FLOWER"

Hellinger distance (Hellinger)"

CALTECH-101"

Chi-square distance (Chi2)"

20 NEWS GROUP" REUTERS"

Cosine similarity (Cosine)" Aitchison map (ILR) + Euclidean distance" Pertubation operation + maximize inverse volume (pFIM)"

[Lebanon, 2006]" 27  

Results on k-Medoids Clustering" MIT SCENE

UIUC SCENE 0.55

0.55

0.5 0.5

F Measure

0.45 0.45

0.4

0.35

0.4

0.3 0.35 0.25 0.3 0.2 0.25

0.15 CHI2

CHI2

HEL

HEL

L1

L1

COSINE

COSINE

L2

L2

ILR

IRL

pFIM

Our method

pFIM Our method

CHI2

CHI2

HEL

HEL

L1

L1

COSINE

COSINE

L2

L2

ILR

IRL

pFIM

Our method

pFIM Our method

28  

Results on k-Medoids Clustering" OXFORD FLOWER

CALTECH 101 0.225

0.45

0.200

0.4

0.175

F Measure

0.35

0.3

0.150

0.25

0.125

0.2

0.100

0.15

0.075

0.1

0.050 CHI2

CHI2

HEL

HEL

L1

L1

COSINE

COSINE

L2

L2

ILR

IRL

pFIM

Our method

pFIM Our method

CHI2

CHI2

HEL

HEL

L1

L1

COSINE

COSINE

L2

L2

ILR

IRL

pFIM

Our method

pFIM Our method

29  

Results on k-Medoids Clustering" 20 NEWS GROUP

REUTERS

0.5

0.5

0.45 0.45

F Measure

0.4 0.4

0.35 0.3

0.35

0.25 0.3 0.2 0.25 0.15 0.2

0.1 CHI2

CHI2

HEL

HEL

L1

L1

COSINE

COSINE

L2

L2

ILR

IRL

pFIM

pFIM

Our method

Our method

CHI2

CHI2

HEL

HEL

L1

L1

COSINE

COSINE

L2

L2

ILR

IRL

pFIM

Our method

pFIM Our method

30  

Experimental Setting:
 k-NN via Locality Sensitive Hashing" [Charikar, 2002]"

Datasets!

Baseline Methods!

CIFAR-10"

Euclidean distance (L2)"

MNIST-60K"

Hellinger distance (Hellinger)" Mahalanobis distance (LMNN)" Hellinger mapping with LMNN (Hellinger-LMNN)" Pertubation operation + maximize inverse volume (pFIM)" 31  

Results on k-NN via LSH" CIFAR−10 0.36

0.4

0.35

0.34

0.3

Accuracy

0.32 0.25 0.3 0.2 0.28 0.15

0.26

0.24 50

0.1

100

150

200

250

300

Number of bits − b

350

400

0.05 0.2

L2 HELLINGER LMNN HELLINGER−LMNN pFIM Our method 0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

−value of LHS 32  

Results on k-NN via LSH" MNIST−60K 0.94

1 0.9

0.92 0.8

Accuracy

0.9

0.7 0.6

0.88

0.5 0.86

L2

0.4

HELLINGER 0.3

0.84

LMNN HELLINGER−LMNN

0.2

pFIM

0.82 0.1 0.8 50

100

150

200

250

300

Number of bits − b

350

400

0 0.2

Our method 0.4

0.6

0.8

1

1.2

1.4

−value of LHS

1.6

1.8

2

33  

Summary" •  Propose a new unsupervised metric learning for histograms that leverages Aitchison transformations." •  Provide a new algorithm to solve a key step for maximizing inverse volume framework by using the contrastive divergence." •  Be able to apply for large datasets via locality sensitive hashing." •  Improve performance of alternative approaches on many benchmark datasets." 34  

Unsupervised Riemannian Metric Learning 
 for Histograms Using Aitchison Transformations" Tam Le, Marco Cuturi" Graduate School of Informatics" Kyoto University, Japan"

ICML, Lille 2015" 35  

Euclidean Geometry for Simplex?"

Euclideangeometry geometryisisnot notsuited suitedto tothe the simplex simplex ••Euclidean

1010 (Image credit: Cuturi)"

36  

Geometry of the Simplex √ Hellinger Geometry • Hellinger map r !→ r is betterfor Simplex?"

(Image credit: Cuturi)"

11

37  

F measure" •  Precision (P) & Recall (R) "

TP P= , TP + FP

TP R= . TP + FN

•  F measure:" 2

F = " where" –  TP: true positive" –  TN: true negative" –  FP: false positive" –  FN: false negative"

+ 1 PR . 2P + R

=

s

|D| |S|

Penalize FN more strongly than FP" 38  

Locality Sensitive Hashing 
 to Approximate k-NN" " Charikar (2002) proposed a hash function " ¯) " hr (¯ x) = sign(rT x where r is a random unit-length vector in R " n+1 " d(x, z) " Pr [hr (¯ x) = hr (¯ z)] = 1 ⇡ " " We use b hash functions to obtain hash keys (b hash bits) for each histogram. The complexity to approximate nearest neighbor search is O(m1/(1+") ) where m is a number of samples. " 39  

Locality Sensitive Hashing 
 to Approximate k-NN" " •  We choose N = O(m1/(1+") ) random permutation of the bits." •  For each permutation, we maintain a sorted order of the bit vectors." •  Given a query bit, we use a binary search on each permutation to retrieve 2 closest bit vectors." •  We examine 2N bit vectors and return k nearest neighbors via Hamming distance to the query bit." " 40  

Experiments: Set up & Parameters"

Dataset MIT Scene UIUC Scene OXFORD Flower CALTECH-101 20 News Group Reuters MNIST-60K CIFAR-10

#Samples 1600 3000 1360 3060 10000 2500 60000 60000

#Class 8 15 17 102 20 10 10 10

Feature SIFT SIFT SIFT SIFT BoW BoW Normalized BoW

Rep BoF BoF BoF BoF LDA LDA Intensity SIFT

#Dim 200 200 200 200 200 200 784 200

#Run 100 100 100 100 100 100 4 4

41  

Riemannian manifold" •  Manifold" –  Is a space that is locally homeomorphic to a Euclidean space. [Lee’02]." –  Each point in the manifold has a neighbourhood that is homeomorphic to a Euclidean space. [Wikipedia]" "

•  Differential manifold (smooth manifold)" –  Is a type of manifold that locally similar enough to a linear space to allow one to do calculus. [Wikipedia]" "

•  Riemannian manifold" –  Is a differential manifold equipped with an inner product in the tangent space. [Lee’02]" –  The family of inner product is called Riemannian metric." 42  

Tangent space" •  Tangent space: Tx M, x 2 M [Lee’02]" –  Set of directional derivatives at x operating on differential functions"C 1 (M, R) –  Classes of curves having the same velocity vectors at x." " •  Illustration: tangent space on the sphere"

Th(x) S+ n Tx S n =

(

v 2 Rn+1

n+1 X i=1

)

v i xi = 0

S+ n 43  

Distance in Riemannian manifold" •  Length of the tangent vector v 2 Tx M: "

kvk =

"

p

gx (v, v)

•  Length of curves " : [a, b] 7! M –  0 (t) is a tangent vector in the tangent space"T (t) M (for any t 2 (a, b) ) (a.k.a velocity vector of the curve at time t)" Z bp " L( ) = gx ( 0 (t), 0 (t)) dt "

a

•  Distance between: ""x, y 2 M Z bp dg (x, y) = inf gx ( 0 (t), 2 (x,y)

0 (t)) dt

a

where (x, y) : set of differentiable curves connecting x and y. " 44  

Pull-back metric"

•  Given (N, h) and a diffeomorphism f : M 7! N , we define a metric f ⇤ h on M called pull-back metric by relation:"

"

(f ⇤ h)x (u, v) = hf (x) (f⇤ u, f⇤ v)

45  

Homeomorphism" " " Function f between 2 topological space (X, TX) and (Y, TY) is called a homeomorphism if" –  f: bijection, continuous" –  Inverse function f-1: continuous"

46