Unsupervised Riemannian Metric Learning
for Histograms Using Aitchison Transformations" Tam Le, Marco Cuturi" Graduate School of Informatics" Kyoto University, Japan"
ICML, Lille 2015" 1
Unsupervised Riemannian Metric Learning
for Histograms Using Aitchison Transformations"
xi xi xj {xi , yi }1im [Xing et al.’02]" [Davis et al.’07]" [Weinberger et al.’06’09]"
{xi }1im [Lebanon ’06]" [Wang et al.’07]" 2
Unsupervised Riemannian Metric Learning
for Histograms Using Aitchison Transformations"
[Aitchison’86]"
f (x) =
p
x
p
d(x, z) = arcos( x
Tp
z) [Amari & Nagaoka’00]" 3
Why Focus on Histograms?"
4
Why Focus on Histograms?"
Pn = x 2 Rn+1 x
0 and xT 1 = 1 5
Supervised Metric Learning
for Histograms" Kedem et al.’12: Chi-square distance, linear map." 2
"
(Lx, Lz) L1 = 1, L 0
Cuturi & Avis’14: EMD distance, ground metric"
dM (x, z) =
" " "
Mij
min
X1=x,XT 1=z X 0
hX, Mi
0, Mii = 0, Mij Mik + Mkj , 8i, j, k 6
Unsupervised Metric Learning
for Histograms" Lebanon’06: Pull-back metric, a family of specific transformations" 0s 1 Ts
d (x, z) = arcos @
x• hx, i
z• A hz, i
" 2 intPn " Reformulation with Aitchison’s perturbation operator" " "
x x = C(x • ); C(x) = T x 1 ⇣p Tp d (x, z) = arcos x z
⌘
7
Aitchison Transformation
with Fisher Information Metric"
f (x) =
p
x
d(u, v) = arcos(uT v)
d (x, z) = arcos(
p
(x)
Tp
(z)) 8
Aitchison Transformation
with Fisher Information Metric"
f (x) =
p
x
d(u, v) = arcos(uT v)
d (x, z) = arcos(
p
(x)
Tp
(z)) 9
Aitchison Geometry" Perturbation operator"
x
z = C(x • z) 2 intPn
" Powering operator " " "
t
x = C(xt ) 2 intPn
where"x, z 2 intPn , t 2 R
x "and"C(x) = xT 1 [Aitchison, 1986]"
10
Perturbation Operator" x (0, 0, 1)
(0, 0, 1)
(1, 0, 0)
(0, 1, 0)
= [0.3, 0.3, 0.4]
(1, 0, 0)
(0, 1, 0)
= [0.28, 0.34, 0.38] 11
Powering Operator" t
t = 0.6
x
t=2
12
Aitchison Transformations" Generalized powering operator" " "
↵ ⌦ x = C(x↵ ) 2 intPn
General Aitchison transformations"
"
: intPn ! intPn " where"x,
x 7! (↵ ⌦ x)
2 intPn , ↵ 2
n+1 R+
13
Generalized Powering Operator" ↵⌦x
(0, 0, 1)
(1, 0, 0)
(0, 0, 1)
(0, 1, 0)
↵ = [1, 1, 0.5]
(1, 0, 0)
(0, 1, 0)
↵ = [1.3, 1, 0.5] 14
General Aitchison Transformations" (↵ ⌦ x) (0, 0, 1)
(1, 0, 0)
(0, 1, 0)
↵ = [0.5, 1, 2] = [0.2, 0.35, 0.45] 15
Aitchison Transformations
with Fisher Information Metric"
d (x, z) = arcos( (x) = (↵ ⌦ x)
p
(x)
Tp
(z))
2 intPn , ↵ 2 Rn+1 +
16
Maximize Inverse Volume Framework:
Pull-back Metric" [Lebanon, 2006]"
h=f f (x) =
p
x
J T
d(u, v) = arcos(u v)
17
Volume Element"
Volume element of Riemannian metric J at point x:! p " dvolJ(x) = det G(x) " where"Gij = J(ri , rj ) " and {rj }1jn : a basis of T " x Pn " "
"
"
18
Compute Gram Matrix
via Push-forward Map" Th(x) S+ n T x Pn
h Pn
S+ n
h⇤ : Tx Pn ! Th(x) S+ n r 7! rh(x)|r
J(ri , rj ) = hh⇤ ri , h⇤ rj i 19
Inverse Volume Element" ⇣
T
⌘ n+1 2
(x↵ • ) 1 ⌘ ⇣ ↵ 2⌘ dvolJ 1 (x) / ⇣ T x 1 g x 2 ↵ where"g(c) =
Y
ck
k
20
Maximize inverse volume" [Lebanon, 2006]"
Volume element summarizes “size” of Riemannian metric.! "
xi
"
Inverse volume element measures “smallness” of the metric." "
xj {xi }1im 21
Unsupervised Riemannian Metric Learning" max ↵,
s.t.
m X 1 dvolJ 1 (xi ) F= log R 1 (x)dx m i=1 dvolJ Pn
2 intPn ,
↵2
2
µ klog ↵k2
n+1 R+
The optimization problem is non-convex." Maximum pseudo log-likelihood function under the model " 1
dvolJ (x) p(x) = R 1 (z)dz dvolJ Pn
22
Gradient Ascent" " At iteration t, we can update for"↵, ✓ ◆ ↵ " t0 @F ↵t+1 = ⇧ ↵t + p " t @↵ ✓ ✓ ◆◆ " t0 @F " t+1 = C t • exp p t@ " where ⇧(·) is the projection on Rn+1 offset by a + threshold ""= 10 20 " 23
Gradient" m X @F 1 @ log dvolJ = @ m i=1 @
@ log dvolJ where" @ " @F Similar for " @↵
1
1
(xi )
(x)
=
✓
Ep(x)
@ log dvolJ @
1
(x)
◆
(n + 1)x↵ 2 (x↵
T
• ) 1
24
Gradient" m X @F 1 @ log dvolJ = @ m i=1 @
@ log dvolJ where" @ " @F Similar for " @↵
1
1
(xi )
(x)
=
✓
Ep(x)
@ log dvolJ @
1
(x)
◆
(n + 1)x↵ 2 (x↵
T
• ) 1
25
Approximate Gradient
by Contrastive Divergence" [Hinton, 2002]"
Approximate Ep(x) (·) by drawing samples from "
dvolJ 1 (x) p(x) = R 1 (z)dz dvolJ Pn
Use MCMC sampling since only a ratio between probabilities is required. " " Metropolis – Hasting sampling method with logistic normal distribution." [Aitchison & Shen, 1980]" [Blei & Lafferty, 2006]" 26
Experimental Setting:
k-Medoids Clustering" Datasets!
Baseline Methods!
MIT SCENE"
Euclidean distance (L2)"
UIUC SCENE"
Total variation distance (L1)"
OXFORD FLOWER"
Hellinger distance (Hellinger)"
CALTECH-101"
Chi-square distance (Chi2)"
20 NEWS GROUP" REUTERS"
Cosine similarity (Cosine)" Aitchison map (ILR) + Euclidean distance" Pertubation operation + maximize inverse volume (pFIM)"
[Lebanon, 2006]" 27
Results on k-Medoids Clustering" MIT SCENE
UIUC SCENE 0.55
0.55
0.5 0.5
F Measure
0.45 0.45
0.4
0.35
0.4
0.3 0.35 0.25 0.3 0.2 0.25
0.15 CHI2
CHI2
HEL
HEL
L1
L1
COSINE
COSINE
L2
L2
ILR
IRL
pFIM
Our method
pFIM Our method
CHI2
CHI2
HEL
HEL
L1
L1
COSINE
COSINE
L2
L2
ILR
IRL
pFIM
Our method
pFIM Our method
28
Results on k-Medoids Clustering" OXFORD FLOWER
CALTECH 101 0.225
0.45
0.200
0.4
0.175
F Measure
0.35
0.3
0.150
0.25
0.125
0.2
0.100
0.15
0.075
0.1
0.050 CHI2
CHI2
HEL
HEL
L1
L1
COSINE
COSINE
L2
L2
ILR
IRL
pFIM
Our method
pFIM Our method
CHI2
CHI2
HEL
HEL
L1
L1
COSINE
COSINE
L2
L2
ILR
IRL
pFIM
Our method
pFIM Our method
29
Results on k-Medoids Clustering" 20 NEWS GROUP
REUTERS
0.5
0.5
0.45 0.45
F Measure
0.4 0.4
0.35 0.3
0.35
0.25 0.3 0.2 0.25 0.15 0.2
0.1 CHI2
CHI2
HEL
HEL
L1
L1
COSINE
COSINE
L2
L2
ILR
IRL
pFIM
pFIM
Our method
Our method
CHI2
CHI2
HEL
HEL
L1
L1
COSINE
COSINE
L2
L2
ILR
IRL
pFIM
Our method
pFIM Our method
30
Experimental Setting:
k-NN via Locality Sensitive Hashing" [Charikar, 2002]"
Datasets!
Baseline Methods!
CIFAR-10"
Euclidean distance (L2)"
MNIST-60K"
Hellinger distance (Hellinger)" Mahalanobis distance (LMNN)" Hellinger mapping with LMNN (Hellinger-LMNN)" Pertubation operation + maximize inverse volume (pFIM)" 31
Results on k-NN via LSH" CIFAR−10 0.36
0.4
0.35
0.34
0.3
Accuracy
0.32 0.25 0.3 0.2 0.28 0.15
0.26
0.24 50
0.1
100
150
200
250
300
Number of bits − b
350
400
0.05 0.2
L2 HELLINGER LMNN HELLINGER−LMNN pFIM Our method 0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
−value of LHS 32
Results on k-NN via LSH" MNIST−60K 0.94
1 0.9
0.92 0.8
Accuracy
0.9
0.7 0.6
0.88
0.5 0.86
L2
0.4
HELLINGER 0.3
0.84
LMNN HELLINGER−LMNN
0.2
pFIM
0.82 0.1 0.8 50
100
150
200
250
300
Number of bits − b
350
400
0 0.2
Our method 0.4
0.6
0.8
1
1.2
1.4
−value of LHS
1.6
1.8
2
33
Summary" • Propose a new unsupervised metric learning for histograms that leverages Aitchison transformations." • Provide a new algorithm to solve a key step for maximizing inverse volume framework by using the contrastive divergence." • Be able to apply for large datasets via locality sensitive hashing." • Improve performance of alternative approaches on many benchmark datasets." 34
Unsupervised Riemannian Metric Learning
for Histograms Using Aitchison Transformations" Tam Le, Marco Cuturi" Graduate School of Informatics" Kyoto University, Japan"
ICML, Lille 2015" 35
Euclidean Geometry for Simplex?"
Euclideangeometry geometryisisnot notsuited suitedto tothe the simplex simplex ••Euclidean
1010 (Image credit: Cuturi)"
36
Geometry of the Simplex √ Hellinger Geometry • Hellinger map r !→ r is betterfor Simplex?"
(Image credit: Cuturi)"
11
37
F measure" • Precision (P) & Recall (R) "
TP P= , TP + FP
TP R= . TP + FN
• F measure:" 2
F = " where" – TP: true positive" – TN: true negative" – FP: false positive" – FN: false negative"
+ 1 PR . 2P + R
=
s
|D| |S|
Penalize FN more strongly than FP" 38
Locality Sensitive Hashing
to Approximate k-NN" " Charikar (2002) proposed a hash function " ¯) " hr (¯ x) = sign(rT x where r is a random unit-length vector in R " n+1 " d(x, z) " Pr [hr (¯ x) = hr (¯ z)] = 1 ⇡ " " We use b hash functions to obtain hash keys (b hash bits) for each histogram. The complexity to approximate nearest neighbor search is O(m1/(1+") ) where m is a number of samples. " 39
Locality Sensitive Hashing
to Approximate k-NN" " • We choose N = O(m1/(1+") ) random permutation of the bits." • For each permutation, we maintain a sorted order of the bit vectors." • Given a query bit, we use a binary search on each permutation to retrieve 2 closest bit vectors." • We examine 2N bit vectors and return k nearest neighbors via Hamming distance to the query bit." " 40
Experiments: Set up & Parameters"
Dataset MIT Scene UIUC Scene OXFORD Flower CALTECH-101 20 News Group Reuters MNIST-60K CIFAR-10
#Samples 1600 3000 1360 3060 10000 2500 60000 60000
#Class 8 15 17 102 20 10 10 10
Feature SIFT SIFT SIFT SIFT BoW BoW Normalized BoW
Rep BoF BoF BoF BoF LDA LDA Intensity SIFT
#Dim 200 200 200 200 200 200 784 200
#Run 100 100 100 100 100 100 4 4
41
Riemannian manifold" • Manifold" – Is a space that is locally homeomorphic to a Euclidean space. [Lee’02]." – Each point in the manifold has a neighbourhood that is homeomorphic to a Euclidean space. [Wikipedia]" "
• Differential manifold (smooth manifold)" – Is a type of manifold that locally similar enough to a linear space to allow one to do calculus. [Wikipedia]" "
• Riemannian manifold" – Is a differential manifold equipped with an inner product in the tangent space. [Lee’02]" – The family of inner product is called Riemannian metric." 42
Tangent space" • Tangent space: Tx M, x 2 M [Lee’02]" – Set of directional derivatives at x operating on differential functions"C 1 (M, R) – Classes of curves having the same velocity vectors at x." " • Illustration: tangent space on the sphere"
Th(x) S+ n Tx S n =
(
v 2 Rn+1
n+1 X i=1
)
v i xi = 0
S+ n 43
Distance in Riemannian manifold" • Length of the tangent vector v 2 Tx M: "
kvk =
"
p
gx (v, v)
• Length of curves " : [a, b] 7! M – 0 (t) is a tangent vector in the tangent space"T (t) M (for any t 2 (a, b) ) (a.k.a velocity vector of the curve at time t)" Z bp " L( ) = gx ( 0 (t), 0 (t)) dt "
a
• Distance between: ""x, y 2 M Z bp dg (x, y) = inf gx ( 0 (t), 2 (x,y)
0 (t)) dt
a
where (x, y) : set of differentiable curves connecting x and y. " 44
Pull-back metric"
• Given (N, h) and a diffeomorphism f : M 7! N , we define a metric f ⇤ h on M called pull-back metric by relation:"
"
(f ⇤ h)x (u, v) = hf (x) (f⇤ u, f⇤ v)
45
Homeomorphism" " " Function f between 2 topological space (X, TX) and (Y, TY) is called a homeomorphism if" – f: bijection, continuous" – Inverse function f-1: continuous"
46