Hash Bit Selection: a Unified Solution for Selection Problems in Hashing Xianglong Liu† Junfeng He‡∗ Bo Lang† Shih-Fu Chang‡ † State Key Lab of Software Development Environment, Beihang University, Beijing, China ‡ Department of Electrical Engineering, Columbia University, New York, NY, USA ∗ Facebook, 1601 Willow Rd, Menlo Park, CA, USA {xlliu, langbo}@nlsde.buaa.edu.cn
[email protected] [email protected] 1. Proof of Theorem 1 h det( BS ) =
To prove the theorem in the paper, we first prove the following proposition and lemmas. The definition of the normalized dominant set can be characterized and represented in terms of determinants. For S ⊆ V , we denote by AS the submatrix of A formed by the rows and the columns indexed by the elements of S. Then we define the matrix BS : BS =
(π −1 )T AS
0 π −1
(1)
and the matrix j BS : j
BS =
0 π −1
(π −1 )T j−1 1 . . . Am AS . . . AS 0 Aj+1 S S
, (2)
wS (ih ) = (−1)m det(h BS ),
+
WS = (−1) det(BS ).
It is easy to verify that w{i} (i) =
0 1 πi2
πi−1 m
X
im i1
··· ···
πi−1 m ai1 im .. .
aih im .. ··· . · · · aim im · · · ai1 im . · · · .. · · · aih im . · · · .. ··· a
im im
πij aih ij det(j BS\{ih } )
ij ∈S\{ih }
1 det(AS\{ih } )] πih = 0,
ij ∈S ik ∈S
namely
(4)
det(AS ) = −πih
Proof First we prove (3) holds for m ≥ 1. 1. For m = 1, 0 πi−1
0 .. .
··· ···
we can obtain: X 1 X X det(AS ) + ( aij ik )πij det(j BS ) = 0, π ij
(3)
ij ∈S
m
B{i} =
··· ···
From the fact that −1 T (π ) 1 (AS 1)T π −1 AS
and,
1
aih i1 .. .
1 [ =− πih
Proposition 1 Let S = {i1 , . . . , im } ⊆ V be a nonempty subset of vertices and assume i1 < . . . < im without loss of generality. Then for any ih ∈ S,
πi−1
πi−1 h .. .
πi−1 h 0 .. .
··· ···
im
where S = {i1 , . . . , im } with i1 < . . . < im and AiS is the i-th column of AS .
πi−1 1 ai1 i1 .. .
··· aim i1 · · · 0 −1 πi 1 ai1 i1 .. .. . . (−1)h+2 −1 π a = i i πih . ih . h 1 .. .. π −1 a
,
0 πi−1 1 .. .
X
πij f (S, ij |ih ) det(j BS ).
ij ∈S
Therefore, we can rewrite det(h BS ) as X π ij − (ai i − f (S\{ih }, ij |ih )) det(j BS\{ih } ) πih h j ij ∈S\{ih } X =− φS\{ih } (ij , ih ) det(j BS\{ih } ).
,
= − det(1 B{i} ).
ij ∈S\{ih }
2. For m > 1 and S = {i1 , . . . , im } with i1 < . . . < im , 1
According to the recursive definition of wS (ih ), we can conclude that (3) holds for m ≥ 1. P h For (4), since det(BS ) = ih ∈S det( BS ), then m W (S) = (−1) det(BS ). When using the following fact −1 h T π h−1 (AS ) π AS
namely, using Π = diag(π) 1T x = 1; (ΠAΠx)i = λ, i ∈ σ(x) By setting Aˆ = ΠAΠ, we prove Lemma 1.
= 0,
Lemma 2 Let σ = σ(x) be the support of a vector x ∈ ∆. Then, x satisfies the KKT equality conditions in (7) in the paper if and only if ( wσ (i) W (σ) , if i ∈ σ; (7) xi = 0, otherwise.
we give an alternative way to compute wS (i): wS (i) =
X j∈S\{i}
1 (πi πj aij − πh πj ahj )wS\{i} (j) (5) πi2
Moreover,
with h as an arbitrary element of S\{i}, and |S| > 1.
wσ∪{j} (j) 1 1 ˆ ˆ = 2 [(Ax) j − (Ax)i ] = − 2 µj W (σ) πj πj
Lemma 1 With Aˆ = ΠAΠ, where Π = diag(π), the KKT equality conditions in (7) in the paper hold, if and only if Bσ [−λ, πi1 xi1 , . . . , πih xih ]T = [1, 0, . . . , 0]T ,
Proof For (6) which is equivalent to the KKT conditionˆ i = λ, if i ∈ σ(x), it can be treated as a linear es Ax) quation problem with unknowns λ and xi , i ∈ σ. Since det(Bσ ) 6= 0, the problem has an unique solution whose support denoted by σ = {i1 , . . . , im } without loss of generality. Using Cramer’s rule, we can get π ih x ih =
Proof x ∈ ∆ satisfies the Karush-Kuhn-Tucker (KKT) conditions for problem (7), if there exist n+1 real constants (Lagrange multipliers) µ1 , . . . , µL and λ, with µi ≥ 0 for all i = 1 . . . L, such that for all i = 1 . . . L: = =
(8)
for all i ∈ σ and j ∈ / σ, where the µj are the (nonnegative) Lagrange multipliers of program (7).
(6)
where λ is a real constant number, σ = σ(x) = {i1 , . . . , ih } with i1 < . . . < ih , and the matrix Bσ : . . . πi−1 0 πi−1 1 h πi−1 ai1 i1 . . . aih ih 1 Bσ = . .. .. . .. . ... . −1 πih aih i1 . . . aih ih
ˆ i − λ + µi (Ax) xi µi
πih det(h Bσ ) , det(Bσ )
Then according to Lemma 1, we have x ih =
0; 0.
(−1)m wσ (ih ) wσ (ih ) = . (−1)m W (σ) W (σ)
for any 1 ≤ h ≤ m. Therefore, x = xσ . Using Equation (5) in the paper, we obtain: P 1 ih ∈σ πj2 (πj πih ajih − πik πih aik ih )wσ (ih ) wσ∪{j} (j) = W (σ) W (σ) X 1 = 2 (πj πih ajih − πik πih aik ih )xσih πj i ∈σ
Because of the nonnegativity of both xi and µi , it can be restated as follows: = λ, if i ∈ σ(x); ˆ (Ax)i ≤ λ, otherwise ˆ with some real constant λ = xT Ax. For σ = σ(x) = {i1 , . . . , im } with i1 < . . . < im . 0 πi−1 . . . πi−1 1 m πi−1 1 Bσ = . . .. Aσ −1 πim
h
1 ˆ ˆ = 2 [(Ax) j − (Ax)ik ]. πj ˆ j − (Ax) ˆ i = −µj for all i ∈ σ and We have the fact (Ax) j∈ / σ, and πj > 0 for all j. Then we can conclude that wσ∪{j} (j) 1 ˆ 1 ˆ = 2 [(Ax) j − (Ax)i ] = − 2 µj ≤ 0. W (σ) πj πj
Then Bσ [−λ, πi1 xi1 , . . . , πim xim ]T = [1, 0, . . . , 0]T is equivalent to: Pm = 1; h=1 xih T Aσ [πi1 xi1 , . . . , πim xim ] = λ[πi−1 , . . . , πi−1 ] 1 m
for all i ∈ σ and j ∈ / σ, where the µj are the (nonnegative) Lagrange multipliers of quadratic programming problem in the paper. 2
Table 1. MAP (%) of bit selection over different hashing algorithms using 32 - 128 bits on GIST-1M. 64 BITS 10.86 10.75 9.56 7.72 11.78
128 BITS 16.78 16.53 15.65 13.97 17.56
0.15
0.07 PCAR−HH ITQ−HH Random Greedy DomSet NDomSet
0.06 0.05
PH2
32 BITS 5.85 5.81 5.74 3.78 7.06
mean average precision
500 BITS LRH [6] R ANDOM RMMH G REEDY D OM S ET ND OM S ET
0.2
0.1
0.04 0.03 0.02
0.05
0.01
0.7
1 PCAH@128 PCAR@128 ITQ@128 Random@196 Greedy@196 DomSet@196 NDomSet@196
0 0.8
1
AH
PC
8 12
@
PC
@
AR
(a) PR
8 12
recall
0.1
0.6
m
o nd
Ra
dy ee
Gr
mS
et
Do
t Se
om
ND
(b) PH2 @ 32 bits
0.7
0.3
0.5
0.2
recall
H −H
ITQ
0.4
0.2
0.4
H
−H
AR
PC
(a) MAP @ 16-128 bits
0.5
0.6
0.2
0
128
number of bits
0.4
0 0
64
6 8 6 6 6 19 12 19 19 19 @ @ @ t@ et@ dy om Se ITQ ee mS nd om Gr Do Ra ND
(b) MAP
0.4
0.3 0.25
0.3
0.1
0.1
0.05 2
4
6
8
10
number of top retrieved samples x 104
(c) R @ 32 bits
Theorem 1 If x∗ is a strict local solution of program (7) with Aˆ = ΠAΠ, where Π = diag(π), then its support σ = σ(x) is the normalized dominant set of graph G = (V, E, A, π), provided that wσ∪{i} (i) 6= 0 for all i ∈ / σ.
0.2
PCAR−HH ITQ−HH Random Greedy DomSet NDomSet
0.15
0.2
0 0
Figure 1. Performance comparison of bit selection methods over multiple hashing methods on SIFT-1M.
0.35 PCAR−HH ITQ−HH Random Greedy DomSet NDomSet
precision
0.6
32
0.6
MAP
0.8
precision
0 16
0 0
2000
4000
6000
8000
10000
number of top retrieved samples
(d) P @ 32 bits
Figure 2. Performance comparison of bit selection methods over multiple bit hashing on GIST-1M.
The scenario derives from the fact that a number of hashing methods can only generate hash codes of limited length, due to the limited number of the feature dimension [1, 8] or the landmark number [5]. Using multiple hashing methods, we can generate any desired number of hash bits, and meanwhile our bit selection can select the most desired ones from the pool of bits generated by multiple hashing methods. With 128-D SIFT features [2], hashing methods like PCAH, PCAR and ITQ can generate at most 128 bits. Therefore to use longer hash codes, say 196 in our experiment, we build a large bit pool with 384 bits, of which each 128 bits are generated respectively by PCAH, PCAR, and ITQ [1]. Then we compare all selection methods picking 196 bits and basic hashing methods with 128 bits. Figure 1 show the comparison performances in terms of P-R curves and MAP. As we can see, bit selection methods like Greedy and NDomSet using 196 bits outperform the three basic hashing methods using their longest hash codes individually (i.e., 128 bits), and moreover they achieve better performances than the native selection methods including Random and DomSet using the same number of hash bits. This observation demonstrates the benefits of a good bit selection under this specific scenario. Note that our bit selection obtains the most significant performance gains in all cases.
Proof Using Lemma 1 and 2, the proof can be completed following that of [7].
2. More experimental results 2.1. Bit selection over basic hashing method Baselines. We still use the learned reconfigurable hashing (LRH) [6] and other naive selection methods as our baselines. However, here we report more results over the state-of-the-art hashing method RMMH [3]. The results are shown in Table 1. As we can observe, the proposed selection (NDomset) attains the best performance in all cases in terms of MAP with 32, 64 and 128 bits. Furthermore, RMMH generates each hash bit independently, and thus randomly selecting l bits acts like generating l bits using RMMH directly. By comparing results using Random and other bit selection methods, we can conclude that our bit selection method improves the performance of RMMH most. Although LRH can obtain performance gains when using long hash bits, it fails to compete with NDomSet in terms of both accuracy and speeds (Roughly, NDomSet is more than five times faster than it).
2.2. Bit selection over multiple hashing methods
2.3. Bit selection over multiple bit hashing
Baselines. We consider the scenario using multiple hashing methods, where our baselines include some hashing methods with the longest codes they can generate.
Baselines. We employ another double bit method using hierarchical hashing (HH) proposed in [4]. Similar to DB in 3
the paper, we generate bits by HH over PCAR (PCAR-HH) and ITQ (ITQ-HH). A 500 bit pool is first built with 250 bits generated by PCAR-HH and the rest bits by ITQ-HH on GIST-1M. Figure 2 shows the results comparing PCAR-HH, ITQ-HH, and different bit selection methods. In Figure 2 (a) the MAP of NDomSet increases dramatically when using more bits, and is consistently superior to both the double bit hashing and selection baselines. Figure 2 (c)-(d) plot the recall, PH2 and P-R curves using 32 bits. In all cases, significant performance improvements are observed, which certainly supports our conclusion in the paper that our bit selection method can recognize the bits of good quality and further improve performances over multi-bit hashing algorithms.
References [1] Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. In CVPR, 2011. 3 [2] H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE TPAMI, 33(1):117–128, 2011. 3 [3] A. Joly and O. Buisson. Random Maximum Margin Hashing. In CVPR. IEEE, 2011. 3 [4] H. Liu and S. Yan. Robust graph mode seeking by graph shift. In ICML, 2010. 3 [5] W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with graphs. In ICML, 2011. 3 [6] Y. Mu, X. Chen, X. Liu, T.-S. Chua, and S. Yan. Multimedia semantics-aware query-adaptive hashing with bits reconfigurability. IJMIR, pages 1–12, 2012. 3 [7] M. Pavan and M. Pelillo. Dominant sets and pairwise clustering. IEEE TPAMI, 29(1):167–172, 2007. 3 [8] J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing for scalable image retrieval. In CVPR, 2010. 3
4