Hyperfeatures – Multilevel Local Coding for Visual Recognition
Ankur Agarwal & Bill Triggs GRAVIR-INRIA-CNRS, Grenoble
European Conference on Computer Vision, May 2006
Introduction • Target problem: effectively coding features for recognition in images • Approach: encode feature co-occurence statistics recursively at multiple levels of abstraction – use local image patches described by robust descriptors – build a hierarchical model starting with texton like representations Talk outline • Textons and co-occurence • Hyperfeatures • Experimental results on classification
Texton / Bag-of-Feature Approach • Represent image as a loose collection of individual patches – dense, multi-scale, sparse at salient points or edges . . . • Encode each patch by a vector of local appearance descriptors – invariance to lighting, geometric perturbations . . . • Characterize image / region by the distribution of its patch descriptors
{
×24,
×2,
×6,
×7,
×2,· · ·
}
Capturing Spatial Coherence • Textons are a very effective model for texture and object images but they encode spatial coherence / object geometry only weakly. Ways to improve this: • Explicitly model inter-feature geometry (e.g. constellation models) • Random fields over local labels — MRF, CRF. • Encoding local (pairwise/neighbourhood) co-occurrence of features.
Co-occurence at Multiple Levels We would like to • Encode object/scene as a hierarchy of visual parts • Capture spatial structure loosely without precise geometry • Provide a generic framework to include different feature coding schemes
• hierarchical, bottom up, memory based model • increasing abstraction, less spatial precision in higher levels
Level 3 features
CO−OCCURENCE
Level 2 features
Level 1 features Image Features
CO−OCCURENCE
CO−OCCURENCE
CO−OCCURENCE
Abstraction
Cortical suggestiveness
Existing Hierarchical Models • Take inspiration from biological pattern recognition • Generally alternating stages of simple and complex cells • e.g. Neocognitron, Convolutional Neural Networks, HMAX .. – Neocognitron activates higher level if atleast one cell is active – CNNs use a bank of convolution filters against learned templates – HMAX performs a max operation to retain only dominant signal • They are all discriminative models Using co-occurence, we can code descriptive statistics and allow for more sophisticated nonlinear codings
Hyperfeatures A general principle for multi-level coding: recursively encode neighbourhood co-occurence statistics at multiple levels of abstraction • Level 0: bag of multiscale features – Each point is a base feature vector e.g. a SIFT descriptor • Level 1: locally collect neighbourhood statistics of these feature vectors – Each point is a higher level (feature) vector • Repeat recursively for higher levels to obtain hyperfeatures – cf. the layers of hypercolumns in the cortex
Hyperfeatures for Image Classification
-
Vector quantization
L EVEL N
global histogram
.. .
i PP
.. .
P local histograms
Vector quantization
L EVEL 1
global histogram
i PP
P local histograms
Vector quantization
L EVEL 0
global histogram (bag of features)
• Global histograms from each level may be fed into a classifer 1
Feature codings at individual levels Base image patch descriptors • SIFT-like gradient orientation histograms – evaluated on a regular (multiscale) grid without rotation normalization Distributional coding methods
VQ centers
• Vector Quantization (VQ) • Gaussian Mixture (GM) – training using EM, diagonal covariance model • Latent Dirichlet Allocation (LDA) – document → topic → word GM centers
Effect of different coding methods Image Classification using a linear SVM on hyperfeatures
10 8
AUC(%)
6 4 2 0 VQ VQ+LDA
Coding
GM GM+LDA
30/30/30/30 50/50/50/50 100/50/50/50 Centres 100/100/100/100 200/100/100/100
per level
• Performance improves with increasing codebook size • Gaussian mixtures consistently outperform vector quantization • Performing LDA futher improves results
Codebook sizes and hyperfeatures Number of topics in LDA
How to distribute centres
• More topics always useful
8 6
AUC(%) 4
• Larger vocabularies better
2
Centres per
16 8 4 0 no LDA
VQ Test VQ Train
• Distributing centres across levels beneficial
12
AUC(%)
0 600/0/0/0 300/300/0/0 400/200/0/0 level 200/200/200/0 150/150/150/150
0.2K
# Topics
0.4K
0.6K
0.8K
K
600
400
200
100
50
# Words (K)
• Too many levels harmful
Coding
Classification of object categories
• Linear SVM classifier applied to higher level features, GM coding • Structured objects like motorbikes and cars benefit most – saturation after 2-3 extra levels (b) Cars
0
0
−2
Upto level 0 Upto level 1 Upto level 2 Upto level 3 −1
10 False positive rate
−1
10
−2
0
10
10 −2 10
0
10
Miss rate
−1
10
(d) People
0
10
Miss rate
Miss rate
10
10 −2 10
(c) Bicycles
Upto level 0 Upto level 1 Upto level 2 Upto level 3 −1
10 False positive rate
−1
10
−2
0
10
10
Miss rate
(a) Motorbikes
10 −2 10
Upto level 0 Upto level 1 Upto level 2 Upto level 3 −1
10 False positive rate
Detection-Error-Tradeoff curves, PASCAL VOC2005 dataset
−1
10
−2
0
10
10 −2 10
Upto level 0 Upto level 1 Upto level 2 Upto level 3 −1
10 False positive rate
0
10
Classification of textures
• Perfect classification on some textures using simple bag-of-features • Including 2 or 3 levels of hyperfeatures benefical for rest of classes (b) Cracker
0
0
−2
Upto level 0 Upto level 1 Upto level 2 Upto level 3 −1
10 False positive rate
−1
10
−2
0
10
10 −2 10
0
10
Miss rate
−1
10
(d) Sponge
0
10
Miss rate
Miss rate
10
10 −2 10
(c) Orange peel
Upto level 0 Upto level 1 Upto level 2 Upto level 3 −1
10 False positive rate
−1
10
−2
0
10
10
Miss rate
(a) Aluminium foil
10 −2 10
Upto level 0 Upto level 1 Upto level 2 Upto level 3 −1
10 False positive rate
Detection-Error-Tradeoff curves, KTH-TIPS texture dataset
−1
10
−2
0
10
10 −2 10
Upto level 0 Upto level 1 Upto level 2 Upto level 3 −1
10 False positive rate
0
10
Experiments on Line Drawings Picture naming project in language research: Classifying drawings of people from objects/scences
• Extent of local spatial aggregation is important • More levels needed for smaller neighbourhoods
AUC(%)
35 30 25 20 15 10 5 0
0
1
2
3x3x3 3
# levels
4
5x5x3 5
6
7x7x3 7
9x9x3
Neighbourhood size
Classifying Local Image Regions Despite spatial aggregation, high level hyperfeatures remain local • Construct hyperfeatures individually for each local region of image – build a “mini-pyramid” for each region • Train a patch classifier using training bounding boxes – patches on object treated as positive – patches on background and other classes as negative • Label each test image region for localizing object parts
Classifying Local Image Regions – examples Localizing object parts in images
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
cycle
cycle
cycle
mbike
mbike
cycle
cycle
mbike
mbike
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
mbike
car
car
car
car
mbike
cycle
cycle
cycle
cycle
cycle
mbike
cycle
cycle
cycle
car
car
car
car
car
car
cycle
cycle
cycle
car
car
car
car
car
car
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
car
car
car
car
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle
cycle cycle
cycle
cycle
cycle
mbike
cycle cycle
cycle
mbike
mbike
cycle
cycle cycle
cycle
car
Classifying Local Image Regions – examples (2) Localizing human body parts in images
person
person
person person
person
person
person person person
person
person person person person person
person
person
person person person
person
person
person
person
person
person
person person person
person
person
person
person
person person
person
person person person person
person
person
person person
person person person
person person
person
person
person person
person person
person
person person
person
person person person person
person person
person person person
person
person
person
person
person
person
person person
person
person
person
person
person
person
person
person
person
person
person
person
person
person
person
person
person
person person person person person
person
person person
person
person
Scalability Algorithm used in above experiments • Learns codings one level at a time • Fixes codebook for lower level before advancing to higher level • Requires all training features in memory during learning Alternative algorithm • Learns all levels in parallel • Processes training images sequentially to avoid large memory usage – uses online EM to learn Gaussian mixture codebooks • A single pass of training data gives very similar results to first algorithm • Usable on arbitrarily large datasets
Conclusions • Hyperfeatures encode spatial information in images, but without a rigid representation –loose structure coded using co-occurence statistics • Multiple levels of recursive coding improve classification performance in many cases – object classes with distinctive geometric structure benefit most Possible extensions • Investigation of more discriminative training methods – more general latent aspect models that use local context • Integration with processes like CRF based segmentation for improved object localization
Thank you
Local region classification performance (b) Cars
0.35
0.3 0.3
0.35 0.4 False positive rate
0.45
0.4
Miss rate
0.35
Upto level 0 Upto level 1 Upto level 2 Upto level 3
(d) People
0.4
Miss rate
0.4
Miss rate
0.4
(c) Bicycles
0.3 0.3
0.35
Upto level 0 Upto level 1 Upto level 2 Upto level 3 0.35 0.4 False positive rate
Miss rate
(a) Motorbikes
0.45
0.3 0.3
0.35
Upto level 0 Upto level 1 Upto level 2 Upto level 3 0.35 0.4 False positive rate
0.45
0.3 0.3
- adding higher levels improves discrimination of local regions.
Upto level 0 Upto level 1 Upto level 2 Upto level 3 0.35 0.4 False positive rate
0.45
Hyperfeatures – Combining Multiple Classifiers Classifying each region as 1 of 4 classes (no explicit background model)
mbike
mbike
person
mbike
car
person
person
mbike
mbike
mbike
mbike
mbike
mbike
mbike
cycle
cycle
mbike
mbike
person
mbike
cycle
mbike
mbike
mbike
mbike
car
cycle
mbike
cycle
cycle
cycle
car
car
car
car
cycle
cycle
car
car
mbike
cycle
person
car
car
car
person
car
person
person
cycle
person
person
person
person
person
person
person
person
person
person
car
car
cycle
car
car
person
car
person
car
car
person
car
cycle
person
mbike
cycle
person
mbike
mbike
cycle
cycle
car
cycle
cycle
mbike
car
mbike
cycle
cycle
person
mbike
car
car
car
mbike
cycle
cycle
mbike
cycle
mbike
mbike
mbike
car
cycle
cycle
mbike
person
cycle
cycle
mbike
cycle
cycle
cycle
person
cycle
cycle
person
cycle
cycle
cycle
cycle
cycle
person
cycle
person
car
mbike
mbike
cycle
mbike
person
person
cycle
cycle
person
mbike
cycle
cycle
person
car
mbike
person
mbike
person
person
person
car
car
car
car
car
car
car
car
car
car
car
car
car
mbike
mbike
car
car
cycle
cycle
cycle
cycle
person
person
car
car
car
car
car
mbike
cycle
car
mbike
mbike
car
mbike
car
car
car
car
car
mbike
person
person
person
person
car
mbike
car
mbike
car
car
person
person
cycle
mbike
car
car
car
car
mbike
cycle
cycle
cycle
car
car
car
car
mbike
car
car
person
mbike
mbike
car
car
car
person
mbike
mbike
mbike
car
car
car
car
car
person
car
car
car
cycle
person
car
person
person
person
person
person
person
person
person
person
person
person
car
mbike
mbike
person
person
person
car
car
car
person
car
person
Hyperfeature coding algorithm – offline version (0)
1. ∀(i, x, y, s), Fixys ← base feature at point (x, y), scale s in image i. 2. For l = 0, . . . , N : (l)
• If learning, cluster {Fixys | ∀(i, x, y, s)} to obtain a codebook of d(l) centres in this feature space. • ∀i: (l)
– If global descriptors need to be output, code Fi... as a d(l) (l) dimensional histogram Hi by globally accumulating votes for the d(l) centers from all (x, y, s). (l+1) – If l < N , ∀(x, y, s) calculate Fixys as a d(l) dimensional local (l)
histogram by accumulating votes from Fix0 y0 s0 over neighbourhood N (l+1) (x, y, s). (l)
3. Return {Hi | ∀i, l}.
Hyperfeature coding algorithm – online version 1. Initialization: run algorithmn 1 using a very small subset (e.g. 10) of the training images 2. Update codebook centres at all levels: ∀i: • For l = 0, . . . , N : – perform one iteration of k-means to update the d(l) centers using (l) {Fixys | ∀(x, y, s)}. (l+1)
– If l < N , ∀(x, y, s) calculate Fixys as in algorithm 1. 3. If centers have not converged, repeat step 2. Else ∀i: • For l = 0, . . . , N : (l+1)
– If l < N , ∀(x, y, s) calculate Fixys as a d(l) dimensional local histogram by (l)
accumulating votes from Fix0 y0 s0 over neighbourhood N (l+1) (x, y, s). (l)
– If global descriptors need to be output, code Fi... as a d(l) dimensional (l) histogram Hi by globally accumulating votes for the d(l) centers from all (x, y, s). (l)
4. Return {Hi | ∀i, l}.