Hyperfeatures - HAL-Inria

Report 2 Downloads 54 Views
Hyperfeatures – Multilevel Local Coding for Visual Recognition

Ankur Agarwal & Bill Triggs GRAVIR-INRIA-CNRS, Grenoble

European Conference on Computer Vision, May 2006

Introduction • Target problem: effectively coding features for recognition in images • Approach: encode feature co-occurence statistics recursively at multiple levels of abstraction – use local image patches described by robust descriptors – build a hierarchical model starting with texton like representations Talk outline • Textons and co-occurence • Hyperfeatures • Experimental results on classification

Texton / Bag-of-Feature Approach • Represent image as a loose collection of individual patches – dense, multi-scale, sparse at salient points or edges . . . • Encode each patch by a vector of local appearance descriptors – invariance to lighting, geometric perturbations . . . • Characterize image / region by the distribution of its patch descriptors

{

×24,

×2,

×6,

×7,

×2,· · ·

}

Capturing Spatial Coherence • Textons are a very effective model for texture and object images but they encode spatial coherence / object geometry only weakly. Ways to improve this: • Explicitly model inter-feature geometry (e.g. constellation models) • Random fields over local labels — MRF, CRF. • Encoding local (pairwise/neighbourhood) co-occurrence of features.

Co-occurence at Multiple Levels We would like to • Encode object/scene as a hierarchy of visual parts • Capture spatial structure loosely without precise geometry • Provide a generic framework to include different feature coding schemes

• hierarchical, bottom up, memory based model • increasing abstraction, less spatial precision in higher levels

Level 3 features

CO−OCCURENCE

Level 2 features

Level 1 features Image Features

CO−OCCURENCE

CO−OCCURENCE

CO−OCCURENCE

Abstraction

Cortical suggestiveness

Existing Hierarchical Models • Take inspiration from biological pattern recognition • Generally alternating stages of simple and complex cells • e.g. Neocognitron, Convolutional Neural Networks, HMAX .. – Neocognitron activates higher level if atleast one cell is active – CNNs use a bank of convolution filters against learned templates – HMAX performs a max operation to retain only dominant signal • They are all discriminative models Using co-occurence, we can code descriptive statistics and allow for more sophisticated nonlinear codings

Hyperfeatures A general principle for multi-level coding: recursively encode neighbourhood co-occurence statistics at multiple levels of abstraction • Level 0: bag of multiscale features – Each point is a base feature vector e.g. a SIFT descriptor • Level 1: locally collect neighbourhood statistics of these feature vectors – Each point is a higher level (feature) vector • Repeat recursively for higher levels to obtain hyperfeatures – cf. the layers of hypercolumns in the cortex

Hyperfeatures for Image Classification

-

Vector quantization

L EVEL N

global histogram

.. .

i PP

.. .

P local histograms

Vector quantization

L EVEL 1

global histogram

i PP

P local histograms

Vector quantization

L EVEL 0

global histogram (bag of features)

• Global histograms from each level may be fed into a classifer 1

Feature codings at individual levels Base image patch descriptors • SIFT-like gradient orientation histograms – evaluated on a regular (multiscale) grid without rotation normalization Distributional coding methods

VQ centers

• Vector Quantization (VQ) • Gaussian Mixture (GM) – training using EM, diagonal covariance model • Latent Dirichlet Allocation (LDA) – document → topic → word GM centers

Effect of different coding methods Image Classification using a linear SVM on hyperfeatures

10 8

AUC(%)

6 4 2 0 VQ VQ+LDA

Coding

GM GM+LDA

30/30/30/30 50/50/50/50 100/50/50/50 Centres 100/100/100/100 200/100/100/100

per level

• Performance improves with increasing codebook size • Gaussian mixtures consistently outperform vector quantization • Performing LDA futher improves results

Codebook sizes and hyperfeatures Number of topics in LDA

How to distribute centres

• More topics always useful

8 6

AUC(%) 4

• Larger vocabularies better

2

Centres per

16 8 4 0 no LDA

VQ Test VQ Train

• Distributing centres across levels beneficial

12

AUC(%)

0 600/0/0/0 300/300/0/0 400/200/0/0 level 200/200/200/0 150/150/150/150

0.2K

# Topics

0.4K

0.6K

0.8K

K

600

400

200

100

50

# Words (K)

• Too many levels harmful

Coding

Classification of object categories

• Linear SVM classifier applied to higher level features, GM coding • Structured objects like motorbikes and cars benefit most – saturation after 2-3 extra levels (b) Cars

0

0

−2

Upto level 0 Upto level 1 Upto level 2 Upto level 3 −1

10 False positive rate

−1

10

−2

0

10

10 −2 10

0

10

Miss rate

−1

10

(d) People

0

10

Miss rate

Miss rate

10

10 −2 10

(c) Bicycles

Upto level 0 Upto level 1 Upto level 2 Upto level 3 −1

10 False positive rate

−1

10

−2

0

10

10

Miss rate

(a) Motorbikes

10 −2 10

Upto level 0 Upto level 1 Upto level 2 Upto level 3 −1

10 False positive rate

Detection-Error-Tradeoff curves, PASCAL VOC2005 dataset

−1

10

−2

0

10

10 −2 10

Upto level 0 Upto level 1 Upto level 2 Upto level 3 −1

10 False positive rate

0

10

Classification of textures

• Perfect classification on some textures using simple bag-of-features • Including 2 or 3 levels of hyperfeatures benefical for rest of classes (b) Cracker

0

0

−2

Upto level 0 Upto level 1 Upto level 2 Upto level 3 −1

10 False positive rate

−1

10

−2

0

10

10 −2 10

0

10

Miss rate

−1

10

(d) Sponge

0

10

Miss rate

Miss rate

10

10 −2 10

(c) Orange peel

Upto level 0 Upto level 1 Upto level 2 Upto level 3 −1

10 False positive rate

−1

10

−2

0

10

10

Miss rate

(a) Aluminium foil

10 −2 10

Upto level 0 Upto level 1 Upto level 2 Upto level 3 −1

10 False positive rate

Detection-Error-Tradeoff curves, KTH-TIPS texture dataset

−1

10

−2

0

10

10 −2 10

Upto level 0 Upto level 1 Upto level 2 Upto level 3 −1

10 False positive rate

0

10

Experiments on Line Drawings Picture naming project in language research: Classifying drawings of people from objects/scences

• Extent of local spatial aggregation is important • More levels needed for smaller neighbourhoods

AUC(%)

35 30 25 20 15 10 5 0

0

1

2

3x3x3 3

# levels

4

5x5x3 5

6

7x7x3 7

9x9x3

Neighbourhood size

Classifying Local Image Regions Despite spatial aggregation, high level hyperfeatures remain local • Construct hyperfeatures individually for each local region of image – build a “mini-pyramid” for each region • Train a patch classifier using training bounding boxes – patches on object treated as positive – patches on background and other classes as negative • Label each test image region for localizing object parts

Classifying Local Image Regions – examples Localizing object parts in images

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

cycle

cycle

cycle

mbike

mbike

cycle

cycle

mbike

mbike

cycle

cycle

cycle

cycle

cycle

cycle

cycle

cycle

cycle

cycle

cycle

cycle

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

mbike

car

car

car

car

mbike

cycle

cycle

cycle

cycle

cycle

mbike

cycle

cycle

cycle

car

car

car

car

car

car

cycle

cycle

cycle

car

car

car

car

car

car

cycle

cycle

cycle

cycle

cycle

cycle

cycle

cycle

cycle

car

car

car

car

cycle

cycle

cycle

cycle

cycle

cycle

cycle

cycle

cycle

cycle

cycle

cycle

cycle

cycle

cycle

cycle

cycle

cycle cycle

cycle

cycle

cycle

mbike

cycle cycle

cycle

mbike

mbike

cycle

cycle cycle

cycle

car

Classifying Local Image Regions – examples (2) Localizing human body parts in images

person

person

person person

person

person

person person person

person

person person person person person

person

person

person person person

person

person

person

person

person

person

person person person

person

person

person

person

person person

person

person person person person

person

person

person person

person person person

person person

person

person

person person

person person

person

person person

person

person person person person

person person

person person person

person

person

person

person

person

person

person person

person

person

person

person

person

person

person

person

person

person

person

person

person

person

person

person

person

person person person person person

person

person person

person

person

Scalability Algorithm used in above experiments • Learns codings one level at a time • Fixes codebook for lower level before advancing to higher level • Requires all training features in memory during learning Alternative algorithm • Learns all levels in parallel • Processes training images sequentially to avoid large memory usage – uses online EM to learn Gaussian mixture codebooks • A single pass of training data gives very similar results to first algorithm • Usable on arbitrarily large datasets

Conclusions • Hyperfeatures encode spatial information in images, but without a rigid representation –loose structure coded using co-occurence statistics • Multiple levels of recursive coding improve classification performance in many cases – object classes with distinctive geometric structure benefit most Possible extensions • Investigation of more discriminative training methods – more general latent aspect models that use local context • Integration with processes like CRF based segmentation for improved object localization

Thank you

Local region classification performance (b) Cars

0.35

0.3 0.3

0.35 0.4 False positive rate

0.45

0.4

Miss rate

0.35

Upto level 0 Upto level 1 Upto level 2 Upto level 3

(d) People

0.4

Miss rate

0.4

Miss rate

0.4

(c) Bicycles

0.3 0.3

0.35

Upto level 0 Upto level 1 Upto level 2 Upto level 3 0.35 0.4 False positive rate

Miss rate

(a) Motorbikes

0.45

0.3 0.3

0.35

Upto level 0 Upto level 1 Upto level 2 Upto level 3 0.35 0.4 False positive rate

0.45

0.3 0.3

- adding higher levels improves discrimination of local regions.

Upto level 0 Upto level 1 Upto level 2 Upto level 3 0.35 0.4 False positive rate

0.45

Hyperfeatures – Combining Multiple Classifiers Classifying each region as 1 of 4 classes (no explicit background model)

mbike

mbike

person

mbike

car

person

person

mbike

mbike

mbike

mbike

mbike

mbike

mbike

cycle

cycle

mbike

mbike

person

mbike

cycle

mbike

mbike

mbike

mbike

car

cycle

mbike

cycle

cycle

cycle

car

car

car

car

cycle

cycle

car

car

mbike

cycle

person

car

car

car

person

car

person

person

cycle

person

person

person

person

person

person

person

person

person

person

car

car

cycle

car

car

person

car

person

car

car

person

car

cycle

person

mbike

cycle

person

mbike

mbike

cycle

cycle

car

cycle

cycle

mbike

car

mbike

cycle

cycle

person

mbike

car

car

car

mbike

cycle

cycle

mbike

cycle

mbike

mbike

mbike

car

cycle

cycle

mbike

person

cycle

cycle

mbike

cycle

cycle

cycle

person

cycle

cycle

person

cycle

cycle

cycle

cycle

cycle

person

cycle

person

car

mbike

mbike

cycle

mbike

person

person

cycle

cycle

person

mbike

cycle

cycle

person

car

mbike

person

mbike

person

person

person

car

car

car

car

car

car

car

car

car

car

car

car

car

mbike

mbike

car

car

cycle

cycle

cycle

cycle

person

person

car

car

car

car

car

mbike

cycle

car

mbike

mbike

car

mbike

car

car

car

car

car

mbike

person

person

person

person

car

mbike

car

mbike

car

car

person

person

cycle

mbike

car

car

car

car

mbike

cycle

cycle

cycle

car

car

car

car

mbike

car

car

person

mbike

mbike

car

car

car

person

mbike

mbike

mbike

car

car

car

car

car

person

car

car

car

cycle

person

car

person

person

person

person

person

person

person

person

person

person

person

car

mbike

mbike

person

person

person

car

car

car

person

car

person

Hyperfeature coding algorithm – offline version (0)

1. ∀(i, x, y, s), Fixys ← base feature at point (x, y), scale s in image i. 2. For l = 0, . . . , N : (l)

• If learning, cluster {Fixys | ∀(i, x, y, s)} to obtain a codebook of d(l) centres in this feature space. • ∀i: (l)

– If global descriptors need to be output, code Fi... as a d(l) (l) dimensional histogram Hi by globally accumulating votes for the d(l) centers from all (x, y, s). (l+1) – If l < N , ∀(x, y, s) calculate Fixys as a d(l) dimensional local (l)

histogram by accumulating votes from Fix0 y0 s0 over neighbourhood N (l+1) (x, y, s). (l)

3. Return {Hi | ∀i, l}.

Hyperfeature coding algorithm – online version 1. Initialization: run algorithmn 1 using a very small subset (e.g. 10) of the training images 2. Update codebook centres at all levels: ∀i: • For l = 0, . . . , N : – perform one iteration of k-means to update the d(l) centers using (l) {Fixys | ∀(x, y, s)}. (l+1)

– If l < N , ∀(x, y, s) calculate Fixys as in algorithm 1. 3. If centers have not converged, repeat step 2. Else ∀i: • For l = 0, . . . , N : (l+1)

– If l < N , ∀(x, y, s) calculate Fixys as a d(l) dimensional local histogram by (l)

accumulating votes from Fix0 y0 s0 over neighbourhood N (l+1) (x, y, s). (l)

– If global descriptors need to be output, code Fi... as a d(l) dimensional (l) histogram Hi by globally accumulating votes for the d(l) centers from all (x, y, s). (l)

4. Return {Hi | ∀i, l}.