Modeling Feature Sharing between Object Detection and Top-down Attention Dirk
1* Walther ,
Thomas
2 Serre ,
Tomaso
2 Poggio ,
Christof
1 Koch
# 1046 VSS 05
1
Computation and Neural Systems. California Institute of Technology, Pasadena, CA, 91125, *
[email protected] 2 Dept. of Brain and Cognitive Sciences and McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, 02139
Data Sets
We propose that feedback connections in an object recognition system can serve this purpose. We demonstrate a computational implementation of such a system that, once trained for detecting faces, is capable of visual search for faces.
Task
PFC
0.8
…
C2
IT
…
0.6
0.4
V4
…
S2
0.2
C1 S1
… …
… …
V1/ V2
0
0
0.2
0.4
MAX operation
1
Conclusions
Top-down attention is compared to:
Top-down, set A
Feature selection for top-down attention
The model is based on the hierarchical feedforward model of object recognition in cortex by Riesenhuber and Poggio [1] and its extension for feature learning at S2 by Serre and Poggio [2].
Bottom-up
more
38% 6%
45%
5%
Top-down, set A
6%
Top-down, set B
9%
0.75
2
5%
31%
82%
0.7
0.65
0.6
1
13% 7%
0.8
3
49%
Bottom-up
The area under the ROC curve for these activity value distributions, averaged over all test images, provides a performance measure for the top-down features, and bottom-up and skin hue maps, respectively.
68%
Skin hue
set A best of set A set B best of set B bottom-up skin hue
0.55
0.5
0
5
10
15
20
low
Skin hue
0.85
12%
3%
Top-down, set B
(2) top-down bias based on skin hue statistics.
21%
S2
Features of intermediate complexity that are learned for the purpose of object recognition can be used effectively to guide top-down attention. Feedback connections in the visual hierarchy can provide a means of mapping an abstract task to a particular set of features that can be useful in solving the task.
(1) The saliency-based model for bottom-up attention by Itti and Koch [5];
How many fixations (visits to the pixels of the activation maps in order of descending activity) does it take on average to find a face? For the nth face in the image we measure the number of non-face fixations since finding the (n-1)th face in the image.
Gaussian-like tuning
For top-down attention, feed-back connections from the abstract object representation down to the S2 level select a few S2 units. Their activity to a stimulus is used to bias spatial selection.
0.8
0.5
high
Fixation Analysis
Experience-dependent tuning
C1
0.6 R /I
…
A modified trace rule [3] selects a stable shape dictionary from snapshots of C1 activity (see also [4]).
10
25
percent 1st fixation
30
35
40
ROC analysis of pixels inside and outside of the regions of interest (here for the skin hue map).
1
true positives
To model skin hue distribution, the skin hue values of 3947 faces in 1153 color photographs were fitted with a 2d Gaussian in the CIE diagram.
1
20
0.6
Testing of top-down attention was done on a set of 179 images that contained between two and 20 faces, with a total of 593 faces.
VTUs
0.7
0
fraction of pixels
Ident.
0.8
30
Separate activity value distributions are obtained for the two regions.
mean ROI ROC
Categ.
The activities of units in each map are separated into face and non-face pixels (based on ground truth).
According to both performance measures, topdown feature sets A and B perform better than bottomup attention. Set A does not reach the performance of Bottomup Skin hue Topdown set A Topdown set B skin hue, while set B performs comparably, even outperforming skin hue in the number of faces that were attended in the first fixation. This is remarkable since top-down attention uses only grayscale versions of the images. 40
The recognition model was trained on the 200 training face images and 200 non-face training images. The ROC areas for in-dependent test sets were 0.989 for set A and 0.994 for set B.
G /I
Robust dictionary of shape-components
Object-tuned units
Task-specific circuits
Model Architecture
All training and test images are handlabeled photographs from the internet. Two sets of 100 S2 features each were learned from 200 training images - set A from the entire training images; set B only from the face regions.
Summary of Results
mean ROI ROC area
When performing visual tasks such as search for natural objects in a cluttered background, the attention system is biased from the top down for certain attributes of the targets. How is the task mapped to the particular features?
Region of Interest Analysis
percent 1st fixation
Introduction
0.5
area = 0.87
face region pixels non-face pixels 0
0
0.5
false positives
1
In our computational implementation we have shown this behavior for faces. Both top-down feature sets performed better than mere bottom-up attention. Feature set A, which was derived from the entire training images, did not reach the performance of a skin hue detector. Set B, which was obtained from only the face regions of the training images, performed better than set A, reaching, and according to one performance measure, even surpassing the skin hue detector. The difference in performance between sets A and B suggests that some amount of guidance of the selection of training regions may be beneficial. In future experiments, we will assess the benefit of using bottom-up attention to guide feature learning.
Future work: pixel activity
The mean ROI ROC value correlates with the percent of faces that are found at the first fixation for the respective features. The best (highest percent first fixations) features are marked for both feature sets.
Extend the implementation to several object categories; Implement a closed-loop system that verifies attended locations using the recognition sub-system; Make the system fully scale invariant.
References 1. Riesenhuber, M. and T. Poggio (1999), Nature Neuroscience, 2(11): p. 1019-1025. 2. Serre, T. and T. Poggio (2005), VSS poster #744. 3. Foldiak, P. (1991), Neural Computation, 3: 194-200. 4. Sigala, R., T. Serre, T. Poggio, and M. Giese (2005), VSS poster #26. 5. Itti, L., C. Koch, and E. Niebur (1998), IEEE PAMI, 20(11): p. 1254-1259.
Acknowledgements Thanks to Xinpeng Huang for labeling the training and test images. The model figures are modified from Serre and Poggio, Cosyne 2005. This research is funded by grants from NSF, NIH, and NIMH.