Supplementary Material: The Role of Context for Object Detection and Semantic Segmentation in the Wild Roozbeh Mottaghi1
Xianjie Chen2 Xiaobai Liu2 Nam-Gyu Cho3 Seong-Whan Lee3 Sanja Fidler4 Raquel Urtasun4 Alan Yuille2 Stanford University1 UCLA2 Korea University3 University of Toronto4
In this paper [6], we are interested in analyzing the effect of context in detection and segmentation approaches. Towards this goal, we label every pixel of the training and validation sets of the PASCAL VOC 2010 detection challenge with a semantic class. We selected PASCAL as our testbed as it has served as the benchmark for detection and segmentation in the community for years (over 600 citations and tens of teams competing in the challenges each year). Our analysis shows that our new dataset is much more challenging than existing ones (e.g., Barcelona [7], SUN [8], SIFT flow [5]), as it has higher class entropy, less pixels are labeled as “stuff” and instead belong to a wide variety of object categories beyond the 20 PASCAL object classes. We analyze the ability of state-of-the-art methods [7, 1] to perform semantic segmentation of the most frequent classes, and show that approaches based on nearest neighbor (NN) retrieval are significantly outperformed by approaches based on bottom-up grouping, showing the variability of PASCAL images. We also study the performance of contextual models for object detection, and show that existing models have a hard time dealing with PASCAL imagery. In order to push forward the performance in this difficult scenario, we propose a novel deformable part-based model, which exploits both local context around each candidate detection as well as global context at the level of the scene. As contextual features we use class-specific segmentation features inspired by the success of segDPM [4]. We show that the model significantly helps in detecting objects at all scales and is particularly effective at tiny objects as well as extra-large ones. The supplementary material includes the following items: • Plots that show the statistics for location and frequency of context classes with respect to different sizes of objects. • Additional successful and failure cases for detection with contextual information, comparing it with DPM [3] • Additional successful and failure cases for segmentation with contextual information, comparing it with O2P [1] Note that in a parallel paper [2] we also provide detailed annotations and analysis for object parts in PASCAL.
References [1] J. Carreira, R. Caseiroa, J. Batista, and C. Sminchisescu. Semantic segmentation with second-order pooling. In ECCV, 2012. 1, 9 [2] X. Chen, R. Mottaghi, X. Liu, N.-G. Cho, S. Fidler, and A. Y. Raquel Urtasun. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, 2014. 1 [3] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. PAMI, 2010. 1, 7 [4] S. Fidler, R. Mottaghi, A. Yuille, and R. Urtasun. Bottom-up segmentation for top-down detection. In CVPR, 2013. 1 [5] C. Liu, J. Yuen, and A. Torralba. Nonparametric scene parsing via label transfer. In CVPR, 2009. 1 [6] R. Mottaghi, X. Chen, X. Liu, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semantic segmentation in the wild. In CVPR, 2014. 1 [7] J. Tighe and S. Lazebnik. Superparsing: Scalable nonparametric image parsing with superpixels. In ECCV, 2010. 1 [8] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010. 1
1
1. Location and Frequency Statistics of Context Classes The frequency of contextual categories around the objects varies for different sizes of objects. In Figures 1–5, we show the frequency of each context class with respect to different object size percentiles. The statistics are computed within four boxes around the object (the same as four context parts that we had in the paper, but without deformation). The statistics represent the normalized number of pixels for each class. The normalization is done according to the total number of pixels that fall in the boxes of a particular direction. There are some interesting trends. For instance, the amount of sky in the bottom region of airplanes increases as airplanes become smaller, which shows that small airplanes typically appear in the sky. Another example is that we see more sky pixels in the top region of buses compared to cars, which shows buses are taller than cars. It is evident that the surroundings of objects have a very biased distribution, which should be exploited particularly when recognizing “difficult” / ambiguous object regions. For example, for tiny objects where little of the structure is visible, or for highly occluded objects, context should play key role in recognition.
Aeroplane-top Car-left/right Car-top Car-bottom
1
Aeroplane-bottom Aeroplane-top Car-left/right Car-top Car-bottom
person
treebus
0.8
building road
0.6
road ground
1
Aeroplane-bottom Aeroplane-top Car-left/right Car-top Car-bottom Aeroplane-left/right
person
treebus
0.8
building road
0.6
road ground
11
person
0.8 0.8
treebus
0.6 0.6
road ground
building road
0.4
grass ground
0.4
grass ground
0.4
grass ground
0.2
car grass
0.2
car grass
0.2
car grass
skytree
0
sky aeroplane
skytree
0
sky aeroplane
building
0.9
sidewalk
0.8
road
0.7
floor
0.6
bicycle
0.5
wall
0.4
tree
0.3
sky
0.2
ground
0.1
grass
0
building
Bicycle-left/right Bicycle-bottom Bicycle-top
Bicycle-bottom Bicycle-top
snow
building
sky aeroplane
building
Bicycle-top
1
skytree
0
1
snow
0.9 0.8
sidewalk road
0.7 0.7 0.6 0.6
floor
bicycle
0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0
wall tree sky ground grass building
person
1
snow
0.9 0.8
sidewalk road
0.7 0.7 0.6 0.6
floor bicycle
0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0
wall tree sky ground grass building person
person
0 Bird-top
1 0.8 0.6 0.4 0.2 0
Bird-left/right
Bird-bottom
bird
water tree sky ground
1 0.8 0.6 0.4 0.2 0
1 0.8 0.6 0.4 0.2 0
grass
Boat%bo'om)
Boat%top'
0.6" 0.4" 0.2" 0. 0~ 0. 1" 0. 1~ 0. 2" 0. 2~ 0. 3" 0. 3~ 0. 4" 0. 4~ 0. 5" 0. 5~ 0. 6" 0. 6~ 0. 7" 0. 7~ 0. 8" 0. 8~ 0. 9" 0. 9~ 1. 0"
0"
1"
0.8"
0.8"
0.6"
0.6"
road"
0.4"
0.4"
mountain"
0.2"
0.2"
0"
0"
grass" water" tree" sky"
building" person" boat"
0. 0~ 0. 1" 0. 1~ 0. 2" 0. 2~ 0. 3" 0. 3~ 0. 4" 0. 4~ 0. 5" 0. 5~ 0. 6" 0. 6~ 0. 7" 0. 7~ 0. 8" 0. 8~ 0. 9" 0. 9~ 1. 0"
0.8"
Boat%le(/right.
1"
ground"
0. 0~ 0. 1" 0. 1~ 0. 2" 0. 2~ 0. 3" 0. 3~ 0. 4" 0. 4~ 0. 5" 0. 5~ 0. 6" 0. 6~ 0. 7" 0. 7~ 0. 8" 0. 8~ 0. 9" 0. 9~ 1. 0"
1"
Figure 1. Pixel-wise frequency of context classes in top, bottom, and left/right contextual parts. The x-axis corresponds to size percentile and the y-axis represents the frequency of appearance. Only the most correlated classes are shown.
Bottle-top
Bottle-bottom Bottle-top
1
Bottle-left/right Bottle-bottom Bottle-top
1
0.9
floor
0.8
cabinet
0.7
board
0.6
table
0.5
window
0.4
wall
0.3
shelves
0.2
door
0.1
ceiling
0
person
1
0.9 0.8
floor
0.7 0.6
board
0.5 0.5 0.4 0.4
window
cabinet table wall
0.3 0.3 0.2 0.2
shelves
0.1 0.1 0 0
ceiling
door person
Bus$top( road" ground" bus"
0.4"
tree"
0.2"
sky"
0. 0~ 0. 1" 0. 1~ 0. 2" 0. 2~ 0. 0. 3" 3~ 0. 4" 0. 4~ 0. 5" 0. 5~ 0. 0. 6" 6~ 0. 7" 0. 7~ 0. 8" 0. 8~ 0. 9" 0. 9~ 1. 0"
0"
building"
1 1.2 1 0.81 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.2 0.2
bus road
0.6
ground
0.4
grass car
0.2
tree
0
treebus road building road ground road ground bus grass ground tree car grass sky skytree building sky aeroplane
shelves
0.1 0.1 0 0
ceiling
sky
1
cat window wall floor clothes building bedclothes sofa
door person bottle
1 1.2 1 1 0.8 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2
person
treebus road building road ground road ground bus grass ground tree car grass sky skytree building sky aeroplane
0
building
bus road
0.6
ground
0.4
grass car
0.2
tree
0
sky
1
person bus
0.8
road
0.6
ground
0.4
grass
0.2
car
0
sky
tree
building
building
Cat-top Cat-bottom
chair
wall
Car-left/right Car-top Car-bottom
person
0.8
Cat-top
ground
table
building
building
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
window
0.3 0.3 0.2 0.2
Car-top Car-bottom person
0.8
0.5 0.5 0.4 0.4
cabinet
Aeroplane-bottom Aeroplane-top Car-left/right Car-top Bus-left/right Car-bottom Bus-top Bus-bottom Aeroplane-left/right person
0
Car-top
1
board
Aeroplane-bottom Aeroplane-top Car-left/right Car-top Car-bottom Bus-top Bus-bottom Aeroplane-left/right
1"
0.6"
floor
0.7 0.6
bottle
bottle
0.8"
0.9 0.8
1 0.9 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0
Cat-left/right Cat-top Cat-bottom
ground
chair cat window wall floor clothes building bedclothes sofa
1 0.9 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0
ground
chair cat window wall floor clothes building bedclothes sofa
Figure 2. Pixel-wise frequency of context classes in top, bottom, and left/right contextual parts. The x-axis corresponds to size percentile and the y-axis represents the frequency of appearance. Only the most correlated classes are shown.
Chair-top
1
Chair-left/right Chair-bottom Chair-top
Chair-bottom Chair-top
ground
1
ground
1
ground
0.8
floor
0.8
floor
0.8
floor
0.6
table
0.6
table
0.6 0.6
table
chair
0.4 0.4
chair
0.4 0.4
chair
0.4
window
0.2
wall
0
building
window
0.2 0.2
wall
00
building
window
0.2 0.2
wall
0 0
building
person
person
person Cow-top
1
1
Cow-bottom Cow-top
Cow-top
ground
1
ground
0.8
tree
0.8
tree
0.6
sky
0.6
sky
mountain
0.4 0.4
mountain
0.4
grass
0.2
fence building
0
grass
0.2 0.2
fence
0 0
building
ground floor window
0.6
wall
0.4 0.2 0
1
ground floor
0.8
window
0.6
wall
building
0.4 0.4
building
sofa
0.2 0.2
sofa
person table
person
00
table
floor
0.8
bedclothes
0.7
dog
0.6
water
0.5
wall
0.4
tree
0.3
ground
0.2
grass
0.1
building
0
sofa person
sky mountain grass
fence building cow
1
ground floor
0.8
window
0.6
wall
0.4 0.4
building
0.2 0.2
sofa person
0 0
table chair
Dog-left/right Dog-bottom Dog-top
Dog-bottom Dog-top
Dog-top
0.9
tree
chair
chair
1
ground
Diningtable-left/right Diningtable-bottom Diningtable-top
Diningtable-bottom Diningtable-top
Diningtable-top
0.8
1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0
cow
cow
1
Cow-left/right Cow-bottom
1 0.9 0.8 0.7 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0
floor
bedclothes dog water wall tree
ground grass building sofa
person
1 0.9 0.8 0.7 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0
floor
bedclothes dog water wall tree
ground grass building sofa
person
Figure 3. Pixel-wise frequency of context classes in top, bottom, and left/right contextual parts. The x-axis corresponds to size percentile and the y-axis represents the frequency of appearance. Only the most correlated classes are shown.
Horse-top
Horse-bottom Horse-top 1
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
person
0.9
person
fence
0.8
fence
0.8
fence
0.8
0.7
horse
0.6
0.6
ground
0.5
ground
tree
0.4
0.4
tree
0.4
0.4 0.4
tree
0.2
sky
0.1
mountain
0.3
grass
0.2 0
grass
0
building
floor motorbike
wall tree
0.5
sky
0.4
road
0.3
mountain
0.2
ground
0.1
grass
0
building
1
floor
motorbike
0.7 0.7 0.6 0.6
wall tree
0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0
sky road mountain ground grass
building
ground grass
floor person table
car boat wall tree
0.6
building
1
ground floor table
sidewalk
0.9 0.8
floor motorbike
0.7 0.7 0.6 0.6
wall tree
0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0
sky road
mountain ground
grass building
person
Person-left/right Person-bottom Person-top
1 road 0.8 ground 0.6 grass 0.4 floor 0.2 person 0 table car
boat wall tree
1 0.9 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0
1 road 0.8 ground 0.6 grass 0.4 floor 0.2 person 0 table car boat
wall tree sky
Pottedplant-left/right Pottedplant-bottom Pottedplant-top
Pottedplant-bottom Pottedplant-top
shelves
grass
sky
Pottedplant-top
0.8
0 0
Person-bottom Person-top
1 0.9 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0
sky
1
0
person
Person-top
road
mountain
0.1
Motorbike-bottom Motorbike-top Motorbike-left/right
sidewalk
0.9 0.8
person
1 0.8 1 0.6 0.9 0.4 0.8 0.2 0.7 0 0.6 0.5 0.4 0.3 0.2 0.1 0
sky
0.2 0.2
0.2
Motorbike-bottom Motorbike-top
sidewalk
0.6
0.3
building
Motorbike-top
0.7
horse
0.6
0.6
0.5
mountain
0.8
0.8
0.7
ground sky
1
1
0.9
horse
0.9
Horse-bottom Horse-top Horse-left/right 1
1
person
1
0.8 0.6
shelves ground floor table
1 0.8 0.6
shelves ground floor table
0.4
window
0.4
window
0.4
window
0.2
wall
0.2
wall
0.2
wall
tree 0
building pottedplant
tree 0
building
pottedplant
tree 0
building pottedplant
Figure 4. Pixel-wise frequency of context classes in top, bottom, and left/right contextual parts. The x-axis corresponds to size percentile and the y-axis represents the frequency of appearance. Only the most correlated classes are shown.
Sheep-top
1
Sheep-bottom Sheep-top
1
person
Sheep-left/right Sheep-bottom Sheep-top
1
person
person
0.8
tree
0.8
tree
0.8
tree
0.6
sky
0.6
sky
0.6
sky
ground
0.4 0.4
ground
0.4 0.4
ground
0.4
grass
0.2
fence
0
grass
0.2 0.2
fence
0 0
building
sheep
Sofa-top Sofa-bottom
Sofa-left/right Sofa-top
1 pottedplant door
building
sheep
Sofa-top
1
fence
0 0
building
sheep
0.9
grass
0.2 0.2
1
pottedplant
0.9
door
pottedplant
0.9
door
0.8
0.8
tvmonitor
0.8
tvmonitor
0.7
chair
0.7
chair
0.7
chair
0.6
ground
0.6
ground
0.6 0.6
ground
floor
0.5 0.5
floor
0.5 0.5
floor
0.5
clothes
sofa
0.4 0.4
0.3
person
0.3 0.3
0.2
table
0.1
window
0.4
wall
0
picture
clothes sofa person
0.2 0.2 0.1 0.1
table window wall
0 0
picture
curtain
platform
ground grass train
tree sky mountain
1 0.9 0.8 0.7 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0
keyboard cabinet
0.7
person
0.6
table
0.5
chair
0.4
window
0.3
wall
0.2
shelves
0.1
picture
0
ceiling
tvmonitor
window wall picture curtain
platform
ground grass train tree
sky mountain
1 0.9 0.8 0.7 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0
track platform
ground grass train tree
sky mountain building
Tvmonitor-bottom Tvmonitor-left/right Tvmonitor-top
Tvmonitor-bottom Tvmonitor-top
0.8
table
0 0
building
Tvmonitor-top
0.9
person
0.2 0.2 0.1 0.1
Train-left/right Train-bottom Train-top
track
building
1
sofa
0.3 0.3
Train-bottom Train-top
track
clothes
0.4 0.4
curtain
Train-top
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
tvmonitor
1 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0
keyboard cabinet person table chair
window wall shelves picture ceiling tvmonitor
1 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0
keyboard cabinet
person table
chair window wall
shelves picture
ceiling tvmonitor
Figure 5. Pixel-wise frequency of context classes in top, bottom, and left/right contextual parts. The x-axis corresponds to size percentile and the y-axis represents the frequency of appearance. Only the most correlated classes are shown.
2. Additional Detection Results In this section we show additional successful and failure cases for detection with our context model (Figures 6–8).
building
sky
sky
ground
building
ground building ground
sky
sky
sky
tree tree
building
building
tree
tree grass
grass
grass
ground
grass
grass
ground
tree
grass
tree
tree
tree tree
water
grass grass
water
floor
grass
grass
sky
sky
water
water
sky
sky
water
water
road
building building
sky
sky
sky
sky
building
building tree
ground
grass
road
building
building
building tree
road
ground
ground
road
DPM
Ground truth
Predicted context label
Ours
Figure 6. In the first column we show the top detection of DPM [3]. The second column shows groundtruth context labeling and groundtruth object box. Third column is the context prediction result. The last column is the result of our context-aware DPM. Inferred context boxes are also shown with different colors. The original 20 classes of PASCAL are not shown in prediction and groundtruth images.
sky tree
sky tree
building
tree
building
grass
grass
road sky
tree
road
building
building
ground sky
tree
building
building
building
building
tree
grass
grass
road sky
tree
tree
road sky
mountain
wall
tree
sky mountain
grass
grass wall
sky
tree
building
wall
wall
wall
wall
wall
building
floor
wall
wall
wall
wall
floor floor
floor floor
DPM
Ground truth
floor
Predicted context label
Ours
Figure 7. In the first column we show the top detection of DPM. The second column shows groundtruth context labeling and groundtruth object box. Third column is the context prediction result. The last column is the result of our context-aware DPM. Inferred context boxes are also shown with different colors. The original 20 classes of PASCAL are not shown in prediction and groundtruth images.
wall
wall
tree
ground
wall
table
cabinet
wood
wood
ground floor
tree
tree ground
ground water
grass
grass grass
water
water grass
ground
DPM
Ground truth
Predicted context label
Ours
Figure 8. Failure cases. In the first column we show the top detection of DPM. The second column shows groundtruth context labeling. Third column is the context prediction result. The last column is the result of our context-aware DPM. The original 20 classes of PASCAL are not shown in prediction and groundtruth images.
3. Additional Segmentation Results In this section we show examples where the context feature helps or hurts O2 P [1] segmentation (Figures 9–10).
dining table
chair dining table
chair
chair
horse
sheep
horse
train
train
horse
horse
horse
bird
bird
person
bus
car
dog
bird
train
bus
car
chair
bird
bird
bird Original Image
Ground truth
O2 P
O2 P + context
Figure 9. Examples for which the simple context feature provides improvement over O2 P.
sofa person
person
person
person
person
person
car
tv/monitor
car
car
tv/monitor
tv/monitor
dining table
person
sofa
sofa chair
sofa chair
dining table
potted plant
Original Image
bottle
bottle
bottle
Ground truth
O2 P
O2 P + context
Figure 10. Failure cases. Examples that the context feature is misleading for O2 P.