The Role of Context for Object Detection and Semantic Segmentation ...

Report 0 Downloads 74 Views
Supplementary Material: The Role of Context for Object Detection and Semantic Segmentation in the Wild Roozbeh Mottaghi1

Xianjie Chen2 Xiaobai Liu2 Nam-Gyu Cho3 Seong-Whan Lee3 Sanja Fidler4 Raquel Urtasun4 Alan Yuille2 Stanford University1 UCLA2 Korea University3 University of Toronto4

In this paper [6], we are interested in analyzing the effect of context in detection and segmentation approaches. Towards this goal, we label every pixel of the training and validation sets of the PASCAL VOC 2010 detection challenge with a semantic class. We selected PASCAL as our testbed as it has served as the benchmark for detection and segmentation in the community for years (over 600 citations and tens of teams competing in the challenges each year). Our analysis shows that our new dataset is much more challenging than existing ones (e.g., Barcelona [7], SUN [8], SIFT flow [5]), as it has higher class entropy, less pixels are labeled as “stuff” and instead belong to a wide variety of object categories beyond the 20 PASCAL object classes. We analyze the ability of state-of-the-art methods [7, 1] to perform semantic segmentation of the most frequent classes, and show that approaches based on nearest neighbor (NN) retrieval are significantly outperformed by approaches based on bottom-up grouping, showing the variability of PASCAL images. We also study the performance of contextual models for object detection, and show that existing models have a hard time dealing with PASCAL imagery. In order to push forward the performance in this difficult scenario, we propose a novel deformable part-based model, which exploits both local context around each candidate detection as well as global context at the level of the scene. As contextual features we use class-specific segmentation features inspired by the success of segDPM [4]. We show that the model significantly helps in detecting objects at all scales and is particularly effective at tiny objects as well as extra-large ones. The supplementary material includes the following items: • Plots that show the statistics for location and frequency of context classes with respect to different sizes of objects. • Additional successful and failure cases for detection with contextual information, comparing it with DPM [3] • Additional successful and failure cases for segmentation with contextual information, comparing it with O2P [1] Note that in a parallel paper [2] we also provide detailed annotations and analysis for object parts in PASCAL.

References [1] J. Carreira, R. Caseiroa, J. Batista, and C. Sminchisescu. Semantic segmentation with second-order pooling. In ECCV, 2012. 1, 9 [2] X. Chen, R. Mottaghi, X. Liu, N.-G. Cho, S. Fidler, and A. Y. Raquel Urtasun. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, 2014. 1 [3] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. PAMI, 2010. 1, 7 [4] S. Fidler, R. Mottaghi, A. Yuille, and R. Urtasun. Bottom-up segmentation for top-down detection. In CVPR, 2013. 1 [5] C. Liu, J. Yuen, and A. Torralba. Nonparametric scene parsing via label transfer. In CVPR, 2009. 1 [6] R. Mottaghi, X. Chen, X. Liu, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semantic segmentation in the wild. In CVPR, 2014. 1 [7] J. Tighe and S. Lazebnik. Superparsing: Scalable nonparametric image parsing with superpixels. In ECCV, 2010. 1 [8] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010. 1

1

1. Location and Frequency Statistics of Context Classes The frequency of contextual categories around the objects varies for different sizes of objects. In Figures 1–5, we show the frequency of each context class with respect to different object size percentiles. The statistics are computed within four boxes around the object (the same as four context parts that we had in the paper, but without deformation). The statistics represent the normalized number of pixels for each class. The normalization is done according to the total number of pixels that fall in the boxes of a particular direction. There are some interesting trends. For instance, the amount of sky in the bottom region of airplanes increases as airplanes become smaller, which shows that small airplanes typically appear in the sky. Another example is that we see more sky pixels in the top region of buses compared to cars, which shows buses are taller than cars. It is evident that the surroundings of objects have a very biased distribution, which should be exploited particularly when recognizing “difficult” / ambiguous object regions. For example, for tiny objects where little of the structure is visible, or for highly occluded objects, context should play key role in recognition.

Aeroplane-top Car-left/right Car-top Car-bottom

1

Aeroplane-bottom Aeroplane-top Car-left/right Car-top Car-bottom

person

treebus

0.8

building road

0.6

road ground

1

Aeroplane-bottom Aeroplane-top Car-left/right Car-top Car-bottom Aeroplane-left/right

person

treebus

0.8

building road

0.6

road ground

11

person

0.8 0.8

treebus

0.6 0.6

road ground

building road

0.4

grass ground

0.4

grass ground

0.4

grass ground

0.2

car grass

0.2

car grass

0.2

car grass

skytree

0

sky aeroplane

skytree

0

sky aeroplane

building

0.9

sidewalk

0.8

road

0.7

floor

0.6

bicycle

0.5

wall

0.4

tree

0.3

sky

0.2

ground

0.1

grass

0

building

Bicycle-left/right Bicycle-bottom Bicycle-top

Bicycle-bottom Bicycle-top

snow

building

sky aeroplane

building

Bicycle-top

1

skytree

0

1

snow

0.9 0.8

sidewalk road

0.7 0.7 0.6 0.6

floor

bicycle

0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0

wall tree sky ground grass building

person

1

snow

0.9 0.8

sidewalk road

0.7 0.7 0.6 0.6

floor bicycle

0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0

wall tree sky ground grass building person

person

0 Bird-top

1 0.8 0.6 0.4 0.2 0

Bird-left/right

Bird-bottom

bird

water tree sky ground

1 0.8 0.6 0.4 0.2 0

1 0.8 0.6 0.4 0.2 0

grass

Boat%bo'om)

Boat%top'

0.6" 0.4" 0.2" 0. 0~ 0. 1" 0. 1~ 0. 2" 0. 2~ 0. 3" 0. 3~ 0. 4" 0. 4~ 0. 5" 0. 5~ 0. 6" 0. 6~ 0. 7" 0. 7~ 0. 8" 0. 8~ 0. 9" 0. 9~ 1. 0"

0"

1"

0.8"

0.8"

0.6"

0.6"

road"

0.4"

0.4"

mountain"

0.2"

0.2"

0"

0"

grass" water" tree" sky"

building" person" boat"

0. 0~ 0. 1" 0. 1~ 0. 2" 0. 2~ 0. 3" 0. 3~ 0. 4" 0. 4~ 0. 5" 0. 5~ 0. 6" 0. 6~ 0. 7" 0. 7~ 0. 8" 0. 8~ 0. 9" 0. 9~ 1. 0"

0.8"

Boat%le(/right.

1"

ground"

0. 0~ 0. 1" 0. 1~ 0. 2" 0. 2~ 0. 3" 0. 3~ 0. 4" 0. 4~ 0. 5" 0. 5~ 0. 6" 0. 6~ 0. 7" 0. 7~ 0. 8" 0. 8~ 0. 9" 0. 9~ 1. 0"

1"

Figure 1. Pixel-wise frequency of context classes in top, bottom, and left/right contextual parts. The x-axis corresponds to size percentile and the y-axis represents the frequency of appearance. Only the most correlated classes are shown.

Bottle-top

Bottle-bottom Bottle-top

1

Bottle-left/right Bottle-bottom Bottle-top

1

0.9

floor

0.8

cabinet

0.7

board

0.6

table

0.5

window

0.4

wall

0.3

shelves

0.2

door

0.1

ceiling

0

person

1

0.9 0.8

floor

0.7 0.6

board

0.5 0.5 0.4 0.4

window

cabinet table wall

0.3 0.3 0.2 0.2

shelves

0.1 0.1 0 0

ceiling

door person

Bus$top( road" ground" bus"

0.4"

tree"

0.2"

sky"

0. 0~ 0. 1" 0. 1~ 0. 2" 0. 2~ 0. 0. 3" 3~ 0. 4" 0. 4~ 0. 5" 0. 5~ 0. 0. 6" 6~ 0. 7" 0. 7~ 0. 8" 0. 8~ 0. 9" 0. 9~ 1. 0"

0"

building"

1 1.2 1 0.81 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.2 0.2

bus road

0.6

ground

0.4

grass car

0.2

tree

0

treebus road building road ground road ground bus grass ground tree car grass sky skytree building sky aeroplane

shelves

0.1 0.1 0 0

ceiling

sky

1

cat window wall floor clothes building bedclothes sofa

door person bottle

1 1.2 1 1 0.8 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2

person

treebus road building road ground road ground bus grass ground tree car grass sky skytree building sky aeroplane

0

building

bus road

0.6

ground

0.4

grass car

0.2

tree

0

sky

1

person bus

0.8

road

0.6

ground

0.4

grass

0.2

car

0

sky

tree

building

building

Cat-top Cat-bottom

chair

wall

Car-left/right Car-top Car-bottom

person

0.8

Cat-top

ground

table

building

building

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

window

0.3 0.3 0.2 0.2

Car-top Car-bottom person

0.8

0.5 0.5 0.4 0.4

cabinet

Aeroplane-bottom Aeroplane-top Car-left/right Car-top Bus-left/right Car-bottom Bus-top Bus-bottom Aeroplane-left/right person

0

Car-top

1

board

Aeroplane-bottom Aeroplane-top Car-left/right Car-top Car-bottom Bus-top Bus-bottom Aeroplane-left/right

1"

0.6"

floor

0.7 0.6

bottle

bottle

0.8"

0.9 0.8

1 0.9 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0

Cat-left/right Cat-top Cat-bottom

ground

chair cat window wall floor clothes building bedclothes sofa

1 0.9 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0

ground

chair cat window wall floor clothes building bedclothes sofa

Figure 2. Pixel-wise frequency of context classes in top, bottom, and left/right contextual parts. The x-axis corresponds to size percentile and the y-axis represents the frequency of appearance. Only the most correlated classes are shown.

Chair-top

1

Chair-left/right Chair-bottom Chair-top

Chair-bottom Chair-top

ground

1

ground

1

ground

0.8

floor

0.8

floor

0.8

floor

0.6

table

0.6

table

0.6 0.6

table

chair

0.4 0.4

chair

0.4 0.4

chair

0.4

window

0.2

wall

0

building

window

0.2 0.2

wall

00

building

window

0.2 0.2

wall

0 0

building

person

person

person Cow-top

1

1

Cow-bottom Cow-top

Cow-top

ground

1

ground

0.8

tree

0.8

tree

0.6

sky

0.6

sky

mountain

0.4 0.4

mountain

0.4

grass

0.2

fence building

0

grass

0.2 0.2

fence

0 0

building

ground floor window

0.6

wall

0.4 0.2 0

1

ground floor

0.8

window

0.6

wall

building

0.4 0.4

building

sofa

0.2 0.2

sofa

person table

person

00

table

floor

0.8

bedclothes

0.7

dog

0.6

water

0.5

wall

0.4

tree

0.3

ground

0.2

grass

0.1

building

0

sofa person

sky mountain grass

fence building cow

1

ground floor

0.8

window

0.6

wall

0.4 0.4

building

0.2 0.2

sofa person

0 0

table chair

Dog-left/right Dog-bottom Dog-top

Dog-bottom Dog-top

Dog-top

0.9

tree

chair

chair

1

ground

Diningtable-left/right Diningtable-bottom Diningtable-top

Diningtable-bottom Diningtable-top

Diningtable-top

0.8

1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0

cow

cow

1

Cow-left/right Cow-bottom

1 0.9 0.8 0.7 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0

floor

bedclothes dog water wall tree

ground grass building sofa

person

1 0.9 0.8 0.7 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0

floor

bedclothes dog water wall tree

ground grass building sofa

person

Figure 3. Pixel-wise frequency of context classes in top, bottom, and left/right contextual parts. The x-axis corresponds to size percentile and the y-axis represents the frequency of appearance. Only the most correlated classes are shown.

Horse-top

Horse-bottom Horse-top 1

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

person

0.9

person

fence

0.8

fence

0.8

fence

0.8

0.7

horse

0.6

0.6

ground

0.5

ground

tree

0.4

0.4

tree

0.4

0.4 0.4

tree

0.2

sky

0.1

mountain

0.3

grass

0.2 0

grass

0

building

floor motorbike

wall tree

0.5

sky

0.4

road

0.3

mountain

0.2

ground

0.1

grass

0

building

1

floor

motorbike

0.7 0.7 0.6 0.6

wall tree

0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0

sky road mountain ground grass

building

ground grass

floor person table

car boat wall tree

0.6

building

1

ground floor table

sidewalk

0.9 0.8

floor motorbike

0.7 0.7 0.6 0.6

wall tree

0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0

sky road

mountain ground

grass building

person

Person-left/right Person-bottom Person-top

1 road 0.8 ground 0.6 grass 0.4 floor 0.2 person 0 table car

boat wall tree

1 0.9 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0

1 road 0.8 ground 0.6 grass 0.4 floor 0.2 person 0 table car boat

wall tree sky

Pottedplant-left/right Pottedplant-bottom Pottedplant-top

Pottedplant-bottom Pottedplant-top

shelves

grass

sky

Pottedplant-top

0.8

0 0

Person-bottom Person-top

1 0.9 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0

sky

1

0

person

Person-top

road

mountain

0.1

Motorbike-bottom Motorbike-top Motorbike-left/right

sidewalk

0.9 0.8

person

1 0.8 1 0.6 0.9 0.4 0.8 0.2 0.7 0 0.6 0.5 0.4 0.3 0.2 0.1 0

sky

0.2 0.2

0.2

Motorbike-bottom Motorbike-top

sidewalk

0.6

0.3

building

Motorbike-top

0.7

horse

0.6

0.6

0.5

mountain

0.8

0.8

0.7

ground sky

1

1

0.9

horse

0.9

Horse-bottom Horse-top Horse-left/right 1

1

person

1

0.8 0.6

shelves ground floor table

1 0.8 0.6

shelves ground floor table

0.4

window

0.4

window

0.4

window

0.2

wall

0.2

wall

0.2

wall

tree 0

building pottedplant

tree 0

building

pottedplant

tree 0

building pottedplant

Figure 4. Pixel-wise frequency of context classes in top, bottom, and left/right contextual parts. The x-axis corresponds to size percentile and the y-axis represents the frequency of appearance. Only the most correlated classes are shown.

Sheep-top

1

Sheep-bottom Sheep-top

1

person

Sheep-left/right Sheep-bottom Sheep-top

1

person

person

0.8

tree

0.8

tree

0.8

tree

0.6

sky

0.6

sky

0.6

sky

ground

0.4 0.4

ground

0.4 0.4

ground

0.4

grass

0.2

fence

0

grass

0.2 0.2

fence

0 0

building

sheep

Sofa-top Sofa-bottom

Sofa-left/right Sofa-top

1 pottedplant door

building

sheep

Sofa-top

1

fence

0 0

building

sheep

0.9

grass

0.2 0.2

1

pottedplant

0.9

door

pottedplant

0.9

door

0.8

0.8

tvmonitor

0.8

tvmonitor

0.7

chair

0.7

chair

0.7

chair

0.6

ground

0.6

ground

0.6 0.6

ground

floor

0.5 0.5

floor

0.5 0.5

floor

0.5

clothes

sofa

0.4 0.4

0.3

person

0.3 0.3

0.2

table

0.1

window

0.4

wall

0

picture

clothes sofa person

0.2 0.2 0.1 0.1

table window wall

0 0

picture

curtain

platform

ground grass train

tree sky mountain

1 0.9 0.8 0.7 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0

keyboard cabinet

0.7

person

0.6

table

0.5

chair

0.4

window

0.3

wall

0.2

shelves

0.1

picture

0

ceiling

tvmonitor

window wall picture curtain

platform

ground grass train tree

sky mountain

1 0.9 0.8 0.7 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0

track platform

ground grass train tree

sky mountain building

Tvmonitor-bottom Tvmonitor-left/right Tvmonitor-top

Tvmonitor-bottom Tvmonitor-top

0.8

table

0 0

building

Tvmonitor-top

0.9

person

0.2 0.2 0.1 0.1

Train-left/right Train-bottom Train-top

track

building

1

sofa

0.3 0.3

Train-bottom Train-top

track

clothes

0.4 0.4

curtain

Train-top

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

tvmonitor

1 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0

keyboard cabinet person table chair

window wall shelves picture ceiling tvmonitor

1 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0

keyboard cabinet

person table

chair window wall

shelves picture

ceiling tvmonitor

Figure 5. Pixel-wise frequency of context classes in top, bottom, and left/right contextual parts. The x-axis corresponds to size percentile and the y-axis represents the frequency of appearance. Only the most correlated classes are shown.

2. Additional Detection Results In this section we show additional successful and failure cases for detection with our context model (Figures 6–8).

building

sky

sky

ground

building

ground building ground

sky

sky

sky

tree tree

building

building

tree

tree grass

grass

grass

ground

grass

grass

ground

tree

grass

tree

tree

tree tree

water

grass grass

water

floor

grass

grass

sky

sky

water

water

sky

sky

water

water

road

building building

sky

sky

sky

sky

building

building tree

ground

grass

road

building

building

building tree

road

ground

ground

road

DPM

Ground truth

Predicted context label

Ours

Figure 6. In the first column we show the top detection of DPM [3]. The second column shows groundtruth context labeling and groundtruth object box. Third column is the context prediction result. The last column is the result of our context-aware DPM. Inferred context boxes are also shown with different colors. The original 20 classes of PASCAL are not shown in prediction and groundtruth images.

sky tree

sky tree

building

tree

building

grass

grass

road sky

tree

road

building

building

ground sky

tree

building

building

building

building

tree

grass

grass

road sky

tree

tree

road sky

mountain

wall

tree

sky mountain

grass

grass wall

sky

tree

building

wall

wall

wall

wall

wall

building

floor

wall

wall

wall

wall

floor floor

floor floor

DPM

Ground truth

floor

Predicted context label

Ours

Figure 7. In the first column we show the top detection of DPM. The second column shows groundtruth context labeling and groundtruth object box. Third column is the context prediction result. The last column is the result of our context-aware DPM. Inferred context boxes are also shown with different colors. The original 20 classes of PASCAL are not shown in prediction and groundtruth images.

wall

wall

tree

ground

wall

table

cabinet

wood

wood

ground floor

tree

tree ground

ground water

grass

grass grass

water

water grass

ground

DPM

Ground truth

Predicted context label

Ours

Figure 8. Failure cases. In the first column we show the top detection of DPM. The second column shows groundtruth context labeling. Third column is the context prediction result. The last column is the result of our context-aware DPM. The original 20 classes of PASCAL are not shown in prediction and groundtruth images.

3. Additional Segmentation Results In this section we show examples where the context feature helps or hurts O2 P [1] segmentation (Figures 9–10).

dining table

chair dining table

chair

chair

horse

sheep

horse

train

train

horse

horse

horse

bird

bird

person

bus

car

dog

bird

train

bus

car

chair

bird

bird

bird Original Image

Ground truth

O2 P

O2 P + context

Figure 9. Examples for which the simple context feature provides improvement over O2 P.

sofa person

person

person

person

person

person

car

tv/monitor

car

car

tv/monitor

tv/monitor

dining table

person

sofa

sofa chair

sofa chair

dining table

potted plant

Original Image

bottle

bottle

bottle

Ground truth

O2 P

O2 P + context

Figure 10. Failure cases. Examples that the context feature is misleading for O2 P.