BMVC 2007 Slides - Carnegie Mellon University

Comment

Report 4 Downloads 3 Views

Improving Spatial Support for Objects via Multiple Segmentations presented at British Machine Vision Conference 2007 September 12, 2007 Tomasz Malisiewicz and Alexei A. Efros The Robotics Institute Carnegie Mellon University

Input Image

in Theory

Boundaries

Segmentation Recognition Person Car#1 Car#2 Road ...

Input Image

in Theory

Boundaries

Segmentation Recognition Person Car#1 Car#2 Road ...

in Practice Input Image

Edges

Segmentation Recognition

?

Sliding Windows

Sliding Windows Person

Sliding Windows Person

Sliding Windows

Person

Sliding Windows Car

Person

Sliding Windows Car

Person

Sliding Windows

Person

Car

Successes of Sliding Windows cars

faces

pedestrians Schneiderman & Kanade ‘00

Viola & Jones ‘04 Schneiderman & Kanade ‘00

Dalal & Triggs ‘05 Ferrari et al ‘07

Successes of Sliding Windows cars

faces

pedestrians Schneiderman & Kanade ‘00

Viola & Jones ‘04 Schneiderman & Kanade ‘00

Dalal & Triggs ‘05 Ferrari et al ‘07

Overview • Does spatial support matter? • How to get good spatial support?

1. Does Spatial Support Matter? Classify Ground-Truth Segment

vs. Classify

Bounding Box

Does Spatial Support Matter? MSRC data-set: 591 images of 23 object classes + pixel-wise segmentation masks

Does Spatial Support Matter? Features

* Classifier Boosted Decision Tree*

*Hoiem et al ‘05

Does Spatial Support Matter?

Bounding Box Segment

0.655 0.765 1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

at

bo dy

bo g

do t ca ad

ro ir

a ch ok

bo rd bi

n g si

g

ne

er w flo

ke bi

r ca ce fa

r e at w

la rp ai y sk

ep e sh

w co ee tr

s as gr

in ild bu

2. How to get good spatial support?

2. How to get good spatial support?

• Segmentation is a natural way to obtain spatial support

2. How to get good spatial support?

• Segmentation is a natural way to obtain spatial support • Can an off-the-shelf segmentation algorithm provide good spatial support?

2. How to get good spatial support?

• Segmentation is a natural way to obtain spatial support • Can an off-the-shelf segmentation algorithm provide good spatial support?

Normalized Cuts

Shi & Malik

Mean Shift

Efficient Graph Based

Comaniciu & Meer

Felzenszwalb & Huttenlocher

Spatial Support

Segment #1 Ground Truth Segment #2

Spatial Support

Segment #1 Ground Truth Segment #2

Spatial Support .825 Segment #1 .892

Ground Truth Segment #2

Ground Truth

Mean Shift

.659

FH

.567

NCuts

.841

Evaluation* 0.9 0.85 0.8

MeanShift FH NCuts Best Single Segmentation

Mean BSS

0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 1

10

100 1,000 10,000 Segment Soup Size (Log Scale)

100,000

1,000,000

Log Number of Segments

*Unnikrishnan et al 2005, Ge et al 2006

The problem with segmentation

The problem with segmentation

No Single Segmentation provides adequate spatial support

The problem with segmentation

No Single Segmentation provides adequate spatial support

Use a Soup of Segments (Hoiem et al 2005, Russell et al 2006)

em summary. Given a set of input images (first column), we wish to discover object categories and infer their spatial extent uildings: final two columns). We summary. compute multiple per image (a subsetwe is depicted in the second Figure 1. Problem Given asegmentations set of input images (first column), wish to discover object through categories and infer their spatial extent l of the segmentations for the first row are shown in Figure 4). The task is to sift the good segments from the bad ones is fordepicted in the second through (e.g. cars (first and buildings: final twotocolumns). We compute multiple segmentations per image (a subset ven a set of input images column), we wish discover object categories and infer their spatial extent Figure 1.segments Problemchosen summary. Given a setare of input images (first column), and we wish to (cars). discover object categories and infer their spatial extent object category. Here, the by our method shown in green (buildings) yellow columns; all of segmentations the segmentations for the (a first row are shown ininFigure 4). The task is to sift the good segments from the bad ones for o columns). Wefifth compute multiple per columns). image subset is depicted thesegmentations second through (e.g. cars and buildings: final two We compute multiple per image (a subset is depicted in the second through each discovered object category. Here, the segments chosen by our method are shown in green (buildings) and yellow (cars). ions for the first row are shown in Figure 4). The task is to sift the good segments from the bad ones for related visual words. In theory, the idea mage belonging to a particular topic. fifth columns; all of the segmentations for the first row are shown in Figure 4). The tasksounds is to siftsimple: the good segments from the bad ones for Here, the segments chosen by our method are shown in green (buildings) and yellow (cars). compute a segmentation each are image so each seg- and issue noticedwords byeach several [17,category. 21], is the segments chosen by ourof method shown inthat green (buildings) (cars). simple: related visual words. In theory, the yellow idea sounds indiscovered the groups imageobject belonging to aHere, particular topic.

ment corresponds toisaidea coherent object. Then clusterofsimal awords” are topic. not always as descriptive as several compute a visual segmentation image each simple: segrelated visual In [17, theory, sounds simple: to particular One noticed by groups related words. In each theory, the so ideathat sounds wordsmajor in theissue image belonging to awords. particular topic.21],the ilar segments together using the “bag of words” represennterparts. While some visual words do capment corresponds to a coherent object. Then cluster simcompute a segmentation of each image so that each segthat [17, the “visual words” are not always as descriptive as y several groups 21], is compute a segmentation of each image so that each segOne issue noticed by several groups [17, 21], is categories blem summary. Given awheels, setmajor ofas input images (first column), we to discover object andis infer their spatial extent tation. However, image segmentation not a solved probl object parts, (e.g. eyes, airplane ilarment segments together the “bag of words” ment corresponds to wish a words coherent Then cluster simtext While visual do object. capnot always astheir descriptive corresponds tousing a through coherent object. Then represencluster simthat thecounterparts. “visual words” aresome notsegmentations always as descriptive assubset buildings: final two columns). We compute multiple per image (a is depicted in the second lem.(first It using is naive to expect adiscover segmentation algorithm toinfer par-their spatial nysome others endture up high-level encoding simple oriented tation. However, image segmentation is not solvedrepresenprobilar segments together the “bag of words” represenFigure 1. Problem summary. Given a set of input images column), we wish to object categories and extent object parts, (e.g. wheels, eyes, airplane visual words do capilar segments together usingforthe “bag of awords” their text counterparts. Whileinsome visual words do capall of the segmentations for the first row are shown Figure 4). The task is to sift the good segments from the bad ones tition an image into itsisconstituent objects the general ers and might more airplane appropriately be end called (e.g. cars andwingtips), buildings: final two columns). We encoding compute multiple segmentations per image (aprobsubset–toisinexpect depicted in the second through lem. It is naive a segmentation algorithm to partation. However, image segmentation not a solved many others up simple oriented (e.g. wheels, eyes, tation. However, object parts, wheels, eyes,in green airplane ed object category.ture Here,high-level the segments chosen by our(e.g. method are shown (buildings) and yellow (cars).image segmentation is not a solved probcase, you aneed to have solved recognition problem al- theobjects fifth columns; all ofand the segmentations for the first row areexpect shown insegmentation Figure 4). The task isthe toansift the good segments from bad ones– for mes” or evensimple “visual letters”. Consequently, tition image into its constituent in the general lem. It is naive to algorithm to parbars corners and might more appropriately be called up encoding oriented lem. It is naive to expect a segmentation algorithm to parwingtips), many others end up encoding simple oriented

Ground Truth

Mean Shift (33)

FH (24)

NCuts (33)

y. Given a set of input images (first column), we wish to discover object categories and infer their spatial extentsummary. Given a set of input images (first column), we wish to discover object categories and infer Figure Problem summary. Given a set of input images (first column), we wish to discover object categories and infer their spatial extent Figure 1. Problem their1.spatial extent al two columns). We summary. compute multiple per image (a subsetwe is depicted in(e.g. the second through (e.g. cars and buildings: final two columns). We summary. compute multiple per image (a subsetwe is depicted in the second Figure 1. Problem Given asegmentations set of input images (first column), wish to discover object categories and final infer two their spatial extent Figure 1. Problem Given asegmentations set of input images (first column), wish to discover object through categories and infe cars and buildings: columns). We compute multiple segmentations per image (a subset is depicted in the second through Figure 1. Problem summary. Given a set of input images (first column), we wish to discover object all categories and infer their spatial mentations thebuildings: first row we are shown in Figureobject 4). task is multiple toand siftinfer thesegmentations good segments the (a bad ones is fordepicted in the fifth columns; ofathe segmentations thebuildings: firstextent row we are shown in Figureobject 4). task is multiple toand siftinfer thesegmentations good segments the (a bad ones is fordepicted in (e.g. carsfor and final twotocolumns). WeThe compute perfrom image subset secondfor through (e.g. carsfor and final twotocolumns). WeThe compute perfrom image subset nput images (first column), wish discover categories their spatial extent Figure 1. Problem summary. Given set of input images (first column), wish discover categories their spatial extent fifth columns; all of the segmentations the first row are shown in Figure 4). The task is to sift the good segments from the bad ones for (e.g. cars and buildings: final two columns). We compute multiple segmentations per image (a subset is depicted in the second through Figure 1. Problem summary. Given a set of input images (first column), we wish to discover object categories and infer their spatial extent Figure 1. Problem summary. Given a set of input images (first column), we wish to discover object categories and Figure 1. Problem summary. Given a set of input images (first column), we wish to discover object categories and infer their spatial extent gory. Here, themultiple segments chosen by ourper method shown in green andThe yellow (cars). each discovered object category. Here, the segments chosen by ourper method are shown in green (buildings) andThe yellow (cars). Figure 1. Problem summary. Given a set of input images (first column), we wish to discover object categories and infer their spatial extent columns; all of segmentations the segmentations for theare first row are shown(buildings) ininFigure 4). taskeach is todiscovered sift the good segments from the bad ones for fifth columns; all of the segmentations for the first row are shown in Figure 4). task is to sift the good segments fr Wefifth compute image (a subset is depicted the second through (e.g. cars and buildings: final two columns). We compute multiple segmentations image (a subset is depicted in the second through object category. Here, themultiple segments chosen by ourper method shown in green andThe yellow (e.g. cars and buildings: final two columns). We compute multiple per image (a two subset is depicted in the second through (e.g.the carsbad andones buildings: fifth columns; all of segmentations the segmentations for theare first row are shown(buildings) ininFigure 4). task(cars). is to sift the good segments from for final two columns). We compute multiple segmentations per image (a subset is depicted (e.g.segmentations cars and buildings: final columns). compute image (a subset is depicted thesegmentations second through each discovered category. Here, segments chosen by our method shown in green (buildings) and yellowWe (cars). each discovered object category. Here, segments chosen by our method shown in green (buildings) and yellow (e.g. carsones and for buildings: final two columns). We compute multiple per (aimage subset isthe depicted in the second through first row shownobject in Figure The taskthe isrelated to good segments from theare bad ones for fifth columns; all of theimage segmentations for first row are shown in Figure 4). The taskthe isrelated to sift the good segments from theare bad ones for visual In theory, the idea sounds simple: visual words. In theory, the idea sounds simple: ging to aare particular topic. words in the belonging to a particular topic. fifth columns; all of the 4). segmentations forsift thethe first rowwords. are shown in Figure 4). The task is to sift the good segments from the bad fifth columns; all of the segmentations for the first row are shown in Figure 4). The task is to sift the good segment each discovered category. Here, segments chosen by our method are shown in green (buildings) and yellow (cars). fifth columns; all of the segmentations for the first row shownobject in Figure 4). The taskthe isrelated to sift the good segments from the bad ones for visual words. In theory, the idea sounds simple: words ineach the image belonging to aare particular topic. ments chosen by our method are shown in green (buildings) and yellow (cars). each discovered object category. Here, the segments chosen by our method are shown in green (buildings) and yellow (cars). fifth columns; all of the segmentations for the first row are shown in Figure 4). The task is to sift the good segments from the bad ones for compute a segmentation of each image so that segcompute a segmentation of each image so that each segedwords byeach several groups [17, 21], is One major issue noticed by several groups [17, 21], is category. the segments method are shown in green (buildings) and yellow (cars). eachindiscovered category. the segments shownwords. in green In (buildings) related visual words. In theory, thesegments idea sounds simple: relatedarevisual theory, and the yell ide indiscovered the imageobject belonging to aHere, particular topic. chosen by our words the imageobject belonging to aHere, particular topic. chosen by our method each discovered object category. Here, the chosen by our method are shown in green (buildings) and yellow (cars).

compute a segmentation each are image so that each segOnecluster majorsimissue noticedwords byeach several groups [17,category. 21], is the segments chosen by ourof method shown inthe green (buildings) and (cars). as related visual words. In the idea sounds simple: indiscovered thesegimageobject belonging to aHere, particular topic. ment corresponds toisaidea coherent object. Then ment corresponds toisaidea coherent object. Then clusterofsimare topic. not always as descriptive as several that “visual areyellow not always descriptive as several compute a visual segmentation of each image so that each compute a visual segmentation image related visual In [17, theory, sounds simple: related visual In [17, theory, sounds simple: lar words in the image belonging totheory, awords” particular topic. One noticed by groups One major noticed by groups related words. Into theory, theare idea sounds simple: related words. In each theory, the wordsmajor in theissue image belonging to awords. particular topic.21],the words in each theissue image belonging to awords. particular topic.21],the ment corresponds to a coherent object. Then cluster simthat the “visual words” not always as descriptive as compute a segmentation of each image so that segrelated visual words. In theory, the idea sounds simple: words in the image belonging a particular topic. One major issue noticed by several groups [17, 21], is ilar segments together using the “bag of words” represenilar segments together the “bag words” While some visual words do captheir text counterparts. While some visual doarecapcorresponds to a coherent object. Then cluster corresponds to arepresencoherent object. compute a by segmentation of each image that each segcompute a by segmentation of each image soment that each of segthat [17, the “visual are not always asgroups descriptive as is soment that idea the “visual words” not always asgroups descriptive as using roups 21], is words” Onerelated major visual issue noticed groups [17, 21], is words words.byInseveral theory, the sounds simple: words insimthesegimage belonging to a particular topic. compute a segmentation of each image so that each compute a segmentation of each ima One major issue noticed several [17, 21], One major issue noticed several [17, 21], is ilar segments together using the “bag of words” representheir text counterparts. While some visual words do capary. Given a set of input images (first column), we wish to discover object categories and infer their spatial extent Figure 1. Problem summary. Given a set of input images (first column), we wish to discover object categories and infer their spatial extent ment corresponds to a coherent object. Then cluster simcompute a segmentation of each image so that each segtation. However, image segmentation is not a solved probtation. However, image segmentation is not a solved probthat [17, the “visual are notbyalways asgroups descriptive as is that the One major simissue noticed bythe several groups 21], is words” arts, (e.g. wheels, airplane high-level parts, (e.g. wheels, eyes,ment airplane ilarment segments together “bag of words” represenilarment segments togethertousing the “bag of ment corresponds to a words coherent cluster corresponds to a words coherent cluster simtextthe counterparts. While visual do object. cap- as Then their text While visual do object. cap- as Then astheir descriptive as eyes, “visualture are notobject always asimage descriptive aseach compute awords” segmentation of each so that segOne major issue noticed several [17, 21], corresponds tousing a through coherent object. Then cluster simcorresponds coherent objec that “visual words” aresome notsegmentations always as descriptive that thecounterparts. “visual words” aresome notsegmentations always as descriptive Figure 1. Problem summary. Given awheels, set ofas input images (first column), we to discover object categories andis infer their spatial extent final two columns). We compute multiple per image (athat depicted inwords” thehigh-level second (e.g. and buildings: final two columns). We compute multiple per image (a subset is depicted inalgorithm the secondatothrough tation. However, image segmentation not a cars solved probture parts, (e.g. eyes, airplane lem.(first It using is naive to expect adiscover segmentation algorithm toinfer parlem.(first It using is naive to expect adiscover segmentation par-their spatia endture up high-level encoding simple oriented wingtips), many others end upwords” encoding simple oriented ilartext segments together using the visual “bag of represenment corresponds to wish a words coherent Then cluster simtheir text counterparts. While some visual do object. captheis “visual are notobject always asextent descriptive tation. However, image segmentation is not awords” solved probtation. However, imageand segmentation is n segments together the “bag ofsubset represensegments together the “bag oftowords” represenProblem summary. Given a set ofWhile input column), we wish towords” object categories and their spatial Figure 1. Problem summary. Given a set ofWhile input column), we wish object categories infer objectilar parts, (e.g.images wheels, eyes, airplane ture high-level objectilar parts, (e.g.images wheels, eyes, airplane ual words do captheir counterparts. While some words do capment corresponds to a coherent object. Then cluster simthat the “visual words” are not always as descriptive as ilar segments together using the “bag of represenilar segments together using the “bag their text counterparts. some visual words do captheir text counterparts. some visual words do capgmentations for the first row are shown in Figure 4). The task is to sift the good segments from the bad ones for fifth columns; all of the segmentations for the first row are shown in Figure 4). The task is to sift the good segments from the bad ones for (e.g. cars and buildings: final two columns). We compute multiple segmentations per image (a subset is depictedbars inalgorithm the second through lem.(first It using is naive to expect adiscover segmentation to(e.g. parwingtips), many others end up high-level encoding simple oriented tition an image into itsistheir constituent objects –toisinexpect theWhile general tition an image into itsisconstituent objects the general ghtwingtips), more airplane appropriately be end called and corners and might more appropriately be end called and buildings: final columns). We encoding compute multiple segmentations per image (aprobsubset depicted in the second through (e.g. cars and buildings: final columns). We encoding compute multiple segmentations per (aprobsubset–toisinexpect depicted the second t tation. However, image segmentation iseyes, not solved probsegments together the “bag oftoture words” represenFigure 1. Problem Given a set ofWhile input column), we wish object categories and infer their spatial extent ture objectilar parts, (e.g.images wheels, eyes, airplane lem. Itcounterparts. is naive segmentation to capparlem. It is naive a in segmentation some visual words do tation. However, image segmentation not atext solved tation. However, image segmentation not a image solved manytwo others up simple oriented wingtips), manytwo others up simple oriented ls, eyes, high-level object parts, wheels, airplane tation. However, imagea segmentation isalgorithm nottheir asummary. solved probtation. However, ilar segments using the “bag of awords” represenobject parts, wheels, eyes, airplane ture high-level object parts, wheels, eyes, airplane text counterparts. visual words capegory.ture Here,high-level the segments chosen by our(e.g. method are shown in green (buildings) and yellow (cars). eachtogether object category. Here, the segments chosen by our(e.g. method are shown in green (buildings) and yellow (cars).image segmentation fifth columns; all of the segmentations for the first rowbearecalled shown insome Figure 4). The task into isdo to its siftconstituent the good segments from the bad ones for case, you aneed to have solved recognition problem alcase, you aneed to have solved recognition problem al- theobject mns; all ofand the segmentations for the first row areexpect shown insegmentation Figure 4). The task isthe toansift the good segments from the bad ones for fifth columns; allthrough ofand the segmentations for the first row areexpect shown insegmentation Figure 4). The task isthe toansift the good segments from bad o en “visual letters”. Consequently, “visual phonemes” or even “visual letters”. Consequently, tition an image objects –toisindiscovered the general bars and corners and might more appropriately (e.g. cars and buildings: final two columns). We compute multiple segmentations per image (a subset depicted in the second tition image into its constituent objects – in the general tition image into its constituent lem. It is naive to algorithm to parlem. It is naive to algorithm to parbars corners and might more appropriately be called bars corners and might more appropriately be called g simple oriented wingtips), many others end up encoding simple oriented lem. It is naive expect a segmentation algorithm to partation. However, image segmentation is not a solved probwingtips), many others end up encoding simple oriented ture high-level object parts, (e.g. wheels, eyes, airplane lem. It is naive to expect a object segmentation par- object lem. It is naive to expect a segmentat wingtips), many others end up chosen encoding simple oriented wingtips), many others end up chosen encoding simple oriented tation. However, image each segmentation issynonyms notcategory. a solved probture parts, wheels, eyes,in green airplane each discovered category.algorithm Here,high-level thetosegments chosen by our(e.g. method are shown (buildings) and yellow (cars). vered object category. Here, thewords segments by Consequently, our method are shown in green (buildings) and yellow (cars). discovered object Here, thewords segments by Consequently, our method are shown in green (buildings) and yellow (cars). ready! In practice, some like Mean-shift [4], ready! In practice, some like Mean-shift [4], visual synonyms – several there ismight aitsproportion of visual – several related visual words. Inapproaches, theory, the idea sounds simple: related visual words. Inapproaches, theory, the ideato sounds simple: onging to a particular topic. words in the image belonging to–be afor particular topic. case, you aneed to have solved recognition problem al-appropriately fifth columns; all–problem ofand the segmentations for the first row areexpect shown insegmentation Figure 4). The task isthe toan sift the good segments from theobjects bad ones “visual phonemes” or even “visual letters”. you to have solved the recognition al- Consequently, you need have solved the recogn tition an image into its constituent objects –case, intition the general tition an image into its constituent objects –case, intition the general “visual even “visual letters”. “visual phonemes” orand even “visual letters”. priately called bars and corners and more called tition image into constituent in the general lem. It is naive to algorithm to parbars corners and might more appropriately be called wingtips), many others end up encoding simple oriented anneed image into its constituent objects in the general an image into its constituent ob barsbephonemes” and cornersorand might more appropriately be called bars and corners might more appropriately be called lem. It is naive to expect a segmentation algorithm to parwingtips), many others end up chosen encoding simple oriented perform only athe low-level over-segmentation the image perform only athe low-level over-segmentation the image ct orimage object and,tomore probdescribing the same object orimage object and,toalmore probcompute a segmentation ofand each image so of that each segcompute a segmentation of each image so of that each seg-sounds lik iced by groups [17, 21], is One issue noticed by groups [17, 21], is each discovered object category. Here, thewords segments by Consequently, our method are shown in green (buildings) andmajor yellow (cars). ready! In practice, some approaches, like Mean-shift [4], –al-several ready! In practice, approaches, you “visual need to have solved recognition problem you “visual need to have solved recognition problem there isseveral a part, proportion ofparticular synonyms – several words isseveral a part, proportion ofparticular synonyms – several words ready! In practice, some like Mean-shift [4], s”. Consequently, “visual phonemes” or even “visual letters”. Consequently, there ismight athe proportion ofrecognition visual synonyms related visual words. Inapproaches, theory, the ideato sounds simple: words in the image belonging tobe aproblem particular topic. you need have solved the problem tition an image into its constituent objects –case, intition the general “visual phonemes” orand even “visual letters”. bars corners and more appropriately called related visual words. Inaltheory, idea sounds simple: related visual words. Inaltheory, the idea the belonging acase, topic. words inrecognition thethere belonging acase, topic. case, you need to have solved the case, you need tosome have solved the recs “visual phonemes” orvisual even letters”. Consequently, “visual phonemes” orvisual even letters”. Consequently, an image into its constituent objects – in the general bars and corners might more appropriately be called (superpixels). Others, Cuts [20] attempt to (superpixels). Others, Cuts [20] attempt to emy –issue the always same word describing lematically, visual polysemy –issue the always same word describing ment corresponds tolike a“visual coherent object. cluster simment corresponds tolike a coherent object. sim”s –are not asbydescriptive as that the “visual words” not asby[4], descriptive as perform only aThen low-level over-segmentation of the perform aThen low-level over-segmenta ready! Inobject practice, some like Mean-shift ready! Inobject practice, some like Mean-shift describing the same object part, and, more probdescribing the same object or part, and, more probseveral there a proportion visual synonyms –are several perform only athe low-level over-segmentation of the image describing the same object or object part, and,tomore probcompute a segmentation of is each image so of that each segOne major issue noticed by groups [17, 21], is compute a Normalized segmentation of each image so that each segcompute a Normalized segmentation of eachcluster image so that eac major several groups [17, 21], is One major noticed several groups [17, 21], is ready! In practice, some approaches, like Mean-shift you “visual need to have solved recognition problem althere isseveral aimage proportion ofparticular synonyms – several words phonemes” or even “visual letters”. Consequently, ready! In[4], practice, some approaches, Mean-shift [4], ready!only In[4], practice, some approaches, there noticed iswords a proportion of or visual synonyms – approaches, several words there iswords a problem proportion visual synonyms – approaches, several words related visual words. In theory, the idea sounds simple: words in thelike image belonging acase, topic. case, you need to have solved the recognition al-of “visual phonemes” orvisual even letters”. Consequently, find global solution, butusing often without success (however, find global solution, butusing often without success (however, or object parts. All this means several different objects or object parts. All this means ilaraword segments together the “bag of words” represenilaraword segments together the “bag of words” represensome visual words dothat captheir text counterparts. While some visual words dothat cap(superpixels). Others, like Normalized Cuts [20] attempt tosame (superpixels). Others, like Normalized Cu perform asame low-level over-segmentation of the image perform only asame low-level over-segmentation of the image lematically, visual polysemy –as describing lematically, visual polysemy –as the describing , While and, more probdescribing the same object or object part, and, more prob(superpixels). Others, like Normalized Cuts [20] attempt to lematically, visual polysemy – the same word describing ment corresponds to a coherent object. Then cluster simment corresponds to a coherent object. Then cluste ment corresponds to a coherent object. Then cluster sim“visual words” are always descriptive that the “visual words” are not always descriptive as that the “visual words” are not always as descriptive as perform only a low-level over-segmentation of the image perform only a low-level over-segme describing the not same object only orthe object part, as and, more probdescribing the same object or object part, and, more probperform only a low-level over-segmentation of the image ready! In practice, some approaches, like Mean-shift [4], describing the object or object part, and, more probthere is a proportion of visual synonyms – several words compute a segmentation of each image so that each segOne major issue noticed by several groups [17, 21], is ready! In practice, some approaches, like Mean-shift [4], there is a proportion of visual synonyms – several words see Duygulu et al. [6] for a clever joint use of segments and see Duygulu et al. [6] for a clever joint use of segments and s alone are sometimes not powerthe statistical text methods alone are sometimes not powertation. However, image segmentation is not a solved probtation. However, image segmentation is not a solved prob(e.g. wheels, eyes, airplane ture high-level object parts, (e.g. wheels, eyes, airplane find a global solution, but often without success (however, find a global solution, but often without (superpixels). Others, like Normalized Cuts [20] attempt to (superpixels). Others, like Normalized Cuts [20] attempt to several different objects or object parts. All this means that several different objects or object parts. All this means that eparts, word describing lematically, visual polysemy – the same word describing ilar segments together using the “bag of words” represenilar segments together using the “bag of words” rep find a global solution, but often without success (however, counterparts. While some visual words do captheir text counterparts. While some visual words do capseveral different objects or object parts. All this means that ilar segments together using the “bag of words” represen(superpixels). Others, like Normalized Cuts [20] attempt to (superpixels). Others, like Normalized their text or counterparts. While someprobvisual words do caplematically, visual polysemy – the same word describing lematically, visualtopolysemy – the same word describing (superpixels). Others, like Normalized Cuts [20] attempt perform asame low-level over-segmentation of the lematically, visual polysemy –as word describing describing the same object object part,“visual and, more ment corresponds to image a coherent object. Then cluster simthatjoint the words” are always descriptive as perform only a low-level over-segmentation of the image describing the not same object only orthe object part, and, more probtextual annotations). textual annotations). he visual data. This ismethods not sur- eyes, ful enough tomany deal with the visual data. This ismethods not sur- eyes, lem. It isoften naive to expect a segmentation algorithm to parlem. It isoften naive to expect a segmentation algorithm to par-ajoint end up encoding oriented wingtips), others end up encoding simple oriented see Duygulu et al. [6] for abut useprobofsuccess segments and see Duygulu et al. [6] for abut us findtoo a alone global solution, but without success (however, findtoo a alone global solution, but without success (however, the statistical textsimple are sometimes not powerthe statistical text are sometimes not powerAll thisseveral means that several different object parts. All thisseveral means that tation. However, image segmentation isclever a solved tation. However, image segmentation isclever not -level object parts, (e.g. wheels, airplane ture high-level object parts, (e.g. wheels, airplane see Duygulu et al. [6] for asegmentation clever joint of the segments and find a visual global solution, often without (however, find a global solution, oftensolved witho the statistical text methods alone aredifferent sometimes not powerdifferent objects or object parts. All this means that different objects or object parts. All this means that tation. However, image isobjects not a–or solved probture high-level object parts, (e.g. wheels, eyes, airplane find aattempt globaluse solution, but often without success (however, (superpixels). Others, like Normalized Cuts [20] to several objects or objectwords parts. All this means lematically, polysemy –not the same word describing ilarthat segments together using “bag of words” representheir text counterparts. While some visual do capual world is much richer and noisprising after all, the visual world is much richer and nois(superpixels). Others, like Normalized Cuts [20] attempt to Recently, Hoiem et al. [13] have proposed a surprisingly Recently, Hoiem etItits al. [13] have proposed surprisingly lematically, visual polysemy – the same word describing tition an image into its constituent objects – in the general an image into constituent objects –a in the general might more appropriately be called bars and corners and might more appropriately be called textual annotations). textual annotations). see Duygulu et al. [6] for a clever joint use of segments and see Duygulu et al.tition [6] for clever joint use of segments anda et ful enough to deal with the visual data. This is not too surful enough to deal with the visual data. This isanot too suretimes not powerthe statistical text methods alone are sometimes not powerlem. It several is naivesee to expect a segmentation algorithm to parlem. is naive to expect segmentation algorithm t , many endtext up methods encodingalone simple wingtips), many others end up encoding simple oriented Duygulu et al. [6] for a clever joint use of segments and see Duygulu al. [6] for a clever join theothers statistical areoriented sometimes not powerthe statistical text methods alone are sometimes not powertextual annotations). ful enough to deal with the visual data. This is not too surlem. It isoften naive to expect a segmentation algorithm to par-or wingtips), many others up encoding oriented see Duygulu et“visual al. [6] for abut clever use ofsuccess segments and findwheels, a alone global solution, but without success (however, the statistical textsimple methods are sometimes not powerdifferent objects or object parts. Allend this means that tation. However, image segmentation isbars not ajoint solved probucted, virtually noiseless world of is effective ier than the human-constructed, virtually noiseless world of is effective ture high-level object parts, (e.g. eyes, airplane way of noisutilizing image segmentation sufway of noisutilizing image segmentation suf-have case, you need to have solved the its recognition alcase, you need to have solved the its recognition alven letters”. Consequently, phonemes” even letters”. Consequently, find atoglobal solution, often without (however, different objects or object parts. All this means that textual annotations). textual annotations). prising –too after all,deal the visual world richer prising –too after all, the visual world richer his is“visual not surful enough deal with the visual data. This is“visual not surRecently, Hoiemwithout etproblem al. [13] have proposed aseveral surprisingly Recently, Hoiemwithout etproblem al. [13] an image into constituent objects –the inand the general an image into constituent objects – in propo the g corners and might more appropriately be called and corners and might more appropriately be called textual annotations). textual annotations). ful enough to with the visual much data. This isand nottition too surful enough to deal with the visual much data. This isand nottition too surprising – after all, visual world is much richer and noisRecently, Hoiem et al. [13] have proposed a surprisingly tition an image into its constituent objects – in the general bars and corners might more appropriately be called textual annotations). see Duygulu et al. [6] for a clever joint use of segments and ful enough to deal with the visual data. This is not too surthe statistical text methods alone are sometimes not powertext. fering from its shortcomings. For each image, they comfering from its shortcomings. For each image, they comready! In practice, someneed approaches, like Mean-shift [4], ready! In practice, someneed approaches, like Mean-shift [4], probl fhvisual synonyms – several words there is [6] a proportion of joint visual synonyms – several words lem. It isprising naivesee to expect athe segmentation algorithm parwingtips), many otherssufendtext up methods encodingalone simple ier prising than the human-constructed, virtually noiseless world of ier to than the human-constructed, virtually noiseless world of richer and – after all, visual world much richer and Duygulu et al. for“visual a is clever use ofnoissegments and the areoriented sometimes not powereffective waysolved ofHoiem utilizing image without effective waysolved ofHoiem utilizing image Hoiem et al. [13] have a surprisingly Recently, Hoiem et al. [13] have a surprisingly case,proposed you toRecently, have the recognition problem alcase,proposed you toRecently, have the recognition honemes” or noisletters”. Consequently, phonemes” or letters”. Consequently, –even after“visual all, theRecently, visual world is much richer and noisprising –even after“visual all, the visual world is much richer and noiset al. [13]segmentation have a statistical surprisingly et al. [13]segment have pr ier“visual than the human-constructed, virtually noiseless world of is effective way of noisutilizing image segmentation without sufcase, you need to have solved thehuman-constructed, recognition alphonemes” or proposed even letters”. Consequently, textual annotations). prising –too after all,deal the visual world richer ful enough to deal with visual data. This is“visual not surRecently, Hoiem etproblem al. [13] have proposed a surprisingly multiple bywithout varying parameters of multiple bywithout varying parameters ofFor perform onlysegmentations asegmentation low-level over-segmentation ofthe the image perform onlysegmentations asegmentation low-level over-segmentation of the image ject or object part, and,synonyms more probdescribing the same or object part, and,synonyms more proban image into its constituent objects –isobject inwords the general text. text. bars and corners and might more appropriately be called noiseless world ofhuman-constructed, ier than the virtually noiseless world ofhuman-constructed, fering from itsthe shortcomings. For each image, they comfering from itsthe shortcomings. each effective way ofpute utilizing image sufeffective way ofpute utilizing image suftextual annotations). words 1.2. Grouping visual ful enough to with the visual much data. This isand nottition too surready! In practice, some approaches, like Mean-shift [4], ready! In practice, some approaches, like Mean-shi proportion of – several words there a proportion of visual – several words ier than thevisual virtually noiseless world of ier than the virtually noiseless world of effective way of utilizing image segmentation without sufeffective way of utilizing image segm text. fering from its shortcomings. For each image, they comready! In practice, some approaches, like Mean-shift [4], there is a proportion of visual synonyms – several words ier than the human-constructed, virtually noiseless world of prising – after all, the visual world is much richer and noiseffective waysolved ofThe utilizing image segmentation without sufthe segmenting algorithm. Each of thecomresulting segmentathe segmenting algorithm. Each of thecomresulting segmentaHoiem et al. richer [13] have a surprisingly (superpixels). Others, Normalized Cuts [20] attempt tovarying (superpixels). Others, Normalized Cuts [20] attempt ysemy –Grouping the same or word describing lematically, visual polysemy –Grouping the same word describing pute multiple by the parameters of pute multiple segmentations bytovarying fering from itsmore shortcomings. For eachlike image, they fering from itsmore shortcomings. For eachlike image, they case,proposed youtext. need toRecently, have the et recognition problem al“visual phonemes” or “visual letters”. Consequently, becomes apparent when problem of[13] visual polysemy becomes apparent when prising –even after all, theRecently, visual world is much and noisperform only afering low-level over-segmentation of the image perform only afering low-level over-segmentation of the Hoiem al. have proposed a surprisingly gpolysemy the same object object part, and, probdescribing the same object or object part, and, prob1.2.text. visual words 1.2. visual words text. fromsegmentations its shortcomings. For each image, they comfrom its shortcomings. For ea multiple bywithout varying the parameters of perform onlysegmentations asegmentation low-level over-segmentation ofdifferent the image describing the same object or object part, and,synonyms more probtions assumed to ier bebut wrong –without but the1.2. is that some tions assumed to bebut wrong –without but like theofhope is that some findisastill global often success (however, findisastill global often success (however, text. object parts. All means that several objects or visual object parts. All this means that than the human-constructed, virtually noiseless world ofhuman-constructed, fering from its shortcomings. each image, they comeffective way ofpute utilizing image sufGrouping visual words the segmenting algorithm. Each of the resulting segmentathe segmenting algorithm. Each of the re pute multiple segmentations bysolution, varying the parameters ofhope pute multiple segmentations bysolution, varying the parameters ready! In1.2. practice, some approaches, like Mean-shift [4], age isThe represented inthis the “bag of we consider how anFor image isThe represented in the “bag of there is a proportion of visual – several words (superpixels). Others, Normalized Cuts [20] atte (superpixels). Others, like Normalized Cuts [20] attempt to lematically, polysemy – the same word describing ly,or visual polysemy – the same word describing ier than the virtually noiseless world of Grouping visual words effective way of utilizing image segmentation without sufproblem of visual polysemy becomes apparent when problem of visual polysemy becomes apparent when pute multiple segmentations by varying the parameters of pute multiple segmentations by varyi 1.2. are Grouping visual words 1.2. Grouping visual words the segmenting algorithm. Each of the resulting segmenta(superpixels). Others, like Normalized Cuts [20] attempt to segments in some of the segmentations will be correct. For segments in some of the segmentations will be correct. For lematically, visual polysemy – the same word describing see Duygulu et al. [6] for a clever joint use of segments and see Duygulu et al. [6] for a clever joint use of segments and ods alone sometimes not powerthe statistical text methods alone are sometimes not powertext. pute multiple segmentations by varying the parameters fering from itsmore shortcomings. For each only image, they from comtions issegmenting stillbut assumed to be wrong but the hope is that some or tions issegmenting stillbut assumed to be wrong – but the segmenting algorithm. Each of the resulting segmentathe segmenting algorithm. Each of the resulting segmentaThe problem of –visual polysemy becomes apparent when All visual words in an an the image are words” document model. All visual words in an anof image are perform a low-level over-segmentation of the image find a global solution, often without success (ho find a global solution, often without success (however, describing the same object object part, and, probseveral different objects or object parts. All this means that fferent objects or object parts. All this means that 1.2. Grouping visual words we consider how image is represented in the “bag of we consider how image is represented in the “bag of mes apparent when The problem of visual polysemy becomes apparent when text. the algorithm. Each of the resulting segmentathe algorithm. Each of the fering its shortcomings. For each image, they comThe problem of visual becomes apparent when The problem of visual becomes apparent when example, consider the images in 1aseveral and None ofofobjects example, consider the images in 1a clever andof4.the None ofof segmen textual annotations). textual annotations). the data. This isand not too surful enough to deal with the visual data. This isand not too surtions assumed to bebut wrong –without buthow the hope is that some findisastill global often success different or object parts. All means that segments some of4.the will be correct. segments some w tions is polysemy still assumed toinbean wrong – are but the hope is figures that some tions is polysemy still assumed toinbean wrong – are but the hope is figures that some the segmenting algorithm. Each of the segmentapute multiple segmentations bysolution, varying the parameters of gram, losing all of spatial neighplaced into a(however, single histogram, losing all spatial neighwe consider how an ishope represented inthis the “bag of see Duygulu ettions al. [6] for joint use see Duygulu ettions al. [6] for clever joint use segments and the statistical text methods alone are sometimes not powerical text methods alone are sometimes not power(superpixels). Others, like Normalized Cuts [20] attempt to lematically, polysemy –visual the same wordbecomes describing words” document model. All visual words words” document model. All visual words ed invisual the “bag we consider an image is represented inresulting the “bag of 1.2. Grouping visual words isin still assumed tosegmentations be wrong –image but thevisual is thatFor some isin still assumed tosegmentations be wrong – but The problem of polysemy apparent when pute multiple segmentations by varying the parameters we consider how anand image issome represented inimage the “bag of we consider how anof image issome represented inimage the “bag of 1.2. Grouping visual words the segmentations are entirely correct, but most objects get the segmentations are entirely correct, but mostthe getin figures isual world is much richer noisprising – after all, the visual world is much richer and noisRecently, Hoiem et al. [13] have proposed a surprisingly Recently, Hoiem et al. example, [13] have consider proposed aobjects surprisingly segments in some of the segmentations will be correct. For see Duygulu et al. [6] for a clever joint use of segments and the statistical text methods alone are sometimes not powerexample, consider the images in figures 1 and 4. None of images segments in of the segmentations will be correct. For segments in of the segmentations will be segments correct. For Suppose ainto car ais described by This tenlosing borhood relationships. Suppose a car is described by ten tions is still assumed to be wrong – but the hope is that some the segmenting algorithm. Each of the resulting segmentatextual annotations). textual annotations). ful enough to deal with the visual data. This is not too surh to with the visual data. is not too surwords” document model. All visual words in an image are placed single histogram, all spatial and neighplaced into a single histogram, losing all spatial and neighds in deal an image are words” document model. All visual words in an image are find a global solution, but often without success (however, segments in some of the segmentations will be correct. For in some of the segmentation several different objects or object parts. All this means that we consider how an image is represented in the “bag of The problem of visual polysemy becomes apparent when the segmenting algorithm. Each of thewords” resulting segmentawords” document model. All visual words inway an in image document model. All visual words inway an in image The problem of visual becomes apparent when segmented correctly atare least once. idea ofare maintaining segmented correctly atare least once. idea ofare maintaining structed, virtually noiseless world of than the human-constructed, virtually noiseless world of effective offigures utilizing image segmentation without sufeffective offigures utilizing image segmentation without sufthe segmentations entirely correct, butthe most objects the segmentations entirely correct, bu example, images 1 and 4. This None of example, images 1 and 4. This None of example, consider the images in figures 1aierclever and 4.the None ofof textual annotations). ful to deal with data. This isand not too surpresence of neighthese ten words in anconsider visual words. Does the of neighthese ten words in anconsider prising –presence after all, the visual world isSuppose much richer noisafter all, the visual world isSuppose much richer noissegments some of will be correct. For tions is polysemy still assumed toinbean wrong – are but the hope is that some Recently, Hoiem et al. [13] have proposed ainsurpri Recently, Hoiem ethow al. [13] have proposed ain surprisingly borhood relationships. alosing carand isthe described by ten borhood relationships. alosing carand isthe described by ten spatial and placed into a in single histogram, losing all spatial and placed into aimages single histogram, losing allget spatial neighexample, consider the figures 1visual and 4. None of model. example, consider the images figur see Duygulu ettions al. [6] for joint use segments and the statistical text methods alone are sometimes not powerwords” document All visual words image we consider an enough image is represented in the “bag of is still assumed tosegmentations be wrong – but the hope isathat some placed into a single histogram, all spatial and neighplaced into single histogram, all spatial and neighmultiple segmentations until further evidence can be used multiple segmentations until further evidence can be used text. we consider how an image is represented in the “bag of fering from its shortcomings. For each image, they comfering from its shortcomings. For each image, they comsegmented correctly at least once. This idea of maintaining segmented correctly at least once. This id the segmentations are entirely correct, but most objects get the segmentations are entirely correct, but most objects get ains a car? Not necessarily, since image imply that it contains a car? Not necessarily, since the segmentations are entirely correct, but most objects get prising – after all, the visual world is much richer and noisRecently, Hoiem et al. [13] have proposed a surprisingly iera than thedescribed human-constructed, virtually noiseless world ofwordseffective hedescribed human-constructed, noiseless world ofwordseffective way of utilizing image segmentation witho way of utilizing image segmentation without sufvisual words. the presence of athese tendescribed inby an ten visual words. Does the presence of these ten in an by relationships. tenDoes virtually borhood relationships. Suppose car is by ten example, consider the images in figures 1 and 4. None of segments in some of the segmentations will be segments correct. For the segmentations are entirely correct, but most objects get the segmentations are entirely correct, borhood relationships. Suppose a car is described by ten textual annotations). ful enough to deal with the visual data. This is not too surplaced into a single histogram, losing all spatial and neighwords” document model. All visual words in an image are borhood Suppose car is borhood relationships. Suppose a car is described by ten in some of the segmentations will be correct.segmented For topute disambiguate is This similar approach of etof topute disambiguate is This similar approach of etof segmentations by varying theBorenstein parameters segmentations by varying theBorenstein parameters words” document model. All visual words inway an in image are segmentations until further evidence can be used segmentations until further evid segmented correctly at multiple least once. ideatomultiple ofthe maintaining correctly at multiple least once. ideatomultiple ofthe maintaining to occur together spatially, these ten words did not have to occur together spatially, words 1.2. Grouping visual words segmented correctly at least once. This idea of maintaining text. ier than the human-constructed, virtually noiseless world of fering from its shortcomings. For each image, they fering from its shortcomings. For each image, they comeffective of utilizing image segmentation without sufimage imply that it contains a car? Not necessarily, since image imply that it contains a car? Not necessarily, since ehave ten words in an visual words. Does the presence of these ten words in an segmented correctly at least once. This idea of maintaining segmented correctly at least once. Thi the segmentations are entirely correct, but most objects get example, consider the images figures 1 and 4. None of visual words. Does the of neighthese ten words in an richer prising after all, the visual world isSuppose much noisvisual words. Does the presence ofthe these ten words inevidence an visual words. Does the presence ofthe these ten words inevidence an toEach Recently, Hoiem et al.consider [13] have ainsurprisingly borhood relationships. alosing carand isall described by neighten placed intoof a the single losing all–presence spatial and example, theproposed images figures 1 and 4.if the None of al. [3].segmenting al. [3].segmenting algorithm. resulting segmentaalgorithm. resulting segmentatoEach disambiguate ishistogram, to the approach of of Borenstein et disambiguate isbysimilar to the approac multiple until further can be used multiple until further can of bethe used placed into a single spatial and ge. Ofvisual course, if the object andsegmentations but anywhere the image. Ofvisual course, object andsegmentations aluping polysemy becomes apparent when Thecan problem of visual polysemy becomes apparent when pute multiple segmentations varying the further paramee pute multiple segmentations bysimilar varying theit parameters multiple segmentations until further bea incar? used these ten words did have toa occur together spatially, these ten words did have toa occur together spatially, necessarily, since image imply that itimage, contains Not necessarily, since text. fering from its shortcomings. For evidence each they commultiple segmentations until evidence canby beten used multiple segmentations until segmented correctly at least once. This idea ofmost maintaining thehistogram, segmentations are ten entirely correct, but most objects getimage 1.2. Grouping words words image imply that contains a car? Not necessarily, since image imply that itnot contains car?istions Not necessarily, since image imply that itnot contains car?istions Not necessarily, since ierafurther than the human-constructed, virtually noiseless world of effective way of utilizing segmentation without sufvisual words. Does the presence of these words in an borhood relationships. Suppose car is described the segmentations are entirely correct, but objects get The problem now becomes one of going through a large The problem now becomes one going through a large is still assumed to be wrong – but the hope is that some is still assumed to beof wrong –ofbut the hope is that some al. [3]. al. [3]. to disambiguate similar to the approach of Borenstein et to disambiguate similar to the approach Borenstein etEach borhood relationships. Suppose a car is described by ten correlated (e.g. cars and roads or its background are highly correlated (e.g. cars and roads or mage isanywhere in the “bagOf ofapparent we consider how an problem image isanywhere represented in the “bagOf ofapparent the segmenting algorithm. of the resulting segm the segmenting algorithm. Each of the resulting segmentabutthese in the image. course, if the object and but in the image. course, if the object and together spatially, these ten words did not have to occur together spatially, to disambiguate is similar to the approach of Borenstein et to disambiguate is similar to the approach of Borenstein et to disambiguate is similar to the appro pute multiple segmentations by varying the parameters of The of visual polysemy becomes when oblem ofrepresented visual polysemy becomes when multiple segmentations until further evidence can be used segmented correctly at least once. This idea of maintaining ten words did not have to “soup” occur together spatially, these ten words did not have to “soup” occur together spatially, these ten words did not have to occur together spatially, 1.2. Grouping visual words text. fering from itssegmented shortcomings. For each image, they comimage imply that Does it contains a car? Not necessarily, since visual words. Does presence of these ten words in an correctly at least once. idea of maintaining ofand (overlapping) segments and tothe the ofand (overlapping) segments and tobediscover the segments in roads some ofisthe segmentations bediscover correct. For segments in roads some ofisthe segmentations correct. For al. [3]. al. [3]. The problem now becomes one ofis going through awords. large The problem now becomes one ofis goin odeling the entire image can actuand grass), then modeling the entire image can actuel. All visual words inhighly an image arein the words” document model. All visual words inhighly an image arein the visual the presence ofthe these ten words inevidence an tionsor still assumed totrying bewill wrong – but the hope tha tionsor still assumed totrying bewill wrong – but the hope that some background correlated cars its background are correlated (e.g. cars ifits the object andare buttoanywhere incows the image. Of course, ifThis the object and al. [3]. al. [3]. al. [3].segmenting algorithm. Each of the resulting segmentawe consider how an image is represented “bag of der how an image is “bag of disambiguate is similar to the approach of Borenstein et multiple segmentations until further can be used but anywhere in represented the image. Of (e.g. course, if the object and but anywhere in the image. Of course, if the object and but anywhere in the image. Of course, if the object and The problem of visual polysemy becomes apparent when pute multiple segmentations by varying the parameters of good ones. But note in “soup” athrough large image dataset with many good ones. But note in “soup” athrough large image dataset many these ten imply words did itnot have toa occur together spatially, image imply that it contains awill car? Not necessarily, since example, consider the images inof figures 1 and 4. None ofbeand example, consider the images inof figures 1 and with 4. None multiple are segmentations until(e.g. further evidence can be to used of (overlapping) trying toimage discover of (overlapping) and tryg Theneighproblem now becomes one ofthat, going athe large Theneighproblem now becomes one ofthat, going athe large 1.2. Grouping visual words owever, this is unlikely tomodeling scale as inthe ally help into recognition. However, this is unlikely scale as inthe ogram, losing all spatial and placed single losing all spatial and segments segmentations willofbeone correc segments segmentations correct. For that car?istions Not necessarily, since cows and grass), then image can actucows and grass), then modeling image can actucars and roads orAll itsof background correlated cars and roads orAll problem now segments becomes one of going through athe large problem now segments becomes of words” document model. visual words anentire image are ocument model. visual words anentire image are The problem now becomes one –ofbut going through aa large is still assumed to be wrong the hope is that somehistogram, its background are highly correlated (e.g. cars and roads or in some The its background are highly correlated (e.g. cars and roads or in some The al. [3]. to contains disambiguate similar to the approach Borenstein ethighly

.659 .804

.567 .816

.841 .862

Quantitative Results 0.9 0.85 0.8

MeanShift FH NCuts Best Single Segmentation MultSeg

Mean BSS

0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 1

10

100 1,000 10,000 Segment Soup Size (Log Scale)

100,000

1,000,000

A closer look

A closer look

Merging Segments • Enumerate all pairs/triplets of adjacent segments

• Inexpensive and fast given an adjacency graph

Mean Shift

.815

FH

.792

NCuts

.830

Quantitative Results 0.9 0.85 0.8

Mean BSS

0.75

MeanShift FH NCuts Best Single Segmentation MultSeg MultSeg + 1 Merge MultSeg + 2 Merges

0.7 0.65 0.6 0.55 0.5 0.45 0.4 1

10

100 1,000 10,000 Segment Soup Size (Log Scale)

100,000

1,000,000

Upper-Bound: Superpixels •

Create superpixels with NCuts and K=200 (Ren & Malik 2003)

• •

Consider all merges of superpixels Infeasible in practice

Superpixel Limit .932

Superpixel Limit .917

Superpixel Limit .825

Quantitative Results 0.9 0.85 0.8 0.75

MeanShift FH NCuts Best Single Segmentation MultSeg MultSeg + 1 Merge MultSeg + 2 Merges

Mean BSS

SP Limit

0.7 0.65 0.6 0.55 0.5 0.45 0.4 1

10

100 1,000 10,000 Segment Soup Size (Log Scale)

100,000

1,000,000

Upper-Bound: Rectangular Windows • Consider the best*

rectangular spatial support

• Infeasible in practice Rectangular Limit .682

Rectangular Limit .909

Rectangular Limit .616

Quantitative Results 0.9 0.85 0.8

Mean BSS

0.75 0.7

MeanShift FH NCuts Best Single Segmentation MultSeg MultSeg + 1 Merge MultSeg + 2 Merges SP Limit BB Limit

0.65 0.6 0.55 0.5 0.45 0.4 1

10

100 1,000 10,000 Segment Soup Size (Log Scale)

100,000

1,000,000

Viola-Jones Sliding Windows • Generate soup of segments by sliding square windows

• Often used in practice

Square .495

Square .555

Square .301

Comparing to Limits 0.9 0.85 0.8

Mean BSS

0.75 0.7

MeanShift FH NCuts Best Single Segmentation MultSeg MultSeg + 1 Merge MultSeg + 2 Merges ViolaJones SP Limit BB Limit

0.65 0.6 0.55 0.5 0.45 0.4 1

10

100 1,000 10,000 Segment Soup Size (Log Scale)

100,000

1,000,000

Which Segmentation Algorithm is the best? 0.9 0.85 0.8

Mean BSS

0.75 0.7

MeanShift FH NCuts Best Single Segmentation MultSeg MultSeg + 1 Merge MultSeg + 2 Merges ViolaJones SP Limit BB Limit

0.65 0.6 0.55 0.5 0.45 0.4 1

10

100 1,000 10,000 Segment Soup Size (Log Scale)

100,000

1,000,000

Which Segmentation Algorithm is the best? 0.9 0.85 0.8

Mean BSS

0.75 0.7

FH+NCuts+MeanShift MeanShift FH NCuts Best Single Segmentation MultSeg MultSeg + 1 Merge MultSeg + 2 Merges ViolaJones SP Limit BB Limit

0.65 0.6 0.55 0.5 0.45 0.4 1

10

100 1,000 10,000 Segment Soup Size (Log Scale)

100,000

1,000,000

Conclusions

Conclusions •

Correct Spatial Support is important for recognition

Conclusions •

Correct Spatial Support is important for recognition

•

Multiple Segmentations are better than one

Conclusions •

Correct Spatial Support is important for recognition

• •

Multiple Segmentations are better than one Mean-Shift is better than FH or NCuts, but together they do best

Conclusions •

Correct Spatial Support is important for recognition

• •

Multiple Segmentations are better than one

•

Segment merging can benefit any segmentation

Mean-Shift is better than FH or NCuts, but together they do best

Conclusions •

Correct Spatial Support is important for recognition

• •

Multiple Segmentations are better than one

• •

Segment merging can benefit any segmentation

Mean-Shift is better than FH or NCuts, but together they do best

“Segment Soup” is large, but not catastrophically large

Conclusions •

Correct Spatial Support is important for recognition

• •

Multiple Segmentations are better than one

• •

Segment merging can benefit any segmentation

Mean-Shift is better than FH or NCuts, but together they do best

“Segment Soup” is large, but not catastrophically large

Questions?

Recommend Documents