Generalizing Edge Detection to Contour Detection ... - Semantic Scholar

Report 3 Downloads 96 Views
Generalizing Edge Detection to Contour Detection for Image Segmentation Hongzhi Wang University of Pennsylvania, Philadelphia 19104, USA

John Oliensis Stevens Institute of Technology, Hoboken, NJ 07030, USA

Abstract One approach to image segmentation defines a function of image partitions whose maxima correspond to perceptually salient segments. We extend previous approaches following this framework by requiring that our image model sharply decreases in its power to organize the image as a segment’s boundary is perturbed from its true position. Instead of making segment boundaries prefer image edges, we add a term to the objective function that seeks a sharp change in fitness with respect to the entire contour’s position, generalizing from edge detection’s search for sharp changes in local image brightness. We also introduce a prior on the shape of a salient contour that expresses the observed multi-scale distribution of contour curvature for physical contours. We show that our new term correlates strongly with salient structure. We apply our method to real images and verify that the new term improves performance. Comparisons with other state–of–the–art approaches validate our Email addresses: [email protected] (Hongzhi Wang), [email protected] (John Oliensis)

Preprint submitted to Computer Vision and Image Understanding

February 4, 2010

method’s advantages. Key words: perceptual organization, image segmentation

1. Introduction Segmentation, the organizing of images into distinct meaningful parts, is one of the main tasks in early vision. Most approaches choose the organizing elements to be image regions; typically, they label a region as “meaningful” when it is homogeneous with respect to a set of image features and has abrupt changes in these features at its boundary. Methods vary in how much they weight boundary change versus homogeneity within each region. Segmentation approaches also vary in their modeling explicitness. Some model the component image segments parametrically, e.g., Mumford–Shah [27] models them as constant brightness regions, while others use a nonparametric approach, e.g., mean shift [8]. Parametric approaches are more generative: the segmentation aims partly at explaining the image. Nonparametric methods distinguish regions without characterizing them and are more akin to classification. All methods assume a prior image model in the sense that they rely on a particular choice of image features. Though recent advances in segmentation focus on new computational methods [8, 35, 33], work remains to be done on extending and refining the segmentation “goodness criteria” that govern what algorithms should aim for. In this paper, we define a new goodness criterion and present an algorithm realizing it. A segmentation organizes the image globally; ideally, the decision about where to segment should involve the entire image. Like many others before 2

us, we adopt a global approach in this paper, defining a function of the image and its partitions such that a larger value indicates a more plausible segmentation and maxima correspond to salient segments. Formalized in this way, segmentation reduces to maximization, and we can address it using optimization methods. For simplicity, we consider the problem of segmenting foreground from background according to brightness. We could use other features in addition to brightness, e.g., texture measures, but we focus on this simpler task to highlight our main contributions. We illustrate our approach in a nonparametric framework, but we could easily apply it for parametric models. Our approach could be applied in conjunction with many existing segmentation techniques. We use the Minimal Description Length (MDL) framework, which defines the best segmentation as that achieving the best image encoding. We determine the “likelihood” of a region segment according to the brightness entropy within the region (a measure of brightness homogeneity) and the curvature entropy of the region’s bounding contour (a measure of boundary smoothness). So far our approach has similarities with [21], though it differs in details such as our use of curvature. The main novelty in our formulation is the following. The approach as described omits the knowledge that significant regions end abruptly. A segmentation approach should incorporate this information: besides seeking to divide an image into smoothly shaped regions with distinct brightness distributions, we should also prefer sharp transitions between regions. Usually, researchers achieve this by making region bound-

3

aries prefer to follow image edges. This commits unnecessarily to a prior assignment of image edges and makes the output segmentation depend on the unreliable results of low level edge detectors. For greater consistency with our goal of producing a global explanation of the image, we follow a different approach: at the correct location of the segmentation boundary, we require that our image explanation sharply change in power as we perturb the position of the boundary slightly. We add a term to the objective function that seeks abrupt change in the global representation with respect to the entire contour’s position, generalizing from edge detection’s search for sharp changes in local image brightness [5]. This global requirement on the boundary helps to overcome local ambiguities. In addition, we impose a smoothness constraint on the shape of a salient contour that expresses the observed multi-scale distribution of contour curvature for physical contours. Our discontinuity definition doesn’t depend on the choice of global representation, and our experiments illustrate how other segmentation techniques with different representations could also incorporate it. Another, more pragmatic, motivation for our approach is the following. Due to the difficulty of perceptual organization and our lack of knowledge about what criteria to apply, we expect that any objective function that we start from will be nonoptimal. Therefore, instead of restricting to our original function, we study how it correlates with correct segmentations and use properties of the function that are found to correlate well as additional criteria for segmentation. This approach is what led us to the particular definition of global derivative given below. We found experimentally that “scale derivatives” of our original MDL objective function correlate strongly with good segmentations, so we include

4

such a derivative into a new objective function. We use a variational method (snake) to find optimal closed contours around the most salient figure in an image. Experiments on many real images show that our new method compares well with state–of–the–art segmentation algorithms. A preliminary version of this work appeared in [38]. 1.1. Related work Leclerc [21] was one of the first to use MDL for segmentation, followed by, e.g., [42][27]. These approaches fit explicit probability models to region statistics and are largely region based. Normalized Cuts (NC) [35] maximizes similarity within regions and dissimilarity at their boundaries but has no explicit probability model. None of these methods represents the image statistics on the boundary, and NC does not represent the boundary at all. The methods’ shortcomings include no boundary smoothness requirement and inaccuracy in boundary localization, e.g., [42] reports that pixels on the borders of the distributions are likely to be misclassified, resulting in inaccurate boundaries. Segmentation relates closely to salient contour detection [31][4]. Contour methods model contours directly and can exploit priors on their shape such as smoothness. Most approaches rely on grouping edges [12, 11, 39, 40, 23]. A problem with these methods is that they exploit intensity or texture statistics mainly on the contour, and the neglect of region statistics makes their discontinuity detection less reliable. Because of their complementary advantages, researchers have sought to combine region–based and edge–based methods. This is not our main aim, 5

but our approach does achieve this, avoiding some problems of current methods. For example, [18] uses Green’s theorem to transfer region energy (of a particular type) into boundary energy. Since it is impossible to fully characterize a region without using its data explicitly, [18] achieves its streamlined computation by sacrificing flexibility in region modeling. [28, 29] models regions as Gaussian mixtures and computes edge energy from a global learned mixture model. Their method confronts difficulties in estimating the number of Gaussian components and in avoiding local minima while learning the mixture model. This approach, and other similar methods, e.g. [25, 34, 37], measure edge energies at each pixel independently, neglecting the cumulative statistics along the contour. Thus they cannot overcome local ambiguities by enforcing consistency between different parts of the contour. Our global derivative imposes consistency across the entire boundary. The transductive learning based interactive segmentation method [9] is also related in the sense that they use global image information to determine boundaries. They follow a simple principle favoring edge pixels located in low density area in the feature space. In comparison, our method is more general and allows incorporating more complicated global image statistics. Our algorithm searches for an optimal segmentation using an active contour (snake) technique. Traditional snakes have energies defined along the boundary [19], but region–based snakes have also been proposed [6, 28, 29]. Often snakes are implemented using Level Sets, e.g. [28]. Recently, two groups [7, 36] proposed the Sobolev active contour technique, which tries to minimize the derivatives of the contour flow along the contour. Sobolev snakes are intended as regularized versions of traditional snakes. In con-

6

trast, we consider the derivatives of our objective function along the contour normal directions and focus on perceptual organization issues rather than regularization. Our work also relates to interest point methods, e.g., [20, 22], that compute “derivatives” with respect to scale. We also use a kind of derivative in scale, and the approach of [20] can be considered a special case of ours. Our approach has similarities to MSER [26], which also seeks regions for which a small area change produces large global brightness change. 2. Global discontinuities To highlight our main contributions, we consider the simple problem of segmenting an image into two parts: a simply connected figure representing the most salient object in the image, and the background. (For a simple extension to multiple segments, we could apply our method recursively to the figure as in the original normalized cuts approach [35]). Finding such a segmentation is equivalent to finding the salient closed contour bounding the figure. We treat these tasks as equivalent, viewing the bounding contour as an image organizer that partitions the image into meaningful parts. Organization is a fundamental quality of contours which has been explored thoroughly [10, 21, 27, 42]. Following MDL, we measure the organizing efficiency of a contour C by the entropy encoding length of the brightness over the foreground and background regions. To increase noise robustness, we compute the encoding lengths from kernel density estimates of the brightness probability distributions. Letting Hf be the raw brightness histogram of the figure, we define the foreground smoothed histogram as hf = Hf ∗ G, where G is a Gaussian 7

(with standard deviation σ = 8 in our experiments, the intensity range is [0, 255]). Normalizing hf gives the estimated probability distribution. Defining hb similarly for the background, O(C)

= − log

Y µ hf (Ii ) ¶ i∈f

= − log 255 X

− log

Nb

i∈b

¶h (i) µ ¶h (i) 255 µ Y hf (i) f hb (i) b i=0

=−

Nf

Y µ hb (Ii ) ¶

Nf

Nb

[log(hf (i))hf (i) + log(hb (i))hb (i)] + Nf log Nf + Nb log Nb

(1)

i=0

where Ii is the brightness of pixel i, and the normalization N{f,b} gives the number of pixels in the specified region. For reasons given below (Sections 2.2 and 2.3), we choose the contour itself to lie in the background. We include the background entropy in (1) since this enables us to compare foreground and background and distinguish them by the differences in their brightness distributions. Using just the figure entropy, we would only capture figures of uniform brightness. O(C) measures the negative log-likelihood in a simple segmentation model. A smaller value of O(C) indicates a more efficient image encoding, and we can consider −O(C) as giving the contour’s “organizing power.” A similar criterion was used in [3]. 2.1. Discontinuity in the global image representation Typical images have sharp transitions between segments; we now modify our objective function −O(C) to reflect this. We aim to favor bounding contours such that small changes in their shape or position sharply reduce their power to explain the image. Analogously, for local edge detection, one seeks edgels for which a small position change gives a large intensity change. 8

This suggests that we should choose our bounding contour such that both the organizing power −O(C) and its derivatives with respect to the contour shape should be large at the contour’s correct location. There are many ways to implement this. For simplicity, and to avoid the noise fluctuations associated with higher order derivatives, which can cause problems in searching for the maximum of the objective function, we choose our function as a linear sum of −O(C) plus its first derivative, D(C), with respect to the contour shape. At the correct location of the contour, −O(C) and D(C) should both be large. This may seem counter-intuitive: a function near its maximum normally has a small, not a large, first derivative. However, in the ideal case of a perfectly sharp transition between segments, the maximum of −O(C) occurs at a cusp singularity, which can be thought of as shaped like a ‘∧.’ See Fig. 4. At this cusp, the usual derivative isn’t small—it doesn’t exist!—and the one-sided derivatives are large. Our derivative is effectively one–sided, so it and −O(C) can be large simultaneously. In practice, for transitions that are not perfectly sharp, we confront the issue of the transition scale—the usual issue of scale which confronts any edge detection or segmentation method. See Section 2.3 below for discussion. The contour shape is a function t → (x(t), y(t)), so the derivative ∂O(C)/∂C is a functional derivative. Since functional derivatives are too unwieldy, we just compute the derivative over a low dimensional family of curves corresponding approximately to dilations. As a further simplification, we adopt the following discrete approximation. Define “dC” as a contour movement in which each contour point moves one pixel along the outward normal direction for C. We call this a positive movement; a negative movement consists

9

(a)

(b)

Figure 1: (a). Contour C+ and C− are generated from contour C by a positive movement and a negative movement respectively. (b). O(C) and D(C) have complementary functions. D(C) gives better discrimination and more accurate localization of salient boundaries, while O(C) responds to a salient boundary over a larger range of deformations. In this example, the image has an uniform black background. It contains two uniform white squares, A and F . B is a subregion within F , which has the same size and shape as A. According to the coding criterion, CA and CB , A and B’s boundaries respectively labeled by light brown color, have the same coding efficiency for the whole image because both the interior image statistics and the exterior image statistics are identical. However, D(C) distinguishes contour CA because it coincides with an image boundary. On the other hand, the low value of O(CB ) indicates that this contour lies at least close to a salient boundary, but we cannot use D(CB ) to detect this nearby boundary.

10

of a one pixel shift in the inward direction. Fig. 1(a) illustrates contours C+ and C− obtained after positive and negative movements from contour C. We define our discrete derivative D(C) ≡ dO(C)/dC by D(C) = O(C+) − O(C).

(2)

We seek a segmentation that maximizes −αO(C) + D(C), where α determines the relative weight of the two terms. The reason D(C) appears with a ‘+’ is discussed below in Section 2.3. As indicated above, we intend our definition to give a rough measure of the contour’s stability with respect to scale: our experiments, as well as previous work [20, 22], indicate that scale stability is an important criterion for selecting salient regions. Our positive/negative movements aren’t pure scalings, since they are independent of curvature; however, our smoothness requirement (see below) makes contours prefer a locally circular shape for which these movements do give a reasonable approximation of scaling. More important, the requirement typically makes the boundary smooth at the pixel scale, which eliminates artifacts from our discretization of the scale derivative (this, plus simplicity, is what motivated our discretization). Our definition works well in practice, producing segmentations that follow natural boundaries. As we mentioned, the interest point detector of [20] can be considered a special case of our approach. This method detects purely circular regions with high entropy in some image property. It determines the salient scales of these regions using a measure analogous to our D(C) but differentiates the probability densities directly instead of entropy. Section 2.3 below discusses our discrete approximation and the sign used 11

in the objective function in more detail. First, we give examples to explain why the extra derivative term is useful. 2.2. The perceptual meaning of D(C). We can define D(C) as above for any contour function O(C). We analyze several O(C) functions from the literature to show that D(C) has an important perceptual meaning complementary to O(C). Generalized simultaneous contrast.. If we take O(C) as the average brightness of the figure region, Ã D(C) =

NC−1

X

Ii − Nf−1

i∈C

X i∈f

! µ ¶ Nf Ii / +1 , NC

(3)

where NC is the number of pixels in the contour C and Nf counts the figure pixels. We want the maximum of −αO(C) + D(C), which now amounts to a global version of a center surround receptive field for detecting dark figures. It has large values when the central figure is dark and the surrounding contour is bright. Such patterns are more “salient” than a figure of the same brightness that lacks a contrasting contour—a kind of simultaneous contrast effect (which is often explained as caused partly by center–surround response) [1, 2]. See Figure 2a. Note that the denominator in (3) favors figures that have small area compared to the length of their boundary. Figure 2b illustrates that such figures have enhanced perceptual saliency. Entropy encoding.. For our entropy encoding definition in (1), Y µ hf (i) + hC (i) ¶hf (i)+hC (i) µ hb (i) − hC (i) ¶hb (i)−hC (i) O(C+) = − log Nf + NC Nb − NC i=0:255 (4) 12

(a)

(b)

Figure 2: (a). Simultaneous contrast effect: although the central squares have the same brightness, the left inside square appears darker because its periphery is brighter. (b). The squares and their boundaries have the same brightness but the figure/contour ratio is smaller on the right; the right square appears darker.

where hC is the smoothed histogram for the contour. The only change from O(C) to O(C+) is that the pixels of C switch from background to figure (recall that we define C to lie in the background). To get an intuition for D(C), we assume that the figure is salient and hence has a much larger area than perimeter. This implies Nf , Nb >> NC and hf (i), hb (i) >> hC (i) for each i. Approximating log(1 + x) = x + O(x2 ), we get ¶ µ 255 X pb (i) , D (C) ≈ NC pC (i) log p (i) f i=0

(5)

where the pa , a ∈ C, b, f , represent the normalized (smoothed) histograms, i.e., the probability distributions, on the contour, background and figure. Since we seek a large, positive value for D(C), the leading NC in (5) favors a large figure. If C lies in the background, we expect pC ∼ pb , making the sum an approximate KL divergence which ensures D(C) in (5) is positive. Thus, maximizing D(C) favors large figures (as in Normalized Cuts), which are likely to be more significant, and contrasting figure and background distributions. For fixed pb,f , the sum favors a distribution pC (i) similar to 13

log(pb (i)/pf (i)). We obtain a kind of generalized simultaneous contrast effect, see Fig. 4a. Fig. 3 illustrates on a real image the usefulness of including D(C). For easier visualization, we consider O(C) along a one–parameter family of curves generated by dilating the true boundary (see also Section 2.4 and Fig. 5). For this family, the minimum of O(C) occurs at a contour that doesn’t correspond to any image curve; instead, its location is determined by a balance between the amount of gray assigned to foreground and background. In contrast, D(C) has a local maximum at the correct vase boundary. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

OC:[43944,45203] DC:[−129,56]

0 −20

−10

0

10

C

(a)

(b)

(c)

Figure 3: Since the vase has a similar intensity histogram as the background, the encoding criterion (1) does not indicate any boundary, however its D(C) term successfully gives a local maximum at the correct boundary location. (a) image; (b) Contours correspond to the minimum of O(C) (red) and the maximum of D(C) (blue) in the pseudo scale space of the vase’s contour; (c) the plots of O(C) and D(C) in the pseudo scale space of the vase’s contour.

Fig. 1b illustrates the different advantages of D(C) and O(C) in searching for good segment boundaries (i.e., in maximizing the objective function). D(C) is sensitive to the precise location of a salient boundary and can help to localize it accurately. But D(C) tends to be small when C is displaced from the correct boundary location (or intersects it at a large angle), so it 14

isn’t useful in large–scale searches for the boundary. The encoding length O(C) remains small for contours “near” the true boundary, and can be used to detect and search for this boundary from a larger range of initial contours. Recall, however, that high values for both D(C) and −O(C) are important in distinguishing true from false segment boundaries.

Figure 4: (a). Black square on grey background, with several contour positions indicated in white. Positioning C at the boundary gives the maximum value of D(C), since the figure’s brightness distribution differs maximally from those of the contour and the background. (b). A plot of O(C) over the pseudo-scale space (Section 2.4) of the salient contour shown in part (a) of this figure. O(C) is the encoding length. A perfect salient contour not only organizes the image well (small O(C)) but also indicates a big organization loss if the contour shifts position (large one-sided derivative of O(C)—or large second derivative when O(C) is smooth).

Normalized cuts.. As a last example, we consider D(C) for Normalized-Cuts (NC). While the objective function for NC favors a boundary with small “cut,” D(C) adds a different type of criterion for boundary goodness, favoring boundary locations that are much better than their near neighbors. Given image regions M , N , Q, and with w(u, v) denoting the affinity between P pixels u and v, we use the shorthands u∈M,v∈N w(u, v) ≡ W (M, N ) and W (M, N ) ± W (M, Q) ≡ W (M, N ± Q). Let V be the overall image region. 15

Then D(C)

= = = ≈

W (f + C, b − C) W (f + C, b − C) W (f, b) W (f, b) + − − W (f + C, V ) W (b − C, V ) W (f, V ) W (b, V ) W (C, b − f − C) W (C, b − f − C) W (f, b)W (C, V ) W (f, b)W (C, V ) + + − W (f + C, V ) W (b − C, V ) W (b, V )W (b − C, V ) W (f, V )W (f + C, V ) W (C, b − C) W (C, f + C) W (C, b)W (f, f ) − W (C, f )W (f, b) W (C, b)W (f, b) − W (C, f )W (b, b) − + + W (b − C, V ) W (f + C, V ) W (f, V )W (f + C, V ) W (b, V )W (b − C, V ) W (C, f ) W (C, b)W (f, f ) − W (C, f )W (f, b) W (C, b)W (f, b) − W (C, f )W (b, b) W (C, b) − + + , W (b, V ) W (f, V ) W 2 (f, V ) W 2 (b, V )

where the approximation follows from assuming W (b, X), W (f, X) À W (C, X) in analogy to our derivation of (5). The first two terms measure whether the contour data are more similar to background (good) or figure. The last two terms favor figures with big W (f, f )/W (f, b) and big W (f, b)/W (b, b), respectively, thus favoring big W (f, f )/W (b, b). Overall, D(C) favors a big figure region with large contrast at the boundary. 2.3. The one-sided derivative: scale and sign The sign of D(C) in the objective function. Since we defined the bounding contour as part of the background, we expect D(C) to achieve its maximum on the background’s inner boundary. This is because the encoding length O(C) increases as C expands beyond the true boundary, and because just outside this boundary the entropy increment from the admixture of background compared to the foreground is largest. Similarly, D(C) has a large negative value when C lies just within the outer figure boundary. Hence, instead of maximizing D(C), we could locate the boundary by minimizing it; alternatively, we could maximize the second derivative D2(C) ≡ D(C) − D(C−) = O(C+) − 2 ∗ O(C) + O(C−). In practice, these approaches work less well than maximizing D(C).

16

We choose to maximize instead of the other possibilities based on the following heuristic reasoning. In segmenting, a common situation is that the figure is a single homogeneous object (or part of one), while the background may be composed of many objects. Thus, we typically expect a peaked brightness distribution, for example uniform brightness, on the figure, with a broader brightness distribution over the background. For this common situation, one can verify (see, e.g., (5)), that the maximum of D(C) on the inner background has much larger magnitude than the minimum in the foreground; thus, it is easier to locate the correct boundary by maximizing D(C) instead of minimizing. In the reverse situation, e.g., when the image consists of a randomly shaded figure on a uniform background, D(C) has the largest magnitude at its minimum, and minimizing may be a better strategy. We expect this case to occur less often (as noted, figures tend to be homogeneous, and their relative smallness compared to the background makes them less likely to have a broad brightness distribution); hence, we target the more common situation described above, and maximize. Another reason not to minimize is that, since our discussion implies that a broad distribution inside C can make D(C) small, minimizing would increase the risk from local minima at which the contour encloses as disparate a collection of pixels as possible, rather than identifying a homogeneous region. Scale. As with any edge detection or segmentation method,1 the scale at which we calculate the derivative determines which transitions the algorithm 1

e.g., in Normalized Cuts the neighborhood size determines the detectable transition scales.

17

will detect. In this paper, we control the scale by smoothing and downsampling the image and then computing the derivative as described above. Note that we don’t expect our method to have a strong sensitivity to scale, as long as this is chosen large enough, i.e., at least comparable to the scale of the sought transitions. This is because the size of an important segment is generally much larger than the “thickness” of the transition between segments. As we shrink or enlarge the contour C away from its correct position at the boundary, we expect −O(C) to decrease over a large range of movement whose scale is set by the segment size, while the ‘flat’ region around the maximum occurs over a small region scaled by the boundary thickness. Hence, the approximate ‘∧’–shaped cusp for −O(C) should persist for resolutions coarser than the transition scale. Also, coarsening the resolution shouldn’t affect the fact that entropy increment is largest at the true segment boundary (see discussion at the beginning of this section). In summary, we expect that maximizing −O(C) + D(C) can give the correct contour for resolutions comparable to or somewhat coarser than the true scale of the segment boundary. 2.4. Visualization To aid intuition, we plot O(C) and D(C). Since it is impossible to show their values for all contours, we plot them for expansions or contractions of a given selected contour, namely, the perceptually best one chosen by human observers. (For convenience, we deform the original contour using MATLAB’s dilation/erosion operators; the new contours are level sets of the Manhattan distance from the original contour and hence approach diamond shapes at large scales). We refer to this set of contours as the pseudo-scale 18

space of the selected contour, where we use “pseudo” to emphasize that the new contours are not strict scalings of the human–selected contour. If a salient contour gives the global maximum of saliency among all possible contours, it will also show this in our simplified plot. If a contour does not give the global maximum in our plot, it cannot give the global maximum in the entire contour space either. The human-selected contour is the only meaningful one in its pseudo–scale space; the others contours simply give a sample of this contour’s neighborhood. Though our expanded and contracted contours aren’t strict scalings of the original, our plots do indicate how O(C) and D(C) vary as C smoothly deforms from the selected contour. We measured O(C), D(C) and D2(C) for Berkeley’s human–labeled image database, using only those images that contain at least one perceptually salient object. For consistency with our segmentation experiments below, each image is first down-sampled to 9600 pixels using a Gaussian pyramid with a standard deviation σ = 1.1. For each image: we use the human-labeled boundaries to select the most salient figure; generate the pseudo-scale space of its boundary (including contours ranging up to the size of the whole image); and plot our O(C) in (1) and its derivative D(C). Fig. 5 shows results for a few of the images. Among the 88 tested images, there are 52 (59.1%) images where the salient contour gives the global minimum of O(C) over the pseudo-scale space. There are 57 (64.8%) and 81 (92.1%) images where the salient contours give the global maximum of D(C) and D2(C) respectively. There are 64 (72.7%) images where the salient contours has either a global minimum in O(C) or a global maximum in D(C). In other cases, the per-

19

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

OC:[48614,49288] DC:[−69,64]

0 0

10

20

30

OC:[46154,47160] DC:[−156,90]

0

40

−20

−10

0

10

C

20

30

C

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

OC:[49088,49428] DC:[−33,36]

0 0

10

20

30

40

OC:[49081,49940] DC:[−77,80]

0

50

−10

0

10

20

C

30

40

C

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

OC:[44421,45136] DC:[−113,85]

0 −10

0

10

20

OC:[47027,48001] DC:[−106,147]

0

30

−10

0

10

C

20

30

40

C

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

OC:[45838,47346] DC:[−299,280]

0 −10

0

10

20

OC:[51058,52141] DC:[−152,211]

0

30

0

20

C 1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

60

0.2

0.1

0.1

OC:[49186,49459] DC:[−48,86]

0 −10

40

C

1

0

10

20

OC:[49269,50055] DC:[−133,314]

0

30

−10

C

0

10

20

30

40

C

Figure 5: Pseudo-scale space analysis. 1st row: original images; 2nd row: salient structures; 3rd row: O(C) (red oval) D(C) (blue triangle, dotted line) in the pseudo-scale space of the salient structure using the encoding criterion. For each contour, the x-axis gives the number of contour movements from the salient contour pseudo-scale space. The big ovals and triangles indicate the positions of the salient contours. O(C) and D(C) are normalized for better display, the actual values vary for each figure (the value range is shown in each figure).

20

ceptually salient contour often corresponds to a local minimum of O(C) or local maximum of D(C). Note that, as expected, D(C) tends to increase at larger C. For a better picture of the global behavior of our measures, in presenting these results we compensated for small localization errors in the human labeled contour and discretization effects by counting the original selected contour as equal to the one with minimum O(C) or maximum D(C) if it can be generated from it by 1 or 2 contour movements. Our results show that a high D(C) or D2(C) value correlates with salient structure at least as strongly as a low value for O(C). D2(C) appears the best measure for discriminating true segment boundaries. However, its noise fluctuations make it difficult to use in optimization, while the smooth behavior of O(C) makes it a more appealing candidate. Our choice of objective function represents a compromise between the high discrimination of D2(C), D(C) and the easier global search possible with O(C), D(C). In another test, we measured O(C) and D(C) in each image for all 88 contours selected as salient in one of the images; this compares the correct salient contour with 87 irrelevant ones. There are 38 (42.0%) images where the correct contour gives the smallest O(C) and 57 (64.8%) images where it gives the largest D(C). (Since the irrelevant contours aren’t generated by contour movements, these results are for the exact contours, without our previous tolerance for small contour movements.) This again verifies the derivative’s higher discrimination. With respect to the previous experiment, here we compare the true contour to about 50% more incorrect contours, resulting in lower performance for both O(C) and D(C).

21

Other region models.. Defining O(C) as the average brightness and using the corresponding D(C) in (3), we find 26 (29.5%) images where the salient contour gives the global minimum/maximum of O(C) and 64 (72.7%) images where the salient contour gives the global minimum/maximum of D(C). For the Normalized-Cuts criterion based on intensities and intervening edges, O(C) =

W (f,b) W (f,V )

+

W (f,b) W (b,V )

and we compute affinities for each pixel within a

51 × 51 neighbor area. The affinity code is from the authors NormalizedCuts tool box. There are 42 (47.7%) images where the salient contour gives the global minimum of O(C) and 44 (50.0%) images where the salient contour gives the global maximum of D(C). Note that our definition of O(C) depends more sensitively on the contour position. All these results are for the pseudoscale space contours. They indicate that D(C) is at least as discriminative as O(C) for these models. Summary.. These results verify that both O(C) and D(C) are useful as saliency signatures. The plots are consistent with our previous analysis: O(C) roughly locates salient figures with good noise robustness, and D(C) discriminates the true boundaries and localizes them with higher precision. By using both O(C) and D(C) we exploit their complementary advantages. As stated, in the pseudo-scale space results we neglected localization errors of 1 or 2 contour movements. By using a one-sided first derivative, and by discretizing it, we are implicitly approximating the segment boundary as a cusp at the pixel scale; as a result, we cannot use D(C) to localize the boundary to sub-pixel accuracy. This disadvantage is more than compensated by the increased simplicity of our objective function and easier global search for the bounding contour. Once the correct bounding contour is found via 22

global search, refining its position to sub-pixel accuracy is a relatively much easier problem. In our framework, we could do this by implementing our approach on a refinement of the original pixel grid or by eliminating our discrete approximations. 3. Saliency measure and its Optimization To demonstrate the usefulness of D(C), we need a complete saliency function measuring the overall goodness of a close contour, which we define in this section. 3.1. Contour shape prior To favor contours with plausible shapes, we exploit the observed statistics of natural contours. The curvatures of a natural contour have a highly kurtotic distribution [41], which can be modeled by a generalized Laplacian x b distribution p(x) = Ae−| a | [11]. Using the MDL framework [14, 21], we could measure the plausibility of a given contour shape by the minimum encoding length based on the generalized Laplacian distribution L(C) = min ²(a, b, C) a,b

(6)

where ²(a, b, C) is the encoding length of C for the generalized Laplacian distribution with parameters a and b. This requires nonlinear optimization to find the optimal a and b [11]. To avoid this, we instead use the lower bound of the encoding length given by the entropy encoding. Further, since natural shapes are defined at multiple scales [30, 32], we define the contour geometry measure as: "

#1 j Y µ hj (i, C) ¶h (i,C) m L(C) = − log NC j=1:m i=−n:n Y

23

(7)

where hj (i, C) is the Gaussian smoothed contour–curvature histogram of C at the j th scale, m is the number of different scales, and n is the biggest curvature. L(C) computes the average encoding length for C at different scales. A smaller L(C) indicates a better shape. In our experiments, we use a curvature histogram with 8 bins and the smoothing Gaussian kernel has σ = 0.5. For simplicity, unlike with O(C), we don’t consider the derivative of L(C); since contour smoothness doesn’t change across scales, we expect it to make a small contribution compared to D(C). 3.2. Saliency function We now define an objective function based on the above measures. Ideally, one could learn the best cue combination [16, 25], but a simple linearcombination strategy already gives very good performance. In combining our cues, we must consider the effects of scale. The encoding length O(C) grows roughly proportional with the area of the image and contour C, while the other two functions, D(C) and L(C), grow proportional to C’s length. Assuming that the most salient figure has size proportional to that of the image, with a smooth outline, the salient contour length is ¢0.5 ¡ proportional to 2π area where the area is that of the whole image. To 2π compensate this scale variance, we tune the organization function by a co¡ 2π ¢0.5 . α is not a free parameter efficient related to the image size α = area because it is fixed for images with the same size. Our saliency function is: Saliency(C) = −αO(C) + D(C) − βL(C)

(8)

where β controls the smoothness prior on the contour geometry and is the only free variable. The choice of β should reflect the figure/background 24

contrast and the complexities of the contour shape and background. For large contrasts, simple figure shape, and cluttered background one should choose β large. In other cases, a smaller β is more appropriate. We also select manually the number of scales used to encode a contour (5 in all our tests), but we have found that our results are insensitive to this number. With more scales, the convergence is smoother and easier. Smoothing the histograms in equations (1) and (7) also helps to smooth convergence. 3.3. Variational optimization To search the optimal salient contour based on saliency function (8), we use a variational method. We illustrate its derivation for the MDL-based O(C) from (1). The derivative of the saliency function with respect to the contour C is: ∂Saliency(C) ∂O(C) ∂D(C) ∂L(C) = −α + −β , ∂v ∂v ∂v ∂v

(9)

where v is a point on C. Each contour point is moved along the negative derivative direction for higher saliency. To compute the first 2 partial derivatives in (9) we have: ∂O(C) = Anv , ∂v ∂O(C+) = Bnv ∂v where nv is the outward unit normal on C and A B

(11)

¸ ∂hf (i) ∂hb (i) ∂Nf ∂Nb + (1 + log(hb (i))) − (1 + log(Nf )) − (1 + log(Nb )) . ∂v ∂v ∂v ∂v i=0 µ ¶ µ ¶¸ 255 · X ∂hf (i) ∂hC (i) ∂hb (i) ∂hC (i) = (1 + log(hf (i) + hC (i))) + + (1 + log(hb (i) − hC (i))) − ∂v ∂v ∂v ∂v i=0 µ ¶ ∂NC ∂Nb ∂NC ∂Nf + − (1 + log(Nb − NC )) − . −(1 + log(Nf + NC )) ∂v ∂v ∂v ∂v =

255 · X

(10)

(1 + log(hf (i)))

25

The partial derivatives in A and B are computed as: ∂hf (i) ∂hb (i) G(i − I+ ) + G(i − I− ) = − = ∂v ∂v 2 ∂hC (i) G(i − I+ ) − G(i − I− ) = ∂v 2 ∂Nf ∂Nb = − =1 ∂v ∂v ³ ´0.5 ³ ´0.5 Nf +1 Nf −1 2π − 2π π π ∂NC = ∂v 2NC

(12) (13) (14) (15)

where G is the Gaussian used to smooth the image–feature histograms in (1), and I+ and I− are, respectively, the pixel intensities after a positive and negative contour movement from v. We compute ∂L as: ∂v " # j ∂L 1 X X ∂h (i, C) ∂NC =− (log hj (i, C) + 1) + (log(NC ) + 1) ∂v m j=1:m i=−n:n ∂v ∂v j

∂h (i, C) = sign(i − ∂v

µ

dθ ds

µ

¶j )Gκ (i − v

dθ ds

¶j ) v



(16)

¡ dθ ¢j ds v

∂v

,

(17)

where Gκ is the Gaussian used to smooth the contour–curvature histogram ¡ ¢j in equation (7), and dθ is the curvature at v measured at the j th scale. ds v ¡ ¢j We compute ∂ dθ /∂v by transforming to the coordinate system of Fig. 6, ds v taking the derivative of µ ¶j y0 y0 dθ = arctan( ) + arctan( ), ds v L − x0 L + x0

(18)

and then transforming back to the original coordinate system. (The contour points are sampled at roughly equal distances, so we neglect the arc length normalization.) 26

Figure 6: Curvature of a contour point v is computed using its adjacent contour points, p1 and p2. L ≡ |p1 − p2|/2.

We start a contour at the image edge and evolve it until it converges to a local minimum. If the contour splits, our current implementation keeps only the most salient part. We can help shrink the contour by including in (9) a negative multiple of (3.3), which adds an extra inward force. This prevents the contour from getting stuck in “flat” regions of the objective function where O(C) and D(C) are changing slowly. Since with the added force the contour no longer moves by steepest descent, we record and output the most salient contour encountered during the evolution. We have also explored a level set implementation of our approach but found no improvement in results. 4. Experiments The main goal of these experiments is to show the usefulness of D(C) in improving the segmentation/detection of salient figures. We also compare our algorithm on real images against Normalized-Cuts (NC) using intensity and intervening boundary cues, Chan and Vese’s method (CV) [6], Ratio Contour (RC) [39], and Paragios and Deriche’s method (PD) [28, 29]. Several of these methods aim at a “globally optimal” segmentation, though 27

given the uncertainty about what makes for a good segmentation, “locally optimal” segmentations may sometimes need to be considered. We view the segmentation algorithm as a way to propose candidates for salient figures. Thus, we evaluate algorithms by running them at several parameter settings (fixed in advance), computing the best segmentation at each setting, and then choosing from among these the one that best isolates the true figure. If the algorithm succeeds in finding the figure at some setting, then it has achieved its goal of proposing good candidates. The code for NC and RC comes from the authors’ home page, while the tested CV and PD algorithms are our own implementations. CV uses region–based intensity cues, the (improved) RC code includes both region intensity and edge information, NC is a region-based method but constructs the affinity matrix using edge information, i.e. it uses both intensity cues and intervening edge cues. PD uses both edges and region intensity cues (edges are estimated via region intensity statistics). For PD, we estimated the Gaussian mixture model by the method of [15] using code available at the author’s home page. This method, like PD, uses the MDL principle to estimate the number of components. In our experiments, we found that the mixture model estimation is unreliable and often overestimates the number of components. Our PD results were generated by fixing this number to be 3, 4, and 5 for each image. Before testing, we standardized the images to 9600 pixels using a Gaussian pyramid. This reduces the variability of discretization errors as a confounding factor in our results. We ran each method with several different parameter settings and report results for the best values. For our method, we explored

28

(a)

(b)

(c)

(d)

(e)

(f)

Figure 7: Highest saliency segmentations obtained for various parameters. [β, force parameter, Saliency] are (a). [0.5,5,-1177.4];(b). [0.5,10,-1067.7];(c). [1,2,-1359.2];(d). [1,5,1289.8];(e). [1,10,-1328.1];(f). [2,5,-1490.0]. For fixed β, higher (less negative) saliency implies a more salient figure.

20 parameter combinations: β = 0.25, 0.5, 1, 2 and values of 1, 2, 5, 10, 20 for the extra force parameter for (3.3) (see discussion at the end of Section 3.3). However, the extra force is not a parameter of our saliency function (8); it just controls the snake’s search over contours. The extra force may cause the contour saliency to increase during optimization, but at the end the algorithm outputs the figure giving the best saliency. Repeating the search for several different force values amounts to conducting a more extensive search. We obtained the results shown by: 1) choosing the segmentation with maximum saliency (8) for each β value; 2) selecting the best of the four resulting segmentations. In this sense, the only parameter we vary for our method is β. To show the importance of the D(C) term in our objective function, we also give results for our algorithm obtained without including this term. When D(C) is not used, we set α in (8) to 1. For optimal results, we use values for β of 1, 2, 5, 10 instead of the values given previously. Fig. 7 shows results of our method for various parameter combinations. For RC, the algorithm outputs salient segments in order of saliency. We use the 5 most salient segments, which has been shown to give nearly the 29

same (upper bound) performance as with more segments [17]. No parameter is varied, except that we choose the best of the 5 segments in reporting results. For NC, the only parameter is the number of segments in the segmentation. We tried from 2 to 20 segments for each image. For each parameter, the two most salient segments are chosen according to the NCut criterion. Hence, there are 38 candidate figures chosen for each image. For PD, we used 16 different value assignments for the parameters, varying the number of clusters over (2,3,4,5) and the contour length weights over (0.5,1,2,5). For CV we used 12 parameter assignments, varying the contour length weights over (0.5,1,2,5) and the region growing parameters over (0.1,0.2,0.5). The selection of the parameters for PD and CV are all tested and tuned experimentally to be optimal. Since each method outputs multiple candidate salient figures, the segment shown in our results for each tested method is the one that overlaps most with the real salient object. These figures also show our O(C) and D(C) values in the scale space of the detected salient contours, with the large dot indicating the contour recovered by the various methods. Our method was implemented in Matlab and usually took several minutes to converge on a 2GHz PC with a 512M memory.

30

4

4

x 10

x 10 4.95

4.45 4.9

O(C)

O(C)

4.4 4.85

4.35

4.8

4.3

4.75

−10

0

10

20

30

40

4.7

50

−10

0

10

20

30

C

40

50

60

70

40

50

60

70

C

300

200 200

100 0

D(C)

D(C)

100 0

−100 −200

−100

−300 −200

−400 −300 −10

0

10

20

30

40

−500

50

−10

0

10

20

30

C

C

4

4

x 10

x 10 4.95

4.45 4.9

O(C)

O(C)

4.4

4.35

4.85

4.8

4.75

4.3

−10

0

10

20

30

40

4.7

50

−10

0

10

20

30

C

40

50

60

70

40

50

60

70

C

300

200 200

100 0

D(C)

D(C)

100 0

−100 −200

−100

−300 −200

−400 −300 −10

0

10

20

30

40

−500

50

−10

0

10

20

30

C

C

4

4

x 10

x 10 4.95

4.45 4.9

O(C)

O(C)

4.4

4.35

4.85

4.8

4.75

4.3

−10

0

10

20

30

40

50

4.7

60

0

20

C

40

60

80

40

60

80

C

300

200 200

100 0

D(C)

D(C)

100 0

−100 −200

−100

−300 −200

−400 −300 −10

0

10

20

30

40

50

−500

60

0

20

C

C

4

4

x 10

x 10 4.95

4.45 4.9

O(C)

O(C)

4.4 4.85

4.35

4.8

4.3

4.75

−20

−10

0

10

20

30

40

4.7

50

0

20

C

40

60

80

40

60

80

C

300

200 200

100 0

D(C)

D(C)

100 0

−100 −200

−100

−300 −200

−400 −300 −20

−10

0

10

20

30

40

−500

50

0

20

C

C

4

4

x 10

x 10 4.95

4.45 4.9

O(C)

O(C)

4.4

4.35

4.85

4.8

4.75

4.3

−10

0

10

20

30

40

4.7 −10

50

0

10

20

30

C

40

50

60

70

80

40

50

60

70

80

C

300

200 200

100 0

D(C)

D(C)

100 0

−100 −200

−100

−300 −200

−400 −300 −10

0

10

20

30

40

−500 −10

50

0

10

20

30

C

C

Figure 8: 1st –5th rows show results by our method, NC, CV, PD, and RC respectively. Also shown for each method are plots of O(C) (upper) and D(C) (bottom) in the pseudoscale space of the contour selected by the method (contour indicated by the large dot).

31

For a cleaner comparison of the different methods, we follow [17] and only use images that contain a single salient object completely within the image. From the original 88 test images, we kept the 46 images shown in Figure 9. For greater precision, we divide these 46 images into four difficulty classes, see Fig. 9, and show results separately for each class. The difficulty of segmentation is measured by the “distinctiveness” of the salient object versus its background, where distinctiveness is defined as in [24] as the χ2 distance between the brightness distributions of figure and background: 255

1 X [pf (k) − pb (k)]2 χ (pf , pb ) = . 2 k=1 pf (k) + pb (k) 2

(19)

Here pf and pb are the figure and background distributions (Gaussian smoothed as before). Distinctiveness is not the only factor affecting segmentation performance, but it is a good predictor when the background is not too cluttered as in this comparison. Table 1 shows quantitative results. The first column gives the χ2 ranges for the four difficulty classes, ordered from most to least difficult. The Table’s remaining columns show the accuracies of the segmentations computed by the various algorithms. We measure the accuracy by comparing the ground truth template to the segmentation results, using the pixel-by-pixel agreement measure of [13, 17]: Accuracy =

AGT ∩CT , AGT ∪CT

(20)

where AGT ∩CT is the number of pixels in the intersection of the ground-truth and computed figures and AGT ∪CT is the number in their union. As in the visualization test, to compensate for small contour localization errors due 32

to discretization effects, we evaluate a contour by using the best score for all contours within 2 contour movements of the original. This criterion is applied to each method. Note that all methods improve their performance when the figures have greater distinctiveness according to our measure. Hence, despite the differences in the criteria used for segmenting, all methods perform in agreement with the statistical model of brightness used by our algorithm. Fig. 9 shows the segmentations computed by our method with/without D(C). As we can see, compared with other methods our encoding criterion does a good job on its own in locating the salient figure. However, our global contour discontinuity measure, D(C), still greatly improves the performance, especially at intermediate distinctiveness. For low distinctiveness, O(C) gives little discrimination between figure and background. Since it correlates poorly with the figure, D(C) also conveys little information and incorporating D(C) does not significantly improve the segmentation. For more distinctive figures, O(C) gives partial discrimination and including D(C) indeed helps to improve segmentation accuracy. For high distinctiveness, O(C) captures the foreground well, and including D(C) offers little improvement. Thus, including D(C) is most useful for intermediate distinctiveness. Although Normalized Cuts (NC) can efficiently segment images into uniform regions, as we observed in our visualization experiment, it does not perform as well as our approach in detecting salient figures. To check if our global contour discontinuity term is also useful for NC, we add D(C) as defined for NC to the usual NC objective function, using them in combination

33

0.08

0.25

0.22 0.09

0.48

0.20 0.10

0.46

0.67 0.11

0.16

0.14 0.13

0.33

0.32

0.14

0.19

0.39 0.15

0.13

0.01 0.16

0.19

0.09 0.17

0.40

0.32 0.18

0.28

0.21

0.05 0.40

0.20

0.14

0.34 0.12 0.34

0.16 0.22

0.28 0.16 0.30

0.19

0.21 0.20 0.62

0.61

0.10 0.21 0.76

0.79 0.22 0.70

0.65 0.22 0.21

0.07 0.23 0.48

0.30 0.32 0.71

0.57 0.34 0.68

0.45

0.23

0.51

0.07 0.26

0.67

0.54

0.26

0.66

0.65 0.28

0.75

0.77 0.25 0.37

0.33

0.86

0.48 0.34

0.40

0.15 0.35

0.62

0.44 0.35

0.36

0.16

0.39 0.37

0.61

0.11 0.37

0.87

0.20

0.46

0.76

0.74 0.47

0.64

0.62 0.47

0.73

0.71

0.50

0.65

0.64 0.53

0.51

0.18 0.57

0.30

0.25

0.72

0.81

0.69 0.76

0.85

0.84

0.80

0.79

0.86 0.95

0.95

0.93 0.69 0.96

0.94 0.72 0.80

0.70

0.50

0.77

0.71

Figure 9: Segmentation results for our method with/without D(C) and NC. Images were divided into 4 difficulty groups based on the objects’ “distinctiveness”. See Table 1 for the χ2 range defining each group. The first number under each image is the distinctiveness. The other three numbers give the segmentation accuracies for our method, with D(C) (left) and without (middle), and the NC method (right). For NC only the best segment is shown.

34

Our overall 0.53 χ2 (pf , pb ) ∈ [0, 0.2) 0.32 χ2 (pf , pb ) ∈ [0.2, 0.4) 0.54 χ2 (pf , pb ) ∈ [0.4, 0.6) 0.60 χ2 (pf , pb ) ∈ [0.6, ∞) 0.86

no D(C) 0.44 0.29 0.42 0.52 0.83

NC 0.41 0.30 0.40 0.46 0.66

RC 0.43 0.34 0.45 0.28 0.72

PD 0.43 0.24 0.43 0.52 0.76

CV 0.42 0.23 0.40 0.55 0.78

Table 1: Comparison of segmentation accuracies for the various methods. Higher values indicate better accuracy.

NCut/w = 0 overall 0.41 χ2 (pf , pb ) ∈ [0, 0.2) 0.30 2 χ (pf , pb ) ∈ [0.2, 0.4) 0.40 2 χ (pf , pb ) ∈ [0.4, 0.6) 0.46 χ2 (pf , pb ) ∈ [0.6, ∞) 0.66

w = 0.5 0.46 0.36 0.43 0.50 0.76

w=1 0.48 0.42 0.42 0.50 0.76

w=5 0.48 0.44 0.43 0.50 0.76

Table 2: Salient figure detection accuracy using normalized cut criterion with our global contour discontinuity.

35

to select the candidate figures. The new region saliency is O(C) + wD(C), where w is a non negative weight, and we use this criterion to determine the best two segments for each of the 19 different segmentations (where, as before, these segmentations are obtained using the original NC algorithm without D(C)). As shown in Tab. 2 for different values of w, using the extra D(C) term for the NC criterion also boosts the detection of salient figures for NC. Most of our algorithms “mistakes” come from undersegmentations. Yet the visualization experiment of Fig. 5 shows that our measures usually have maxima, at least over pseudo-scale space, at the human segmentations of the Berkeley images. This suggests that our method could discover the “correct” segmentations if the snake started from appropriate contours (for example, as discussed in the introduction, we could apply our approach recursively to resegment the undersegmented region initially found by our algorithm). As a check, we ran our method on the images for which it scored an accuracy less than 0.5. We eliminated the extra force, ensuring convergence to the nearest local maximum, and used the human segmentation as an initialization. As Fig. 10 shows, in most cases our algorithm converged to a local maximum near the human segmentation. Among the 22 test images used in this test, 16 give larger saliency values at these new maxima than at the previous contour results we obtained by initializing the contour at the boundary of the image. This suggests that our approach can find the correct segmentation in these more difficult cases also. To further verify the significance of this result, we reran the algorithm starting from “random” contours (human segmentations of other images) and found that it usually

36

Figure 10: Segmentation results for our algorithm with/without D(C) using initialization close to the true figure. For comparison, the best results by NC and our earlier results with initialization at the image boundary are shown. 1st column: our results with initialization at the boundary of the image; 3rd column: our results for the initializations at the salient figures shown in the 2nd column; 4th column: results without using D(C).

37

converged to a contour far from the starting one. Thus, the algorithm is more likely to converge to a contour close to its starting point when started from the human segmentation. This suggests that local maxima are not plentiful and that their occurrence near the human segmentations is significant. Fig. 10 also shows results obtained without D(C). In most cases, using just O(C) gives poor results, while including D(C) keeps the snake close to the true contour. 5. Discussions and Conclusions We proposed a new goodness criterion for segmenting closed figures. We demonstrated our approach for intensity–based segmentation, but it should be straightforward to extend it to incorporate other features such as texture. The demonstration task is that of finding a figure region within the image that has an intensity distribution distinct from that of the surrounding background. We use the Minimal Description Length (MDL) framework to measure the figure’s distinctiveness, but other standard measures of segmentation goodness, e.g., normalized cuts, could also be used. Our main contribution is to add a term into the MDL framework incorporating our expectation that the figure/background boundary is sharp. We argue that large, abrupt change in the global image representation induced by a segmentation is an important predictor of figure outline. This large change represents the analog for the global representation of the large gradient used for detecting edges. Our approach naturally combines edge and region information, and integrates statistics over the entire figure boundary, thus overcoming the local ambiguities of edge detection. It generalizes the simultaneous contrast

38

effect from local receptive fields to global representations. In addition, we proposed a new multi-scale curvature-based prior on contour shape. We maximize our measure of segmentation goodness using the snake technique. Our experiments on a large number of real images show the advantages of our new global discontinuity measure in accurately locating salient structures. Our approach compares well to other state of the art techniques, including Normalized Cuts, Ratio Contours, and the approaches of Chan and Vese and Paragios and Deriche. Because of the large brightness fluctuations at boundaries, the image statistics are harder to estimate there than over uniform regions. To address this, we could explore using a “thickened” contour that, for instance, would include all background pixels within r pixels of the figure. Using a “thickened” contour will also help to detect salient structures with a smoother transition at the boundary; by coarsening the resolution, it can extend the usefulness of our step-edge transition model and the one-sided derivative based on it. Our approach is region–based and applies most naturally to closed contours. We can extend it to open contours as long as the contour can be considered as organizing the image (or part of it) into distinct regions. For instance, we could associate the pixels on either side of a contour with regions, identifying the “figure” as the more convex side. References [1] E. H. Adelson. Perceptual Organization and the Judgement of Brightness. Science, 262:2042–2044, 1993. [2] E. H. Adelson. Lightness Perception and Lightness Illusions. The New Cognitive Neurrosciences, 2nd Ed. Cambridge, MA: MIT Press, 339–351, 2000.

39

[3] J. August, On the distribution of Saliency. Medical Image Computing and Computer-Assisted Intervention (MICCAI), Montreal, 2003, R.E. Ellis and T.M. Peters, eds., Part II, pp. 992-993. [4] A. Berengoltsand and M. Lindenbaum, CVPR, 2004.

On the distribution of Saliency.

[5] J. Canny. A Computational Approach to Edge Detection. IEEE Trans. PAMI, 9(6):679–698, 1986. [6] T.F. Chan and L. A. Vese. Active Contours Without Edges. IEEE Trans. Image Processing, 10(2):266–277, 2001. [7] T. Charpiat, R. Keriven, J.-P. Pons and O. Paugeras. Designing Spatially Coherent Minimizing Flows for Variational Problems Based on Active Contours. ICCV, 2005. [8] D. Comaniciu and P. Meer. Mean Shift: A Robust Approach Towards Feature Space Analysis. IEEE Trans. PAMI, 24(5):603–619, 2002. [9] O. Duchenne, J.-Y. Audibert, R. Keriven, J. Ponce, and F. Segonne Segmentation by transduction. CVPR, 2008. [10] J.H. Elder. Are Edges Incomplete? IJCV, 34:97–122, 1999. [11] J.H. Elder , A. Krupnil and L.A. Johnston. Contour Grouping with Prior Models. IEEE Trans. PAMI, 25(6):661674, 2003. [12] J.H. Elder and S.W. Zucker. Computing Contour Closure. European Conf. Computer Vision, 1996. [13] F.J. Estrada and J.H. Elder. Multi-Scale Contour Extraction Based on Natural Image Statistics. 5th workshop on Perceptual Oranization in Computer Vision, 2006. [14] J. Feldman and M. Singh. Information Along Contours and Object Boundaries. Psychological Review, 112(1):243–252, 2005. [15] M. Figueiredo and A.K. Jain. Unsupervised Learning of Finite Mixture Models. IEEE Trans. PAMI, 24(3):381–396, 2002. [16] C. Fowlkes, D. Martin and J. Malik. Learning Affinity Functions for Image Segmentation: Combining Patch-based and Gradient-based Approaches. CVPR, 2003.

40

[17] F. Ge, S. Wang and T. Liu. Image-Segmentation Evaluation From the Perspective of Salient Object Extraction. CVPR, 2006. [18] I.H. Jermyn and H. Ishikawa. Globally Optimal Regions and Boundaries as Minimum Ratio Weight Cycles. IEEE Trans. PAMI, 23(10):1075–1088, 2001. [19] M. Kass, A. Witkin and D. Terzopoulos. Snakes: Active Contour Models. IJCV, 1(4):321–331, 1988. [20] T. Kadir and M. Brady. Saliency, Scale and Image Dscription. IJCV, 45(2):83– 105, 2001. [21] Y.G. Leclerc. Constructing Simple Table Descriptions for Image Partitioning. IJCV, 1994. [22] D. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. IJCV, 2004. [23] S. Mahamud, L. Williams, K. Thornber and K. Xu. Segmentation of Multiple Salient Closed Contours from Real Images. IEEE Trans. PAMI, 25(4):1–12, 2003. [24] J. Malik, S. Belongie, T. Leung and J. Shi. Contour and Texture Analysis for Image Segmentation. IJCV, 43(1), 2001. [25] D.R. Martin, C.C. Fowlkes and J. Malik. Learning to Detect Image Boundaries Using Local Brightness, Color, and Texture Cues. IEEE Trans. PAMI, 26(5), 2004. [26] J. Matas, O. Chum, M. Urban and T. Pajdla. Robust Wide Baseline Stereo from Maximally Stable Extremal Regions. BMVC, 2002. [27] D. Mumford and J. Shah. Optimal Approximation by Piecewise Smooth Functions, and Associated Variational Problems. Comm. Pure Math., 577– 684, 1989. [28] N. Paragios and R Deriche. Coupled geodesic active regions for image segmentation. ECCV, 2000. [29] N. Paragios and R Deriche. Geodesic Active Regions and Level Set Methods for Supervised Texture Segmentation. IJCV, 223-247, 2002. [30] X. Ren and J. Malik. A Probabilistic Multi-Scale Model for Contour Completion Based on Image Statistics. ECCV, 2002.

41

[31] A. Sha’ashua and S. Ullman. Structural Saliency: The Detection of Globally Salient Structure Using a Locally Connected Network. ICCV, 1988. [32] E. Sharon, A. Brandt and R. Basri. Completion Energies and Scale. IEEE Trans. PAMI, 22(10):1117–1131, 2000. [33] E. Sharon, A. Brandt and R. Basri. Fast Multiscale Image Segmentation. CVPR, I:70–77, 2000. [34] E. Sharon, A. Brandt and R. Basri. Segmentation and Boundary Detection Using Multiscale Intensity Measurements. CVPR, 2001. [35] J. Shi and J. Malik. Normalized Cuts and ImageSegmentation. IEEE Trans. PAMI, 22(8):888–905, 2000. [36] G. Sundaramoorthi, A. Yezzi and A. Mennucci. Sobolev Active Contours. VLSM workshop, 2005. [37] L. Wolf, X. Huang, I. Martin, and D. Metaxas. Patch-Based Texture Edges and Segmentation. ECCV, 2006. [38] H. Wang and J. Oliensis. Salient Contour Detection using a Global Discontinuity Measurement. POCV workshop, 2006. [39] S. Wang, T. Kubota, J. M. Siskind and J. Wang. Salient Closed Boundary Extraction with Ration Contour. IEEE Trans. PAMI, 27(4):546–561, 2005. [40] L.R. Williams and D.W. Jacobs. Stochastic Completion Fields: A Neural Model of Illusory Contour Shape and Salience. Neural Computation, 9(4):837– 858, 1997. [41] S.C. Zhu. Embedding Gestalt Laws in Markov Random Fields. IEEE Trans. PAMI, 21(11):1170–1187, 1999. [42] S.C. Zhu and A. Yuille. Region Competition: Unifying Snakes, Region Growing, and Bayes/MDL for Multiband Image Segmentation. IEEE Trans. PAMI, 18(9):884–900, 1996.

42