arXiv:1511.04377v1 [cs.CV] 13 Nov 2015
Segmentation-Aware Convolutional Nets Adam W. Harley, Konstantinos G. Derpanis Ryerson University
Iasonas Kokkinos Ecole Centrale Paris
{adam.harley, kosta } @ryerson.ca
[email protected] Abstract This paper proposes a new deep convolutional neural network (DCNN) architecture for learning semantic segmentation. The main idea is to train the DCNN to produce internal representations that respect object boundaries. That is, for any two pixels on the same object, the DCNN is trained to produce nearly-identical internal representations; conversely, the DCNN is trained to produce dissimilar representations for pixels coming from differing objects. This strategy is complementary to many others pursued in semantic segmentation, making its integration with existing systems very straightforward. Experimental results show that when this approach is combined with a pre-trained state-of-the-art segmentation system, per-pixel classification accuracy improves, and the resulting segmentations are qualitatively sharper. When combined with a dense conditional random field, this approach exceeds the prior state-of-the-art on the PASCAL VOC2012 segmentation task. Further experiments show that the internal representations learned by the network make state-of-the-art features for patch-based stereo correspondence and motion tracking.
Figure 1. Semantic segmentations produced by DCNNs can be improved by adding segmentation-aware descriptors to the network, such as boundary cues and similarity embeddings.
end of the DCNN, introducing contextual information to the segmentation via long-range dependencies in the CRF [5, 14]. A complementary strategy is to reduce the subsampling effected by convolution and pooling, by using the “hole” algorithm for convolution [5]. A third strategy is to add trainable up-sampling stages to the network via “deconvolution” layers in the DCNN [16]. This paper’s strategy, which is complementary to those previously explored, is to train the network to produce segmentation-aware internal representations, so that foreground pixels and background pixels within local regions can be treated differently. In particular, the aim is to improve the quality of output from the convolutional layers, by using local pixel affinities to moderate the application of the filter weights. For instance, if a filter is centered on a “cat” pixel, and the patch includes some pixels from an occluder or the background, the goal is to apply the filter weights exclusively to the “cat” pixels in the patch. This way, the filter response is invariant to distractors in the context. The key to accomplishing this is to have the network produce internal representations that lend themselves to pairwise comparisons, such that any pair that lies on the same object will produce a high affinity measure, and pairs that straddle a boundary produce a low affinity measure. This work explores several techniques for achieving this, including deep-learned “embeddings”, and intervening contours on deep-learned boundary cues. Similar work has been done using hand-crafted representations [23], with good results. The current work general-
1. Introduction This paper proposes a deep convolutional network (DCNN) architecture for learning semantic segmentation. Deep convolutional neural networks (DCNNs) are the method of choice for a variety of high-level vision tasks [19]. Fully-convolutional DCNNs have recently been popular for semantic segmentation tasks, because they can be efficiently trained end-to-end for pixel-level classification [5, 16]. A weakness of DCNNs is that they tend to produce smooth and low-resolution predictions, partly due to the subsampling that is a result of cascaded convolution and max-pooling layers. Many different strategies have been explored for sharpening the boundaries of predictions produced by fully-convolutional DCNNs. One popular strategy is to add a dense conditional random field (CRF) to the 1
izes this to a fully-learned approach. The entire system is implemented within the framework of a DCNN, and can be trained end-to-end.
2. Related work The task of semantic segmentation is to simultaneously recognize and segment objects in an image. This is in contrast to pure segmentation techniques (such as curve evolution [21] and graph cuts [12]) which rely on object proposals first, and also in contrast to pure recognition techniques (such as DCNN classifiers [11]) which seek to identify but not localize objects in images. An important aspect of semantic segmentation is that it requires features to be computed densely across the image. This is not only a computational challenge, but places a burden on the features to be invariant to per-pixel changes in scale and occlusions. An influential approach to this problem has been the search for general “objectness” descriptors [2], which enable category-independent segmentations and region proposals [7]. State-of-the-art object hypothesis algorithms are very fast, and have high recall [20]. Still, the risk of a missed detection in the front-end has motivated a long and continued history of research into sliding-window detectors [25]. Prior work on this task most related to the current work involve segmentation-aware descriptors, which are typically handcrafted, and DCNNs, which are typically trained without segmentation-awareness in mind. Segmentation-aware descriptors. The purpose of a segmentation-aware descriptor is to capture the appearance of the foreground while being invariant to changes in the background or occlusions. There has been significant work in hand-crafting segmentation-aware descriptors. For instance, soft foreground-background segmentations have been used to suppress the contribution of the background in sliding-window HOG features [18]. Similarly, boundary cues [17, 22] and soft segmentation masks [13] have been used to modify SIFT-like patch descriptors to suppress contributions from pixels likely to come from the background [23]. Additionally, superpixel algorithms [1] and HOG descriptors have been used to find pixel affinities within object proposal windows, and construct foreground- and background-focused descriptors for deformable part models [24]. DCNNs for semantic segmentation. Fullyconvolutional DCNNs are effectively sliding-window detectors. Despite this fact, recent progress in open-source GPU implementations has made this approach very fast and efficient [10]. Part of the appeal of DCNNs for semantic segmentation is they can be trained end-to-end, without the need for any handcrafted feature representations. Still, DCNNs have many hyperparameters to consider, and careful work has gone into engineering around DCNNs’
inherent weaknesses. For semantic segmentation in particular, DCNNs’ repeated subsampling (through strided convolution and max-pooling) reduces the resolution at which the DCNN can make predictions. Chen et al. [5] recently introduced the “hole” algorithm to partially address this issue (reducing the subsampling factor for a standard architecture from 32 to 8), but blurred outputs are still a problem. Trainable up-sampling stages via “deconvolution” layers in the DCNN [16] are another solution, and appear to produce sharper predictions than bilinear upsampling. A third solution, growing in popularity, is to attach a CRF to the top layer of a DCNN. This has the effect of introducing contextual information to the segmentation, via long-range dependencies in the CRF. The CRF can be trained as a separate module [5] or trained jointly with the DCNN [14], though in both cases this is at significant added computational cost. Another approach to the subsampling issue, more in line with the current paper, is to incorporate segmentation cues into a DCNN. Dai et al. [6] recently used superpixels to generate masks for convolutional featuremaps, enforcing sharp contours in their outputs. To date, this has only been implemented using a handcrafted segmentation cue, without any learning. The current paper takes the idea of segmentationaware DCNNs further, by replacing all handcrafted parts with learnable variants, and by introducing new learning objectives to the DCNN to encourage segmentation-aware internal representations. Contributions. In light of the related work, this paper makes the following contributions. First, the paper frames pairwise feature affinity computation as an convolution-like process that can be efficiently implemented in a DCNN. Second, the paper presents segmentation-aware convolutional nets: DCNNs that learn features that respect object boundaries. Third, the segmentation-aware DCNN is integrated with a state-of-the-art semantic segmentation system, and is shown to improve performance on the VOC2012 segmentation task.
3. Technical approach This work builds on the DeepLab [5] model for semantic segmentation, and also on the Holistically-nested Edge Detection (HED) [26] model for contour detection. The core of the DeepLab approach is a fully-convolutional network, initialized by the popular VGG-16 [4] object recognition model. This network has been fine-tuned in fullyconvolutional fashion on semantic segmentation, and it contains several streams for processing the image at different scales. In the complete DeepLab pipeline, a CRF is trained on each output image of the network, and the CRFs produce sharpened outputs. The HED network is another fullyconvolutional multi-scale network, trained to produce an image representing the probability of a boundary at each
pixel. In this work, these baseline networks make two out of three parts in a larger network. The third part is the novel segmentation-aware network, and is the main focus of this section.
3.1. Learning segmentation embeddings This work creates a set of convolutional layers focused on creating descriptors or “embeddings” that can be used to calculate pixel affinities relating to semantic similarity. The goal is for pixels that share a semantic category to produce similar embeddings (i.e., a high affinity), and for pairs that do not share a semantic category to produce dissimilar embeddings (i.e., a low affinity). This goal is irrespective of the embeddings’ spatial proximity, although in practice the comparisons occur in a small window. To learn embeddings, pairwise comparisons are made between embeddings produced at different pixels. For a pair of pixels i, j in an image, we compute the distance between embeddings at those pixels |vi − vj |, and penalize large distances for pixels with a shared label li = lj , and penalize small distances for pixels with different labels li 6= lj . Specifically, we define a near threshold α, and a far threshold β, and define the loss of a comparison to be max (|vi − vj | − α, 0) if li = lj , `ij = , (1) max (β − |vi − vj |, 0) if li 6= lj where vi represents the embedding at a pixel i, and li represents the label at that pixel. This loss is computed at all pixels i in the image I, and for each pixel j in a neighbourhood N (i) surrounding i, so the total loss L of the network is given by X X L= `ij . (2) i∈I j∈N (i)
The network is trained to minimize this loss through stochastic gradient descent. Once these embeddings are learned, they can be used to create segmentation masks. For a pixel i and a neighbour pixel j ∈ N (i), we can define mi = exp(−λ|vi − vj |)
(3)
to be the weight applied to pixel j in a mask centered on i, where λ is a parameter specifying the sharpness of the mask. This parameter can be learned inside a DCNN. To improve the consistency of the masks, we rescale the distances to the range [0, 1] before softening by the logistic function. We apply these masks convolutionally, so that the output at each pixel becomes X yi = mj xj , (4) j∈N (i)
where xj is the value of input pixel j, and yi is the output value of the masked convolution at that pixel. The effect of this is to replace each pixel with a weighted average of its affine neighbours. Since the affinities capture semantic similarity, this is expected to improve the quality of the output. In practice, we normalize each output by the sum total of the mask, so that the magnitudes across the image do not change as a function of the neighbourhood size.
3.2. Joint masking and semantic segmentation As an alternative to learning embeddings in a separate training process with a dedicated loss function, we also try training the embedding-enabled network directly on semantic segmentation. In this way, we attempt to learn a segmentation-aware DCNN in an end-to-end manner. Even without enforcing good distances through a loss function, “embeddings” should still be learned in this endto-end process, since they will be used to produce masks. That is, by expecting the outputs of the embeddding layers to be good material for masks, they should naturally learn to actually represent segmentation awareness. To help this process, we pre-train the embeddings using a dedicated loss before adding them to the end-to-end network for finetuning.
3.3. Contour-based affinities We also explore the use of contour cues to generate pixel affinities. In this work, we use the learned contour cues of a state-of-the-art DCNN trained on the task, named the Holistically-nested Edge Detection (HED) network [26]. For each pixel i, we compute the intervening contours algorithm [9] for each of its neighbours j ∈ N (i), to determine the maximum probability of a contour being on a line that travels from i to j. If two pixels i and j lie on different objects, there is likely to be a boundary separating them; the intervening contours step returns the probability of that boundary. As with the embeddings, this step is implemented entirely within the DCNN. Intervening contours are computed on the GPU with a specialized layer for the task. Figure 2 illustrates how the embeddings are integrated in a network. We add the embedding-and-masking process to a pre-trained VGG network.
4. Evaluation For training the components of the segmentation-aware network, we use several datasets with per-pixel object labellings: BSDS500 [3], VOC2012 [8], and MSCOCO [15]. For evaluation, our baseline model is the publiclyreleased Deeplab-MSc-Coco-LargeFOV network [5], which approximately represents the state-of-the-art. This model was initialized from a VGG network trained on
Input
DeepLab
im2col
L
L
L
Output
col2im
×
L
L
L
im2dist
im2dist
im2dist
im2dist
im2dist
im2dist
E1
E2
E3
E4
E5
Eavg
+
References im2interv
HED
Figure 2. Schematic for the segmentation-aware fullyconvolutional DCNN featured in this work. The image is sent to three processing streams, which merge later: a “Deeplab” network [5], an “embeddings” network, and a “Holistically-nested Edge Detection” (HED) [26] network. Each layer of the embedding network is sent to a pairwise L1 distance layer (indicated with L1 boxes), and the layers are trained through those istances via loss layers (indicated with L boxes). Affinities computed via embedding distances are added to affinities computed via intervening contours on the HED boundaries, and this sum is multiplied with im2col-processed output of the DeepLab network. The multiplication can be weighted, for normalized convolution. This is finally converted back to image format via a col2im operation. 1
Method Deeplab Deeplab with 15x15 IC Deeplab with 15x15 L1 embeds. Deeplab with 15x15 L2 embeds. Deeplab with 15x15 IC and L1 embeds.
ing a state-of-the-art semantic segmentation DCNN with “segmentation-aware” streams improves results by approximately 1% on the validation set. This is a work in progress.
mean IOU (%) 78.220 79.237 79.387 78.999 79.477
Table 1. VOC 2012 validation results. “Deeplab” refers to the publicly-released multi-scale large field-of-view model from Chen et al. [5], which was trained on the trainval set of VOC2012 as well as the COCO training set.
ImageNet, then trained on the Microsoft COCO training and validation sets [15], and finally fine-tuned on training and validation sets of Pascal VOC 2012 [8]. We augment this network with embeddings learned on the VOC training set, and with a HED network trained on the BSDS dataset. Our initial evaluation on the VOC validation set is presented in Table 1. Results show that using either the embeddings or intervening contours on the boundaries offers approximately a 1% improvement over the Deeplab network. Combining the two segmentation cues together pushes results slightly higher, to a 1.2% improvement.
5. Conclusion This paper proposes and provides initial results for a new deep convolutional neural network architecture for learning semantic segmentation. Results show that augment-
[1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk. SLIC superpixels compared to state-of-the-art superpixel methods. TPAMI, 34(11):2274–2282, 2012. 2 [2] B. Alexe, T. Deselaers, and V. Ferrari. What is an object? In CVPR, pages 73–80, 2010. 2 [3] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. TPAMI, 33(5):898–916, May 2011. 3 [4] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In BMVC, 2014. 2 [5] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, , and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2014. 1, 2, 3, 4 [6] J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmentation. arXiv, 2014. 2 [7] I. Endres and D. Hoiem. Category independent object proposals. In ECCV, pages 575–588, 2010. 2 [8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html. 3, 4 [9] C. Fowlkes, D. Martin, and J. Malik. Learning affinity functions for image segmentation: Combining patch-based and gradient-based approaches. In CVPR, volume 2, pages II– 54, 2003. 3 [10] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv, 2014. 2 [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106–1114, 2012. 2 [12] V. Lempitsky, A. Blake, and C. Rother. Image segmentation by branch-and-mincut. In ECCV, pages 15–29, 2008. 2 [13] M. Leordeanu, R. Sukthankar, and C. Sminchisescu. Efficient closed-form solution to generalized boundary detection. In ECCV, pages 516–529, 2012. 2 [14] G. Lin, C. Shen, I. D. Reid, and A. van den Hengel. Efficient piecewise training of deep structured models for semantic segmentation. arXiv, 2015. 1, 2 [15] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, pages 740–755, 2014. 3 [16] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2014. 1, 2 [17] M. Maire, P. Arbel´aez, C. Fowlkes, and J. Malik. Using contours to detect and localize junctions in natural images. In CVPR, pages 1–8, 2008. 2
[18] P. Ott and M. Everingham. Implicit color segmentation features for pedestrian and object detection. In ICCV, pages 723–730, 2009. 2 [19] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNN features off-the-shelf: An astounding baseline for recognition. In CVPR, pages 512–519, 2014. 1 [20] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. 2 [21] M. Rousson and N. Paragios. Shape priors for level set representations. In ECCV, pages 78–92, 2002. 2 [22] J. Shi and J. Malik. Normalized cuts and image segmentation. TPAMI, 22(8):888–905, 2000. 2 [23] E. Trulls, I. Kokkinos, A. Sanfeliu, and F. Moreno-Noguer. Dense segmentation-aware descriptors. In CVPR, pages 2890–2897, 2013. 1, 2 [24] E. Trulls, S. Tsogkas, I. Kokkinos, A. Sanfeliu, and F. Moreno-Noguer. Segmentation-aware deformable part models. In CVPR, pages 168–175, 2014. 2 [25] A. Vedaldi and A. Zisserman. Structured output regression for detection with partial truncation. In Advances in neural information processing systems, pages 1928–1936, 2009. 2 [26] S. Xie and Z. Tu. Holistically-nested edge detection. In CVPR, 2015. 2, 3, 4