Efficient Object Localization Using Convolutional Networks

Report 9 Downloads 183 Views
Efficient Object Localization Using Convolutional Networks Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, Christoph Bregler New York University

100 90 80 70 60 50 40 30 20 10 0 0

4x pool 4x pool cascade 8x pool 8x pool cascade 16x pool 16x pool cascade

0.05 0.1 0.15 Normalized distance

Detection rate, %

Recent state-of-the-art performance on human-body pose estimation has been achieved with Deep Convolutional Networks (ConvNets). Traditional ConvNet architectures include pooling and sub-sampling layers which reduce computational requirements, introduce invariance, and prevent overtraining. These benefits of pooling come at the cost of reduced localization accuracy. In this paper we introduce a novel architecture which includes an efficient ‘position refinement’ model that is trained to estimate the joint offset location within a small region of the image. This refinement model is jointly trained in cascade with a state-of-the-art ConvNet model [3] to achieve improved accuracy in human joint location estimation. An overview of the detection archiecture is shown in Figure 1.

Detection rate, %

tompson/goroshin/ajain/lecun/[email protected]

0.2

100 90 80 70 60 50 40 30 20 10 0 0

(a) Face

4x pool 4x pool cascade 8x pool 8x pool cascade 16x pool 16x pool cascade

0.05 0.1 0.15 Normalized distance

0.2

(b) Wrist

Figure 3: Performance improvement from cascaded model

3x256x256

Coarse Heat-Map Model pool

conv

pool

2xconv+ pool

lcn

conv

pool

conv

pool

2xconv+ pool

3x conv

coarse (x,y) 128x64x64 128x128x128 128x128x128 128x256x256

Crop at (x,y)

Final (x,y)

14x128x9x9 14x128x18x18 14x128x18x18 14x128x36x36

14x36x36

Fine HeatMap Model

(¢x, ¢y) refinement

Figure 1: Overview of our Cascaded Architecture

Detection rate, %

conv

100 90 80 70 60 50 40 30 20 10 0 0

this work Tompson et al. Toshev et al. Jain et al. MODEC Yang et al. Sapp et al.

0.05

0.1

0.15

0.2

Detection rate, %

14x32x32

lcn

100 90 80 70 60 50 40 30 20 10 0 0

this work 4X Pishchulin et al.

0.1

0.2

0.3

0.4

0.5

Normalized distance Normalized distance Inspired by the work of Tompson et al. [3], we use a multi-resolution ConvNet architecture (Figure 2) to implement a sliding window detector (a) FLIC (b) MPII with overlapping contexts to produce a coarse heat-map output. This netFigure 4: Our model performance on two standard datasets compared to work outputs a low resolution, per-pixel heat-map, which represents the state-of-the-art likelihood of a joint occurring in each spatial location. LCN + 5x5 Conv+ 5x5 Conv+ 5x5 Conv+ ReLU + ReLU + ReLU + 2x2 MaxPool 2x2 MaxPool 2x2 MaxPool

9x9 Conv+ ReLU

SpatialDropout

128x128x128

128x64x64

128x32x32

128x64x64

128x32x32

128x16x16

512x32x32

3x256x256

1x1 Conv+ ReLU

1x1 Conv+ ReLU

+ 512x32x32

3x128x128

1x1 Conv+ ReLU

256x32x32

256x32x32

Aditionally, we carried out an informal user study to estimate the statistical variation of human annotators on the FLIC dataset. From this experiement we can conclude that the UV error variance of our detector approaches the variance of human annotations.

14x32x32

Heat-Maps

5x5 Conv+ LCN + 5x5 Conv+ ReLU + 5x5 Conv+ ReLU + ReLU + 2x2 MaxPool 2x2 MaxPool 2x2 MaxPool

512x16x16

9x9 Conv+ ReLU

512x32x32

2x2 Upscale

Figure 2: Multi-resolution Sliding Window Detector With Overlapping Contexts If adjacent pixels within feature maps are strongly correlated (as is normally the case in early convolution layers) then we show that i.i.d. dropout will not regularize the activations and will otherwise just result in an effective learning rate decrease. Instead we introduce SpaitalDropout - a modified dropout implementation - which allows us to improve upon the model of [3] by promoteing activation independence across feature maps. We use the architecture of Figure 1 as a platform to discuss and empirically evaluate the role of Max-pooling layers in convolutional architectures for dimensionality reduction and improving invariance to noise and local image transformations. The results in Figure 3 shows that our cascaded archiecture is able to recover spatial accuracy on the face (Figure 3a), and to a lesser extent on the wrist joint (Figure 3b), even in the presence of large pooling and sub-sampling in the coarse heat-map model. For this evaluation we use the standard PCK measure [2] on the FLIC dataset [2]. Our descriminative archiecture is able to outperform existing state-ofthe-art on the FLIC and MPII [1] datasets. Figures 4a and 4b shows the PCK and PCKh results on the FLIC and MPII datasets respectively. Figure 5 shows a selection of joint predictions on the MPII test-set. This is an extended abstract. The full paper is available at the Computer Vision Foundation webpage.

Figure 5: Our Model‘s Predicted Joint Positions on the MPII-human-pose database test-set[1] [1] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2d human pose estimation: New benchmark and state of the art analysis. 2014. [2] B. Sapp and B. Taskar. Modec: Multimodal decomposable models for human pose estimation. In CVPR, 2013. [3] Jonathan Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. Join training of a convolutional network and a graphical model for human pose estimation. NIPS, 2014.