Efficient Object Localization Using Convolutional Networks Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, Christoph Bregler New York University
100 90 80 70 60 50 40 30 20 10 0 0
4x pool 4x pool cascade 8x pool 8x pool cascade 16x pool 16x pool cascade
0.05 0.1 0.15 Normalized distance
Detection rate, %
Recent state-of-the-art performance on human-body pose estimation has been achieved with Deep Convolutional Networks (ConvNets). Traditional ConvNet architectures include pooling and sub-sampling layers which reduce computational requirements, introduce invariance, and prevent overtraining. These benefits of pooling come at the cost of reduced localization accuracy. In this paper we introduce a novel architecture which includes an efficient ‘position refinement’ model that is trained to estimate the joint offset location within a small region of the image. This refinement model is jointly trained in cascade with a state-of-the-art ConvNet model [3] to achieve improved accuracy in human joint location estimation. An overview of the detection archiecture is shown in Figure 1.
Detection rate, %
tompson/goroshin/ajain/lecun/
[email protected] 0.2
100 90 80 70 60 50 40 30 20 10 0 0
(a) Face
4x pool 4x pool cascade 8x pool 8x pool cascade 16x pool 16x pool cascade
0.05 0.1 0.15 Normalized distance
0.2
(b) Wrist
Figure 3: Performance improvement from cascaded model
3x256x256
Coarse Heat-Map Model pool
conv
pool
2xconv+ pool
lcn
conv
pool
conv
pool
2xconv+ pool
3x conv
coarse (x,y) 128x64x64 128x128x128 128x128x128 128x256x256
Crop at (x,y)
Final (x,y)
14x128x9x9 14x128x18x18 14x128x18x18 14x128x36x36
14x36x36
Fine HeatMap Model
(¢x, ¢y) refinement
Figure 1: Overview of our Cascaded Architecture
Detection rate, %
conv
100 90 80 70 60 50 40 30 20 10 0 0
this work Tompson et al. Toshev et al. Jain et al. MODEC Yang et al. Sapp et al.
0.05
0.1
0.15
0.2
Detection rate, %
14x32x32
lcn
100 90 80 70 60 50 40 30 20 10 0 0
this work 4X Pishchulin et al.
0.1
0.2
0.3
0.4
0.5
Normalized distance Normalized distance Inspired by the work of Tompson et al. [3], we use a multi-resolution ConvNet architecture (Figure 2) to implement a sliding window detector (a) FLIC (b) MPII with overlapping contexts to produce a coarse heat-map output. This netFigure 4: Our model performance on two standard datasets compared to work outputs a low resolution, per-pixel heat-map, which represents the state-of-the-art likelihood of a joint occurring in each spatial location. LCN + 5x5 Conv+ 5x5 Conv+ 5x5 Conv+ ReLU + ReLU + ReLU + 2x2 MaxPool 2x2 MaxPool 2x2 MaxPool
9x9 Conv+ ReLU
SpatialDropout
128x128x128
128x64x64
128x32x32
128x64x64
128x32x32
128x16x16
512x32x32
3x256x256
1x1 Conv+ ReLU
1x1 Conv+ ReLU
+ 512x32x32
3x128x128
1x1 Conv+ ReLU
256x32x32
256x32x32
Aditionally, we carried out an informal user study to estimate the statistical variation of human annotators on the FLIC dataset. From this experiement we can conclude that the UV error variance of our detector approaches the variance of human annotations.
14x32x32
Heat-Maps
5x5 Conv+ LCN + 5x5 Conv+ ReLU + 5x5 Conv+ ReLU + ReLU + 2x2 MaxPool 2x2 MaxPool 2x2 MaxPool
512x16x16
9x9 Conv+ ReLU
512x32x32
2x2 Upscale
Figure 2: Multi-resolution Sliding Window Detector With Overlapping Contexts If adjacent pixels within feature maps are strongly correlated (as is normally the case in early convolution layers) then we show that i.i.d. dropout will not regularize the activations and will otherwise just result in an effective learning rate decrease. Instead we introduce SpaitalDropout - a modified dropout implementation - which allows us to improve upon the model of [3] by promoteing activation independence across feature maps. We use the architecture of Figure 1 as a platform to discuss and empirically evaluate the role of Max-pooling layers in convolutional architectures for dimensionality reduction and improving invariance to noise and local image transformations. The results in Figure 3 shows that our cascaded archiecture is able to recover spatial accuracy on the face (Figure 3a), and to a lesser extent on the wrist joint (Figure 3b), even in the presence of large pooling and sub-sampling in the coarse heat-map model. For this evaluation we use the standard PCK measure [2] on the FLIC dataset [2]. Our descriminative archiecture is able to outperform existing state-ofthe-art on the FLIC and MPII [1] datasets. Figures 4a and 4b shows the PCK and PCKh results on the FLIC and MPII datasets respectively. Figure 5 shows a selection of joint predictions on the MPII test-set. This is an extended abstract. The full paper is available at the Computer Vision Foundation webpage.
Figure 5: Our Model‘s Predicted Joint Positions on the MPII-human-pose database test-set[1] [1] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2d human pose estimation: New benchmark and state of the art analysis. 2014. [2] B. Sapp and B. Taskar. Modec: Multimodal decomposable models for human pose estimation. In CVPR, 2013. [3] Jonathan Tompson, Arjun Jain, Yann LeCun, and Christoph Bregler. Join training of a convolutional network and a graphical model for human pose estimation. NIPS, 2014.