Human Tracking Using Convolutional Neural Networks IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 21, NO. 10, OCTOBER 2010 Jialue Fan, Wei Xu, Ying Wu, Yihong Gong
1
Outline • INTRODUCTION • WHAT IS CNN? • CNN TRACKING • Detail Description • From Shift-Invariant to Shift-Variant • Training Procedure • Handling the Scale Change • EXPERIMENTAL RESULTS • CONCLUSION
2
Outline • INTRODUCTION • WHAT IS CNN? • CNN TRACKING • Detail Description • From Shift-Invariant to Shift-Variant • Training Procedure • Handling the Scale Change • EXPERIMENTAL RESULTS • CONCLUSION
3
INTRODUCTION • The major challenge of the traditional learning-based and/or trackingby-detection methods is the false positive matches that lead to wrong association of the tracks (drift)
4
INTRODUCTION • We extract both spatial and temporal structures (motion information) by considering the image pair of two consecutive frames rather than a single frame • In this paper, we use convolutional neural networks (CNNs) • Shift-invariant? → Shift-variant • Local features & Global features • Scale change? → Key Points
5
Outline • INTRODUCTION • WHAT IS CNN? • CNN TRACKING • Detail Description • From Shift-Invariant to Shift-Variant • Training Procedure • Handling the Scale Change • EXPERIMENTAL RESULTS • CONCLUSION
6
WHAT IS CNN? • Fully Connected Neural Networks
Spatial structure?
Head
Body
Legs
7
Convolutional Neural Networks (CNN)
8
Convolutional Neural Networks (CNN) 28 x 28
9
Local receptive fields
10
Local receptive fields 28 x 28
24 x 24
sometimes a different stride length is used
11
Shared weights and bias Sigmoid function =
12
Shared weights and bias • To do image recognition we'll need more than one feature map
LeNet-5 used 6 feature maps
13
Shared weights and bias • A big advantage of sharing weights and biases is that it greatly reduces the number of parameters • Fully connected : 28 x 28 (input neurons) x 30 (hidden neurons) + 30 (bias) = 23550 • Convolutional : 5 x 5 (shared weights) x 20 (feature maps) + 20 (bias) = 520 (30 = 780)
14
Pooling layers • Simplify the information in the output from the convolutional layer • Max-pooling (2 x 2)
24 x 24
12 x 12
15
Outline • INTRODUCTION • WHAT IS CNN? • CNN TRACKING • Detail Description • From Shift-Invariant to Shift-Variant • Training Procedure • Handling the Scale Change • EXPERIMENTAL RESULTS • CONCLUSION
16
CNN TRACKING
17
Outline • INTRODUCTION • WHAT IS CNN? • CNN TRACKING • Detail Description • From Shift-Invariant to Shift-Variant • Training Procedure • Handling the Scale Change • EXPERIMENTAL RESULTS • CONCLUSION
18
Detail Description 1. Each normalized patch (48 x 128) of the image pair is split into five input feature maps (R/G/B/𝐷𝑥 /𝐷𝑦 ) 2. 𝐶1 is a convolutional layer with 10 feature maps connected to a 5 × 5 neighborhood of the input → 44 x 124 ※ 𝐶1 (k, i, j) is the value at position (i, j) in the kth feature map of layer 𝐶1 3.
19
Detail Description 4. The global branch aims to enlarge the receptive field The local branch aims to discover more details about local structures
20
Detail Description
ηλ, 0~𝑝−1
λ=9 p=4 η9, 3 =4
21
Detail Description
[10] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998
• Kλ =p x 3 x 7 trainable
randomly fourtimes chooseupsampling? 10field feature Global receptive = 28maps x 68
22
Detail Description • Each unit in each feature map is connected to a 7 × 7 neighborhood of layer S1
23
Detail Description • The convolution filter of the probability map is a linear function followed by a sigmoid transformation trainable
24
Outline • INTRODUCTION • WHAT IS CNN? • CNN TRACKING • Detail Description • From Shift-Invariant to Shift-Variant • Training Procedure • Handling the Scale Change • EXPERIMENTAL RESULTS • CONCLUSION
25
From Shift-Invariant to Shift-Variant
high detection score
depends on the object’s previous location 𝐿𝐴 , 𝐿𝐴′ are similar 𝐺𝐴 larger than 𝐺𝐴′
26
Outline • INTRODUCTION • WHAT IS CNN? • CNN TRACKING • Detail Description • From Shift-Invariant to Shift-Variant • Training Procedure • Handling the Scale Change • EXPERIMENTAL RESULTS • CONCLUSION
27
Training Procedure • The offline training set includes around 20 000 samples • The dataset was collected by NEC Laboratories • We manually annotate the bounding boxes of human heads • Given one bounding box centered at 𝑥𝑡−1 , we extract 𝑁𝑡−1 (𝑥𝑡−1 ) and 𝑁𝑡 (𝑥𝑡−1 ) as the CNN input
28
Training Procedure
29
Training Procedure • The model is trained offline using standard stochastic gradient descent by minimizing the difference between the probability map outputted by the CNN and the target probability map • Once the model is trained, it is fixed during tracking (Unlike adaptive models)
30
Outline • INTRODUCTION • WHAT IS CNN? • CNN TRACKING • Detail Description • From Shift-Invariant to Shift-Variant • Training Procedure • Handling the Scale Change • EXPERIMENTAL RESULTS • CONCLUSION
31
[30] R. T. Collins, “Mean-shift blob tracking through scale space,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Madison, WI, Jun. 2003, pp. 234–240
Handling the Scale Change • In [30] a solution to scale integrated in mean-shift framework is presented • In our method, we calculate the scale by detecting the key points of the target
Trained not independently!
32
Handling the Scale Changes • 𝑠𝑡 : the scale of 𝑥𝑡
33
Outline • INTRODUCTION • WHAT IS CNN? • CNN TRACKING • Detail Description • From Shift-Invariant to Shift-Variant • Training Procedure • Handling the Scale Change • EXPERIMENTAL RESULTS • CONCLUSION
34
EXPERIMENTAL RESULTS Setup and Comparison Baseline • No testing sequences shown here have been trained • C++ ,Intel 3.6-GHz desktop, 10–15 frames/s on average (640 × 480) • Compared with : 1. [5] Ensemble tracker (learning-based method) not contain scale 2. [23] Support vector tracker 3. [31] Mean-shift tracker 4. [32] Adaptive model (clustering method) • We still use 48 × 128 image patches as our CNN input when only some parts of the whole human body
35
EXPERIMENTAL RESULTS • [5] S. Avidan, “Ensemble tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 2, pp. 261–271, Feb. 2007 • [23] S. Avidan, “Support vector tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 8, pp. 1064–1072, Aug. 2004 • [31] D. Comaniciu, V. Ramesh, and P. Meer, “Real-time tracking of non-rigid objects using mean shift,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Hilton Head, SC, Jun. 2000, pp. 142–149 • [32] M. Yang, Z. Fan, J. Fan, and Y. Wu, “Tracking non-stationary visual appearances by data-driven adaptation,” IEEE Trans. Image Process., vol. 18, no. 7, pp. 1633–1644, Jul. 2009
36
EXPERIMENTAL RESULTS Impact of Using Temporal Features
Both spatial and temporal
Only spatial
37
EXPERIMENTAL RESULTS Impact of Using Temporal Features
Both spatial and temporal
Only spatial
38
EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture
CNN tracker
39
EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture
CNN tracker
40
EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture
CNN tracker only global branch
41
EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture
CNN tracker only global branch
42
EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture
CNN tracker only local branch
43
EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture
CNN tracker only local branch
44
EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture
Shift-invariant
45
EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture
Shift-invariant
46
EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture
Ensemble tracker
47
EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture
Ensemble tracker
48
EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture
Support vector tracker
49
EXPERIMENTAL RESULTS Impact of Shift-Variant Architecture
Support vector tracker
50
EXPERIMENTAL RESULTS Drift Correction • For the purpose of drift correction, we generate some small random shifts in the training procedure
51
EXPERIMENTAL RESULTS
52
EXPERIMENTAL RESULTS Quantitative Experiments
Shopping
Occlusion
53
EXPERIMENTAL RESULTS Quantitative Experiments
54
EXPERIMENTAL RESULTS Tracking with Partial Occlusion
CNN tracker
55
EXPERIMENTAL RESULTS Tracking with Partial Occlusion
CNN tracker
56
EXPERIMENTAL RESULTS Tracking with Partial Occlusion
Ensemble tracker
57
EXPERIMENTAL RESULTS Tracking with Partial Occlusion
Ensemble tracker
58
EXPERIMENTAL RESULTS Tracking with Partial Occlusion
Support vector tracker
59
EXPERIMENTAL RESULTS Tracking with Partial Occlusion
Support vector tracker
60
EXPERIMENTAL RESULTS Illumination Changes
CNN tracker
61
EXPERIMENTAL RESULTS Illumination Changes
CNN tracker
62
EXPERIMENTAL RESULTS Illumination Changes
Mean-shift tracker
63
EXPERIMENTAL RESULTS Illumination Changes
Mean-shift tracker
64
EXPERIMENTAL RESULTS Illumination Changes
Clustering method
65
EXPERIMENTAL RESULTS Illumination Changes
Clustering method
66
EXPERIMENTAL RESULTS Scale and View Changes
CNN tracker
67
EXPERIMENTAL RESULTS Scale and View Changes
CNN tracker
68
EXPERIMENTAL RESULTS Scale and View Changes
Mean-shift tracker
69
EXPERIMENTAL RESULTS Scale and View Changes
Mean-shift tracker
70
EXPERIMENTAL RESULTS Scale and View Changes
Clustering method
71
EXPERIMENTAL RESULTS Scale and View Changes
Clustering method
72
EXPERIMENTAL RESULTS
73
EXPERIMENTAL RESULTS
74
EXPERIMENTAL RESULTS
75
EXPERIMENTAL RESULTS Discussion • In our experiment, we treat humans as our subject for tracking • The target is related to its spatial context that is induced by the pixels in its vicinity (shoulder, torso) • Trained using 1000 samples (drift due to cluttered background) • How to select representative training samples is nontrivial and may be our future work
76
Outline • INTRODUCTION • WHAT IS CNN? • CNN TRACKING • Detail Description • From Shift-Invariant to Shift-Variant • Training Procedure • Handling the Scale Change • EXPERIMENTAL RESULTS • CONCLUSION
77
CONCLUSION • Learning method for tracking based on CNNs • Spatial and temporal structures • Shift-variant architecture • Global features and local features • Key points solve the scale problem • The main limitation is that the CNN model is not designed to handle full and long-term occlusions by the distracter of the same object class
78
Thanks for listening!
79